Vail Lab: Entropy of a categorical data set using map reduce

In this post, I will detail out how to calculate entropy of a given dataset using map-reduce paradigm. In the next post, I will use these methods for outlier detection using entropy values. This post covers entropy calculation for discrete values. It’s easy to extend these for continuous value, but I will reserve it for another post. Entropy calculation can be used downstream to filter outliers from datasets.

http://en.wikipedia.org/wiki/Entropy_(information_theory)

Refer to the above wiki link for some theory behind entropy. From a data set perspective, the entropy calculation is done at column level. For every column, entropy value is calculated. Sum of entropy value of all columns is the entropy value for the dataset.

Consider a data set X ={x1,x2,….xn}, having n tuples.

Each tuple xi = {xi1,xi2….xim}, having m attributes.

E(xj) = Entropy of attribute j, where 1 <= j <= m

Entropy of the dataset is the sum of entropies of its individual attributes.

E(X) = E(x1) + E(x2) + …. + E(Xm)

The formula for entropy calculation = -Sum(p(x) * log2(x) )

I have implemented entropy calculation here as a series of mapreduce tasks.

1. Task to find out the number of rows, n in a dataset

2. Attribute Instance Counter – Count the number of unique values each attribute posses

3. Entropy calculation. Use the previous two output to find out the entropy for each attribute

Map (RowNumber, row)

Cols = Split the row by delimiter

For each col in cols

Emit ( colindex,1)

Reduce (colindex,list<1>)

Sum = 0

For each element in list<1>

Sum++

Emit(colindex,Sum)

The above segments, counts the number of rows per column in the dataset

Map (LineNumber,Row)

For each col in Row

Emit ( colNumber + colValue,1)

Reduce ( key (colNumber + colValue ), List )

Total = sum ( all list values)

Emit(key,Total)

By combining the column number and column value in as a key in the map routine, the above map reduce routine calculates count of distinct attribute values.

Ouptput of above two map reduce routines are used by the next routine to calculate the entropy value. I am assuming the total number of rows is presented as a hash map called TotalRows. Given a column number, this hashmap will return the number of rows. CountDict is another hashmap which stores count of distinct attribute values; given a column number and value, this will return the count.

Map (LineNumber,Row)

Entropy = 0

For each col in Row

Key 1= colNumber + colValue

Key2 = colNumber

Count = CountDict(Key1)

Rows = TotalRows(Key2)

Entropy = Count / Rows * log2(Count/Rows)

Emit(colNumber,Entropy)

Reduce (Key ,List)

Sum =0

For row in List

Sum = sum + row

Emit(Key,-1*Sum)

The final output is the individual entropy of each attribute. Summing up these individual value will give the entropy of the dataset.

Vail Lab

Monday, March 10, 2014

Entropy of a categorical data set using map reduce

No comments:

Post a Comment