In this post, I will detail out how to calculate entropy of
a given dataset using map-reduce paradigm. In the next post, I will use these
methods for outlier detection using entropy values. This post covers entropy
calculation for discrete values. It’s easy to extend these for continuous value,
but I will reserve it for another post. Entropy calculation can be used
downstream to filter outliers from datasets.
Refer to the above wiki link for some theory behind
entropy. From a data set perspective,
the entropy calculation is done at column level. For every column, entropy value is calculated.
Sum of entropy value of all columns is the entropy value for the dataset.
Consider a data set X ={x1,x2,….xn}, having n tuples.
Each tuple xi = {xi1,xi2….xim}, having m attributes.
E(xj) = Entropy of attribute j, where 1 <= j <= m
Entropy of the dataset is the sum of entropies of its
individual attributes.
E(X) = E(x1) + E(x2) + …. + E(Xm)
The formula for entropy calculation = -Sum(p(x) * log2(x) )
I have implemented entropy calculation here as a series of
mapreduce tasks.
1.
Task to find out the number of rows, n in a
dataset
2.
Attribute Instance Counter – Count the number of
unique values each attribute posses
3.
Entropy calculation. Use the previous two output
to find out the entropy for each attribute
Map (RowNumber, row)
Cols = Split
the row by delimiter
For each
col in cols
Emit
( colindex,1)
Reduce (colindex,list<1>)
Sum = 0
For each
element in list<1>
Sum++
Emit(colindex,Sum)
The above segments, counts the number of rows per column in
the dataset
Map (LineNumber,Row)
For
each col in Row
Emit
( colNumber + colValue,1)
Reduce ( key (colNumber + colValue ), List )
Total =
sum ( all list values)
Emit(key,Total)
By combining the column number and column value in as a key
in the map routine, the above map reduce routine calculates count of distinct
attribute values.
Ouptput of above two map reduce routines are used by the
next routine to calculate the entropy
value. I am assuming the total number of rows is presented as a hash map called
TotalRows. Given a column number, this hashmap will return the number of rows.
CountDict is another hashmap which stores count of distinct attribute values;
given a column number and value, this will return the count.
Map (LineNumber,Row)
Entropy
= 0
For each col in Row
Key
1= colNumber + colValue
Key2
= colNumber
Count =
CountDict(Key1)
Rows =
TotalRows(Key2)
Entropy
= Count / Rows * log2(Count/Rows)
Emit(colNumber,Entropy)
Reduce (Key ,List)
Sum =0
For row in List
Sum
= sum + row
Emit(Key,-1*Sum)
The final output is the individual entropy of each
attribute. Summing up these individual value will give the entropy of the
dataset.
No comments:
Post a Comment