Monday, March 10, 2014

Entropy of a categorical data set using map reduce

In this post, I will detail out how to calculate entropy of a given dataset using map-reduce paradigm. In the next post, I will use these methods for outlier detection using entropy values. This post covers entropy calculation for discrete values. It’s easy to extend these for continuous value, but I will reserve it for another post. Entropy calculation can be used downstream to filter outliers from datasets.
Refer to the above wiki link for some theory behind entropy.  From a data set perspective, the entropy calculation is done at column level.  For every column, entropy value is calculated. Sum of entropy value of all columns is the entropy value for the dataset.
Consider a data set X ={x1,x2,….xn}, having n tuples.
Each tuple xi = {xi1,xi2….xim}, having m attributes.
E(xj) = Entropy of attribute j, where 1 <= j <= m
Entropy of the dataset is the sum of entropies of its individual attributes.
E(X) = E(x1) + E(x2) + …. + E(Xm)
The formula for entropy calculation = -Sum(p(x) * log2(x) )

I have implemented entropy calculation here as a series of mapreduce tasks.
1.       Task to find out the number of rows, n in a dataset
2.       Attribute Instance Counter – Count the number of unique values each attribute posses
3.       Entropy calculation. Use the previous two output to find out the entropy for each attribute

Map (RowNumber, row)
                Cols = Split the row by delimiter
                For each col in cols
                                Emit ( colindex,1)
Reduce (colindex,list<1>)
                Sum = 0
                For each element in list<1>
                                Sum++
                Emit(colindex,Sum)

The above segments, counts the number of rows per column in the dataset

Map (LineNumber,Row)
                For each col in Row
                                Emit ( colNumber + colValue,1)

Reduce ( key (colNumber + colValue ),  List )
                Total = sum ( all list values)
                Emit(key,Total)
By combining the column number and column value in as a key in the map routine, the above map reduce routine calculates count of distinct attribute values.

Ouptput of above two map reduce routines are used by the next routine to calculate  the entropy value. I am assuming the total number of rows is presented as a hash map called TotalRows. Given a column number, this hashmap will return the number of rows. CountDict is another hashmap which stores count of distinct attribute values; given a column number and value, this will return the count.

Map (LineNumber,Row)
             Entropy = 0
For each col in Row
                Key 1= colNumber + colValue
                Key2 = colNumber
               
Count = CountDict(Key1)
Rows = TotalRows(Key2)
                                Entropy = Count / Rows * log2(Count/Rows)
                                Emit(colNumber,Entropy)
Reduce (Key ,List)
                Sum =0
For row in List
                                Sum = sum + row
                Emit(Key,-1*Sum)


The final output is the individual entropy of each attribute. Summing up these individual value will give the entropy of the dataset.

Thursday, March 6, 2014

Unsupervised outlier detection for categorical variable

In this article, I will discuss about an outlier detection technique called Attribute Value Frequency (AVF) for categorical data. This is a part of series of blogs I intend to write on outlier detection methods. AVF method is attributed to Anna Koufakau (Scalable and Efficient Outlier Detection in Large Distributed Data Sets with Mixed-Type Attributes).

The intuition behind AVF; outliers tend to occur infrequently in a dataset. Tuples are scored based on frequency of individual attribute values. Tuples with low AVF scores are marked as outliers.
Assume X = {x1,x2,x3…..xn} is the given dataset having n tuples ,each tuple with m attributes,
 xi = {xi1,xi2…..xim} where  1 <= i <= n.



The function in the formula is frequency count of attribute j’s value. If a tuple has rare values for each of its attribute, due to summation, its overall score will be low. Thus at the end of the exercise, by arranging the tuples in increasing order of AVF score, the top K tuples are identified as outliers.
Given the summation nature of the calculation, the algorithm can be coded up using map reduce paradigm. The pseudo-code below implements AVF using map- reduce constructs.

Map (LineNumber,Row)
                For each col in Row
                                Emit ( colNumber + colValue,1)

Reduce ( key (colNumber + colValue ),  List )
                Total = sum ( all list values)
                Emit(key,Total)

The output of Reduce is used in the next map reduce. Assume it’s a hashmap named as CountDict.

Map (LineNumber,Row)
                Sum = 0
For each col in Row
                Key = colNumber + colValue
                Count = CountDict(Key)
                                Sum = sum + count
                Emit(Sum,Row)
Reduce (Key (Sum),List)
                For row in List
                                Emit (Key, row)

References


  1.  Scalable and Efficient Outlier Detection in Large Distributed Data Sets with Mixed-Type Attributes