In this article, I will discuss about an outlier detection technique
called Attribute Value Frequency (AVF) for categorical data. This is a part of
series of blogs I intend to write on outlier detection methods. AVF method is attributed
to Anna Koufakau (Scalable and Efficient Outlier Detection in Large Distributed Data Sets with Mixed-Type Attributes).
The intuition behind AVF; outliers tend to occur infrequently
in a dataset. Tuples are scored based on frequency of individual attribute
values. Tuples with low AVF scores are marked as outliers.
Assume X = {x1,x2,x3…..xn} is the given dataset having n
tuples ,each tuple with m attributes,
xi = {xi1,xi2…..xim}
where 1 <= i <= n.
The function in the formula is frequency
count of attribute j’s value. If a tuple has rare values for each of its attribute,
due to summation, its overall score will be low. Thus at the end of the
exercise, by arranging the tuples in increasing order of AVF score, the top K
tuples are identified as outliers.
Given the summation nature of the calculation, the algorithm
can be coded up using map reduce paradigm. The pseudo-code below implements AVF
using map- reduce constructs.
Map (LineNumber,Row)
For each
col in Row
Emit
( colNumber + colValue,1)
Reduce ( key (colNumber + colValue ), List )
Total =
sum ( all list values)
Emit(key,Total)
The output of Reduce is used in the next map reduce. Assume it’s a hashmap named as CountDict.
Map (LineNumber,Row)
Sum = 0
For each col in Row
Key
= colNumber + colValue
Count
= CountDict(Key)
Sum
= sum + count
Emit(Sum,Row)
Reduce (Key (Sum),List)
For row
in List
Emit
(Key, row)
References
- Scalable and Efficient Outlier Detection in
Large Distributed Data Sets with Mixed-Type Attributes
No comments:
Post a Comment