Vail Lab: Unsupervised learning

Showing posts with label Unsupervised learning. Show all posts

Monday, August 11, 2014

Unsupervised Feature Learning - Sparse Filtering

SparseFiltering

Sparse Filtering

An unsupervised feature learning method. Click here for the paper.
The code is available in mygithub

A small python implementation of the Sparse Filtering algorithm. This implementation has dependency on

scikit learn Link
Scipy Optimize Link

As discussed in the paper we use a soft absolute activation function.

def soft_absolute(v):
    return np.sqrt(v**2 + epsilon)

The input X dimension is (nsamples,ndimensions). For testing purpose, we use scikit learn's make_classification function. We create 500 samples and 100 features.

def load_data():
    X,Y = make_classification(n_samples = 500,n_features=100)
    return X,Y

We use a simple Support Vector Machine Linear classifier to do final classiciation.

def simple_model(X,Y):
    clf_org_x = SVC()
    clf_org_x.fit(X,Y)
    predict = clf_org_x.predict(X)
    acc=  accuracy_score(Y,predict)
    return acc

We train a two layer network.

X,Y = load_data()
acc = simple_model(X,Y)

X_trans = sfiltering(X,25)

acc1= simple_model(X_trans,Y)

X_trans1 = sfiltering(X_trans,10)

acc2= simple_model(X_trans1,Y)

print "Without sparsefiltering, accuracy = %f "%(acc)
print "One Layer Accuracy, = %f, Increase = %f"%(acc1,acc1-acc)
print "Two Layer Accuracy,  = %f, Increase = %f"%(acc2,acc2-acc1)

At the first layer, we create 25 features. At the second layer we reduce them to 10. Finally a (500,10) X matrix is used by the SVC classifier.

Without sparsefiltering, accuracy = 0.986000 
One Layer Accuracy, = 1.000000, Increase = 0.014000
Two Layer Accuracy,  = 1.000000, Increase = 0.000000

With a single layer sparse filtering the accuracy reaches 100%. The second layer is redundant here.

Other implementations available in the web are,

Saturday, August 2, 2014

Unsupervised key phrase extraction from document - RAKE

RAKE

RAKE is an extremely effiecient keyword extraction algorithm and operates on individual documents. Its language and domain independent.

Literature is abundant with methods which uses Noun phrase chunks,POS tags, ngram statistics and similar others.

Given a document, stop word list and a list of phrase delimiters, RAKE extracts candidate phrases.

phrases = []

for line in sentences:
    words = nltk.word_tokenize(line)
    phrase = ''
    for word in words:
        if word not in stopwords.words('english') and word not in [',','.','?',':',';']:
            phrase+=word + ' '
        else:
            if phrase != '':
            phrases.append(phrase.strip())
            phrase = ''

We takes the stop word list from nltk and use a list of special characters as word delimiters.Every sentence is split into chunks at stop words occurence and phrase delimiters occurence.

The next step is to find out the frequency of the individual words in these phrases. Frequency is the count of occurence of the word.

word_freq = defaultdict(int)
word_degree = defaultdict(int)
word_score = defaultdict(float)

for phrase in phrases:
    words = phrase.split(' ')
    phrase_length = len(words)
    for word in words:
        word_freq[word]+=1
        word_degree[word]+=phrase_length

Degree of a word is sum of length of all the phrases where the word occurs. Finally scoring for each word is done by

score(word) = degree(word) / freq(word).

for word,freq in word_freq.items():
    degree = word_degree[word]
    score = ( 1.0 * degree ) / (1.0 * freq )
    word_score[word] = score

Phrase scores are calcuated by sum of individual word scores in that phrase. Phrase with very large scores are considered to be keyphrases for the document.

RAKE algorithm is explained in the book Text Mining:Application and Theory

There are several python implementations available.

The current code is available in github

Thursday, March 6, 2014

Unsupervised outlier detection for categorical variable

In this article, I will discuss about an outlier detection technique called Attribute Value Frequency (AVF) for categorical data. This is a part of series of blogs I intend to write on outlier detection methods. AVF method is attributed to Anna Koufakau (Scalable and Efficient Outlier Detection in Large Distributed Data Sets with Mixed-Type Attributes).

The intuition behind AVF; outliers tend to occur infrequently in a dataset. Tuples are scored based on frequency of individual attribute values. Tuples with low AVF scores are marked as outliers.

Assume X = {x1,x2,x3…..xn} is the given dataset having n tuples ,each tuple with m attributes,

xi = {xi1,xi2…..xim} where 1 <= i <= n.

The function in the formula is frequency count of attribute j’s value. If a tuple has rare values for each of its attribute, due to summation, its overall score will be low. Thus at the end of the exercise, by arranging the tuples in increasing order of AVF score, the top K tuples are identified as outliers.

Given the summation nature of the calculation, the algorithm can be coded up using map reduce paradigm. The pseudo-code below implements AVF using map- reduce constructs.

Map (LineNumber,Row)

                For each col in Row

                                Emit ( colNumber + colValue,1)

Reduce ( key (colNumber + colValue ), List )

                Total = sum ( all list values)

                Emit(key,Total)

The output of Reduce is used in the next map reduce. Assume it’s a hashmap named as CountDict.

Map (LineNumber,Row)

                Sum = 0

For each col in Row

                Key = colNumber + colValue

                Count = CountDict(Key)

                                Sum = sum + count

                Emit(Sum,Row)

Reduce (Key (Sum),List)

                For row in List

                                Emit (Key, row)

References

Scalable and Efficient Outlier Detection in Large Distributed Data Sets with Mixed-Type Attributes