Showing posts with label machine learning. Show all posts

Wednesday, February 4, 2015

My ICML 2014 reading list

1.Discriminative Features via Generalized Eigenvectors

Right representation of data dictates the success rate of the algorithm used. Deep learning and dictionary learning methods to address these issues in domains such as image classification, audio systems and others. This paper talks about computationally simple methods to extract discriminative features from large data sets. The authors claims that
In this work, we explore conceptually and computationally simple ways to create discriminative features that can scale to a large number of examples, even when data is distributed across many machines. Our techniques are not a panacea. They are exploiting simple second order structure in the data and it is very easy to come up with sufficient conditions under which they will not give any advantage over learning using the raw signal. Nevertheless, they empirically work remarkably well.

Abstract

Representing examples in a way that is compatible with the underlying classifier can greatly enhance the performance of a learning system. In this paper we investigate scalable techniques for inducing discriminative features by taking advantage of simple second order structure in the data. We focus on multiclass classification and show that features extracted from the generalized eigenvectors of the class conditional second moments lead to classifiers with excellent empirical performance. Moreover, these features have attractive theoretical properties, such as inducing representations that are invariant to linear transformations of the input. We evaluate classifiers built from these features on three different tasks, obtaining state of the art results.

2. Coding for Random Projections

Abstract

The method of random projections has become popular for large-scale applications in statistical learning, information retrieval, bio-informatics and other applications. Using a well-designed coding scheme for the projected data, which determines the number of bits needed for each projected value and how to allocate these bits, can significantly improve the effectiveness of the algorithm, in storage cost as well as computational speed. In this paper, we study a number of simple coding schemes, focusing on the task of similarity estimation and on an application to training linear classifiers. We demonstrate that uniform quantization outperforms the standard and influential method (Datar et al., 2004), which used a window-and-random offset scheme. Indeed, we argue that in many cases coding with just a small number of bits suffices. Furthermore, we also develop a non-uniform 2-bit coding scheme that generally performs well in practice, as confirmed by our experiments on training linear support vector machines (SVM). Proofs and additional experiments are available at arXiv:1308.2218. In the context of using coded random projections for approximate near neighbor search by building hash tables (arXiv:1403.8144) (Li et al., 2014), we show that the step of random offset in (Datar et al., 2004) is again not needed and may hurt the performance. Furthermore, we show that, unless the target similarity level is high, it usually suffices to use only 1 or 2 bits to code each hashed value for this task. Section 7 presents some experimental results for LSH

Tuesday, October 21, 2014

Analtyics Node

Recently I had setup a cluster for analytics purpose.

The machine has the following softwares installed.

Java 7

Base R

R-Studio

H2O Cluster

R Studio can be accessed through browser

http://<ip address>:8787/

In order to access H2O cluster, ssh tunnelling has be enabled from the client. Install H2O section has the details of tunnelling. Curently there are two

Install Java

sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update

Oracle JDK 7

sudo apt-get install oracle-java7-installer

Install R

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9 
echo "deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu precise/" >> /etc/apt/sources.list
sudo apt-get update 
sudo apt-get install r-base

Type R to check installation

Install R Studio Server

64 bit version

$ sudo apt-get install gdebi-core
$ sudo apt-get install libapparmor1 # Required only for Ubuntu, not Debian
$ wget http://download2.rstudio.org/rstudio-server-0.98.994-amd64.deb
$ sudo gdebi rstudio-server-0.98.994-amd64.deb

Check by accessing http://:8787/ Use the linux username and password

Install H2o

Download H2o

http://s3.amazonaws.com/h2o-release/h2o/rel-kramer/1/index.html?smau_=iVV4RHKSHKjqQMBh

unzip h2o-2.4.6.1.zip`

Run h2o

java -Xmx2g -jar h20.jar
http://localhost:54322/

To access H2o from a remote machine. Setup a ssh tunnel in remote machine

ssh -L 55555:localhost:54321 tyconet@10.47.86.77

Access from remote machine using

http://localhost:55555

Install H2o cluster

Unzip h2o zip file in all the nodes of the cluster Create a nodes.txt file with the ip address of nodes, in this case it will only one entry

<ip address>:54321
<ip address>:54321

Start the H2o cluster in all the nodes

java -Xmx4g -jar h2o.jar -flatfile nodes.txt -port 54321

Tuesday, September 2, 2014

Platt scaling - calibrating classifier probabilities

Platts In many real world cases, it is very important to predict well calibrated probabilites.

Medical Domains, doctors will be more interested in the probability value associated with predictions, than a categorical output, which says if a patient has a certain condition or not.
In document clustering, a ranked affinity of documents to a cluster label will be more informative.

In this two part series, we explore how to go about calibrating output probabilities from classifiers. In this part we detail the problem and the use Reliability charts to visualize the output probabilities. Reliability charts indicates if there is a need to calibrate the probabilities.
Let us simulate some classification data for this exercise.

def load_data():
    X,Y = make_classification(n_samples = 100,n_features=100, random_state = 100)
    return X,Y

Use a maximum margin (SVM in this case) and Naive Bayes classifier

def max_margin_model(X,Y):
    mdl = SVC()
    mdl.fit(X,Y)
    return mdl

def nb_model(X,Y):
    mdl = BernoulliNB()
    mdl.fit(X,Y)
    return mdl

SVC does not returns the actual probability value, it returns the distance from the separating planes. We can scale these values between 0 and 1 to get the actual probability values.
We Fit the simulated data on these two classifiers and examine the output probability.

X,Y = load_data()
X_train,X_test,X_train,Y_test = train_test_split(X,Y,test_size=0.5)

mdl_nb = nb_model(X_train,Y_train)
predict_prob_nb = mdl_nb.predict_proba(X_test)

Y_t[Y==0]=-1

mdl_max_margin = max_margin_model(X_train,Y_train)
predict_prob_max_margin = mdl_max_margin.decision_function(X_test)

We need to look at the probability values of the true positives, so given the actual probability or a distance and true positive list, the following function returns the predicted probability for the true positives.

def getProbability(distances,Y,scale=False):
    Y_ones = numpy.where(Y == 1)[0]
    distances_ones = distances[Y_ones]
    if scale:
        min_max_scaler = preprocessing.MinMaxScaler()
        distances_ones = min_max_scaler.fit_transform(distances_ones)

    return distances_ones

For distance output from SVC, we set the scale variable to true, hence we get a number between 0 and 1. For naive bayes, we return the probability values for the true positives.
We use the above function to get the probabilty values attributed to the true positives, for both the models

predict_prob_nb = getProbability(predict_prob_nb[:,1],Y_test,False)
predict_prob_max_margin = getProbability(mdl_max_margin.decision_function(X_test),Y_test)

We look at the probability histogram,

bins = list(numpy.arange(0,1.1,0.1))
hist,bins = numpy.histogram(predict_prob_max_margin,bins)
>>> hist
array([ 9,  7, 12, 16, 19, 16, 24,  7, 21, 18])

Its interesting to see the spread of the probability values for SVC.

As its seen in the output values, SVM is trying to push the predicted output value away from 0 and 1.
With naive bayes

>>> hist,bins = numpy.histogram(predict_prob_nb,bins)
>>> hist
array([ 25,   1,   2,  10,   9,  25,  17,  15,  12, 131])

The probability values are pushed towards 0 or 1, its evident from the below plot.

As expected, naive bayes is pushing the probability values towards 0 and 1.

Reference

http://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf http://www.stat.cmu.edu/~fienberg/Statistics36-756/DegrootFienberg-Statistician-1983.pdf

Monday, August 11, 2014

Unsupervised Feature Learning - Sparse Filtering

SparseFiltering

Sparse Filtering

An unsupervised feature learning method. Click here for the paper.
The code is available in mygithub

A small python implementation of the Sparse Filtering algorithm. This implementation has dependency on

scikit learn Link
Scipy Optimize Link

As discussed in the paper we use a soft absolute activation function.

def soft_absolute(v):
    return np.sqrt(v**2 + epsilon)

The input X dimension is (nsamples,ndimensions). For testing purpose, we use scikit learn's make_classification function. We create 500 samples and 100 features.

def load_data():
    X,Y = make_classification(n_samples = 500,n_features=100)
    return X,Y

We use a simple Support Vector Machine Linear classifier to do final classiciation.

def simple_model(X,Y):
    clf_org_x = SVC()
    clf_org_x.fit(X,Y)
    predict = clf_org_x.predict(X)
    acc=  accuracy_score(Y,predict)
    return acc

We train a two layer network.

X,Y = load_data()
acc = simple_model(X,Y)

X_trans = sfiltering(X,25)

acc1= simple_model(X_trans,Y)

X_trans1 = sfiltering(X_trans,10)

acc2= simple_model(X_trans1,Y)

print "Without sparsefiltering, accuracy = %f "%(acc)
print "One Layer Accuracy, = %f, Increase = %f"%(acc1,acc1-acc)
print "Two Layer Accuracy,  = %f, Increase = %f"%(acc2,acc2-acc1)

At the first layer, we create 25 features. At the second layer we reduce them to 10. Finally a (500,10) X matrix is used by the SVC classifier.

Without sparsefiltering, accuracy = 0.986000 
One Layer Accuracy, = 1.000000, Increase = 0.014000
Two Layer Accuracy,  = 1.000000, Increase = 0.000000

With a single layer sparse filtering the accuracy reaches 100%. The second layer is redundant here.

Other implementations available in the web are,

Monday, June 9, 2014

Mutlti-label classification using vowpal wabbit

README

Multi-label classification of print media article to topics.

6/9/2014 2:13:52 PM
https://www.kaggle.com/c/wise-2014
https://github.com/subramgo/greekmedia
There are 203 class labels, and each instance can have one or more labels. We convert this problem to a binary classification problem so that vowpal wabbit can handle it.
The data in libsvm format, we use scripts/feature_creation.py to convert them to vw format. For every instance, say

103 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......

We create 203 vw entries as follows

....
-1 |LABEL_102 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......NO_WORDS:17

+1 |LABEL_103 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......NO_WORDS:17

As seen above we have added a new feature NO_WORDS, to count the number of words.
Using vw we train it as follows

vw --loss_function hinge  -d data/wise2014-train.vw --binary  -f models/model-09-13-18.bin

number of examples per pass = 13165768
passes used = 1
weighted example sum = 1.31658e+07
weighted label sum = -1.29777e+07
average loss = 0.00458097
best constant = -0.985716
total feature number = 3519146694

For feature prediction, we run the model in daemon mode

vw -i models/model-09-13-18.bin --daemon --quiet -t --port 26543

scripts/prediction.py is used to predict the test set.
This puts us in 16th place.

Thursday, March 6, 2014

Unsupervised outlier detection for categorical variable

In this article, I will discuss about an outlier detection technique called Attribute Value Frequency (AVF) for categorical data. This is a part of series of blogs I intend to write on outlier detection methods. AVF method is attributed to Anna Koufakau (Scalable and Efficient Outlier Detection in Large Distributed Data Sets with Mixed-Type Attributes).

The intuition behind AVF; outliers tend to occur infrequently in a dataset. Tuples are scored based on frequency of individual attribute values. Tuples with low AVF scores are marked as outliers.

Assume X = {x1,x2,x3…..xn} is the given dataset having n tuples ,each tuple with m attributes,

xi = {xi1,xi2…..xim} where 1 <= i <= n.

The function in the formula is frequency count of attribute j’s value. If a tuple has rare values for each of its attribute, due to summation, its overall score will be low. Thus at the end of the exercise, by arranging the tuples in increasing order of AVF score, the top K tuples are identified as outliers.

Given the summation nature of the calculation, the algorithm can be coded up using map reduce paradigm. The pseudo-code below implements AVF using map- reduce constructs.

Map (LineNumber,Row)

                For each col in Row

                                Emit ( colNumber + colValue,1)

Reduce ( key (colNumber + colValue ), List )

                Total = sum ( all list values)

                Emit(key,Total)

The output of Reduce is used in the next map reduce. Assume it’s a hashmap named as CountDict.

Map (LineNumber,Row)

                Sum = 0

For each col in Row

                Key = colNumber + colValue

                Count = CountDict(Key)

                                Sum = sum + count

                Emit(Sum,Row)

Reduce (Key (Sum),List)

                For row in List

                                Emit (Key, row)

References

Scalable and Efficient Outlier Detection in Large Distributed Data Sets with Mixed-Type Attributes

Vail Lab