Monday, June 9, 2014

Mutlti-label classification using vowpal wabbit

README

Multi-label classification of print media article to topics.

6/9/2014 2:13:52 PM
https://www.kaggle.com/c/wise-2014
https://github.com/subramgo/greekmedia
There are 203 class labels, and each instance can have one or more labels. We convert this problem to a binary classification problem so that vowpal wabbit can handle it.
The data in libsvm format, we use scripts/feature_creation.py to convert them to vw format. For every instance, say
103 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......
We create 203 vw entries as follows
....
-1 |LABEL_102 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......NO_WORDS:17

+1 |LABEL_103 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......NO_WORDS:17
As seen above we have added a new feature NO_WORDS, to count the number of words.
Using vw we train it as follows
vw --loss_function hinge  -d data/wise2014-train.vw --binary  -f models/model-09-13-18.bin

number of examples per pass = 13165768
passes used = 1
weighted example sum = 1.31658e+07
weighted label sum = -1.29777e+07
average loss = 0.00458097
best constant = -0.985716
total feature number = 3519146694
For feature prediction, we run the model in daemon mode
vw -i models/model-09-13-18.bin --daemon --quiet -t --port 26543
scripts/prediction.py is used to predict the test set.
This puts us in 16th place.

4 comments:

  1. hey,

    Quite an interesting post. I tried running this example. There are two major errors I am facing

    1. The size of converted files from libsvm to wv is huge, around 6.5 gb for test and 29 gb for train is it okay ?

    2. Please mention more about execution time and your hardware.


    Also , a minor change in the predictions file.




    import socket
    import sys
    import math
    import time

    input_file = sys.argv[1]
    output_file = sys.argv[2]

    # List of labels
    label_list =[str(ii) for ii in range(1,204)]
    hostname = "127.0.0.1"
    port = 26543
    linecount = 64858
    label_header = "LABEL"
    buffer_size = 256
    i = open( input_file )
    o = open( output_file, 'wb' )
    o.write( 'ArticleId,Labels\n' )
    lines_processed = 0
    hostname="localhost"
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect((hostname, port))


    """ Netcat equivalent in python """
    def netcat(contents):
    predictions =[]
    for content in contents:
    s.sendall(content)

    while 1:
    data = s.recv(buffer_size)
    if data == "":
    break
    if data != "":
    predictions.append(data)
    break
    return predictions


    def getVWFormat(line):
    y, x = line.split( " ", 1 )
    x=x.rstrip()
    no_words = len(x.split(" "))
    x+=" NO_WORDS:" + str(no_words)

    vw_list = ["|" + label_header + "_" + l + " " + x + '\n' for l in label_list]
    return vw_list


    def getPrediction(line):
    vw_instances = getVWFormat(line)
    prediction_list = []
    instance_count = 0
    prediction_list=netcat(vw_instances)
    prediction_list_float = map(float,prediction_list)
    prediction_labels = [i + 1 for i,x in enumerate(prediction_list_float) if x == 1.0]
    if len(prediction_labels) > 0:
    return prediction_labels
    else:
    p_sorted = sorted(range(len(prediction_list_float)), key=prediction_list_float.__getitem__)
    return [p_sorted[len(p_sorted)-1]+1]


    lis = []
    for line in i:
    p = getPrediction(line.rstrip())
    lines_processed+=1
    if lines_processed%10000 == 0:
    for x in lis :
    o.write(x)
    lis = []
    print "%d lines finished."%(lines_processed)
    p_str = " ".join(str(e) for e in p)
    stri = str(linecount) + "," + p_str + "\n"
    lis.append(stri)
    print lines_processed
    linecount+=1

    for x in lis :
    o.write(x)
    lis = []

    s.shutdown(socket.SHUT_WR)
    s.close()




    ReplyDelete
  2. Yes, its okay since we are changing a multiclass problem to a binary one, we need to have 203 (Total class labels) entries for each document.

    I am running on 8GB laptop and takes me close to 9 minutes to build the model.

    Yeah, Thanks for the script change

    ReplyDelete
    Replies
    1. Thanks for acknowledging the changes, How to run this test ?

      python predictions.py data/wise2014-test.wv results.csv

      i am generating vw file for the test also, and running classifier on that, is that correct way to do things ? or should it be run on libsvm file.

      I am quite new to the domain and your help would be highly appreciated
      Thanks

      Delete
    2. python predictions.py data/wise2014-test.libsvm results.csv

      You need to give the libsvm format to predictions.py. Function getVWFormat() will change it to vowpal wabbit format.

      Delete