Multi-label classification of print media article to topics.

6/9/2014 2:13:52 PM
https://www.kaggle.com/c/wise-2014
https://github.com/subramgo/greekmedia
There are 203 class labels, and each instance can have one or more labels. We convert this problem to a binary classification problem so that vowpal wabbit can handle it.
The data in libsvm format, we use scripts/feature_creation.py to convert them to vw format. For every instance, say

103 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......

We create 203 vw entries as follows

....
-1 |LABEL_102 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......NO_WORDS:17

+1 |LABEL_103 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......NO_WORDS:17

As seen above we have added a new feature NO_WORDS, to count the number of words.
Using vw we train it as follows

vw --loss_function hinge  -d data/wise2014-train.vw --binary  -f models/model-09-13-18.bin

number of examples per pass = 13165768
passes used = 1
weighted example sum = 1.31658e+07
weighted label sum = -1.29777e+07
average loss = 0.00458097
best constant = -0.985716
total feature number = 3519146694

For feature prediction, we run the model in daemon mode

vw -i models/model-09-13-18.bin --daemon --quiet -t --port 26543

scripts/prediction.py is used to predict the test set.
This puts us in 16th place.

4 comments:

UnknownJune 16, 2014 at 11:43 PM
hey,

Quite an interesting post. I tried running this example. There are two major errors I am facing

1. The size of converted files from libsvm to wv is huge, around 6.5 gb for test and 29 gb for train is it okay ?

2. Please mention more about execution time and your hardware.

Also , a minor change in the predictions file.

import socket
import sys
import math
import time

input_file = sys.argv[1]
output_file = sys.argv[2]

# List of labels
label_list =[str(ii) for ii in range(1,204)]
hostname = "127.0.0.1"
port = 26543
linecount = 64858
label_header = "LABEL"
buffer_size = 256
i = open( input_file )
o = open( output_file, 'wb' )
o.write( 'ArticleId,Labels\n' )
lines_processed = 0
hostname="localhost"
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((hostname, port))

""" Netcat equivalent in python """
def netcat(contents):
predictions =[]
for content in contents:
s.sendall(content)

while 1:
data = s.recv(buffer_size)
if data == "":
break
if data != "":
predictions.append(data)
break
return predictions

def getVWFormat(line):
y, x = line.split( " ", 1 )
x=x.rstrip()
no_words = len(x.split(" "))
x+=" NO_WORDS:" + str(no_words)

vw_list = ["|" + label_header + "_" + l + " " + x + '\n' for l in label_list]
return vw_list

def getPrediction(line):
vw_instances = getVWFormat(line)
prediction_list = []
instance_count = 0
prediction_list=netcat(vw_instances)
prediction_list_float = map(float,prediction_list)
prediction_labels = [i + 1 for i,x in enumerate(prediction_list_float) if x == 1.0]
if len(prediction_labels) > 0:
return prediction_labels
else:
p_sorted = sorted(range(len(prediction_list_float)), key=prediction_list_float.__getitem__)
return [p_sorted[len(p_sorted)-1]+1]

lis = []
for line in i:
p = getPrediction(line.rstrip())
lines_processed+=1
if lines_processed%10000 == 0:
for x in lis :
o.write(x)
lis = []
print "%d lines finished."%(lines_processed)
p_str = " ".join(str(e) for e in p)
stri = str(linecount) + "," + p_str + "\n"
lis.append(stri)
print lines_processed
linecount+=1

for x in lis :
o.write(x)
lis = []

s.shutdown(socket.SHUT_WR)
s.close()

ReplyDelete
Replies
GopiJune 19, 2014 at 7:55 AM
Yes, its okay since we are changing a multiclass problem to a binary one, we need to have 203 (Total class labels) entries for each document.

I am running on 8GB laptop and takes me close to 9 minutes to build the model.

Yeah, Thanks for the script change
ReplyDelete
Replies

Add comment