Multi-label classification of print media article to topics.
6/9/2014 2:13:52 PMhttps://www.kaggle.com/c/wise-2014
https://github.com/subramgo/greekmedia
There are 203 class labels, and each instance can have one or more labels. We convert this problem to a binary classification problem so that vowpal wabbit can handle it.
The data in libsvm format, we use scripts/feature_creation.py to convert them to vw format. For every instance, say
103 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......
We create 203 vw entries as follows....
-1 |LABEL_102 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......NO_WORDS:17
+1 |LABEL_103 1123:0.003061 1967:0.250931 3039:0.074709 20411:0.025801 24432:0.229228 38215:0.081586 41700:0.139233 46004:0.007150 54301:0.074447 .......NO_WORDS:17
As seen above we have added a new feature NO_WORDS, to count the number of words.Using vw we train it as follows
vw --loss_function hinge -d data/wise2014-train.vw --binary -f models/model-09-13-18.bin
number of examples per pass = 13165768
passes used = 1
weighted example sum = 1.31658e+07
weighted label sum = -1.29777e+07
average loss = 0.00458097
best constant = -0.985716
total feature number = 3519146694
For feature prediction, we run the model in daemon modevw -i models/model-09-13-18.bin --daemon --quiet -t --port 26543
scripts/prediction.py is used to predict the test set.This puts us in 16th place.
hey,
ReplyDeleteQuite an interesting post. I tried running this example. There are two major errors I am facing
1. The size of converted files from libsvm to wv is huge, around 6.5 gb for test and 29 gb for train is it okay ?
2. Please mention more about execution time and your hardware.
Also , a minor change in the predictions file.
import socket
import sys
import math
import time
input_file = sys.argv[1]
output_file = sys.argv[2]
# List of labels
label_list =[str(ii) for ii in range(1,204)]
hostname = "127.0.0.1"
port = 26543
linecount = 64858
label_header = "LABEL"
buffer_size = 256
i = open( input_file )
o = open( output_file, 'wb' )
o.write( 'ArticleId,Labels\n' )
lines_processed = 0
hostname="localhost"
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((hostname, port))
""" Netcat equivalent in python """
def netcat(contents):
predictions =[]
for content in contents:
s.sendall(content)
while 1:
data = s.recv(buffer_size)
if data == "":
break
if data != "":
predictions.append(data)
break
return predictions
def getVWFormat(line):
y, x = line.split( " ", 1 )
x=x.rstrip()
no_words = len(x.split(" "))
x+=" NO_WORDS:" + str(no_words)
vw_list = ["|" + label_header + "_" + l + " " + x + '\n' for l in label_list]
return vw_list
def getPrediction(line):
vw_instances = getVWFormat(line)
prediction_list = []
instance_count = 0
prediction_list=netcat(vw_instances)
prediction_list_float = map(float,prediction_list)
prediction_labels = [i + 1 for i,x in enumerate(prediction_list_float) if x == 1.0]
if len(prediction_labels) > 0:
return prediction_labels
else:
p_sorted = sorted(range(len(prediction_list_float)), key=prediction_list_float.__getitem__)
return [p_sorted[len(p_sorted)-1]+1]
lis = []
for line in i:
p = getPrediction(line.rstrip())
lines_processed+=1
if lines_processed%10000 == 0:
for x in lis :
o.write(x)
lis = []
print "%d lines finished."%(lines_processed)
p_str = " ".join(str(e) for e in p)
stri = str(linecount) + "," + p_str + "\n"
lis.append(stri)
print lines_processed
linecount+=1
for x in lis :
o.write(x)
lis = []
s.shutdown(socket.SHUT_WR)
s.close()
Yes, its okay since we are changing a multiclass problem to a binary one, we need to have 203 (Total class labels) entries for each document.
ReplyDeleteI am running on 8GB laptop and takes me close to 9 minutes to build the model.
Yeah, Thanks for the script change
Thanks for acknowledging the changes, How to run this test ?
Deletepython predictions.py data/wise2014-test.wv results.csv
i am generating vw file for the test also, and running classifier on that, is that correct way to do things ? or should it be run on libsvm file.
I am quite new to the domain and your help would be highly appreciated
Thanks
python predictions.py data/wise2014-test.libsvm results.csv
DeleteYou need to give the libsvm format to predictions.py. Function getVWFormat() will change it to vowpal wabbit format.