Stock Sentiment Mining In this post we will look at performing sentiment mining on Stock news. We will use R package tm and tm.plugin.webmining.
Ensure that packages tm, tm.plugin.tags and tm.plugin.webmining are installed.

install.packages("tm.plugin.tags", repos = "http://datacube.wu.ac.at", type = "source")

Include the necessary libraries

library(tm)
library(tm.plugin.webmining)
library(tm.plugin.tags)

For a given stock, let us say TCS in our case, we can retrive the news associated with them from google and yahoo

stock <- "TCS"
Googlecorpus <- WebCorpus(GoogleFinanceSource(stock))
Yahoocorpus <- WebCorpus(YahooFinanceSource(stock))

Create a combined corpus from both the sources

corpus <- c(Googlecorpus,Yahoocorpus)

We extract the heading from all the news articles associated with the stock TCS. We filter them to include only those headings which contains the word TCS

# Get all the headings
headings <- sapply(corpus,FUN=function(x){attr(x,"Heading")})
headings.filtered <- headings[grepl(stock,headings)]

We do a similar operation for the contents of the news article

# Get all the descriptions
descriptions <- sapply(corpus,FUN=function(x){attr(x,"Description")})
descriptions.filtered <- descriptions[grepl(stock,descriptions)]

We get the dictionary of positive and negative words and stem them.

control <- list(stemming=TRUE)
neg <- tm_get_tags("Negativ",control=control)
pos <- tm_get_tags("Positiv",control=control)

Now we define a function to give us a sentiment score

# Sentiment mining
score <- function(text,pos,neg) {
corpu <- Corpus(VectorSource(text))
termfreq_control <- list(removePunctuation = TRUE, 
stemming=TRUE, stopwords=TRUE, wordLengths=c(2,100)) 

dtm <-DocumentTermMatrix(corpu, control=termfreq_control) 

# term frequency matrix 
tfidf <- weightTfIdf(dtm) 

# identify positive terms 
which_pos <- Terms(dtm) %in% pos 

# identify negative terms 
which_neg <- Terms(dtm) %in% neg 

# number of positive terms in each row 
score_pos <- colSums(as.data.frame(t(as.matrix(dtm[, which_pos])))) 

# number of negative terms in each row 
score_neg <- colSums(as.data.frame(t(as.matrix(dtm[, which_neg])))) 


polarity <- (score_pos - score_neg) / (score_pos+score_neg)

return(polarity)
}

The function takes as input a corpus and set of positive and negative sentiment words. Let us decode the function It creates a term document matrix of the corpus. It ignores punctuations, stopwords and words of length 1. TFIDF score is used in the term document matrix

corpu <- Corpus(VectorSource(text))
termfreq_control <- list(removePunctuation = TRUE, 
stemming=TRUE, stopwords=TRUE, wordLengths=c(2,100)) 

dtm <-DocumentTermMatrix(corpu, control=termfreq_control) 
# term frequency matrix 
tfidf <- weightTfIdf(dtm)

It creates a list of postive and negative terms present in the input corpus.

# identify positive terms 
which_pos <- Terms(dtm) %in% pos 

# identify negative terms 
which_neg <- Terms(dtm) %in% neg

A weighted sum of positive and negative words are calculated.

# number of positive terms in each row 
score_pos <- colSums(as.data.frame(t(as.matrix(dtm[, which_pos])))) 

# number of negative terms in each row 
score_neg <- colSums(as.data.frame(t(as.matrix(dtm[, which_neg]))))

Finally polarity is a ration of postive and negative scores.

polarity <- (score_pos - score_neg) / (score_pos+score_neg)

Thus using this function we can calucate the polarity score of any document corpus.

descriptions.polarity <- score(descriptions.filtered,pos,neg)

This polarity score when plotted against time gives an idea of how the stock sentiment has moved over time. This movement can be later correlated with volume change over time.

[RandomForestDemo]

Random Forest Classifier Demo¶

From scikit learn package import digits dataset

In [2]:

from sklearn.datasets import load_digits

In [3]:

digits_data = load_digits()
X = digits_data['data']
Y = digits_data['target']

In [4]:

X.shape

Out[4]:

(1797, 64)

In [5]:

Y
Y.shape

Out[5]:

(1797,)

In [6]:

import pylab as pl
pl.gray()
pl.matshow(digits_data.images[0])
pl.show()

<matplotlib.figure.Figure at 0xe483af0>

In [7]:

X[0]

Out[7]:

array([  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
        15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
         8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
         5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
         1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
         0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.])

In [8]:

Y[0]

Out[8]:

In [9]:

pl.matshow(digits_data.images[1])
pl.show()

In [10]:

X[1]

Out[10]:

array([  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.,   0.,   0.,   0.,
        11.,  16.,   9.,   0.,   0.,   0.,   0.,   3.,  15.,  16.,   6.,
         0.,   0.,   0.,   7.,  15.,  16.,  16.,   2.,   0.,   0.,   0.,
         0.,   1.,  16.,  16.,   3.,   0.,   0.,   0.,   0.,   1.,  16.,
        16.,   6.,   0.,   0.,   0.,   0.,   1.,  16.,  16.,   6.,   0.,
         0.,   0.,   0.,   0.,  11.,  16.,  10.,   0.,   0.])

In [11]:

Y[1]

Out[11]:

From the corpus let us create Train and Test Dataset

In [12]:

from sklearn.cross_validation import train_test_split

In [13]:

x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=42)

In [14]:

x_train.shape

Out[14]:

(1437, 64)

In [15]:

x_test.shape

Out[15]:

(360, 64)

In [24]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier(n_estimators=1,criterion="entropy")

clf.fit(x_train,y_train)
predictions = clf.predict(x_train)

In [25]:

predictions

Out[25]:

array([6, 0, 0, ..., 2, 7, 1])

In [26]:

print "Train Accuracy = %f "%(accuracy_score(y_train,predictions)*100)

predictions_test = clf.predict(x_test)

print "Test Accuracy = %f "%(accuracy_score(y_test,predictions_test)*100)

Train Accuracy = 92.414753 
Test Accuracy = 79.722222

In [22]:

In []:

Vail Lab

Wednesday, December 3, 2014

Stock Sentiment Mining in R

Tuesday, December 2, 2014

Random Forest using scikit learn in IPython

Random Forest Classifier Demo¶