Showing posts with label sentiment mining. Show all posts
Showing posts with label sentiment mining. Show all posts

Wednesday, December 3, 2014

Stock Sentiment Mining in R

Stock Sentiment Mining In this post we will look at performing sentiment mining on Stock news. We will use R package tm and tm.plugin.webmining.
Ensure that packages tm, tm.plugin.tags and tm.plugin.webmining are installed.
install.packages("tm.plugin.tags", repos = "http://datacube.wu.ac.at", type = "source") 
Include the necessary libraries
library(tm)
library(tm.plugin.webmining)
library(tm.plugin.tags)
For a given stock, let us say TCS in our case, we can retrive the news associated with them from google and yahoo
stock <- "TCS"
Googlecorpus <- WebCorpus(GoogleFinanceSource(stock))
Yahoocorpus <- WebCorpus(YahooFinanceSource(stock))
Create a combined corpus from both the sources
corpus <- c(Googlecorpus,Yahoocorpus)
We extract the heading from all the news articles associated with the stock TCS. We filter them to include only those headings which contains the word TCS
# Get all the headings
headings <- sapply(corpus,FUN=function(x){attr(x,"Heading")})
headings.filtered <- headings[grepl(stock,headings)]
We do a similar operation for the contents of the news article
# Get all the descriptions
descriptions <- sapply(corpus,FUN=function(x){attr(x,"Description")})
descriptions.filtered <- descriptions[grepl(stock,descriptions)]
We get the dictionary of positive and negative words and stem them.
control <- list(stemming=TRUE)
neg <- tm_get_tags("Negativ",control=control)
pos <- tm_get_tags("Positiv",control=control)
Now we define a function to give us a sentiment score
# Sentiment mining
score <- function(text,pos,neg) {
corpu <- Corpus(VectorSource(text))
termfreq_control <- list(removePunctuation = TRUE, 
stemming=TRUE, stopwords=TRUE, wordLengths=c(2,100)) 

dtm <-DocumentTermMatrix(corpu, control=termfreq_control) 

# term frequency matrix 
tfidf <- weightTfIdf(dtm) 

# identify positive terms 
which_pos <- Terms(dtm) %in% pos 

# identify negative terms 
which_neg <- Terms(dtm) %in% neg 

# number of positive terms in each row 
score_pos <- colSums(as.data.frame(t(as.matrix(dtm[, which_pos])))) 

# number of negative terms in each row 
score_neg <- colSums(as.data.frame(t(as.matrix(dtm[, which_neg])))) 


polarity <- (score_pos - score_neg) / (score_pos+score_neg)

return(polarity)
}
The function takes as input a corpus and set of positive and negative sentiment words. Let us decode the function It creates a term document matrix of the corpus. It ignores punctuations, stopwords and words of length 1. TFIDF score is used in the term document matrix
corpu <- Corpus(VectorSource(text))
termfreq_control <- list(removePunctuation = TRUE, 
stemming=TRUE, stopwords=TRUE, wordLengths=c(2,100)) 

dtm <-DocumentTermMatrix(corpu, control=termfreq_control) 
# term frequency matrix 
tfidf <- weightTfIdf(dtm) 
It creates a list of postive and negative terms present in the input corpus.
# identify positive terms 
which_pos <- Terms(dtm) %in% pos 

# identify negative terms 
which_neg <- Terms(dtm) %in% neg 
A weighted sum of positive and negative words are calculated.
# number of positive terms in each row 
score_pos <- colSums(as.data.frame(t(as.matrix(dtm[, which_pos])))) 

# number of negative terms in each row 
score_neg <- colSums(as.data.frame(t(as.matrix(dtm[, which_neg])))) 
Finally polarity is a ration of postive and negative scores.
polarity <- (score_pos - score_neg) / (score_pos+score_neg)
Thus using this function we can calucate the polarity score of any document corpus.
descriptions.polarity <- score(descriptions.filtered,pos,neg)
This polarity score when plotted against time gives an idea of how the stock sentiment has moved over time. This movement can be later correlated with volume change over time.