Ensure that packages tm, tm.plugin.tags and tm.plugin.webmining are installed.
install.packages("tm.plugin.tags", repos = "http://datacube.wu.ac.at", type = "source")
Include the necessary librarieslibrary(tm)
library(tm.plugin.webmining)
library(tm.plugin.tags)
For a given stock, let us say TCS in our case, we can retrive the news associated with them from google and yahoostock <- "TCS"
Googlecorpus <- WebCorpus(GoogleFinanceSource(stock))
Yahoocorpus <- WebCorpus(YahooFinanceSource(stock))
Create a combined corpus from both the sourcescorpus <- c(Googlecorpus,Yahoocorpus)
We extract the heading from all the news articles associated with the stock TCS. We filter them to include only those headings which contains the word TCS# Get all the headings
headings <- sapply(corpus,FUN=function(x){attr(x,"Heading")})
headings.filtered <- headings[grepl(stock,headings)]
We do a similar operation for the contents of the news article# Get all the descriptions
descriptions <- sapply(corpus,FUN=function(x){attr(x,"Description")})
descriptions.filtered <- descriptions[grepl(stock,descriptions)]
We get the dictionary of positive and negative words and stem them.control <- list(stemming=TRUE)
neg <- tm_get_tags("Negativ",control=control)
pos <- tm_get_tags("Positiv",control=control)
Now we define a function to give us a sentiment score# Sentiment mining
score <- function(text,pos,neg) {
corpu <- Corpus(VectorSource(text))
termfreq_control <- list(removePunctuation = TRUE,
stemming=TRUE, stopwords=TRUE, wordLengths=c(2,100))
dtm <-DocumentTermMatrix(corpu, control=termfreq_control)
# term frequency matrix
tfidf <- weightTfIdf(dtm)
# identify positive terms
which_pos <- Terms(dtm) %in% pos
# identify negative terms
which_neg <- Terms(dtm) %in% neg
# number of positive terms in each row
score_pos <- colSums(as.data.frame(t(as.matrix(dtm[, which_pos]))))
# number of negative terms in each row
score_neg <- colSums(as.data.frame(t(as.matrix(dtm[, which_neg]))))
polarity <- (score_pos - score_neg) / (score_pos+score_neg)
return(polarity)
}
The function takes as input a corpus and set of positive and negative sentiment words.
Let us decode the function
It creates a term document matrix of the corpus. It ignores punctuations, stopwords and words of length 1. TFIDF score is used in the term document matrixcorpu <- Corpus(VectorSource(text))
termfreq_control <- list(removePunctuation = TRUE,
stemming=TRUE, stopwords=TRUE, wordLengths=c(2,100))
dtm <-DocumentTermMatrix(corpu, control=termfreq_control)
# term frequency matrix
tfidf <- weightTfIdf(dtm)
It creates a list of postive and negative terms present in the input corpus.# identify positive terms
which_pos <- Terms(dtm) %in% pos
# identify negative terms
which_neg <- Terms(dtm) %in% neg
A weighted sum of positive and negative words are calculated.# number of positive terms in each row
score_pos <- colSums(as.data.frame(t(as.matrix(dtm[, which_pos]))))
# number of negative terms in each row
score_neg <- colSums(as.data.frame(t(as.matrix(dtm[, which_neg]))))
Finally polarity is a ration of postive and negative scores.polarity <- (score_pos - score_neg) / (score_pos+score_neg)
Thus using this function we can calucate the polarity score of any document corpus.descriptions.polarity <- score(descriptions.filtered,pos,neg)
This polarity score when plotted against time gives an idea of how the stock sentiment has moved over time. This movement can be later correlated with volume change over time.