March 2017 – No More Hacks

So I heard from someone that they are using R to mine text, to look for sentiment in statements about the market.

I thought I’d give it a try, but instead using the Gutenberg Press text of Jane Eyre.

I used the TextMining package because I found that first, and it got me started, though I haven’t done any of the real work of analysing text (like looking for correlations between words).

But still, got me started.

#text mining

library(tm)

library(wordcloud2)

library(tidyverse)

docs <- Corpus(DirSource(pattern=”text_source*”, ignore.case = TRUE, encoding = “UTF-8”))

# don’t use this, it seems to break everything

# inspect(docs)

# clean the docs

docs <- tm_map(docs, removePunctuation)

# stopwords is a very slow step, avoid running it in demo

docs <- tm_map(docs, removeWords, stopwords(“english”))

docs <- tm_map(docs, removeWords, c(“will”, “now”, “one”, “said”, “like”, “little”))

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, tolower)

dtm <- DocumentTermMatrix(docs)

freq <-colSums( as.matrix(dtm))

ord <- order(freq,decreasing=TRUE)

tops <- freq[head(ord, 1000)]

wf <- tibble(word=labels(tops), count=tops) %>% filter (count < 500)

wordcloud2(data = wf)

Month: March 2017

Text Mining in R