As a way of learning new ideas and skills I resolved to taking little experiments.
An interesting lesson is generating a word-cloud in R with RStudio .
In order toTo get started generating a word cloud,
we need to install the following packages :
install.packages("tm") #for textmining
install.packages("SnowballC") #for text stemming
install.packages("wordcloud") #word cloud generator
install.packages("RColorBrewer") # import color palettes
Load up the above installed packages with ;
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
I downloaded Theodore Roosevelt’s, Man in the Arena speech and saved it in a text file. To access it the file in R, locate the text file and read its contents.
#read file
filePath <- "/home/wordcloud/man_in_arena.txt"
text <- readLines(filePath)
#load data
docs <- Corpus(VectorSource(text))
#inspect contents of the document
inspect(docs)
Modify the contents within the document that would improve semantics; reduce noise within the text and factors such as white spaces
#replace "/, @, |" with space
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
#remove unnecessary whitespaces
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, toSpace, "/")
Text stemming
Stemming is the process of reducing inflected words to their word stem
As an example fishing
has its root word as fish
To exercise this, you can remove words that you’d like to;
#remove common words(stop words) &
#convert to lowercase
docs <- tm_map(docs, content_transformer(tolower))
#remove numbers
docs <- tm_map(docs, removeNumbers)
#remove english common stop words
docs <- tm_map(docs, removeWords, stopwords("english"))
#remove own / undesired stopword
docs <- tm_map(docs, removeWords, c("Theodore", "We", "Shall"))
#remove punctuations
docs <- tm_map(docs, removePunctuation)
#eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
#text stemming
docs <- tm_map(docs, stemDocument)
Build a term-document matrix
As a reference, this allows you to index each word in the text file and the frequency it appears in the document.
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq=v)
head(d, 10)
#sample output
word freq
great 3
actual 2
deed 2
fail 2
strive 2
achieve 1
arena 1
and finally,
Generate the word cloud with :
set.seed(1200)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words = 200, random.orders=FALSE, rot.per=0.35,
colors=brewer.pal(9, "RdPu"))