Text pre-processing

Before we build our model, we need to prepare our data so it can be provided to our model. We want a feature vector and a class label. In our case, the class label can take two values, positive or negative depending on if the sentence has a positive or a negative sentiment. Words are our features. We will use the bag-of-words model to represent our text as features. In a bag-words-model, the following steps are performed to transform a text into a feature vector:

  1. Extract all unique individual words from the text dataset. We call a text dataset a corpus.
  2. Process the words. Processing typically involves removing numbers and other characters, placing the words in lowercase, stemming the words, and removing unnecessary white spaces.
  3. Each word is assigned a unique number and together they form the vocabulary. A word uknown is added to the vocabulary. This is for the unknown words we will be seeing in future datasets.
  4. Finally, a document term matrix is created. The rows of this matrix are the document IDs, and the columns are formed by the words from the vocabulary.

Consider this simple example:

  • d1: Cats hate dogs
  • d2: Dogs chase cats

The binary document term matrix is now as follows:

chase cats hate dogs
0 1 1 1
1 1 0 1

 

We need to pre-process our tweets in a similar manner. We will use the tm R package to pre-process our twitter text.

Let us proceed to use the tm package to create a document term matrix:

library(tm)
get.dtm <- function(text.col, id.col, input.df, weighting){

title.reader <- readTabular(mapping=list(content=text.col, id=id.col))
corpus <- Corpus(DataframeSource(input.df), readerControl=list(reader=title.reader))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, content_transformer(tolower))


dtm <- DocumentTermMatrix(corpus, control = list(weighting = weighting))
return(dtm)


}

> dtm <- get.dtm('text','id', tweet.final,"weightTfIdf")
> dtm.mat <- as.matrix(dtm)

The get.dtm function creates a document term matrix from the given data frame. It does text pre-processing/normalization before creating the actual document term matrix. Preprocessing involves removing any punctuation, removing numbers, stripping unwanted white space, removing English stop words, and finally converts the text into lowercase.

The following is a list of English stop words stored in the tm package:

> stopwords("english")
[1] "i" "me" "my" "myself" "we" "our" "ours"
[8] "ourselves" "you" "your" "yours" "yourself" "yourselves" "he"
[15] "him" "his" "himself" "she" "her" "hers" "herself"
[22] "it" "its" "itself" "they" "them" "their" "theirs"
[29] "themselves" "what" "which" "who" "whom" "this" "that"
......

The output is truncated, as we have not shown all the stop words. In text mining, stop words are considered those which do not contribute much to the context of the text. Let's say we have two documents--if we want to differentiate them, we cannot use stop words to find what those two documents are uniquely representing. Hence, we typically remove these words from our text as a pre-processing exercise.

Let us look at the document term matrix:

<<DocumentTermMatrix (documents: 62, terms: 246)>>
Non-/sparse entries: 677/14575
Sparsity : 96%
Maximal term length: 17
>

We have around 62 documents and 246 words. Document term matrices are typically sparse, as all words don't appear in all the documents. This brings us to the weighting scheme. When we explained the document term matrix, we used a binary weightage scheme. A one was added the cell if a word was present in a document; if the word did not appear in the document, a zero was added. We can use a different weighting scheme called term-frequency inverse document frequency.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset