Understanding the Amazon reviews dataset

We use the Amazon product reviews polarity dataset for the various projects in this chapter. It is an open dataset constructed and made available by Xiang Zhang. It is used as a text classification benchmark in the paper: Character-level Convolutional Networks for Text Classification and Advances in Neural Information Processing Systems 28, Xiang Zhang, Junbo Zhao, Yann LeCun, (NIPS 2015).

The Amazon reviews polarity dataset is constructed by taking review score 1 and 2 as negative, 4 and 5 as positive. Samples of score 3 are ignored. In the dataset, class 1 is the negative and class 2 is the positive. The dataset has 1,800,000 training samples and 200,000 testing samples.

The train.csv and test.csv files contains all the samples as comma-separated values. There are three columns in them, corresponding to class index (1 or 2), review title, and review text. The review title and text are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character that is " ".

To ensure that we are able to run our projects, even with minimal infrastructure, let's restrict the number of records to be considered in our dataset to 1,000 records only. Of course, the code that we use in the projects can be extended to any number of records, as long as the hardware infrastructure support is available. Let's first read the data and visualize the records with the following code:

# reading first 1000 reviews
reviews_text<-readLines('/home/sunil/Desktop/sentiment_analysis/amazon _reviews_polarity.csv', n = 1000)
# converting the reviews_text character vector to a dataframe
reviews_text<-data.frame(reviews_text)
# visualizing the dataframe
View(reviews_text)

This will result in the following output:

Post reading the file, we can see that there is only one column in the dataset and this column had both the review text and the sentiment components in it. We will slightly modify the format of the dataset for the purpose of using it with sentiment analysis projects in this chapter involving the BoW, Word2vec, and GloVe approaches. Let's modify the format of the dataset with the following code:

# separating the sentiment and the review text
# post separation the first column will have the first 4 characters
# second column will have the rest of the characters
# first column should be named "Sentiment"
# second column to be named "SentimentText"
library(tidyr)
reviews_text<-separate(data = reviews_text, col = reviews_text, into = c("Sentiment", "SentimentText"), sep = 4)
# viewing the dataset post the column split
View(reviews_text)

This will result in the following output:

Now we have two columns in our dataset. However, there is unnecessary punctuation that exists in both the columns that may cause problems with processing the dataset further. Let's attempt to remove the punctuation with the following code:

# Retaining only alphanumeric values in the sentiment column
reviews_text$Sentiment<-gsub("[^[:alnum:] ]","",reviews_text$Sentiment)
# Retaining only alphanumeric values in the sentiment text
reviews_text$SentimentText<-gsub("[^[:alnum:] ]"," ",reviews_text$SentimentText)
# Replacing multiple spaces in the text with single space
reviews_text$SentimentText<-gsub("(?<=[\s])\s*|^\s+|\s+$", "", reviews_text$SentimentText, perl=TRUE)
# Viewing the dataset
View(reviews_text)
# Writing the output to a file that can be consumed in other projects
write.table(reviews_text,file = "/home/sunil/Desktop/sentiment_analysis/Sentiment Analysis Dataset.csv",row.names = F,col.names = T,sep=',')

This will result in the following output:

From the preceding output, we see that we have a clean dataset that is ready for use. Also, we have written the output to a file. When we build the sentiment analyzer, we can start directly reading the dataset from the Sentiment Analysis Dataset.csv file.

The fastText algorithm expects the dataset to be in a different format. The data input to fastText should comply the following format:

__label__<X>  <Text>

In this example, X is the class name. Text is the actual review text that led to the rating specified under the class. Both the rating and text should be placed on one line with no quotes. The classes are __label__1 and __label__2, and there should be only one class per row. Let's accomplish the fastText library required format with the following code block:

# reading the first 1000 reviews from the dataset
reviews_text<-readLines('/home/sunil/Desktop/sentiment_analysis/amazon _reviews_polarity.csv', n = 1000)
# basic EDA to confirm that the data is read correctly
print(class(reviews_text))
print(length(reviews_text))
print(head(reviews_text,2))
# replacing the positive sentiment value 2 with __label__2
reviews_text<-gsub("\"2\",","__label__2 ",reviews_text)
# replacing the negative sentiment value 1 with __label__1
reviews_text<-gsub("\"1\",","__label__1 ",reviews_text)
# removing the unnecessary " characters
reviews_text<-gsub("\""," ",reviews_text)
# replacing multiple spaces in the text with single space
reviews_text<-gsub("(?<=[\s])\s*|^\s+|\s+$", "", reviews_text, perl=TRUE)
# Basic EDA post the required processing to confirm input is as desired
print("EDA POST PROCESSING")
print(class(reviews_text))
print(length(reviews_text))
print(head(reviews_text,2))
# writing the revamped file to the directory so we could use it with
# fastText sentiment analyzer project
fileConn<-file("/home/sunil/Desktop/sentiment_analysis/Sentiment Analysis Dataset_ft.txt")
writeLines(reviews_text, fileConn)
close(fileConn)

This will result in the following output:

[1] "EDA PRIOR TO PROCESSING"
[1] "character"
[1] 1000
[1] ""2","Stuning even for the non-gamer","This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^""
[2] ""2","The best soundtrack ever to anything.","I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.""
[1] "EDA POST PROCESSING"
[1] "character"
[1] 1000
[1] "__label__2 Stuning even for the non-gamer , This sound track was beautiful! It paints the senery in your mind so well I would recommend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^"
[2] "__label__2 The best soundtrack ever to anything. , I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade. The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."

From the output of basic EDA code, we can see that the dataset is in the required format, therefore we can proceed to our next section of implementing the sentiment analysis engine using the BoW approach. Along side the implementation, we will delve into learning the concept behind the approach, and explore the sub-techniques that can be leveraged in the approach to obtain better results.

Table of Contents for Understanding the Amazon reviews dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding the Amazon reviews dataset