Customized sentiment analysis

As mentioned earlier, sentiment analysis is the process of identifying and extracting sentiment information related to a specified topic, domain, or entity, from a set of documents. The sentiment is identified using trained sentiment classifiers. Thus, the quality and the type of the training data have a big impact on the classifier's performance. Most pre-trained classifiers (like VADER) are trained on general texts because they are designed to be versatile for use on different topics. Unfortunately, when we need to extract sentiment from a specific textual data (for example, very domain specific) such as a general classifier might not perform very well. That is why, it makes great sense to train our own classifier that will fit specific needs, or alternately, just train a general classifier, but based on customized, verified, and known datasets. In short, the magnitude of adaptation to the domain is what makes the difference between a good sentiment analysis and a better one.

There are many sources of datasets available on the internet free of charge, but in extreme cases, we can also prepare our own.

The preparation of a custom classifier requires two data sets:

  • Training data set: The data on which the classifier algorithm learns the model parameters
  • Test data set: This is used to determine the accuracy of the algorithm

There is no rule of thumb for selecting training and testing data set sizes, but there is a broad agreement among practitioners that 60-80% of the total data should be training data, and 40-20% should be testing data, respectively.

As it is a supervised learning task, the data (tweets) must be tagged by output categories. In our case, we want to categorize our texts into three classes: positive, neutral, and negative. Thus, each sentence (tweet) should be assigned to one of the classes:

('Kasami vs Palace is the best premier league goal ever by the way', 'pos')

In order to create custom-made classifiers for sentiment analysis, we will use the Python scikit-learn library. The library features multiple machine learning algorithms, among which, we will find regression, classification, or clustering implementations.

Our first classifier will be a simple sentiment analyzer trained on a small dataset of tweets.

To begin, we will import a few elements from the library:

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.naive_bayes import MultinomialNB 

TfidfVectorizer is needed for transforming our data into numerical features usable for the model. It means the text will be represented as numerical data. Next, we select the type of the model for our classifier. Naive Bayes is a simple, yet powerful technique, thus very popular in many prototyping cases. Based on Bayes Theorem, it assumes that every feature contributes independently to the probability of each class (positive, neutral, and negative in our case). This machine learning technique is often used for simple classification tasks such as spam or document classification. It is also a very suitable algorithm for our "bag of words" approach to sentiment classification.

In the next step, we import all necessary libraries:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score 
from sklearn.model_selection import cross_val_predict 

We also need to import several modules for model evaluation. Evaluating the performance of a model is one of the key stages in the model building process. It indicates how successful the predictions of a dataset have been by a trained model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset