Chapter 8. Sentiment Analysis of Twitter Data

 

"He who molds the public sentiment... makes statutes and decisions possible or impossible to make."

 
 --Abraham Lincoln.

What people think matters not only to politicians and celebrities but also to most of us social beings. This need to know opinions about ourselves has affected people for a long time and is aptly summarized by the preceding famous quote. The opinion bug not only affects our own outlook, it affects the way we use products and services as well. As discussed while learning about market basket analysis and recommender engines (see Chapter 3, Predicting Customer Shopping Trends with Market Basket Analysis and Chapter 4, Building a Product Recommendation System respectively), our behavior can be approximated or predicted by observing the behavior of a group of people with similar characteristics such as price sensitivity, color preferences, brand loyalty, and so on. We also discussed in the earlier chapters that, for a long time, we have asked our friends and relatives for their opinions before making that next big purchase. While those opinions are important to us at an individual level, there are far more valuable insights we can derive from such information.

To say that the advent of the World Wide Web has simply accelerated and widened our circle would be an understatement. Without being repetitive, it is worth mentioning that the web has opened new doors for analyzing human behavior.

In the previous chapter, social networks were the object of discussion. We not only used social networks as tools to derive insights but we also discussed the fact that these platforms satisfy our inherent curiosity about what others are thinking or doing. Social networks provide us all with a platform where we can voice our opinions and be heard. The be heard aspect of it is a little tricky to define and handle. For instance, our opinions and feedback (assuming they are genuine) about someone or something on these platforms will certainly be heard by the people in our circles (directly or indirectly), but they may or may not be heard by the people or organizations they are intended for. Nevertheless, such opinions or feedback do impact the people connected to them and their behavior from then on. This impact of opinions and our general curiosity about what people think, coupled with more such use cases, is the motivation for this chapter.

In this chapter, we will:

  • Learn about sentiment Analysis and its key concepts
  • Look into the applications and challenges presented by sentiment analysis
  • Understand the different approaches to perform opinion mining
  • Apply the concepts of sentiment analysis on Twitter data

Understanding Sentiment Analysis

The fact that Internet-based companies and their CEOs feature as some of the most profitable entities in the global economy says a lot about how the world is being driven by technology and shaped by the Internet. Unlike any other medium, the Internet has become ubiquitous and has penetrated every aspect of our lives. It is no surprise that we are using and relying on the Internet and Internet-based solutions for advice and recommendations, apart from using it for many other purposes.

As we saw in the previous chapters, the relationship between the Internet and domains such as e-commerce and financial institutions goes way too deep. But our use of and trust in the online world doesn't stop there. Be it about booking a table at the new restaurant in your neighborhood or deciding which movie to see tonight, we take help from the Internet to know what opinions others have, or what others have to share, before we make the final call. As we will see later, such decision aids are not just limited to the commerce platforms but also apply to many other domains.

Opinion mining or sentiment analysis (as it is widely and interchangeably known) is the process of automatically identifying the subjectivity in text using natural language processing, text analytics, and computational linguistics. Sentiment analysis aims to identify the positive, negative, or neutral opinion, sentiment, or attitude of the speaker using said techniques. Sentiment analysis (henceforth used interchangeably with opinion mining) finds its application in areas from commerce to service domains across the world.

Key concepts of sentiment analysis

We will now examine the key terms and concepts related to sentiment analysis. These terms and concepts will help us formalize our discussions in the coming sections.

Subjectivity

Opinions or sentiments are one's own expression of views and beliefs. Furthermore, subjectivity (or subjective text) expresses our sentiments about entities such as products, people, governments, and so on. For instance, a subjective sentence could be I love to use Twitter, which shows a person's love towards a particular social network, while an objective sentence would be Twitter is a social network. The second example simply states a fact. Sentiment analysis revolves around subjective texts or subjectivity classification. It is also important to understand that not all subjective texts express sentiment. For example, I just created my Twitter account.

Sentiment polarity

Once we have a piece of text which is subjective in nature (and expresses some sentiment), the next task is to classify it into one of the sentiment classes of positive or negative (sometimes neutral is also considered). The task may also involve placing the text's sentiment on a continuous (or discrete) scale of polarities, thus defining the degree of positivity (or sentiment polarity). The sentiment polarity classification may deal with a different set of classes depending upon the context. For example, in a rating system for movies, sentiment polarities may be defined as liked versus disliked, or in a debate the views may be classified as for versus against.

Opinion summarization

Opinion classification or sentiment extraction from a piece of text is an important task in the process of sentiment analysis. This is often followed by a summarization of sentiments. To draw insights or conclusions from different texts related to the same topic (say, reviews of a given movie), it is important to aggregate (or summarize) the sentiments into a consumable form to draw conclusions (whether the movie is a blockbuster or a dud). This may involve the use of visualizations to infer the overall sentiment.

Feature extraction

As we have seen across the chapters, feature identification and extraction is what makes or breaks a machine learning algorithm. It is the most important factor after the data itself. Let us look at some of the feature sets utilized in solving the problem of sentiment analysis:

  • TF-IDF: Information Retrieval makes heavy use of Term Frequency-Inverse Document Frequency (tf-idf) to enable quick information retrieval and analysis. In the context of tf-idf, a piece of text is represented as a feature vector containing words as its constituents. Recent research has also shown that, in the context of sentiment analysis, the presence of a word improves the performance and accuracy as compared to the frequency of the word.

    Note

    Source:

    Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 79–86, 2002.

    TF-IDF is given as: Feature extraction

    Where,

    tf(t,d) is the term frequency of term t in document d.

    idf(t,D) is the inverse document frequency for term t in document set D.

    For example, we have the following screenshots of two documents with their terms and their corresponding frequencies:

    Feature extraction

    In its simplest form, TF-IDF for the term Twitter can be given as:

    Feature extraction

    Different weight schemes can be used for calculating tfidf; the preceding example uses log with base 10 to calculate idf.

  • n-Grams: Computational linguistics and probability consider a text corpus as a contiguous sequence of terms, which may be phonemes, letters, words, and so on. The n-gram-based modeling techniques find their roots in information theory, where the likelihood of the next character or word is based upon the n previous terms. Depending upon the value of n, the feature vector or model is termed as unigram (for n=1), bigram (for n=2), trigram (for n=3), and so on. n-grams are particularly useful with out-of-vocabulary words and approximate matches. For example, considering a sequence of words, a sentence such as A chapter on sentiment analysis would have bigrams such as a chapter, chapter on, on sentiment, sentiment analysis, and so on.
  • Parts of Speech (POS): Understanding and making use of the underlying structure of the language for analysis has obvious advantages. POS are rules of language which are used to create sentences, paragraphs and documents. In its simplest form, adjectives are usually pretty good indicators of subjectivity (not always, though). A number of approaches make use of the polarity of adjectives while classifying subjective texts. Using phrases containing adjectives has been shown to improve performance even further. Research into using other parts of speech, such as verbs and nouns, along with adjectives has also shown positive results.

    Note

    Reference:

    Peter Turney. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the Association for Computational Linguistics (ACL), pages 417–424, 2002.

    The following example shows the parts of speech (adjectives, nouns, and so on) tagged in a sample sentence, for example, We saw the yellow dog:

  • Negation: In the case of sentiment analysis, negation plays an important role. For example, sentences such as I like oranges and I don't like oranges differ only in the word don't, but the negation term flips the polarity of sentences to opposite classes (positive and negative respectively). Negation may be used as a secondary feature set where the original feature vector is generated as is, but is flipped in polarity later on based on the negation term. There are other variants to this approach as well, and they show an improvement in the results as compared to approaches which do not take into account the effects of negation.
  • Topic specific features: Topic plays an important role in setting the context. Since sentiment analysis is about the speaker's opinion, the subjectivity is influenced by the topic being discussed. Extensive research has gone into analyzing the effects of and relationship between the topic and sentiment of the text corpus.

Note

Reference:

Tony Mullen and Nigel Collier. Sentiment analysis using support vector machines with diverse information sources. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 412–418, July 2004. Poster paper.

Approaches

Now that we have a basic understanding of the key concepts from the world of sentiment analysis, let us look at different approaches for tackling this problem.

Sentiment analysis is mostly performed at the following two levels of abstraction:

  • Document level: At this level of abstraction, the task is to analyze a given document to determine whether its overall sentiment is positive or negative (or neutral in certain cases). The basic assumption is that the whole document expresses opinions related to a single entity. For example, given a product review, the system analyzes it to determine whether the review is positive or negative.
  • Sentence level: Sentence level analysis is a more granular form of sentiment analysis. This level of granularity counters the fact that not all sentences in a document are subjective and thus makes better use of subjectivity classification to determine the sentiment on a per sentence basis.

Pretty much like other machine learning techniques, sentiment analysis can also be tackled using supervised and unsupervised methods:

  • Supervised Approach: Research on sentiment analysis has been going on for quite some time. While the earlier research was constrained by the availability of labeled datasets and performed rather shallow analysis, modern day supervised learning approaches for sentiment analysis have seen a boost, both in terms of systems utilizing these techniques as well as the overall performance of such systems due to the availability of labeled datasets. Datasets such as WordNet, SentiWordNet, SenticNet, newswire, Epinions, and so on enormously assist researchers in improving supervised algorithms by providing datasets with polar words, documents classified into categories, user opinions, and so on. Algorithms such as Naïve Bayes, Support Vector Machines (SVM), as discussed in Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics, and Maximum-Entropy-based classification algorithms are classic examples of supervised learning approaches.
  • Unsupervised Approach: Unsupervised algorithms for sentiment analysis usually begin with building or learning a sentiment lexicon and then determining the polarity of the text input. Lexicon generation has been done through techniques such as linguistic heuristics, bootstrapping, and so on. One famous approach was detailed by Turney in his 2002 paper where he describes unsupervised sentiment analysis using some fixed syntactic patterns which were based on POS.

    Note

    Reference:

    Linguistic heuristics: Vasileios Hatzivassiloglou and Kathleen McKeown. Predicting the semantic orientation of adjectives. In Proceedings of the Joint ACL/EACL Conference, pages 174–181, 1997.

    Bootstrapping: Ellen Riloff and Janyce Wiebe. Learning extraction patterns for subjective expressions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2003.

    Turney: Peter Turney. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the Association for Computational Linguistics (ACL), pages 417–424, 2002.

Applications

As we have been discussing throughout, our reliance upon online opinions is something of a surprise. We knowingly or unknowingly check these opinions or are influenced by them before buying products, downloading software, selecting apps, or choosing a restaurant. Sentiment analysis or opinion mining finds its application in many areas; they can be summarized into the following broad categories:

  • Online and Offline Commerce: Customer preferences can make or break brands in an instant. For a product to be a hot seller, everything including pricing, packaging, and marketing has to be perfect. Customers form opinions about all aspects related to products, and thus affect their sales. It is not just the case with online commerce where customers check product reviews on multiple websites or blogs before making the actual purchase, but word of mouth and other such factors affect customer opinions in the world of offline commerce as well. Sentiment analysis thus forms an important factor which brands or companies track and analyze to be on top of the game. Analysis of social media content such as tweets, Facebook posts, blogs, and so on provide brands with an insight into how customers perceive their product. In certain cases, brands roll out specific marketing campaigns to set the general sentiment or hype about a product.
  • Governance: In a world where most activities have online counterparts, governments are no exceptions. There have been projects by various governments across the globe which have made use of sentiment analysis in matters of policy making and security (by analyzing and monitoring any increase in hostile or negative communications). Sentiment Analysis has also been used by analysts to determine or predict outcomes of elections as well. Tools such as eRuleMaking have sentiment analysis as a key component.

Apart from the aforementioned two categories, opinion mining acts as an augmenting technology in fields such as recommendation engines and general prediction systems. For example, opinion mining may be used in conjunction with recommender engines to exclude products from recommendation lists which have opinions or sentiments below certain thresholds. Sentiment analysis may also find innovative use in predicting whether an upcoming movie will be a blockbuster or not based on sentiments related to the star cast, production house, topic of the movie, and so on.

Challenges

To understand the opinions and/or sentiments of others is an inherently difficult task. To be able to handle such a problem algorithmically is equally hard. The following are some of the challenges faced while performing Sentiment Analysis:

  • Understanding and Modeling Natural Language Constructs: Sentiment analysis is inherently a natural language processing (NLP) problem, albeit a restricted one. Even though sentiment analysis is a restricted form of NLP, involving classification into positive, negative or neutral, it still faces issues like coreference resolution, word sense disambiguation, and negation handling to name a few. Advancements in NLP, as well as Sentiment Analysis, in recent years have helped in overcoming these issues to a certain extent, yet there is a long way to travel before we will be able to model the rules of a natural language perfectly.
  • Sarcasm: Sentiments can be expressed in pretty subtle ways. It is not just negative sentiments; positive sentiments can also be nicely disguised within sarcastic sentences. Since understanding sarcasm is a trick only a few can master, it is not easy to model sarcasm and identify sentiment correctly. For example, the comment Such a simple to use product, you just need to read 300 pages from the manual, contains only positive words yet has a negative flavor to it which is not easy to model.
  • Review and reviewer quality: Opinions vary from person to person. Some of us may present our opinions very strongly while others may not. Another issue is that everyone has an opinion, whether they know about a subject or not. This creates a problem of review and reviewer quality, which may affect overall analysis. For example, a person who is a casual reader may not be the most apt person to ask for a review of a new book. Similarly, it may not be advisable to get a new author's book reviewed by a critic. Both extremes may result in biased outcomes or incorrect insights.
  • Opinion data size and skew: The web has loads and loads of blogs and sites which provide users with a platform to voice and share opinions on everything possible on and beyond the planet. Still, the opinion data at a granular level is an issue. As we discussed in the previous chapter, the amount of data related to a particular context (say a brand or a person) is so limited that it affects the overall analysis. Moreover, the data available is sometimes skewed in favor of (or against) entities due to prejudices, incorrect facts, or rumors.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset