"He who molds the public sentiment... makes statutes and decisions possible or impossible to make." | ||
--Abraham Lincoln. |
What people think matters not only to politicians and celebrities but also to most of us social beings. This need to know opinions about ourselves has affected people for a long time and is aptly summarized by the preceding famous quote. The opinion bug not only affects our own outlook, it affects the way we use products and services as well. As discussed while learning about market basket analysis and recommender engines (see Chapter 3, Predicting Customer Shopping Trends with Market Basket Analysis and Chapter 4, Building a Product Recommendation System respectively), our behavior can be approximated or predicted by observing the behavior of a group of people with similar characteristics such as price sensitivity, color preferences, brand loyalty, and so on. We also discussed in the earlier chapters that, for a long time, we have asked our friends and relatives for their opinions before making that next big purchase. While those opinions are important to us at an individual level, there are far more valuable insights we can derive from such information.
To say that the advent of the World Wide Web has simply accelerated and widened our circle would be an understatement. Without being repetitive, it is worth mentioning that the web has opened new doors for analyzing human behavior.
In the previous chapter, social networks were the object of discussion. We not only used social networks as tools to derive insights but we also discussed the fact that these platforms satisfy our inherent curiosity about what others are thinking or doing. Social networks provide us all with a platform where we can voice our opinions and be heard. The be heard aspect of it is a little tricky to define and handle. For instance, our opinions and feedback (assuming they are genuine) about someone or something on these platforms will certainly be heard by the people in our circles (directly or indirectly), but they may or may not be heard by the people or organizations they are intended for. Nevertheless, such opinions or feedback do impact the people connected to them and their behavior from then on. This impact of opinions and our general curiosity about what people think, coupled with more such use cases, is the motivation for this chapter.
In this chapter, we will:
The fact that Internet-based companies and their CEOs feature as some of the most profitable entities in the global economy says a lot about how the world is being driven by technology and shaped by the Internet. Unlike any other medium, the Internet has become ubiquitous and has penetrated every aspect of our lives. It is no surprise that we are using and relying on the Internet and Internet-based solutions for advice and recommendations, apart from using it for many other purposes.
As we saw in the previous chapters, the relationship between the Internet and domains such as e-commerce and financial institutions goes way too deep. But our use of and trust in the online world doesn't stop there. Be it about booking a table at the new restaurant in your neighborhood or deciding which movie to see tonight, we take help from the Internet to know what opinions others have, or what others have to share, before we make the final call. As we will see later, such decision aids are not just limited to the commerce platforms but also apply to many other domains.
Opinion mining or sentiment analysis (as it is widely and interchangeably known) is the process of automatically identifying the subjectivity in text using natural language processing, text analytics, and computational linguistics. Sentiment analysis aims to identify the positive, negative, or neutral opinion, sentiment, or attitude of the speaker using said techniques. Sentiment analysis (henceforth used interchangeably with opinion mining) finds its application in areas from commerce to service domains across the world.
We will now examine the key terms and concepts related to sentiment analysis. These terms and concepts will help us formalize our discussions in the coming sections.
Opinions or sentiments are one's own expression of views and beliefs. Furthermore, subjectivity (or subjective text) expresses our sentiments about entities such as products, people, governments, and so on. For instance, a subjective sentence could be I love to use Twitter, which shows a person's love towards a particular social network, while an objective sentence would be Twitter is a social network. The second example simply states a fact. Sentiment analysis revolves around subjective texts or subjectivity classification. It is also important to understand that not all subjective texts express sentiment. For example, I just created my Twitter account.
Once we have a piece of text which is subjective in nature (and expresses some sentiment), the next task is to classify it into one of the sentiment classes of positive or negative (sometimes neutral is also considered). The task may also involve placing the text's sentiment on a continuous (or discrete) scale of polarities, thus defining the degree of positivity (or sentiment polarity). The sentiment polarity classification may deal with a different set of classes depending upon the context. For example, in a rating system for movies, sentiment polarities may be defined as liked versus disliked, or in a debate the views may be classified as for versus against.
Opinion classification or sentiment extraction from a piece of text is an important task in the process of sentiment analysis. This is often followed by a summarization of sentiments. To draw insights or conclusions from different texts related to the same topic (say, reviews of a given movie), it is important to aggregate (or summarize) the sentiments into a consumable form to draw conclusions (whether the movie is a blockbuster or a dud). This may involve the use of visualizations to infer the overall sentiment.
As we have seen across the chapters, feature identification and extraction is what makes or breaks a machine learning algorithm. It is the most important factor after the data itself. Let us look at some of the feature sets utilized in solving the problem of sentiment analysis:
TF-IDF is given as:
Where,
tf(t,d)
is the term frequency of term t
in document d
.
idf(t,D)
is the inverse document frequency for term t
in document set D
.
For example, we have the following screenshots of two documents with their terms and their corresponding frequencies:
In its simplest form, TF-IDF
for the term Twitter
can be given as:
Different weight schemes can be used for calculating tfidf
; the preceding example uses log with base 10 to calculate idf
.
n-Grams
: Computational linguistics and probability consider a text corpus as a contiguous sequence of terms, which may be phonemes, letters, words, and so on. The n-gram-based modeling techniques find their roots in information theory, where the likelihood of the next character or word is based upon the n previous terms. Depending upon the value of n
, the feature vector or model is termed as unigram (for n=1
), bigram (for n=2
), trigram (for n=3
), and so on. n-grams are particularly useful with out-of-vocabulary words and approximate matches. For example, considering a sequence of words, a sentence such as A chapter on sentiment analysis would have bigrams such as a chapter, chapter on, on sentiment, sentiment analysis, and so on.Interesting work by Google on using n-grams: http://googleresearch.blogspot.in/2006/08/all-our-n-gram-are-belong-to-you.html.
The following example shows the parts of speech (adjectives, nouns, and so on) tagged in a sample sentence, for example, We saw the yellow dog:
Now that we have a basic understanding of the key concepts from the world of sentiment analysis, let us look at different approaches for tackling this problem.
Sentiment analysis is mostly performed at the following two levels of abstraction:
Pretty much like other machine learning techniques, sentiment analysis can also be tackled using supervised and unsupervised methods:
Reference:
Linguistic heuristics: Vasileios Hatzivassiloglou and Kathleen McKeown. Predicting the semantic orientation of adjectives. In Proceedings of the Joint ACL/EACL Conference, pages 174–181, 1997.
Bootstrapping: Ellen Riloff and Janyce Wiebe. Learning extraction patterns for subjective expressions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2003.
Turney: Peter Turney. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the Association for Computational Linguistics (ACL), pages 417–424, 2002.
As we have been discussing throughout, our reliance upon online opinions is something of a surprise. We knowingly or unknowingly check these opinions or are influenced by them before buying products, downloading software, selecting apps, or choosing a restaurant. Sentiment analysis or opinion mining finds its application in many areas; they can be summarized into the following broad categories:
Apart from the aforementioned two categories, opinion mining acts as an augmenting technology in fields such as recommendation engines and general prediction systems. For example, opinion mining may be used in conjunction with recommender engines to exclude products from recommendation lists which have opinions or sentiments below certain thresholds. Sentiment analysis may also find innovative use in predicting whether an upcoming movie will be a blockbuster or not based on sentiments related to the star cast, production house, topic of the movie, and so on.
To understand the opinions and/or sentiments of others is an inherently difficult task. To be able to handle such a problem algorithmically is equally hard. The following are some of the challenges faced while performing Sentiment Analysis: