D.I. Hernández Fariasa,b; P. Rossoa a Technical University of Valencia, Valencia, Spain
b University of Turin, Turin, Italy
Irony and sarcasm are sophisticated forms of speech in which authors write the opposite of what they mean. They have been studied in linguistics, psychology, and cognitive science. While irony is often used to emphasize occurrences that deviate from the expected, sarcasm is commonly used to convey implicit criticism. However, the detection of irony and sarcasm is a complex task, even for humans. The difficulty in recognizing irony and sarcasm causes misunderstanding in everyday communication and poses problems to many natural language processing tasks such as sentiment analysis. This is particularly challenging when one is dealing with social media messages, where the language is concise, informal, and ill-formed.
Irony detection; Sarcasm detection; Figurative language processing; Social media; Twitter; User-generated content
This work was done in the framework of the SomEMBED MINECO TIN2015-71147-C2-1-P research project. The National Council for Science and Technology (CONACyT Mexico) funded the research of Delia Irazú Hernández Farias (grant no. 218109/313683 CVU-369616).
Everyday, people make judgments about their environment. This is an inherent behavior of humans. There are different ways to express our opinions, one of the most interesting is by figurative language devices such as irony and sarcasm. This allows us to express ourselves in a particular way using words not only in their most salient meaning but also in a creative and funny sense. The use of words or expressions with a meaning that is different from the literal interpretation is known as figurative language.
Irony and sarcasm are two interesting and strongly related concepts. Usually people do not have a clear idea of what they are. However, from early childhood we begin to use them in our daily life. They have been a topic studied by different disciplines, such as linguistics, philosophy, psychology, psycholinguistics, cognitive science, and recently computational linguistics. Each discipline has tried to define what they are, how they are produced, and why they are used.
These figurative devices give us the opportunity to explore the interaction between cognition and language. Broadly speaking, irony and sarcasm are figurative language devices that serve to achieve different communication purposes. The commonest definition of irony refers to an utterance by which the speaker expresses a meaning opposite that literally said. There are different theories that attempt to explain what irony is. Grice’s theory [1] points out that the speaker intentionally violates the “maxim of quality” (the speaker does not say what he or she believes to be false) when he or she expresses an ironic utterance. Some theories such the one described in [2] propose define it beyond the literal sense of the words: for Wilson and Sperber [2] an ironic utterance is an “echoic mention” that alludes to some real or hypothetical proposition to demonstrate its absurdity. Attardo [3] considers an ironic utterance as a form of “relevant inappropriateness” in which the speaker relies on the ability of the listener to reject the literal meaning on the basis of the disparity between what is literally said and the context in which it is said. On the other hand, the “failed expectation” intention (ie, the speaker’s approval or disapproval of the entity or situation at hand) behind an ironic expression has been studied by Utsumi [4] and Kumon-Nakamura and Glucksberg [5].
Usually, irony is considered as a broader term that covers also sarcasm [6, 7]. Irony may be positive (ie, noncritical), while sarcasm usually is not [8, 9]. Sarcasm is commonly more aggressive and offensive than irony. In this work irony and sarcasm are treated as two different concepts.
Social media offer a face-saving way for people to express themselves, and they sometimes choose to use ironic or sarcastic utterances to communicate their attitude or evaluative judgment toward a particular target (eg, a public person, a product, a movie, or an event). The presence of ironic or sarcastic content in human communication may cause misunderstandings. Identification of this intention is not a trivial task even for humans: different cognitive processes are involved and knowledge of the environment is needed. For natural language processing tasks such as sentiment analysis, this kind of subjective user-generated content is a big challenge. In some cases the presence of ironic content plays a particular role: “polarity reversal.” This means, for instance, that an utterance seems to be positive but its real intention is negative (or vice versa).
We introduce the following example, extracted from an ironic set of Amazon reviews collected by Filatova [10]: “I would recomend this book to friends who have insomnia or those who I absolutely despise.”1 For a sentiment analysis system that exploits the basic approach of considering the frequency of positive and negative terms to assign a polarity, this sentence could be considered as positive. The words recomend (recommend), book, and friends are positive terms, while insomnia and despise denote a negative sense. Therefore in the sentence there are three positive terms and two negative terms, and the sentence could be identified as positive. However, this review conveys a meaning far from positive. The author expresses a negative judgment against the book in an imaginative way. On one hand, the author writes about recommending the book, which can be considered as a positive aspect about the target (the book), but at the same time there is a point about “friends who have insomnia” or “those who I absolutely despise.” Thus the author’s hidden intention could be to state that the “book” is so boring as to induce sleep (even in those who have insomnia). Research in irony could not only improve the performance of sentiment analysis systems but could also help us to understand the cognitive process involved and how humans process and produce utterances of this kind. After introducing the state of the art in irony and sarcasm detection, we investigate the impact that the use of these figurative language devices may have on sentiment analysis.
This chapter is organized as follows. In Section 2 we describe the state of the art in irony and sarcasm detection. In Section 3 we address the impact that figurative language has on sentiment analysis. We analyze three shared tasks that have been recently organized. Section 4 discusses future trends and directions. In Section 5 we draw some conclusions.
Irony and sarcasm detection are considered as special cases of text classification, where the main goal is to distinguish ironic (or sarcastic) texts from nonironic (or nonsarcastic) ones. To analyze figurative devices of this kind, it is necessary to consider not only the syntactic and lexical textual level (to extract salient features such as word position and punctuation marks) but also semantics (literal vs. nonliteral meaning of the words), pragmatics (words matching with the appropriate context), and discourse analysis (relation between the utterance at hand and the way in which it is expressed). However, the progress so far has been a result of the use of mainly syntactic, lexical, and shallow semantics.
Dealing with social media texts is a challenging task. They have specific characteristics: they are informal and use ill-formed language. People express themselves in a face-saving way by unstructured content. Usually, social media texts contain spelling mistakes, abbreviations, and slang. In Twitter, the text should be written in a maximum of 140 characters; therefore figurative language is expressed in a very concise manner, which causes an additional issue. When people express their opinions by ironic or sarcastic utterances, they can choose how to use the language to achieve their communicative goals. There is no particular structure to construct ironic or sarcastic utterances.
In a such way the main objective of irony and sarcasm detection is to discover features that allow us to discriminate ironic (or sarcastic) texts from nonironic (or nonsarcastic) texts. The interest in irony and sarcasm detection in social media requires we have user-generated data that allow us to capture the real use of figurative language devices of this kind. As in most natural language processing tasks, the lack of corpora is an issue. There are two main approaches for ironic/sarcastic corpus construction: self-tagging and crowdsourcing. The first one considers as positive instances those texts in which the author points out her intention using an explicit label (eg, the hashtags #irony and #sarcasm). Therefore in this case we rely on the author’s definition of what irony or sarcasm is. Crowdsourcing involves human interaction by the labeling of the content as ironic (or sarcastic). Mainly, the labeling process is done without any strict definition or guideline. Therefore it is a subjective task, where the agreement between annotators is often very low. In this way it is possible to obtain potential ironic and sarcastic texts produced by people in social media.
For computational linguistic purposes, irony and sarcasm are often considered as synonyms. The following subsections describe some proposed approaches to address irony and sarcasm detection. The first one is focused on work where the ironic intention was considered as an overall term, while the second one is focused on research where sarcasm was considered as a different concept.
One of the first studies in irony detection was by Carvalho et al. [11]. They worked on the identification of a set of surface patterns to identify ironic sentences in a Portuguese online newspaper. The most relevant features were the use of punctuation marks and emoticons. Veale and Hao [12] conducted an experiment by harvesting the web, looking for a commonly used framing device for linguistic irony: the simile (two queries “as * as *” and “about as * as *” were used to retrieve snippets from the web). They analyzed a very large corpus to identify characteristics of ironic comparisons, and presented a set of rules to classify a simile as ironic or nonironic.
Reyes et al. [13] analyzed tweets tagged with the hashtags #irony and #humor to identify textual features for distinguishing between them. They proposed a model that includes structural, morphosyntactic, semantic and psychological features. Additionally, they considered the polarity expressed in a tweet using the Macquarie Semantic Orientation Lexicon.2 They experimented with different feature sets and a decision tree classifier, obtaining encouraging results (F measure of approximately 0.80).
Afterward, Reyes et al. [14] collected a corpus composed of 40,000 tweets, relying on the “self-tagged” approach. Four different hashtags were selected: #irony, #education, #politics, and #humor. Their model is organized according to four types of conceptual features—signatures (such as punctuation marks, emoticons, and discursive terms), unexpectedness (opposition, incongruency, and inconsistency in a text), style (recurring sequences of textual elements), and emotional scenarios (elements that symbolize sentiment, attitude, feeling, and mood)—by exploiting the Dictionary of Affect in Language (DAL).3 They addressed the problem as a binary classification task, distinguishing ironic tweets from nonironic tweets by using naïve Bayes and decision tree classifiers. They achieved an average F measure of 0.70.
Barbieri and Saggion [15] proposed a model to detect irony using lexical features, such as the frequency of rare and common terms, punctuation marks, emoticons, synonyms, adjectives, and positive and negative terms. They compared their approach with that of Reyes et al. [14] on the same corpus using a decision tree, and obtained results slightly better than those previously obtained. They concluded that rare words, synonyms ,and punctuation marks seem to be the most discriminating features. Hernández-Farías et al. [16] described an approach for irony detection that uses a set of surface text properties enriched with sentiment analysis features. They exploited two widely applied sentiment analysis lexicons: Hu&Liu4 and AFINN.5 They experimented with the same dataset used in [14, 15]. Their proposal was evaluated with use of a set of classifiers composed of Naïve bayes, decision tree, support vector machine (SVM), multilayer perceptron, and logistic regression classifiers. The proposed model improved on the previous results (F measure of approximately 0.79). The features related to sentiment analysis were the most relevant.
Buschmeier et al. [17] presented a classification approach using the Amazon review corpus collected by Filatova [10], which contains both ironic and nonironic reviews annotated by Mechanical Turk crowdsourcing. They proposed a model that takes into account features such as n-grams, punctuation marks, interjections, emoticons, and the star rating of each review (a particular feature from Amazon reviews, which, according to the authors, seems to help result in good performance in the task. They experimented with a set of classifiers (composed of naïve Bayes, logistic regression, decision tree, random forest, and SVM classifiers), achieving an F-measure rate of 0.74.
Wallace et al. [18] attempted to undertake the study of irony detection using contextual features, specifically by combining noun phrases and sentiment extracted from comments. They proposed exploiting information regarding the conversational threads to which comments belong. Their approach capitalizes on the intuition that members of different user communities are likely to be sarcastic about different things. A dataset of comments posted on Reddit6 was used.7
Karoui et al. [19] recently presented an approach to separate ironic from nonironic tweets written in French. They proposed a two-stage model. In the first part they addressed the irony detection as a binary classification problem. Then the misclassified instances are processed by an algorithm that tries to correct them by querying Google to check the veracity of tweets with negation. They represented each tweet with a vector composed of six groups of features: surface (such as punctuation marks, emoticons, and uppercase letters), sentiment (positive and negative words), sentiment shifter (positive and negative words in the scope of an intensifier), shifter (presence of an intensifier, a negation word, or reporting speech verbs), opposition (sentiment opposition or contrast between a subjective and an objective proposition), and internal contextual (the presence/absence of personal pronouns, topic keywords, and named entities). The authors experimented with an SVM as a classifier, achieving an F measure of 0.87.
To sum up, several approaches have been proposed to detect irony as a classification task. Many of the features employed have already been used in various tasks related to sentiment analysis such as polarity classification. The ironic intention is captured by the exploitation of mainly surface features such as punctuation marks and emoticons. These kinds of lexical cues have been shown to be useful to distinguish ironic content, especially in tweets. It may confirm in some way the necessity of users to add textual markers to deal with the absence of paralinguistic cues. Besides, many authors point out the importance of capturing the inherent incongruity in ironic utterances. To achieve this goal, the presence of opposite polarities (positive and negative words) and the use of semantically unrelated terms (synonyms and antonyms) have been considered in many approaches. Both kinds of features seem to be relevant to distinguish ironic from nonironic utterances. Decision trees are among the classifiers that produced the best results.
To determine whether specific lexical factors (eg, the use of some part of speech or punctuation marks) play a role in sarcasm detection, Kreuz and Caucci [20] asked some college students to read excerpts from paragraphs that originally contained the “said sarcastically” sentence (removed before the task). The participants were able to distinguish sarcastic from nonsarcastic utterances. This work represents a key to consider the influence that lexical factors can have in the analysis of social media content.
One of the first approaches that considered the #sarcasm hashtag as an indicator of sarcastic content was developed by Davidov et al. [21]. They introduced a semisupervised algorithm for sarcasm detection that considers as features frequent words, punctuation marks, and syntactic patterns so as to identify sarcastic utterances. They collected a dataset from both Amazon and Twitter; their results seem to be promising, with F measures close to 0.80.
Gonzalez et al. [22] performed an experiment on two datasets: a set of self-tagged tweets and a manually annotated set. They considered as sarcastic instances a set of self-tagged tweets containing the #sarcasm or #sarcastic hashtag, and as nonsarcastic instances some positive and negative tweets (retrieved with use of different hashtags, such as #happy, #joy, and #lucky and #sadness, #angry, and #frustrated respectively). As features they considered interjections and emoticons as well as some resources such as LIWC8 and WordNet-Affect.9 They attempted to distinguish between sarcastic, positive, and negative tweets. They applied an SVM and logistic regression as classifiers. Their reported results are related to both datasets; the overall accuracy rate was around 0.57. They suggested that their results demonstrate the difficulty of sarcasm detection for both humans and machine learning methods.
According to Riloff et al. [23], a common form of sarcasm in Twitter consists of a positive sentiment contrasting with a negative situation (eg, absolutely adore it when my bus is late #sarcasm). The goal of their research was to recognize sarcasm instances containing this pattern.10 They presented a bootstrapping algorithm that automatically learns phrases corresponding to negative situations. As sarcastic instances for the learning process, tweets that contained a sarcasm hashtag were retrieved. From the bootstrapping process they collected some positive sentiment verb phrases, predicative expressions, and negative situation phrases. They also performed some binary classification experiments using an SVM classifier. They used a set of features that contain not only their list of phrases but also n-grams and three sentiment and subjectivity lexicons (Hu&Liu, AFINN, and MPQA11 ). The best result (F measure 0.51) was achieved by a hybrid approach where a tweet is considered as sarcastic if either it contains a contrast (according to their list of phrases) or it is identified as such by the SVM (with unigram and bigram features).
Wang [24] presented a study to identify similarities and distinctions between irony and sarcasm. The study consisted of a quantitative sentiment analysis and a qualitative content analysis. A set of sarcastic and ironic tweets collected by the self-tagging approach was used. She found that sarcastic tweets were more positive that ironic ones.
Barbieri et al. [25] attempted to study the differences between ironic and sarcastic tweets. They addressed the problem as a binary classification task between tweets tagged with the #irony and #sarcasm hashtags. Their system is similar to the one presented in [15] for irony detection; they included two new features in their model: if a tweet contains a URL and named entities. The model was evaluated with use of a decision tree as a classifier. They obtained an F measure of 0.62; this result emphasizes the difficulty to distinguish between irony and sarcasm. Barbieri et al. mentioned the two most relevant features to distinguish between ironic and sarcastic tweets: the use of adverbs (more intense ones in sarcastic samples) and the sentiment value (sarcastic tweets are denoted by more positive words than ironic tweets).
Fersini et al. [26] addressed sarcasm detection by introducing an ensemble approach (the Bayesian model average). As features they used emoticons, punctuation marks, onomatopoeic expressions, part-of-speech labels, and a bag of words. They collected a set of tweets using the #sarcasm and #sarcastic hashtags, then three annotators were asked to determine the presence of sarcastic content in tweets. They also evaluated the ensemble method over the corpus presented in [14]. Their results, around 0.8 in F-measure terms in both corpora, seem to indicate that this strategy outperforms those that use traditional classifiers.
Rajadesingan et al. [27] developed a framework for sarcasm detection that uses a behavioral modeling approach. It defines some criteria so as to determine whether a tweet is sarcastic or not, by leveraging behavioral traits (using some of the user’s past tweets) and textual-content features (such as punctuation marks, uppercase words, and parts of speech). Rajadesingan et al. collected tweets that contain the #sarcasm and #not hashtags as sarcastic instances; as negative instances the last 80 tweets from each sarcastic sample’s author were retrieved. A binary classification task was performed between the sarcastic and nonsarcastic instances with use of decision tree, logistic regression, and SVM classifiers. Their results seem to be good, reaching rates above 0.70 in terms or accuracy.
A similar approach is that of Bamman and Smith [28], who stated that modeling the relationship between a sarcastic tweet and the author’s past tweets can improve accuracy. They presented some experiments to discern the effect of sarcasm by using features derived not only from the local context of the message itself (words in the tweet and parts of speech, among others). They also used information about the author, the relationship with his or her audience, and the immediate communicative context they both share (such as salient historical terms and topics and profile information). For evaluation purposes, all tweets with #sarcasm or #sarcastic in the GardenHose sample of tweets in the period from August 2013 to July 2014 were used as sarcastic instances, while for the nonsarcastic ones the 3200 most recent tweets from each “sarcastic author” (ie, the user who posted a tweet labeled with #sarcasm or #sarcastic in the subset) were retrieved. As a classifier a binary logistic regression was employed, achieving an accuracy of 0.851.
To sum up, there is a consistent body of work focused on sarcasm detection. It is a controversial issue whether irony and sarcasm are considered to be similar linguistic phenomena. Almost the same features used for irony detection have been employed for sarcasm detection. Among the most widely applied features we mention punctuation marks and part-of-speech labels. As classifiers, logistic regression and SVMs have been the most used ones for sarcasm detection. Recent approaches on sarcasm detection consider information beyond the text itself, exploiting contextual information and information about the user.
In recent years the interest in understanding the role of irony and sarcasm in sentiment analysis has derived from different evaluation campaigns. Their main objective is not to identify ironic or sarcastic content but to develop systems that will be able to correctly classify the polarity of figurative language social media texts. The presence of figurative language devices such irony and sarcasm usually causes a polarity reversal. Irony and sarcasm detection is a necessary and important part for a sentiment analysis system because the performance of the latter is affected by the performance of the former. Maynard and Greenwood [29] performed an experiment to measure the effect of sarcasm on the polarity of tweets. They proposed a set of rules to improve the accuracy of sentiment analysis when sarcasm is present.
In the following, three different evaluation campaigns are introduced. In Section 3.1 we describe a pilot subtask to identify ironic content. A sentiment classification task in Twitter for both sarcastic and nonsarcastic social media text is presented in Section 3.2. Finally, a recent sentiment analysis task wholly dedicated to figurative language in Twitter is described in Section 3.3.
In the context of Evalita12 2014, the sentiment polarity classification task [30] was organized. Its main focus was the sentiment classification at the message level of Italian tweets. The task was divided into three independent subtasks: (1) subjectivity classification; (2) polarity classification, and (3) irony detection. Participants were provided with a dataset composed of a collection of 6448 tweets in Italian (70% for training and 30% for the test) derived from two existing corpora: SENTI-TUT [31] and TWITA [32]. Each tweet in the dataset was labeled according to subjectivity (subjective or objective), polarity (positive, negative, neutral, or mixed) and the presence of ironic content. The systems were evaluated by means of the F measure for each subtask. Eleven teams participated in the sentiment polarity classification task (further information about each system can be found in [33]). Table 7.1 summarizes the results obtained13 by the teams that participated in the irony detection task.
Table 7.1
Sentiment Polarity Classification Task Results in F-Measure Terms
Team | Task 1 | Task 2 | Task 3 |
UNIBA2930 | 0.71 | 0.67 | – |
UNITOR | 0.68 | 0.62 | 0.57 |
IRADABE | 0.67 | 0.63 | 0.54 |
SVMSLU | 0.58 | 0.60 | 0.53 |
ITGETARUNS | 0.52 | 0.51 | 0.49 |
Mind | 0.59 | 0.53 | 0.47 |
fbkshelldkm | 0.55 | 0.56 | 0.47 |
UPFtaln | 0.64 | 0.60 | 0.46 |
Baseline | 0.40 | 0.37 | 0.44 |
All the participants outperformed the established baseline. The performance rates as the F measure for both subjectivity and polarity classification were near 0.70, while on subtask 3 the values were below 0.60. This confirms the difficulty of the ironic content–related subtask. The best ranked team for the first two subtasks (UNIBA2930 [34]) did not participate in the irony detection task (see Table 7.1). No system was developed to address particularly the irony detection subtask.
Most systems used supervised learning, and the SVM algorithm was the most popular. One further challenge for this task was the lack of Italian resources as well as natural language processing tools (such as tokenizers and part-of-speech taggers); however, some systems (eg, UNIBA2930 and IRADABE) translated some of the resources available in English into Italian. For classification purposes a variety of features were used such as a bag of words, punctuation marks, emoticons, and Twitter language markers (such as hashtags and mentions). UNITOR [35], the best ranked system in irony detection, proposed an “ironic vector” that captures the presence of some features such as punctuation marks, emoticons, a bag of words, and a sentiment analyis resource in Italian called Sentix14 to train an SVM classifier. IRADABE [36] exploited two different sets of features: textual (eg, n-grams, emoticons, parts of speech, and uppercase words), and information extracted from the in-house Italian version of English resources such as AFINN, SentiWordNet,15 Hu&Liu, DAL, and temporal compression and counterfactuality terms16 together with an SVM classifier. The SVMSLU [37] system addressed the problem using an SVM for classification of binary vectors of tokens together with punctuation marks, hashtags, and retweet marks. In ITGETARUNS [38] a set of linguistic rules was defined to classify the tweets; the author considered some markers such as intensifiers and diminishers and modal verbs. The Mind system [39] is based on a multilayer Bayesian ensemble learning; the authors addressed the task under a hierarchical framework. If a given sentence is detected as ironic, then its positive or negative polarity is reversed. On the other hand, if the sentence is ironic but its polarity has been classified as mixed, then it is switched to negative. The system takes into account only a vector composed of terms for which a Boolean weight was computed; no additional information was added. The description of the fbkshelldkm system is not available on the proceedings of the task.
Finally, the UPFtaln [40] system addressed the task by a decision tree classifier. This approach is similar to the one presented in [15] for irony detection. The main difference is the use of Italian resources: Italian WordNet 1.6,17 Sentix, and the CoLFIS corpus.18
In recent years, as part of SemEval,19 a task on sentiment analysis in Twitter has been organized [41–43]. The participating systems were required to assign one of the following labels: positive, negative, or objective (neutral). The organizers provided two datasets20 for training and the test, composed of social media texts, mainly from Twitter.
In both 2014 and 2015 the participating systems were evaluated also on a subset of sarcastic tweets. In 2014 a small set of tweets that contained #sarcasm was added to the test set, whereas in 2015 a set of tweets were manually labeled as “sarcastic” by human annotators. In Table 7.2 we show the seven best performing systems among the 44 participating systems.
Table 7.2
Sentiment Analysis Task in F-Measure Terms for Both Regular and Sarcastic Tweets in the 2014 Edition of SemEval
System | Twitter 2014 | Sarcasm 2014 |
TeamX | 70.96 | 56.50 |
coooolll | 70.14 | 46.66 |
RTRGO | 69.95 | 47.09 |
NRC-Canada | 69.85 | 58.16 |
TUGAS | 69.00 | 52.87 |
CISUC_KIS | 67.95 | 55.49 |
SAIL | 67.77 | 57.26 |
The results obtained for the best ranked teams in the 2015 edition are shown in Table 7.3. The overall drop in the F measure between regular and sarcastic tweets is slightly less than in 2014. From the tables it can be seen there is an important drop in the performance when the systems were evaluated on the sarcastic tweets. Generally, sentiment analysis systems produce good results for regular content, but when the same systems are evaluated with sarcastic content the overall performance is affected. None of the proposed approaches directly tried to capture the sarcastic intention. All systems addressed the task as a supervised approach, taking into account features widely applied in sentiment analysis tasks such as a bag of words, part-of-speech tags, and punctuation marks.
Table 7.3
Best Results in the Sentiment Analysis Task in F-Measure Terms for Both Regular and Sarcastic Tweets in the 2015 Edition of SemEval
System | Twitter 2015 | Sarcasm 2015 |
Webis | 64.84 | 53.59 |
unitn | 64.59 | 55.01 |
lsislif | 64.27 | 46.00 |
INESC-ID | 64.17 | 64.91 |
Splusplus | 63.73 | 60.99 |
wxiaoac | 0.63 | 52.22 |
IOA | 62.62 | 65.77 |
Some of the systems used well-known resources such as, AFINN, Hu&Liu, and SentiWordNet. A more detailed description of the shared task and the participating systems can be found in [41, 42].
Task 11 at SemEval 201521 was the first sentiment analysis task addressing figurative language devices such as irony, sarcasm, and metaphors. The goal of the task was not to directly detect any of the previously mentioned devices but to perform sentiment analysis in a fine-grained scale ranging from − 5 (very negative) to +5 (very positive). Since irony and sarcasm are typically used to criticize or to mock, and thus skew the perception of sentiment toward the negative, it is not enough for a system to simply determine whether the sentiment of a given tweet is positive or negative [44]. The participants were asked to determine the degree to which a sentiment was communicated rather than to assign a more general score (such as in the previously described tasks).
A corpus composed of three subsets of tweets was supplied to the participants: trial (1025), training (8000), and test (4000). The corpus construction involved crowdsourcing and some tweets explicitly tagged with hashtags such as #irony, #sarcasm, #not, and #yeahright or that contained words commonly associated with the use of a metaphor (eg, “literally” and “virtually”). Further information can be found in [44].
Fifteen teams participated in the task on sentiment analysis of figurative language.22Table 7.4 shows the results of the seven best ranked systems according to the overall cosine similarity measure.
Table 7.4
Best Results in the Task on Sentiment Analysis of Figurative Language in Twitter (Cosine Similarity Measure)
Team | All | Irony | Sarcasm |
ClaC | 0.758 | 0.904 | 0.892 |
UPF-taln | 0.711 | 0.873 | 0.903 |
LLT_PolyU | 0.687 | 0.918 | 0.896 |
EliRF | 0.658 | 0.905 | 0.904 |
LT3 | 0.658 | 0.897 | 0.891 |
ValenTo | 0.634 | 0.901 | 0.895 |
HLT | 0.630 | 0.907 | 0.887 |
The best ranked system, ClaC [45], showed robustness across different sentiment analysis related tasks [46].23 ClaC is based on a pipeline framework that groups different phases, from preprocessing to polarity induction. It exploits some resources such as NRC-lexicon,24 Hu&Liu, and MPQA. In addition, the authors developed a new resource called Gezi (for more details, see [45, 46]). The main difference between their proposal for both tasks was the machine learning algorithm used for polarity assignment, an SVM for the regular one and M5P (a decision tree regressor) for figurative language tweets. Nevertheless, it did not achieve the best performance either for ironic or for sarcastic tweets in the figurative language task. The UPF-taln [47] system presented an extended approach that considered frequent, rare, positive, and negative words and also exploited a bag of words as features. To assign the polarity degree, the authors used a regression algorithm (random subspace with M5P). Their system achieved second place in the overall ranking.
Two similar and efficient approaches were the ones proposed by LLT_PolyU [48] and EliRF [49]. They scored the best results in irony and sarcasm detection respectively. LLT_PolyU and EliRF considered as features n-grams, negation scope windows, and sentiment resources (LLT_PolyU exploited Hu&Liu, MPQA, AFINN, and SentiWordNet and EliRF used Pattern,25 AFINN, Hu&Liu, NRC-lexicon, and SentiWordNet). In both systems, regression models (RepTree in LLT_PolyU and a regression SVM in EliRF) were used to calculate the polarity value.
The LT3 [50] and ValenTo [51] systems included in their set of features the presence of punctuation marks, emoticons, and hasthags. To capture potential clues of figurative content in tweets, LT3 took advantage of features to detect changes in the narrative as well as contrasting, contradictory, and polysemic words. In the LT3 system an SVM classifier was used to determine the polarity value of tweets. Furthermore, the ValenTo system exploits sentiment analysis resources (such as AFINN, Hu&Liu, General Inquirer,26 and SentiWordNet) as well as some containing emotional and psycholinguistic information (ANEW,27 DAL, SenticNet,28 and LIWC. Besides, a feature to reverse the polarity valence of a tweet when it contains a sarcastic intention was considered. In ValenTo a linear regression model was used to assign the polarity value. Finally, the HLT system [44] used an SVM approach together with lexical features such as negation and intensifiers and some markers of amusement and irony.
Irony detection and sarcasm detection have been addressed as a text classification task. Salient features such as lexical marks are mainly used to characterize ironic and sarcastic utterances. As figurative language devices, irony and sarcasm need to be studied beyond the scope of the textual content of the utterance. In this regard both the context in which utterances are expressed and common knowledge should be considered to identify the real intention behind an ironic or sarcastic expression. There are some attempts to take advantage of this kind of information. Wallace et al. [18] exploited contextual information of the forum where a comment was posted. Information about users who wrote sarcastic tweets (such as their past tweets) was considered by Rajadesingan et al. [27] and Bamman and Smith [28] so as to distinguish between sarcastic and nonsarcastic tweets.
Besides, it is necessary to consider how affective and emotional content is implicitly embedded in irony and sarcasm. Some work in the literature has already started to exploit affective information by using sentiment and affective lexica such as DAL [14, 16], AFINN and Hu&Liu [16], and SentiWordNet [15, 25].
With regard to the impact of sentiment analysis on irony and sarcasm detection, before the polarity of an utterance is determined it would be helpful to identify if the utterance expresses either ironic or sarcastic intention. Further investigations are needed to develop approaches that could efficiently identify ironic and sarcastic content to avoid misclassification of the polarity score of a subjective text.
People communicate their ideas in complex ways. Figurative language devices such as irony and sarcasm are often used to express evaluative judgments in an unconventional way. Irony and sarcasm are concepts that are difficult to define; however, they are often used in social media. In this sense user-generated content represents a big challenge. The progress achieved so far in irony and sarcasm detection has been a result of the exploitation of mainly the syntactic, lexical, and semantic levels of natural language processing. Similar approaches have been proposed to address the task as a binary classification. Currently, the biggest effort concerns identification of the most salient features that allow one to determine when the intended content of an utterance is ironic or sarcastic.
From the sentiment analysis perspective, the presence of irony and sarcasm affects the performance of the task. As we pointed out, state-of-the-art systems generally have good results when they are dealing with regular content, but when they are evaluated with ironic or sarcastic content, their overall performance is affected. Therefore robust sentiment analysis systems will need to understand when human communications in social media make use of figurative language devices such as irony and sarcasm.