(PDF) Automatically annotating a five-bi

WASSA 2012 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis Proceedings of the Workshop July 12, 2012 Jeju, Republic of Korea Endorsed by SIGSEM (ACL Special Interest Group on Computational Semantics) Endorsed by SIGNLL (ACL’s Special Interest Group on Natural Language Learning) Endorsed by SIGLEX (Special Interest Group on the Lexicon of the Association for Computational Linguistics) Sponsored by the Academic Institute for Research in Computer Science (Instituto Universitario de Investigación Informática), University of Alicante, Spain 2012 c The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860

[email protected]

ISBN 978-1-937284-33-6 ii Introduction In the past years, the quantity of contents generated by users on the Web, in social networking sites, fora and microblogs has reached an unprecedented level. All this data adds on to the contents generated in traditional media, such as newspapers, bringing additional factual, as well as a high quantity of opinionated and subjective information. In the context of the society in which we live, where sifting through the immense quantities of information to gather knowledge has become a must, the challenge of processing opinionated and subjective information is becoming more and more a focus to the Natural Language Processing (NLP) research communities worldwide. In the past decade, the interest in proposing computational methods to deal with subjectivity and sentiment in text has grown constantly from the NLP community. However, although the subjectivity and sentiment analysis research fields have been highly dynamic in this period, much remains still to be done, so that systems dealing with subjectivity, sentiment and, more generally, affect in text, can be reliably used in critical decision-making environments. Moreover, the new means of communication and user connection, in microblogs and social networks, become more and more relevant to these two tasks, as the contexts (internal and external) of the information communication process bring about new challenges and applications to be explored. Inspired by the above-mentioned issues and the objectives we aimed at in the first two editions of the Workshop on Computational Approaches to Subjectivity Analysis (WASSA 2010 and WASSA 2.011), the purpose of the third edition of the Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2012) was to create a framework for presenting and discussing the challenges related to subjectivity and sentiment analysis in NLP and its applications, in traditional and Social Media contexts, from an interdisciplinary theoretical and practical perspective. WASSA 2012 was organized in conjunction to the 50th Annual Meeting of the Association for Computational Linguistics, on July 12, 2012, in Jeju, Korea. At this third edition of the workshop, we received a total of 31 submissions, from a wide range of countries, of which 11 were accepted as full papers and another 4 as short papers. Each paper has been thoroughly reviewed by 3 members of the Program Committee. The accepted papers were all highly assessed by the reviewers, the best paper receiving an average punctuation (computed as an average of all criteria used to assess the papers) of 4.6 out of 5. The main topics of the accepted papers are the creation and evaluation of resources for subjectivity and sentiment analysis in a cross-lingual and multilingual setting, subjectivity and sentiment analysis using semi-supervised and supervised methods in different types of texts (although the accent this year has been undoubtedly on Social Media texts) and affect detection in context. Additionally, the WASSA 2012 authors have enhanced the analysis of these phenomena beyond the traditional intra- textual aspects, towards the reader and writer intentions and interpretations, and have also analyzed the application of subjectivity and sentiment reseach in NLP to real-life, relevant scenarios (such as the detection of socially unacceptable behavior in online contexts). The invited talks reflected the multimodal and interdisciplinary nature of the research in affect-related phenomena as well. Prof. Rada Mihalcea, from the University of North Texas, presented a talk on “Multimodal Sentiment Analysis”, linking the textual aspects of affect detection to affect detection iii in para-textual contexts. Prof. Janyce Wiebe’s talk concentrated on the language ambiguity in the subjectivity analysis area. In her keynote on “Subjectivity Word Sense Disambiguation”, she showed the importance of distinguishing among objective and subjective usages of word senses. This year’s edition has shown again that there is a demonstrated and increasingly growing interest in the topics addressed by WASSA and that the knowledge disseminated through this forum and the associated publications is bringing an important contribution to the research in subjectivity and sentiment analysis. We would like to thank the ACL 2012 Organizers for the help and support at the different stages of the workshop organization process. We are also especially grateful to the Program Committee members and the external reviewers for the time and effort spent assessing the papers. We would like to extend our thanks to our invited speakers – Prof. Rada Mihalcea and Prof. Janyce Wiebe, for accepting to deliver the keynote talks. Secondly, we would like to express our gratitude for the official endorsement we received from SIGSEM (the ACL Special Interest Group on Computational Semantics), SIGNLL (the ACL Special Interest Group on Natural Language Learning) and SIGLEX (the Special Interest Group on the Lexicon of the Association for Computational Linguistics). Further on, we would like to thank the Editors of the “Computer Speech and Language Journal”, published by Elsevier, for accepting to organize a Special Issue of this journal containing the extended versions of the best full papers accepted at WASSA 2012. We would like to express our gratitude to Yaniv Steiner from the European Commission Joint Research Centre (Italy), who created the WASSA logo and to Miguel Ángel Varo and Miguel Ángel Baeza, from the University of Alicante, for the technical support they provided. Last, but not least, we are grateful for the financial support given by the Academic Institute for Research in Computer Science of the University of Alicante (Instituto Universitario para la Investigación en Informática, Universidad de Alicante). Alexandra Balahur, Andrés Montoyo, Patricio Martı́nez-Barco, Ester Boldrini WASSA 2012 Chairs iv Organizers: Alexandra Balahur European Commission Joint Research Centre Institute for the Protection and Security of the Citizen Andrés Montoyo University of Alicante Department of Software and Computing Systems Patricio Martı́nez-Barco University of Alicante Department of Software and Computing Systems Ester Boldrini University of Alicante Department of Software and Computing Systems Program Committee: Khurshid Ahmad, Trinity College Dublin (Ireland) Sivaji Bandyopadhyay, Jadavpur University (India) Nicoletta Calzolari, CNR Pisa (Italy) Erik Cambria, University of Stirling (U.K.) José Carlos Cortizo, European University Madrid (Spain) Michael Gamon, Microsoft (U.S.A.) Jesús M. Hermida, University of Alicante (Spain) Veronique Hoste, University of Ghent (Belgium) Mijail Kabadjov (Mexico) Zornitsa Kozareva, Information Sciences Institute California (U.S.A.) Rada Mihalcea, University of North Texas (U.S.A.) Saif Mohammad, National Research Council (Canada) Karo Moilanen, University of Oxford (U.K.) Rafael Muñoz, University of Alicante (Spain) Günter Neumann, DFKI (Germany) Alena Neviarouskaia, University of Tokyo (Japan) Manabu Okumura, Tokyo Institute of Technology (Japan) v Constantin Orasan, University of Wolverhampton (U.K.) Manuel Palomar, University of Alicante (Spain) Viktor Pekar, University of Wolverhampton (U.K.) Paolo Rosso, Technical University of Valencia (Spain) Josef Steinberger, European Commission Joint Research Centre (Italy) Ralf Steinberger, European Commission Joint Research Centre (Italy) Veselin Stoyanov, John Hopkins University (U.S.A.) Hristo Tanev, European Commission Joint Research Centre (Italy) Maite Taboada, Simon Fraser University (Canada) Mike Thelwall, University of Wolverhampton (U.K.) José Antonio Troyano, University of Seville (Spain) Dan Tufis, RACAI (Romania) Alfonso Ureña, University of Jaén (Spain) Erik van der Goot, European Commission Joint Research Center (Italy) Piek Vossen, Vrije Universiteit Amsterdam (The Netherlands) Marilyn Walker, University of California Santa Cruz (U.S.A.) Janyce Wiebe, University of Pittsburgh (U.S.A.) Michael Wiegand, Saarland University (Germany) Theresa Wilson, John Hopkins University (U.S.A.) Taras Zagibalov, Brantwatch (U.K.) Additional Reviewers: Elena Lloret, University of Alicante (Spain) Invited Speakers: Prof. Dr. Rada Mihalcea, University of North Texas (U.S.A.) Prof. Dr. Janyce Wiebe, University of Pittsburgh (U.S.A.) vi Table of Contents Multimodal Sentiment Analysis Rada Mihalcea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Subjectivity Word Sense Disambiguation Janyce Wiebe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Random Walk Weighting over SentiWordNet for Sentiment Polarity Detection on Twitter Arturo Montejo-Ráez, Eugenio Martı́nez-Cámara, M. Teresa Martı́n-Valdivia and L. Alfonso Ureña-López . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Mining Sentiments from Tweets Akshat Bakliwal, Piyush Arora, Senthil Madhappan, Nikhil Kapre, Mukesh Singh and Vasudeva Varma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 SAMAR: A System for Subjectivity and Sentiment Analysis of Arabic Social Media Muhammad Abdul-Mageed, Sandra Kuebler and Mona Diab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 Opinum: statistical sentiment analysis for opinion classification Boyan Bonev, Gema Ramı́rez Sánchez and Sergio Ortiz Rojas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Sentimantics: Conceptual Spaces for Lexical Sentiment Polarity Representation with Contextuality Amitava Das and Gambäck Björn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Analysis of Travel Review Data from Reader’s Point of View Maya Ando and Shun Ishizaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Multilingual Sentiment Analysis using Machine Translation? Alexandra Balahur and Marco Turchi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Unifying Local and Global Agreement and Disagreement Classification in Online Debates Jie Yin, Nalin Narang, Paul Thomas and Cecile Paris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Prior versus Contextual Emotion of a Word in a Sentence Diman Ghazi, Diana Inkpen and Stan Szpakowicz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Cross-discourse Development of Supervised Sentiment Analysis in the Clinical Domain Phillip Smith and Mark Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 POLITICAL-ADS: An annotated corpus for modeling event-level evaluativity Kevin Reschke and Pranav Anand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Automatically Annotating A Five-Billion-Word Corpus of Japanese Blogs for Affect and Sentiment Anal- ysis Michal Ptaszynski, Rafal Rzepka, Kenji Araki and Yoshio Momouchi . . . . . . . . . . . . . . . . . . . . . . . 89 vii How to Evaluate Opinionated Keyphrase Extraction? Gábor Berend and Veronika Vincze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Semantic frames as an anchor representation for sentiment analysis Josef Ruppenhofer and Ines Rehbein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 On the Impact of Sentiment and Emotion Based Features in Detecting Online Sexual Predators Dasha Bogdanova, Paolo Rosso and Thamar Solorio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 viii Conference Program Thursday July 12, 2012 (8:30) Opening Remarks (8:40) Invited talk (I): Prof. Dr. Rada Mihalcea Multimodal Sentiment Analysis Rada Mihalcea (9:35) Invited talk (II): Prof. Dr. Janyce Wiebe Subjectivity Word Sense Disambiguation Janyce Wiebe (10:30) Break (11:00) Session 1: Subjectivity and Sentiment Analysis in Social Media Random Walk Weighting over SentiWordNet for Sentiment Polarity Detection on Twitter Arturo Montejo-Ráez, Eugenio Martı́nez-Cámara, M. Teresa Martı́n-Valdivia and L. Alfonso Ureña-López Mining Sentiments from Tweets Akshat Bakliwal, Piyush Arora, Senthil Madhappan, Nikhil Kapre, Mukesh Singh and Vasudeva Varma SAMAR: A System for Subjectivity and Sentiment Analysis of Arabic Social Media Muhammad Abdul-Mageed, Sandra Kuebler and Mona Diab ix Thursday July 12, 2012 (continued) (12:30) Lunch Break (13:30) Session 2: Affect Detection and Classification (I) Opinum: statistical sentiment analysis for opinion classification Boyan Bonev, Gema Ramı́rez Sánchez and Sergio Ortiz Rojas Sentimantics: Conceptual Spaces for Lexical Sentiment Polarity Representation with Con- textuality Amitava Das and Gambäck Björn Analysis of Travel Review Data from Reader’s Point of View Maya Ando and Shun Ishizaki Multilingual Sentiment Analysis using Machine Translation? Alexandra Balahur and Marco Turchi (15:30) Break (16:00) Session 3: Affect Detection and Classification (II) Unifying Local and Global Agreement and Disagreement Classification in Online Debates Jie Yin, Nalin Narang, Paul Thomas and Cecile Paris Prior versus Contextual Emotion of a Word in a Sentence Diman Ghazi, Diana Inkpen and Stan Szpakowicz Cross-discourse Development of Supervised Sentiment Analysis in the Clinical Domain Phillip Smith and Mark Lee POLITICAL-ADS: An annotated corpus for modeling event-level evaluativity Kevin Reschke and Pranav Anand x Thursday July 12, 2012 (continued) (17:30) Session 4: Applications of Subjectivity and Sentiment Analysis Automatically Annotating A Five-Billion-Word Corpus of Japanese Blogs for Affect and Sentiment Analysis Michal Ptaszynski, Rafal Rzepka, Kenji Araki and Yoshio Momouchi How to Evaluate Opinionated Keyphrase Extraction? Gábor Berend and Veronika Vincze Semantic frames as an anchor representation for sentiment analysis Josef Ruppenhofer and Ines Rehbein On the Impact of Sentiment and Emotion Based Features in Detecting Online Sexual Predators Dasha Bogdanova, Paolo Rosso and Thamar Solorio xi Multimodal Sentiment Analysis (Abstract of Invited Talk) Rada Mihalcea Department of Computer Science and Engineering University of North Texas P. O. Box 311366 Denton, TX 76203-6886, U.S.A.

[email protected]

Abstract With more than 10,000 new videos posted online every day on social websites such as YouTube and Facebook, the internet is be- coming an almost infinite source of informa- tion. One important challenge for the com- ing decade is to be able to harvest relevant information from this constant flow of mul- timodal data. In this talk, I will introduce the task of multimodal sentiment analysis, and present a method that integrates linguistic, au- dio, and visual features for the purpose of identifying sentiment in online videos. I will first describe a novel dataset consisting of videos collected from the social media web- site YouTube, which were annotated for senti- ment polarity. I will then show, through com- parative experiments, that the joint use of vi- sual, audio, and textual features greatly im- proves over the use of only one modality at a time. Finally, by running evaluations on datasets in English and Spanish, I will show that the method is portable and works equally well when applied to different languages. This is joint work with Veronica Perez-Rosas and Louis-Philippe Morency. 1 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, page 1, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics Subjectivity Word Sense Disambiguation (Abstract of Invited Talk) Janyce Wiebe Department of Computer Science University of Pittsburgh Sennott Square Building, Room 5409 210 S. Bouquet St., Pittsburgh, PA 15260, U.S.A.

[email protected]

Abstract Many approaches to opinion and sentiment analysis rely on lexicons of words that may be used to express subjectivity. These are com- piled as lists of keywords, rather than word meanings (senses). However, many keywords have both subjective and objective senses. False hits – subjectivity clues used with objec- tive senses – are a significant source of error in subjectivity and sentiment analysis. This talk will focus on sense-level opinion and sen- timent analysis. First, I will give the results of a study showing that even words judged in previous work to be reliable opinion clues have significant degrees of subjectivity sense ambiguity. Then, we will consider the task of distinguishing between the subjective and objective senses of words in a dictionary, and the related task of creating “usage inventories” of opinion clues. Given such distinctions, the next step is to automatically determine which word instances in a corpus are being used with subjective senses, and which are being used with objective senses (we call this task “SWSD”). We will see evidence that SWSD is more feasible than full word sense disam- biguation, because it is more coarse grained – often, the exact sense need not be pin- pointed, and that SWSD can be exploited to improve the performance of opinion and sen- timent analysis systems via sense-aware clas- sification. Finally, I will discuss experiments in acquiring SWSD data, via token-based con- text discrimination where the context vector representation is adapted to distinguish be- tween subjective and objective contexts, and the clustering process is enriched by pair-wise constraints, making it semi-supervised. 2 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, page 2, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics Random Walk Weighting over SentiWordNet for Sentiment Polarity Detection on Twitter A. Montejo-Ráez, E. Martı́nez-Cámara, M. T. Martı́n-Valdivia, L. A. Ureña-López University of Jaén E-23071, Jaén (Spain) {amontejo, emcamara, maite, laurena}@ujaen.es Abstract to publish any information in a simple way and to share it with their network of contacts or “friends”. This paper presents a novel approach in Sen- These social networks have also evolved and be- timent Polarity Detection on Twitter posts, by come a continuous flow of information. A clear ex- extracting a vector of weighted nodes from the ample is the microblogging platform Twitter4 . Twit- graph of WordNet. These weights are used ter publishes all kinds of information, disseminating on SentiWordNet to compute a final estima- tion of the polarity. Therefore, the method views on many different topics: politics, business, proposes a non-supervised solution that is economics and so on. Twitter users regularly pub- domain-independent. The evaluation over a lish their comments on a particular news item, a re- generated corpus of tweets shows that this cently purchased product or service, and ultimately technique is promising. on everything that happens around them. This has aroused the interest of the Natural Language Pro- cessing (NLP) community, which has begun to study 1 Introduction the texts posted on Twitter, and more specifically re- The birth of Web 2.0 supposed a breaking down of lated to Sentiment Analysis (SA) challenges. the barrier between the consumers and producers of In this manuscript we present a new approach to information, i.e. the Web has changed from a static resolve the scoring of posts according to the ex- container of information into a live environment in pressed positive or negative degree in the text. This which any user, in a very simple manner, can pub- polarity detection problem is resolved by combin- lish any type of information. This simplified means ing SentiWordNet scores with a random walk analy- of publication has led to the rise of several differ- sis of the concepts found in the text over the Word- ent websites specialized in the publication of users Net graph. In order to validate our non-supervised opinions. Some of the most well-known sites in- approach, several experiments have been performed clude Epinions1 , RottenTomatoes2 and Muchocine3 , to analyze major issues in our method and to com- where users express their opinions or criticisms on a pare it with other approaches like plain SentiWord- wide range of topics. Opinions published on the In- Net scoring or machine learning solutions such as ternet are not limited to certain sites, but rather can Support Vector Machines in a supervised approach. be found in a blog, forum, commercial website or The paper is structured as follows: first, an introduc- any other site allowing posts from visitors. tion to the polarity detection problem is provided, On of the most representative tools of the Web 2.0 followed by the description of our approach. Then, are social networks, which allow millions of users the experimental setup is given with a description of the generated corpus and the results obtained. Fi- 1 http://epinions.com nally, conclusions and further work are discussed. 2 http://rottentomatoes.com 3 4 http://muchocine.net http://twitter.com 3 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 3–10, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics 2 The polarity detection problem ity (which is very common, indeed): a real value in the interval [-1, 1] would be sufficient. Values In the literature related to the SA in long text a dis- over zero would reflect a positive emotion expressed tinction is made between studies of texts where we in the tweet, while values below zero would rather assume that the text is a opinion and therefore solely correspond to negative opinions. The closer to the need to calculate its polarity, and those in which be- zero value a post is, the more its neutrality would fore measuring polarity it is necessary to determine be. Therefore, a polarity detection system could be whether the text is subjective or objective. A wide represented as a function p on a text t such as: study on SA can be found in (Pang and Lee, 2008), (Liu, 2010) and (Tsytsarau and Palpanas, 2011). p : RN → R Concerning the study of the polarity in Twitter, most experiments assume that tweets5 are subjective. One so that p(t) ∈ [−1, 1]. We will define how to of the first studies on the classification of the polar- compute this function, but before an explanation of ity in tweets was published in 2009 by (Go et al., the techniques implied in such a computation is pro- 2009), in which the authors conducted a supervised vided. classification study of tweets in English. Zhang et al. (Zhang et al., 2011) proposed a hy- 3 The approach: Random Walk and brid method for the classification of the polarity in SentiWordNet Twitter, and they demonstrated the validity of their 3.1 The Random Walk algorithm method over an English corpus on Twitter. The clas- Personalized Page Rank vectors (PPVs) consists on sification is divided into two phases. The first one a ranked sequence of WordNet (Fellbaum, 1998) consists on applying a lexicon-based method. In synsets weighted according to a random walk algo- the second one the authors used the SVM algorithm rithm. Taking the graph of WordNet, where nodes to determine the polarity. For the machine learning are synsets and axes are the different semantic re- phase, it is needed a labelled corpus, so the purpose lations among them, and the terms contained in a of the lexicon-method is to tag the corpus. Thus, the tweet, we can select those synsets that correspond to authors selected a set of subjective words from all the closest sense for each term and. Then, it starts those available in English and added hash-tags with an iterative process so more nodes are selected if a subjective meaning. After labelling the corpus, it they are not far from these “seeds”. After a num- is used SVM for classifying new tweets. ber of iterations or a convergence of the weights, a In (Agarwal et al., 2011) a study was conducted final list of valued nodes can be retrieved. A simi- on a reduced corpus of tweets labelled manually. lar approach has been used recently by (Ramage et The experiment tests different methods of polarity al., 2009) to compute text semantic similarity in rec- classification and starts with a base case consisting ognizing textual entailment, and also as a solution on the simple use of unigrams. Then a tree-based for word sense disambiguation (Agirre and Soroa, model is generated. In a third step, several linguis- 2009). We have used the UKB software from this tic features are extracted and finally a final model last citation to generate the PPVs used in our system. learned as combination of the different models pro- Random walk algorithms are inspired originally by posed is computed. A common feature used both in the Google PageRank algorithm (Page et al., 1999). the tree-based model and in the feature-based one is The idea behind it is to represent each tweet as a vec- the polarity of the words appearing in each tweet. In tor weighted synsets that are semantically close to order to calculate this polarity the authors used DAL the terms included in the post. In some way, we are dictionary (Whissell, 1989). expanding these sort texts by a set of disambiguated Most of the proposed systems for polarity detec- concepts related to the terms included in the text. tion compute a value of negativeness or positiveness. As an example of a PPV,the text ”Overall, we’re Some of them even produce a neutrality value. We still having a hard time with it, mainly because we’re will consider the following measurement of polar- not finding it in an early phase.” becomes the vector 5 The name of posts in Twitter. of weighted synsets: 4 [02190088-a:0.0016, 12613907-n:0.0004, responding to those synset nodes which have been 01680996-a:0.0002, 00745831-a:0.0002, ...] activated during the random walk process. There- Here, the synset 02190088-a has a weight of fore, terms like dog and bite (both mainly neutral 0.0016, for example. in SentiWordNet) appearing in the same tweet could eventually be expanded with a more emotional term 3.2 SentiWordNet like hurt, which holds, in SentiWordNet, a negative SentiWordNet (Baccianella et al., 2008) is a lexi- score of 0.75. cal resource based on the well know WordNet (Fell- baum, 1998). It provides additional information on 4 Experiments and results synsets related to sentiment orientation. A synset is the basic item of information in WordNet and it Our experiments are focused in testing the validity represents a “concept” that is unambiguous. Most of applying this unsupervised approach compared to of the relations over the lexical graph use synsets a classical supervised one based on Support Vector as nodes (hyperonymy, synonymy, homonymy and Machines (Joachims, 1998). To this end, the corpus more). SentiWordNet returns from every synset a has been processed obtaining lemmas, as this is the set of three scores representing the notions of “pos- preferred input for the UKB software. The algorithm itivity”, “negativity” and “neutrality”. Therefore, takes the whole WordNet graph and performs a dis- every concept in the graph is weighting accord- ambiguation process of the terms as a natural con- ing to its subjectivity and polarity. The last ver- sequence of applying random walk over the graph. sion of SentiWordNet (3.0) has been constructed In this way, the synsets that are associated to these starting from manual annotations of previous ver- terms are all of them initialized. Then, the iterative sions, populating the whole graph by applying a ran- process of the algorithm (similar to Page Rank but dom walk algorithm. This resource has been used optimized according to an stochastic solution) will by the opinion mining community, as it provides a change these initial values and propagate weights to domain-independent resource to get certain informa- closer synsets. An interesting effect of this process is tion about the degree of emotional charge of its con- that we can actually obtain more concepts that those cepts (Denecke, 2008; Ogawa et al., 2011). contained in the tweet, as all the related ones will also finalize with a certain value due to the propaga- 3.3 Computing the final estimation tion of weights across the graph. We believe that our As a combination of SentiWordNet scores with ran- approach benefits from this effect, as texts in tweets dom walk weights is wanted, it is important that use to suffer from a very sort length, allowing us to the final equation leads to comparable values. To expand short posts. this end, the weights associated to synsets after the Another concern is, therefore, the final size of the random walk process are L1 normalized so vectors PPV vector. If too many concepts are taken into ac- of “concepts” sum up the unit as maximum value. count we may introduce noise in the understanding The final polarity score is obtained by the product of of the latent semantic of the text. In order to study this vector with associated SentiWordNet vector of this fact, different sizes of the vector have been ex- scores, as expressed in equation 1. plored and evaluated. r·s p= (1) 4.1 Our Twitter corpus |t| where p is the final score, r is the vector of The analysis of the polarity on microblogging is a weighted synsets computed by the random walk al- very recent task, so there are few free resources gorithm of the tweet text over WordNet, s is the vec- (Saša et al., 2010). Thus, we have collected our tor of polarity scores from SentiWordNet, t is the own English corpus in order to accomplish the ex- set of concepts derived from the tweet. The idea be- periments. The work of downloading tweets is not hind it is to “expand” the set of concepts with addi- nearly difficult due to the fact that Twitter offers two tional ones that are close in the WordNet graph, cor- kinds of API to those purposes. We have used the 5 Search API of Twitter6 for automatically accessing Emoticons mapped to :) :) :) :-) tweets through a query. For a supervised polarity (positive tweets) study and to evaluate our approach, we need to gen- ;) ;-) =) erate a labelled corpus. We have built a corpus of ˆˆ :-D :D tweets written in English following the procedure :d =D C: described in (Read, 2005) and (Go et al., 2009). Xd XD xD According to (Read, 2005), when authors of an Xd (x (= electronic communication use an emotion, they are ˆˆ ˆoˆ ’u’ effectively marking up their own text with an emo- nn *-* *O* tional state. The main feature of Twitter is that the *o* ** length of the messages must be 140 characters, so Emoticons mapped to :( :-( :( :(( the users have to express their opinions, thoughts, (negative tweets) and emotional states with few words. Therefore, :( D: Dx frequently users write “smileys” in their tweets. ’n’ :\ /: Thus, we have used positive emoticons to label pos- ):-/ :’ =’[ itive tweets and negative emoticons to tag negative :( /T T TOT tweets. The full list of emoticons that we have con- ;; sidered to label the retrieved tweets can be found in Table 1: Emoticons considered as positives and negatives Table 1. So, following (Go et al., 2009), the pre- sumption in the construction of the corpus is that the query “:)” returns tweets with positive smileys, and to another one, he or she introduces a Mention. the query “:(” retrieves negative emotions. We have A Mention is easily recognizable because all of collected a set of 376,296 tweets (181,492 labelled them start with the symbol “@” followed by the as positive tweets and 194,804 labelled as negative user name. We consider that this feature does tweets), which were published on Twitter’s public not provide any relevance information, so we message board from September 14th 2010 to March have removed the mentions in all the tweets. 19th 2011. Table 2 lists other characteristics of the 3. Links: It is very common that tweets include corpus. web directions. In our approach we do not ana- On the other hand, the language used in Twit- lyze the documents that links those urls, so we ter has some unique attributes, which have been re- have eliminated them from all tweets. moved because they do not provide relevant infor- mation for the polarity detection process. These spe- 4. Hash-tags: A hash-tag is the name of a topic cific features are: in Twitter. Anybody can begin a new topic by typing the name of the topic preceded by the 1. Retweets: A retweet is the way to repeat a mes- symbol “#”. For this work we do not classify sage that users consider interesting. Retweets topics so we have neglected all the hash-tags. can be done through the web interface using the Retweet option, or as the old way writing Due to the fact that users usually write tweets RT, the user name and the post to retweet. The with a very casual language, it is necessary to pre- first way is not a problem because is the same process the raw tweets before feeding the sentiment tweet, so the API only return it once, but old analyzer. For that purpose we have applied the fol- way retweets are different tweets but with the lowing filters: same content, so we removed them to avoid pit- 1. Remove new lines: Some users write tweets ting extra weight on any particular tweet. in two or three different lines, so all newlines 2. Mentions: Other feature of Twitter is the so symbols were removed. called Mentions. When a user wants to refer 2. Opposite emoticons: Twitter sometimes con- 6 https://dev.twitter.com/docs/api/1/get/search siders positive or negative a tweet with smileys 6 Total as a different word. Thus, we have normalized Positive tweets 181,492 all the repeated letters, and any letter occurring Negative tweets 194,804 376,296 more than two times in a word is replaced with Unique users in positive 157,579 two occurrences. The example above would be tweets converted into: Unique users in negative 167,479 325,058 blood drive todayy :) everyone tweets donatee!! Words in positive tweets 418,234 Words in negative tweets 334,687 752,921 5. Laugh: There is not a unique manner to ex- Average number of 9 press laugh. Therefore, we have normalized words per positive tweet the way to write laugh. Table 4 lists the con- Average number of 10 versions. words per negative tweet Laugh Conversion Table 2: Statistical description of the corpus. hahahaha... haha hehehehe... hehe hihihihi... hihi that have opposite senses. For example: hohohoho... hoho @Harry Styles I have all day to try huhuhuhu... huhu get a tweet off you :) when are Lol haha you coming back to dublin i missed Huashuashuas huas you last time,I was in spain :( muahahaha Buaha The tweet has two parts one positive and the buahahaha Buaha other one negative, so the post cannot be con- Table 4: Normalization for expressions considered as sidered as positive, but the search API returns “Laugh” as a positive tweet because it has the positive smiley “:)”. We have removed this kind of Finally, although the emoticons have been used tweets in order to avoid ambiguity. to tag the positive and negative samples, the fi- 3. Emoticons with no clear sentiment: The nal corpora does not include these emoticons. Twitter Search API considers some emoticons In addition, all the punctuation characters have like “:P” or “:PP” as negative. However, some been neglected in order to reduce the noise in users do not type them to express a negative the data. Figure 1 shows the process to gener- sentiment. Thus, we have got rid of all tweets ate our Twitter corpus. with this kind of smileys (see Table 3). 4.2 Results obtained Fuzzy emoticons :-P :P :PP \( Our first experiment consisted on evaluating a super- vised approach, like Support Vector Machines, us- Table 3: Emoticons considered as fuzzy sentiments ing the well know vector space model to build the vector of features. Each feature corresponds to the 4. Repeated letters: Users frequently repeat sev- TF.IDF weight of a lemma. Stop words have not eral times letters of some words to emphasize been removed and the minimal document frequency their messages. For example: required was two, that is, if the lemma is not present Blood drive todayyyy!!!!! :) in two o more tweets, then it is discarded as a di- Everyone donateeeee!! mension in the vectors. The SVM-Light7 software was used to compute support vectors and to evaluate This can be a problem for the classification pro- them using a random leave-one-out strategy. From cess, because the same word with different rep- 7 etitions of the same letter would be considered http://svmlight.joachims.org/ 7 of the graphs, the size of the PPV vectors affects the performance. Sizes above 10 presents an sta- ble behavior, that is, considering a large number of synsets does not improves the performance of the system, but it gets worse neither. The WordNet graph considered for the random walk algorithm in- cludes antonyms relations, so we wanted to check whether discarding these connections would affect the system. From these graphs we can extract the conclusion that antonyms relations are worth keep- ing. Figure 1: Corpus generation work-flow a total of 376,284 valid samples 85,423 leave-one- out evaluations were computed. This reported the following measurements: Figure 2: Precision values against PPV sizes Precision Recall F1 0.6429 0.6147 0.6285 In our first implementation of our method, the fi- nal polarity score is computed as described in equa- tion 1. More precisely, it is the average of the prod- uct between the difference of positive and negative SentiWordNet scores, and the weight obtained with the random walk algorithm, as unveiled in equa- tion 2. P − rws · (swn+ s − swns ) p = ∀s∈t (2) |t| Where s is a synset in the tweet t, rws is the weight of the synset s after the random walk pro- Figure 3: Recall values against PPV sizes cess over WordNet, swn+ − s and swns ) are positive and negative scores for the synset s retrieved from SentiWordNet. Comparing our best configuration to the SVM ap- The results obtained are graphically shown in fig- proach, the results are not better, but quite close (ta- ures 2, 3 and 4 for precision, recall and F1 values ble 5). Therefore, this unsupervised solution is an respectively. As can be noticed from the shapes interesting alternative to the supervised one. 8 just final scores. As an additional task, the process- ing of original texts is important. The numerous grammatical and spelling errors found in this fast way of publication demand for a better sanitization of the incoming data. An automatic spell checker is under development. As final conclusion, we believe that this first at- tempt is very promising and that it has arose many relevant questions on the subject of sentiment analy- sis. More extensive research and experimentation is being undertaken from the starting point introduced in this paper. Figure 4: F1 values against PPV sizes Acknowledgments Precision Recall F1 This work has been partially supported by a grant SVM 0.6429 0.6147 0.6285 from the Fondo Europeo de Desarrollo Regional RW·SWN 0.6259 0.6207 0.6233 (FEDER), TEXT-COOL 2.0 project (TIN2009- 13391-C04-02) from the Spanish Government. This Table 5: Approaches comparative table paper is partially funded by the European Commis- sion under the Seventh (FP7 - 2007-2013) Frame- work Programme for Research and Technologi- 5 Conclusions and further work cal Development through the FIRST project (FP7- A new unsupervised approach to the polarity detec- 287607). This publication reflects the views only tion problem in Twitter posts has been proposed. By of the author, and the Commission cannot be held combining a random walk algorithm that weights responsible for any use which may be made of the synsets from the text with polarity scores provided information contained therein. by SentiWordNet, it is possible to build a system comparable to a SVM based supervised approach in References terms of performance. Our solution is a general ap- proach that do not suffer from the disadvantages as- Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, sociated to supervised ones: need of a training cor- and Rebecca Passonneau. 2011. Sentiment analysis pus and dependence on the domain where the model of twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 30–38, was obtained. Portland, Oregon, jun. Association for Computational Many issues remain open and they will drive our Linguistics. future work. How to deal with negation is a ma- Eneko Agirre and Aitor Soroa. 2009. Personalizing jor concern, as the score from SentiWordNet should pagerank for word sense disambiguation. In EACL be considered in a different way in the final com- ’09: Proceedings of the 12th Conference of the Eu- putation if the original term comes from a negated ropean Chapter of the Association for Computational phrase. Our “golden rules” must be taken carefully, Linguistics, pages 33–41, Morristown, NJ, USA. As- because emoticons are a rough way to classify the sociation for Computational Linguistics. polarity of tweets. Actually, we are working in the Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- tiani. 2008. Sentiwordnet 3.0 : An enhanced lexical generation of a new corpus in the politics domain resource for sentiment analysis and opinion mining. that is now under a manual labeling process. An- Proceedings of the Seventh conference on Interna- other step is to face certain flaws in the computation tional Language Resources and Evaluation LREC10, of the final score. In this sense, we plan to study 0:2200–2204. the context of a tweet among the time line of tweets K. Denecke. 2008. Using sentiwordnet for multilingual from that user to identify publisher’s mood and ad- sentiment analysis. In Data Engineering Workshop, 9 2008. ICDEW 2008. IEEE 24th International Confer- ence on, pages 507 –512, april. Christiane Fellbaum, editor. 1998. WordNet: An Elec- tronic Lexical Database. MIT Press, Cambridge, MA. Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit- ter sentiment classification using distant supervision. Processing, pages 1–6. T. Joachims. 1998. Text categorization with support vec- tor machines: learning with many relevant features. In European Conference on Machine Learning (ECML). Bing Liu. 2010. Sentiment analysis and subjectivity. Handbook of Natural Language Processing, 2nd ed. Tatsuya Ogawa, Qiang Ma, and Masatoshi Yoshikawa. 2011. News Bias Analysis Based on Stakeholder Min- ing. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, E94D(3):578–586, MAR. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University. Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Found. Trends Inf. Retr., 2(1-2):1– 135. Daniel Ramage, Anna N. Rafferty, and Christopher D. Manning. 2009. Random walks for text seman- tic similarity. In TextGraphs-4: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pages 23–31, Morristown, NJ, USA. Association for Computational Linguistics. Jonathon Read. 2005. Using emoticons to reduce de- pendency in machine learning techniques for senti- ment classification. In Proceedings of the ACL Stu- dent Research Workshop, ACLstudent ’05, pages 43– 48, Stroudsburg, PA, USA. Association for Computa- tional Linguistics. Petrović Saša, Miles Osborne, and Victor Lavrenko. 2010. The edinburgh twitter corpus. In Proceed- ings of the NAACL HLT 2010 Workshop on Compu- tational Linguistics in a World of Social Media, WSA ’10, pages 25–26, Stroudsburg, PA, USA. Association for Computational Linguistics. Mikalai Tsytsarau and Themis Palpanas. 2011. Survey on mining subjective data on the web. Data Mining and Knowledge Discovery, pages 1–37, October. C M Whissell, 1989. The dictionary of affect in lan- guage, volume 4, pages 113–131. Academic Press. Ley Zhang, Riddhiman Ghosh, Mohamed Dekhil, Me- ichun Hsu, and Bing Liu. 2011. Combining lexicon- based and learning-based methods for twitter senti- ment analysis. Technical Report HPL-2011-89, HP, 21/06/2011. 10 Mining Sentiments from Tweets Akshat Bakliwal, Piyush Arora, Senthil Madhappan Nikhil Kapre, Mukesh Singh and Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Hyderabad. {akshat.bakliwal, piyush.arora}@research.iiit.ac.in, {senthil.m, nikhil.kapre, mukeshkumar.singh}@students.iiit.ac.in,

[email protected]

Abstract and thus is a useful source of information. Users often discuss on current affairs and share their per- Twitter is a micro blogging website, where sonals views on various subjects via tweets. users can post messages in very short text Out of all the popular social media’s like Face- called Tweets. Tweets contain user opin- book, Google+, Myspace and Twitter, we choose ion and sentiment towards an object or per- Twitter because 1) tweets are small in length, thus son. This sentiment information is very use- ful in various aspects for business and gov- less ambigious; 2) unbiased; 3) are easily accessible ernments. In this paper, we present a method via API; 4) from various socio-cultural domains. which performs the task of tweet sentiment In this paper, we introduce an approach which can identification using a corpus of pre-annotated be used to find the opinion in an aggregated col- tweets. We present a sentiment scoring func- lection of tweets. In this approach, we used two tion which uses prior information to classify different datasets which are build using emoticons (binary classification ) and weight various sen- and list of suggestive words respectively as noisy la- timent bearing words/phrases in tweets. Us- ing this scoring function we achieve classifi- bels. We give a new method of scoring “Popularity cation accuracy of 87% on Stanford Dataset Score”, which allows determination of the popular- and 88% on Mejaj dataset. Using supervised ity score at the level of individual words of a tweet machine learning approach, we achieve classi- text. We also emphasis on various types and levels fication accuracy of 88% on Stanford dataset. of pre-processing required for better performance. Roadmap for rest of the paper: Related work is discussed in Section 2. In Section 3, we describe 1 Introduction our approach to address the problem of Twitter With enormous increase in web technologies, num- sentiment classification along with pre-processing ber of people expressing their views and opinions steps.Datasets used in this research are discussed in via web are increasing. This information is very Section 4. Experiments and Results are presented in useful for businesses, governments and individuals. Section 5. In Section 6, we present the feature vector With over 340+ million Tweets (short text messages) approach to twitter sentiment classification. Section per day, Twitter is becoming a major source of infor- 7 presents as discussion on the methods and we con- mation. clude the paper with future work in Section 8. Twitter is a micro-blogging site, which is popular because of its short text messages popularly known 2 Related Work as “Tweets”. Tweets have a limit of 140 characters. Research in Sentiment Analysis of user generated Twitter has a user base of 140+ million active users1 content can be categorized into Reviews (Turney, 1 As on March 21, 2012. Source: 2002; Pang et al., 2002; Hu and Liu, 2004), Blogs http://en.wikipedia.org/wiki/Twitter (Draya et al., 2009; Chesley, 2006; He et al., 2008), 11 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 11–18, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics News (Godbole et al., 2007), etc. All these cat- tion method and achieved the highest accuracy of egories deal with large text. On the other hand, 83.33% on a hand labeled test dataset. Tweets are shorter length text and are difficult to (Agarwal et al., 2011) performed three class (pos- analyse because of its unique language and struc- itive, negative and neutral) classification of tweets. ture. They collected their dataset using Twitter stream (Turney, 2002) worked on product reviews. Tur- API and asked human judges to annotate the data ney used adjectives and adverbs for performing into three classes. They had 1709 tweets of each opinion classification on reviews. He used PMI-IR class making a total of 5127 in all. In their research, algorithm to estimate the semantic orientation of the they introduced POS-specific prior polarity features sentiment phrase. He achieved an average accuracy along with twitter specific features. They achieved of 74% on 410 reviews of different domains col- max accuracy of 75.39% for unigram + senti fea- lected from Epinion. (Hu and Liu, 2004) performed tures. feature based sentiment analysis. Using Noun-Noun phrases they identified the features of the products Our work uses (Go et al., 2009) and (Bora, 2012) and determined the sentiment orientation towards datasets for this research. We use Naive Bayes each feature. (Pang et al., 2002) tested various ma- method to decide the polarity of tokens in the tweets. chine learning algorithms on Movie Reviews. He Along with that we provide an useful insight on how achieved 81% accuracy in unigram presence feature preprocessing should be done on tweet. Our method set on Naive Bayes classifier. of Senti Feature Identification and Popularity Score (Draya et al., 2009) tried to identify domain spe- perform well on both the datasets. In feature vec- cific adjectives to perform blog sentiment analysis. tor approach, we show the contribution of individual They considered the fact that opinions are mainly NLP and Twitter specific features. expressed by adjectives and pre-defined lexicons fail to identify domain information. (Chesley, 2006) per- formed topic and genre independent blog classifica- 3 Approach tion, making novel use of linguistic features. Each post from the blog is classified as positive, negative and objective. Our approach can be divided into various steps. To the best of our knowledge, there is very less Each of these steps are independent of the other but amount of work done in twitter sentiment analy- important at the same time. sis. (Go et al., 2009) performed sentiment analy- sis on twitter. They identified the tweet polarity us- ing emoticons as noisy labels and collected a train- 3.1 Baseline ing dataset of 1.6 million tweets. They reported an accuracy of 81.34% for their Naive Bayes classi- In the baseline approach, we first clean the tweets. fier. (Davidov et al., 2010) used 50 hashtags and 15 We remove all the special characters, targets (@), emoticons as noisy labels to create a dataset for twit- hashtags (#), URLs, emoticons, etc and learn the ter sentiment classification. They evaluate the effect positive & negative frequencies of unigrams in train- of different types of features for sentiment extrac- ing. Every unigram token is given two probability tion. (Diakopoulos and Shamma, 2010) worked on scores: Positive Probability (Pp ) and Negative Prob- political tweets to identify the general sentiments of ability (Np ) (Refer Equation 1). We follow the same the people on first U.S. presidential debate in 2008. cleaning process for the test tweets. After clean- (Bora, 2012) also created their dataset based on ing the test tweets, we form all the possible uni- noisy labels. They created a list of 40 words (pos- grams and check for their frequencies in the training itive and negative) which were used to identify the model. We sum up the positive and negative proba- polarity of tweet. They used a combination of bility scores of all the constituent unigrams, and use a minimum word frequency threshold and Cate- their difference (positive - negative) to find the over- gorical Proportional Difference as a feature selec- all score of the tweet. If tweet score is > 0 then it is 12 positive otherwise negative. words don’t carry any sentiment information and thus are of no use to us. We create a list of stop Pf = F requency in P ositive T raining Set words like he, she, at, on, a, the, etc. and ignore Nf = F requency in N egative T raining Set them while scoring. We also discard words which Pp = P ositive P robability of the token. are of length ≤ 2 for scoring the tweet. = Pf /(Pf + Nf ) 3.5 Spell Correction Np = N egative P robability of the token. = Nf /(Pf + Nf ) Tweets are written in random form, without any fo- cus given to correct structure and spelling. Spell (1) correction is an important part in sentiment analy- 3.2 Emoticons and Punctuations Handling sis of user- generated content. Users type certain characters arbitrary number of times to put more em- We make slight changes in the pre-processing mod- phasis on that. We use the spell correction algo- ule for handling emoticons and punctuations. We rithm from (Bora, 2012). In their algorithm, they use the emoticons list provided by (Agarwal et al., replace a word with any character repeating more 2011) in their research. This list2 is built from than twice with two words, one in which the re- wikipedia list of emoticons3 and is hand tagged into peated character is placed once and second in which five classes (extremely positive, positive, neutral, the repeated character is placed twice. For example negative and extremely negative). In this experi- the word ‘swwweeeetttt’ is replaced with 8 words ment, we replace all the emoticons which are tagged ‘swet’, ‘swwet’, ‘sweet’, ‘swett’, ‘swweet’, and so positive or extremely positive with ‘zzhappyzz’ and on. rest all other emoticons with ‘zzsadzz’. We append and prepend ‘zz’ to happy and sad in order to pre- Another common type of spelling mistakes oc- vent them from mixing into tweet text. At the end, cur because of skipping some of characters from the ‘zzhappyzz’ is scored +1 and ‘zzsadzz’ is scored -1. spelling. like “there” is generally written as “thr”. Exclamation marks (!) and question marks (?) Such types of spelling mistakes are not currently also carry some sentiment. In general, ‘!’ is used handled by our system. We propose to use phonetic when we have to emphasis on a positive word and level spell correction method in future. ‘?’ is used to highlight the state of confusion or disagreement. We replace all the occurrences of ‘!’ 3.6 Senti Features with ‘zzexclaimzz’ and of ‘?’ with ‘zzquestzz’. We At this step, we try to reduce the effect of non- add 0.1 to the total tweet score for each ‘!’ and sub- sentiment bearing tokens on our classification sys- tract 0.1 from the total tweet score for each ‘?’. 0.1 tem. In the baseline method, we considered all the is chosen by trial and error method. unigram tokens equally and scored them using the Naive Bayes formula (Refer Equation 1). Here, we 3.3 Stemming try to boost the scores of sentiment bearing words. We use Porter Stemmer4 to stem the tweet words. In this step, we look for each token in a pre-defined We modify porter stemmer and restrict it to step 1 list of positive and negative words. We use the list of only. Step 1 gets rid of plurals and -ed or -ing. of most commonly used positive and negative words provided by Twitrratr5 . When we come across a to- 3.4 Stop Word Removal ken in this list, instead of scoring it using the Naive Stop words play a negative role in the task of senti- Bayes formula (Refer Equation 1), we score the to- ment classification. Stop words occur in both pos- ken +/- 1 depending on the list in which it exist. All itive and negative training set, thus adding more the tokens which are missing from this list went un- ambiguity in the model formation. And also, stop der step 3.3, 3.4, 3.5 and were checked for their oc- 2 http://goo.gl/oCSnQ currence after each step. 3 http://en.wikipedia.org/wiki/List of emoticons 4 5 http://tartarus.org/m̃artin/PorterStemmer/ http://twitrratr.com/ 13 3.7 Noun Identification After doing all the corrections (3.3 - 3.6) on a word, we look at the reduced word if it is being converted to a Noun or not. We identify the word as a Noun word by looking at its part of speech tag in English WordNet(Miller, 1995). If the majority sense (most commonly used sense) of that word is Noun, we discard the word while scoring. Noun words don’t carry sentiment and thus are of no use in our experi- ments. 3.8 Popularity Score This scoring method boosts the scores of the most commonly used words, which are domain specific. Figure 1: Flow Chart of our Algorithm For example, happy is used predominantly for ex- pressing the positive sentiment. In this method, we multiple its popularity factor (pF) to the score of 4 Datasets each unigram token which has been scored in the In this section, we explain the two datasets used in previous steps. We use the occurrence frequency of this research. Both of these datasets are built using a token in positive and negative dataset to decide on noisy labels. the weight of popularity score. Equation 2 shows how the popularity factor is calculated for each to- 4.1 Stanford Dataset ken. We selected a threshold 0.01 min support as the This dataset(Go et al., 2009) was built automat- cut-off criteria and reduced it by half at every level. ically using emoticons as noisy labels. All the Support of a word is defined as the proportion of tweets which contain ‘:)’ were marked positive and tweets in the dataset which contain this token. The tweets containing ‘:(’ were marked negative. Tweets value 0.01 is chosen such that we cover a large num- that did not have any of these labels or had both ber of tokens without missing important tokens, at were discarded. The training dataset has ∼1.6 mil- the same time pruning less frequent tokens. lion tweets, equal number of positive and negative tweets. The training dataset was annotated into two Pf = F requency in P ositive T raining Set classes (positive and negative) while the testing data was hand annotated into three classes (positive, neg- Nf = F requency in N egative T raining Set ative and neutral). For our experimentation, we use if (Pf − Nf ) > 1000) only positive and negative class tweets from the test- pF = 0.9; ing dataset for our experimentation. Table 1 gives elseif ((Pf − Nf ) > 500) the details of dataset. pF = 0.8; Training Tweets elseif ((Pf − Nf ) > 250) Positive 800,000 pF = 0.7; Negative 800,000 elseif ((Pf − Nf ) > 100) Total 1,600,000 Testing Tweets pF = 0.5; Positive 180 elseif ((Pf − Nf < 50)) Negative 180 pF = 0.1; Objective 138 (2) Total 498 Figure 1 shows the flow of our approach. Table 1: Stanford Twitter Dataset 14 4.2 Mejaj we train on the given training data and test on the Mejaj dataset(Bora, 2012) was built using noisy la- testing data. In the second series of experiments, bels. They collected a set of 40 words and manually we perform 5 fold cross validation using the training categorized them into positive and negative. They data. Table 4 shows the results of each of these ex- label a tweet as positive if it contains any of the pos- periments on steps which are explained in Approach itive sentiment words and as negative if it contains (Section 3). any of the negative sentiment words. Tweets which In table 4, we give results for each step emoticons do not contain any of these noisy labels and tweets and punctuations handling, spell correction, stem- which have both positive and negative words were ming and stop word removal mentioned in Approach discarded. Table 2 gives the list of words which were Section (Section 3). The Baseline + All Combined used as noisy labels. This dataset contains only two results refers to combination of these steps (emoti- class data. Table 3 gives the details of the dataset. cons, punctuations, spell correction, Stemming and stop word removal) performed together. Series 2 re- Positive Labels Negative Labels sults are average of accuracy of each fold. annoyed, ashamed, amazed, amused, awful, defeated, 5.2 Mejaj Dataset attracted, cheerful, depressed, Similar series of experiments were performed on delighted, elated, disappointed, this dataset(Bora, 2012) too. In the first series of excited, festive, funny, discouraged, experiments, training and testing was done on the hilarious, joyful, displeased, respective given datasets. In the second series of ex- lively, loving, embarrassed, furious, periments, we perform 5 fold cross validation on the overjoyed, passion, gloomy, greedy, training data. Table 5 shows the results of each of pleasant, pleased, guilty, hurt, lonely, these experiments. pleasure, thrilled, mad, miserable, In table 5, we give results for each step emoticons wonderful shocked, unhappy, and punctuations handling, spell correction, stem- upset ming and stop word removal mentioned in Approach Section (Section 3). The Baseline + All Combined Table 2: Noisy Labels for annotating Mejaj Dataset results refers to combination of these steps (emoti- cons, punctuations, spell correction, Stemming and Training Tweets stop word removal) performed together. Series 2 re- Positive 668,975 sults are average of accuracy of each fold. Negative 795,661 Total 1,464,638 5.3 Cross Dataset Testing Tweets To validate the robustness of our approach, we ex- Positive 198 perimented with cross dataset training and testing. Negative 204 We trained our system on one dataset and tested on Total 402 the other dataset. Table 6 reports the results of cross dataset evaluations. Table 3: Mejaj Dataset 6 Feature Vector Approach 5 Experiment In this feature vector approach, we form features us- In this section, we explain the experiments carried ing Unigrams, Bigrams, Hashtags (#), Targets (@), out using the above proposed approach. Emoticons, Special Symbol (‘!’) and used a semi- supervised SVM classifier. Our feature vector com- 5.1 Stanford Dataset prised of 11 features. We divide the features into On this dataset(Go et al., 2009), we perform a series two groups, NLP features and Twitter specific fea- of experiments. In the first series of experiments, tures. NLP features include frequency of positive 15 Method Series 1 (%) Series 2 (%) Baseline 78.8 80.1 Baseline + Emoticons + Punctuations 81.3 82.1 Baseline + Spell Correction 81.3 81.6 Baseline + Stemming 81.9 81.7 Baseline + Stop Word Removal 81.7 82.3 Baseline + All Combined (AC) 83.5 85.4 AC + Senti Features (wSF) 85.5 86.2 wSF + Noun Identification (wNI) 85.8 87.1 wNI + Popularity Score 87.2 88.4 Table 4: Results on Stanford Dataset Method Series 1 (%) Series 2 (%) Baseline 77.1 78.6 Baseline + Emoticons + Punctuations 80.3 80.4 Baseline + Spell Correction 80.1 80.0 Baseline + Stemming 79.1 79.7 Baseline + Stop Word Removal 80.2 81.7 Baseline + All Combined (AC) 82.9 84.1 AC + Senti Features (wSF) 86.8 87.3 wSF + Noun Identification (wNI) 87.6 88.2 wNI + Popularity Score 88.1 88.1 Table 5: Results on Mejaj Dataset Method Training Dataset Testing Dataset Accuracy wNI + Popularity Score Stanford Mejaj 86.4% wNI + Popularity Score Mejaj Stanford 84.7% Table 6: Results on Cross Dataset evaluation Unigram (f1) # of positive and negative unigram NLP Bigram (f2) # of positive and negative Bigram Hashtags (f3) # of positive and negative hashtags Emoticons (f4) # of positive and negative emoticons Twitter Specific URLs (f5) Binary Feature - presence of URLs Targets (f6) Binary Feature - presence of Targets Special Symbols (f7) Binary Feature - presence of ‘!’ Table 7: Features and Description 16 Feature Set Accuracy (Stanford) tive. If we incorporate the effect of emoticon “:-D”, f1 + f2 85.34% then this tweet is tagged positive. “:-D” is a strong f3 + f4 + f7 53.77% positive emoticon. f3 + f4 + f5 + f6 + f7 60.12% Consider this example, “Bill Clinton Fail - f1 + f2 + f3 + f4 + f7 85.89% Obama Win?”. In this example, there are two senti- f1 + f2 + f3 + f4 + ment bearing words, “Fail” and “Win”. Ideally this 87.64% f5 + f6 + f7 tweet should be neutral but this is tagged as a posi- tive tweet in the dataset as well as using our system. Table 8: Results of Feature Vector Classifier on Stanford In this tweet, if we calculate the popularity factor Dataset (pF) for “Win” and “Fail”, they come out to be 0.9 and 0.8 respectively. Because of the popularity fac- unigrams matched, negative unigrams matched, pos- tor weight, the positive score domniates the negative itive bigrams matched, negative bigrams matched, score and thus the tweet is tagged as positive. It is etc and Twitter specific features included Emoti- important to identify the context flow in the text and cons, Targets, HashTags, URLs, etc. Table 7 shows also how each of these words modify or depend on the features we have considered. the other words of the tweet. HashTags polarity is decided based on the con- For calculating the system performance, we as- stituent words of the hashtags. Using the list of pos- sume that the dataset which is used here is correct. itive and negative words from Twitrratr6 , we try to Most of the times this assumption is true but there find if hashtags contains any of these words. If so, are a few cases where it fails. For example, this we assign the polarity of that to the hashtag. For tweet “My wrist still hurts. I have to get it looked example, “#imsohappy” contains a positive word at. I HATE the dr/dentist/scary places. :( Time to “happy”, thus this hashtag is considered as posi- watch Eagle eye. If you want to join, txt!” is tagged tive hashtag. We use the emoticons list provided as positive, but actually this should have been tagged by (Agarwal et al., 2011) in their research. This negative. Such erroneous tweets also effect the sys- list7 is built from wikipedia list of emoticons8 and tem performance. is hand tagged into five classes (extremely positive, There are few limitations with the current pro- positive, neutral, negative and extremely negative). posed approach which are also open research prob- We reduce this five class list to two class by merging lems. extremely positive and positive class to single posi- tive class and rest other classes (extremely negative, 1. Spell Correction: In the above proposed ap- negative and neutral) to single negative class. Ta- proach, we gave a solution to spell correction ble 8 reports the accuracy of our machine learning which works only when extra characters are en- classifier on Stanford dataset. tered by the user. It fails when users skip some characters like “there” is spelled as “thr”. We 7 Discussion propose the use of phonetic level spell correc- tion to handle this problem. In this section, we present a few examples evaluated using our system. The following example denotes 2. Hashtag Segmentation: For handling hashtags, the effect of incorporating the contribution of emoti- we looked for the existence of the positive or cons on tweet classification. Example “Ahhh I can’t negative words9 in the hashtag. But there can move it but hey w/e its on hell I’m elated right now be some cases where it may not work correctly. :-D”. This tweet contains two opinion words, “hell” For example, “#thisisnotgood”, in this hashtag and “elated”. Using the unigram scoring method, if we consider the presence of positive and neg- this tweet is classified neutral but it is actually posi- ative words, then this hashtag is tagged posi- tive (“good”). We fail to capture the presence 6 http://twitrratr.com/ and effect of “not” which is making this hash- 7 http://goo.gl/oCSnQ 8 9 http://en.wikipedia.org/wiki/List of emoticons word list taken from http://twitrratr.com/ 17 tag as negative. We propose to devise and use data. In Proceedings of the Workshop on Languages some logic to segment the hashtags to get cor- in Social Media LSM ’11. rect constituent words. Bora, N. N. (2012). Summarizing Public Opinions in Tweets. In Journal Proceedings of CICLing 2012, 3. Context Dependency: As discussed in one of New Delhi, India. the examples above, even tweet text which is Chesley, P. (2006). Using verbs and adjectives to auto- limited to 140 characters can have context de- matically classify blog sentiment. In In Proceedings pendency. One possible method to address this of AAAI-CAAW-06, the Spring Symposia on Compu- tational Approaches. problem is to identify the objects in the tweet Davidov, D., Tsur, O. and Rappoport, A. (2010). En- and then find the opinion towards those objects. hanced sentiment learning using Twitter hashtags and smileys. In Proceedings of the 23rd International Con- 8 Conclusion and Future Work ference on Computational Linguistics: Posters COL- Twitter sentiment analysis is a very important and ING ’10. Diakopoulos, N. and Shamma, D. (2010). Characterizing challenging task. Twitter being a microblog suffers debate performance via aggregated twitter sentiment. from various linguistic and grammatical errors. In In Proceedings of the 28th international conference on this research, we proposed a method which incorpo- Human factors in computing systems ACM. rates the popularity effect of words on tweet senti- Draya, G., Planti, M., Harb, A., Poncelet, P., Roche, ment classification and also emphasis on how to pre- M. and Trousset, F. (2009). Opinion Mining from process the Twitter data for maximum information Blogs. In International Journal of Computer Informa- extraction out of the small content. On the Stanford tion Systems and Industrial Management Applications dataset, we achieved 87% accuracy using the scor- (IJCISIM). ing method and 88% using SVM classifier. On Me- Go, A., Bhayani, R. and Huang, L. (2009). Twitter Sen- timent Classification using Distant Supervision. In jaj dataset, we showed an improvement of 4.77% as CS224N Project Report, Stanford University. compared to their (Bora, 2012) accuracy of 83.33%. Godbole, N., Srinivasaiah, M. and Skiena, S. (2007). In future, This work can be extended through in- Large-Scale Sentiment Analysis for News and Blogs. corporation of better spell correction mechanisms In Proceedings of the International Conference on We- (may be at phonetic level) and word sense disam- blogs and Social Media (ICWSM). biguation. Also we can identify the target and enti- He, B., Macdonald, C., He, J. and Ounis, I. (2008). An ties in the tweet and the orientation of the user to- effective statistical approach to blog post opinion re- wards them. trieval. In Proceedings of the 17th ACM conference on Information and knowledge management CIKM ’08. Acknowledgement Hu, M. and Liu, B. (2004). Mining Opinion Features in Customer Reviews. In AAAI. We would like to thank Vibhor Goel, Sourav Dutta Miller, G. A. (1995). WordNet: A Lexical Database for and Sonil Yadav for helping us with running SVM English. Communications of the ACM 38, 39–41. classifier on such a large data. Pang, B., Lee, L. and Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification using Machine Learning Techniques. References Turney, P. D. (2002). Thumbs Up or Thumbs Down? Se- mantic Orientation Applied to Unsupervised Classifi- Agarwal, A., Xie, B., Vovsha, I., Rambow, O. and Pas- cation of Reviews. In ACL. sonneau, R. (2011). Sentiment analysis of Twitter 18 SAMAR: A System for Subjectivity and Sentiment Analysis of Arabic Social Media Muhammad Abdul-Mageed, Sandra Kübler Mona Diab Indiana University Columbia University Bloomington, IN, USA New York, NY, USA {mabdulma,skuebler}@indiana.edu

[email protected]

Abstract and blogs. This excludes, for example, social me- dia genres (such as Wikipedia Talk Pages). Second, In this work, we present SAMAR, a sys- despite increased interest in the area of SSA, only tem for Subjectivity and Sentiment Analysis few attempts have been made to build SSA systems (SSA) for Arabic social media genres. We for morphologically-rich languages (Abbasi et al., investigate: how to best represent lexical in- 2008; Abdul-Mageed et al., 2011b), i.e. languages formation; whether standard features are use- in which a significant amount of information con- ful; how to treat Arabic dialects; and, whether genre specific features have a measurable im- cerning syntactic units and relations is expressed at pact on performance. Our results suggest that the word-level, such as Finnish or Arabic. We thus we need individualized solutions for each do- aim at partially bridging these two gaps in research main and task, but that lemmatization is a fea- by developing an SSA system for Arabic, a mor- ture in all the best approaches. phologically highly complex languages (Diab et al., 2007; Habash et al., 2009). We present SAMAR, a sentence-level SSA system for Arabic social media 1 Introduction texts. We explore the SSA task on four different gen- In natural language, subjectivity refers to aspects of res: chat, Twitter, Web forums, and Wikipedia Talk language used to express opinions, feelings, eval- Pages. These genres vary considerably in terms of uations, and speculations (Banfield, 1982) and, as their functions and the language variety employed. such, it incorporates sentiment. The process of sub- While the chat genre is overridingly in dialectal Ara- jectivity classification refers to the task of classify- bic (DA), the other genres are mixed between Mod- ing texts as either objective (e.g., The new iPhone ern Standard Arabic (MSA) and DA in varying de- was released.) or subjective. Subjective text can grees. In addition to working on multiple genres, further be classified with sentiment or polarity. For SAMAR handles Arabic that goes beyond MSA. sentiment classification, the task consists of iden- 1.1 Research Questions tifying whether a subjective text is positive (e.g., In the current work, we focus on investigating four The Syrians continue to inspire the world with their main research questions: courage!), negative (e.g., The bloodbaths in Syria are horrifying!), neutral (e.g., Obama may sign the • RQ1: How can morphological richness be bill.), or, sometimes, mixed (e.g., The iPad is cool, treated in the context of Arabic SSA? but way too expensive). • RQ2: Can standard features be used for SSA In this work, we address two main issues in Sub- for social media despite the inherently short jectivity and Sentiment Analysis (SSA): First, SSA texts typically used in these genres? has mainly been conducted on a small number of genres such as newspaper text, customer reports, • RQ3: How do we treat dialects? 19 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 19–28, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics • RQ4: Which features specific to social media performance. can we leverage? RQ4 is concerned with attempting to improve SSA performance, which suffers from the problems RQ1 is concerned with the fact that SSA has described above, by leveraging information that is mainly been conducted for English, which has lit- typical for social media genres, such as author or tle morphological variation. Since the features used gender information. in machine learning experiments for SSA are highly The rest of the paper is organized as follows: In lexicalized, a direct application of these methods is Section 2, we review related work. Section 3 de- not possible for a language such as Arabic, in which scribes the social media corpora and the polarity lex- one lemma can be associated with thousands of sur- icon used in the experiments, Section 4 describes face forms. For this reason, we need to investigate SAMAR, the SSA system and the features used in how to avoid data sparseness resulting from using the experiments. Section 5 describes the experi- lexical features without losing information that is ments and discusses the results. In Section 6, we important for SSA. More specifically, we concen- give an overview of the best settings for the differ- trate on two questions: Since we need to reduce ent corpora, followed by a conclusion in Section 7. word forms to base forms to combat data sparseness, is it more useful to use tokenization or lemmatiza- 2 Related Work tion? And given that the part-of-speech (POS) tagset for Arabic contains a fair amount of morphological The bulk of SSA work has focused on movie and information, how much of this information is useful product reviews (Dave et al., 2003; Hu and Liu, for SSA? More specifically, we investigate two dif- 2004; Turney, 2002). A number of sentence- and ferent reduced tagsets, the RTS and the ERTS. For phrase-level classifiers have been built: For exam- more detailed information see section 4. ple, whereas Yi et al. (2003) present a system that RQ2 addresses the impact of using two stan- detects sentiment toward a given subject, Kim and dard features, frequently employed in SSA studies Hovy’s (2004) system detects sentiment towards a (Wiebe et al., 2004; Turney, 2002), on social media specific, predefined topic. Our work is similar to Yu data, which exhibit DA usage and text length vari- and Hatzivassiloglou (2003) and Wiebe et al. (1999) ations, e.g. in twitter data. First, we investigate the in that we use lexical and POS features. utility of applying a UNIQUE feature (Wiebe et al., Only few studies have been performed on Arabic. 2004) where low frequency words below a thresh- Abbasi et al. (2008) use a genetic algorithm for both old are replaced with the token ”UNIQUE”. Given English and Arabic Web forums sentiment detection that our data includes very short posts (e.g., twitter on the document level. They exploit both syntactic data has a limit of only 140 characters per tweet), and stylistic features, but do not use morphological it is questionable whether the UNIQUE feature will features. Their system is not directly comparable to be useful or whether it replaces too many content ours due to the difference in data sets.More related to words. Second, we test whether a polarity lexicon our work is our previous effort (2011b) in which we extracted in a standard domain using Modern Stan- built an SSA system that exploits newswire data. We dard Arabic (MSA) transfers to social media data. report a slight system improvement using the gold- Third, given the inherent lack of a standardized or- labeled morphological features and a significant im- thography for DA, the problem of replacing content provement when we use features based on a polarity words is expected to be increased since many DA lexicon from the news domain. In that work, our content words would be spelled in different ways. system performs at 71.54% F for subjectivity classi- RQ3 is concerned with the fact that for Arabic, fication and 95.52% F for sentiment detection. This there are significant differences between dialects. current work is an extension on our previous work However, existing NLP tools such as tokenizers and however it differs in that we use automatically pre- POS taggers are exclusively trained on and for MSA. dicted morphological features and work on data be- We thus investigate whether using an explicit feature longing to more genres and DA varieties, hence ad- that identifies the dialect of the text improves SSA dressing a more challenging task. 20 Data set SUBJ GEN LV UID DID 3 Data Sets and Annotation DAR X X MONT X X X To our knowledge, no gold-labeled social media TRGD X X X X SSA data exist. Thereby, we create annotated data THR X X comprising a variety of data sets: Table 1: Types of annotation labels (features) manually DARDASHA (DAR): (Arabic for “chat”) com- assigned to the data. prises the first 2798 chat turns collected from a ran- domly selected chat session from “Egypt’s room” in Maktoob chat chat.mymaktoob.com. Maktoob tweet as MSA if it mainly employs MSA words and is a popular Arabic portal. DAR is an Egyptian Ara- adheres syntactically to MSA rules, otherwise it is bic subset of a larger chat corpus that was harvested treated as dialectal. Table 1 shows the annotations between December 2008 and February 2010. for each data set. Data statistics, distribution of TAGREED (TGRD): (“tweeting”) is a corpus classes, and inter-annotator agreement in terms of of 3015 Arabic tweets collected during May 2010. Kappa (K) are provided in Table 2. TRGD has a mixture of MSA and DA. The MSA Polarity Lexicon: We manually created a lexicon part (TRGD-MSA) has 1466 tweets, and the dialec- of 3982 adjectives labeled with one of the following tal part (TRGD-DA) has 1549 tweets. tags {positive, negative, neutral}, as is reported in TAHRIR (THR): (“editing”) is a corpus of 3008 our previous work (2011b). We focus on adjectives sentences sampled from a larger pool of 30 MSA since they are primary sentiment bearers. The ad- Wikipedia Talk Pages that we harvested. jectives pertain to the newswire domain, and were MONTADA (MONT): (“forum”) comprises of extracted from the first four parts of the Penn Arabic 3097 Web forum sentences collected from a larger Treebank (Maamouri et al., 2004). pool of threaded conversations pertaining to differ- ent varieties of Arabic, including both MSA and DA, 4 SAMAR from the COLABA data set (Diab et al., 2010). The discussions covered in the forums pertain to social 4.1 Automatic Classification issues, religion or politics. The sentences were au- SAMAR is a machine learning system for Arabic tomatically filtered to exclude non-MSA threads. SSA. For classification, we use SVMlight (Joachims, Each of the data sets was labeled at the sentence 2008). In our experiments, we found that linear ker- level by two college-educated native speakers of nels yield the best performance. We perform all ex- Arabic. For each sentence, the annotators assigned periments with presence vectors: In each sentence one of 3 possible labels: (1) objective (OBJ), (2) vector, the value of each dimension is binary, regard- subjective-positive (S-POS), (3) subjective-negative less of how many times a feature occurs. (S-NEG), and (3) subjective-mixed (S-MIXED). In the current study, we adopt a two-stage clas- Following (Wiebe et al., 1999), if the primary goal sification approach. In the first stage (i.e., Subjec- of a sentence is judged as the objective reporting tivity), we build a binary classifier to separate objec- of information, it was labeled as OBJ. Otherwise, a tive from subjective cases. For the second stage (i.e., sentence was a candidate for one of the three SUBJ Sentiment) we apply binary classification that distin- classes. We also labeled the data with a number of guishes S-POS from S-NEG cases. We disregard the other metadata1 tags. Metadata labels included the neutral and mixed classes for this study. SAMAR user gender (GEN), the user identity (UID) (e.g. the uses different feature sets, each of which is designed user could be a person or an organization), and the to address an individual research question: source document ID (DID). We also mark the lan- guage variety (LV) (i.e., MSA or DA) used, tagged 4.2 Morphological Features at the level of each unit of analysis (i.e., sentence, Word forms: In order to minimize data sparse- tweet, etc.). Annotators were instructed to label a ness as a result of the morphological richness of 1 We use the term ’metadata’ as an approximation, as some Arabic, we tokenize the text automatically. We features are more related to social interaction phenomena. use AMIRA (Diab, 2009), a suite for automatic 21 Data set # instances # types # tokens # OBJ # S-POS # S-NEG # S-MIXED Kappa (K) DAR 2,798 11,810 3,133 328 1647 726 97 0.89 MONT 3,097 82,545 20,003 576 1,101 1,027 393 0.88 TRGD 3,015 63,383 16,894 1,428 483 759 345 0.85 TRGD-MSA 1,466 31,771 9,802 960 226 186 94 0.85 TRGD-DIA 1,549 31,940 10,398 468 257 573 251 0.82 THR 3,008 49,425 10,489 1,206 652 1,014 136 0.85 Table 2: Data and inter-annotator agreement statistics. processing of MSA, trained on Penn Arabic Tree- Unique: Following Wiebe et al. (2004), we ap- bank (Maamouri et al., 2004) data, which consists ply a UNIQUE (Q) feature: We replace low fre- of newswire text. We experiment with two different quency words with the token ”UNIQUE”. Exper- configurations to extract base forms of words: (1) iments showed that setting the frequency threshold Token (TOK), where the stems are left as is with no to 3 yields the best results. further processing of the morpho-tactics that result Polarity Lexicon (PL): The lexicon (cf. section from the segmentation of clitics; (2) Lemma (LEM), 3) is used in two different forms for the two tasks: where the words are reduced to their lemma forms, For subjectivity classification, we follow Bruce and (citation forms): for verbs, this is the 3rd person Wiebe (1999; 2011b) and add a binary has adjective masculine singular perfective form and for nouns, feature indicating whether or not any of the ad- this corresponds to the singular default form (typi- Jm ' ð jectives in the sentence is part of our manually cally masculine). For example, the word ÑîEA . + Ñë created polarity lexicon. For sentiment classifica- (wbHsnAtHm) is tokenized as ð + H . + HA Jk tion, we apply two features, has POS adjective and (w+b+HsnAt+Hm) (note that in TOK, AMIRA does has NEG adjective. These binary features indicate not split off the pluralizing suffix H@ (At) from the whether a POS or NEG adjective from the lexicon (Hsn)), while in the lemmatization step stem ák occurs in a sentence. by AMIRA, the lemma rendered is éJk (Hsnp). 4.4 Dialectal Arabic Features Thus, SAMAR uses the form of the word as Hsnp in the LEM setting, and HsnAt in the TOK setting. Dialect: We apply the two gold language variety features, {MSA, DA}, on the Twitter data set to rep- POS tagging: Since we use only the base forms resent whether the tweet is in MSA or in a dialect. of words, the question arises whether we lose mean- ingful morphological information and consequently 4.5 Genre Specific Features whether we could represent this information in the Gender: Inspired by gender variation research ex- POS tags instead. Thus, we use two sets of POS ploiting social media data (e.g., (Herring, 1996)), features that are specific to Arabic: the reduced we apply three gender (GEN) features correspond- tag set (RTS) and the extended reduced tag set ing to the set {MALE, FEMALE, UNKNOWN}. (ERTS) (Diab, 2009). The RTS is composed of 42 Abdul-Mageed and Diab (2012a) suggest that there tags and reflects only number for nouns and some is a relationship between politeness strategies and tense information for verbs whereas the ERTS com- sentiment expression. And gender variation research prises 115 tags and enriches the RTS with gender, in social media shows that expression of linguistic number, and definiteness information. Diab (2007b; politeness (Brown and Levinson, 1987) differs based 2007a) shows that using the ERTS improves re- on the gender of the user. sults for higher processing tasks such as base phrase User ID: The user ID (UID) labels are inspired chunking of Arabic. by research on Arabic Twitter showing that a consid- erable share of tweets is produced by organizations 4.3 Standard Features such as news agencies (Abdul-Mageed et al., 2011a) This group includes two features that have been em- as opposed to lay users. We hence employ two fea- ployed in various SSA studies. tures from the set {PERSON, ORGANIZATION} to 22 classification of the Twitter data set. The assumption SUBJ SENTI Data Cond. Acc F-O F-S Acc F-P F-N is that tweets by persons will have a higher correla- DAR Base 84.75 0.00 91.24 63.02 77.32 0.00 tion with expression of sentiment. TOK 83.90 0.00 91.24 67.71 77.04 45.61 LEM 83.76 0.00 91.16 70.16 78.65 50.43 TRGD Base 61.59 0.00 76.23 56.45 0.00 72.16 Document ID: Projecting a document ID (DID) TOK 69.54 64.06 73.56 65.32 49.41 73.62 feature to the paragraph level was shown to im- LEM 71.19 64.78 75.63 62.10 41.98 71.86 prove subjectivity classification on data from the THR Base 52.92 0.00 69.21 75.00 0.00 85.71 TOK 58.44 28.09 70.78 60.47 37.04 71.19 health policy domain (Abdul-Mageed et al., 2011c). LEM 57.79 26.97 70.32 63.37 38.83 73.86 Hence, by employing DID at the instance level, we MONT Base 83.44 0.00 90.97 86.82 92.94 0.00 TOK 83.44 0.00 90.97 74.55 83.63 42.86 are investigating the utility of this feature for social LEM 83.44 0.00 90.97 72.27 81.68 42.99 media as well as at a finer level of analysis, i.e., the sentence level. Table 3: SSA results with preprocessing TOK and LEM. 5 Empirical Evaluation TOK. TGRD: Both preprocessing schemes outper- For each data set, we divide the data into 80% train- form Base on all metrics with TOK outperforming ing (TRAIN), 10% for development (DEV), and LEM across the board. THR: LEM outperforms 10% for testing (TEST). The classifier was opti- TOK for all metrics of sentiment, yet they are be- mized on the DEV set; all results that we report be- low Base performance. MONT: TOK outperforms low are on TEST. In each case, our baseline is the LEM in terms of accuracy, and positive sentiment, majority class in the training set. We report accu- yet LEM slightly outperforms TOK for negative sen- racy as well as the F scores for the individual classes timent classification. Both TOK and LEM are beat (objective vs. subjective and positive vs. negative). by Base in terms of accuracy and positive classifica- tion. Given the observed results, we observe no clear 5.1 Impact of Morphology on SSA trends for the impact for morphological preprocess- We run two experimental conditions: 1. A compari- ing alone on performance. son of TOK to LEM (cf. sec. 4.2); 2. A combination of RTS and ERTS with TOK and LEM. Adding POS tags: Table 4 shows the results of adding POS tags based on the two tagsets RTS TOK vs. LEM: Table 3 shows the results for the and ERTS. Subjectivity classification: The results morphological preprocessing conditions. The base- show that adding POS information improves ac- line, Base, is the majority class in the training data. curacy and F score for all the data sets except For all data sets, Subjective is the majority class. MONT which is still at Base performance. RTS For subjectivity classification we see varying per- outperforms ERTS with TOK, and the opposite with formance. DAR: TOK outperforms LEM for all LEM where ERTS outperforms RTS, however, over- metrics, yet performance is below Base. TGRD: all TOK+RTS yields the highest performance of LEM preprocessing yields better accuracy results 91.49% F score on subjectivity classification for the than Base. LEM is consistently better than TOK DAR dataset. For the TGRD and THR data sets, we for all metrics. THR: We see the opposite perfor- note that TOK+ERTS is equal to or outperforms the mance compared to the TGRD data set where TOK other conditions on subjectivity classification. For outperforms LEM and also outperforming Base. Fi- MONT there is no difference between experimental nally for MONT: the performance of LEM and TOK conditions and no impact for adding the POS tag in- are exactly the same yielding the same results as in formation. In the sentiment classification task: Base. The sentiment task shows a different trend: here, For sentiment classification, the majority class the highest performing systems do not use POS tags. is positive for DAR and MONT and negative for This is attributed to the variation in genre between TGRD and THR. We note that there are no obvi- the training data on which AMIRA is trained (MSA ous trends between TOK and LEM. DAR: we ob- newswire) and the data sets we are experimenting serve better performance of LEM over Base and with in this work. However in relative compari- 23 SUBJ SENTI Data Cond. Acc F-O F-S Acc F-P F-N DAR Base 84.75 91.24 63.02 77.32 TOK+RTS 84.32 0.00 91.49 66.15 76.36 40.37 TOK+ERTS 83.90 0.00 91.24 67.19 77.09 42.20 LEM+RTS 83.47 0.00 90.99 67.71 77.21 44.64 LEM+ERTS 83.47 0.00 90.99 68.75 77.94 46.43 TGRD Base 61.59 76.23 56.45 72.16 TOK+RTS 70.20 64.57 74.29 62.90 43.90 72.29 TOK+ERTS 71.19 65.06 75.49 62.90 42.50 72.62 LEM+RTS 70.20 64.57 74.29 62.90 46.51 71.60 LEM+ERTS 72.19 76.54 71.19 65.32 48.19 73.94 THR Base 52.92 69.21 75.00 85.71 TOK+RTS 57.47 28.42 69.75 59.30 33.96 70.59 TOK+ERTS 59.42 28.57 71.66 59.88 38.94 70.13 LEM+RTS 59.42 28.57 71.66 59.88 33.01 71.37 LEM+ERTS 58.77 25.73 71.46 60.47 37.04 71.19 MONT Base 83.44 90.97 86.82 92.94 TOK+RTS 83.44 0.00 90.97 69.09 79.27 39.29 TOK+ERTS 83.44 0.00 90.97 71.82 81.55 40.38 LEM+RTS 83.44 0.00 90.97 70.00 80.36 36.54 LEM+ERTS 83.44 0.00 90.97 69.55 79.64 39.64 Table 4: SSA results with different morphological preprocessing and POS features. son between RTS and ERTS for sentiment shows and 6.81% for MONT. The deterioration on THR is that in a majority of the cases, ERTS outperforms surprising and may be a result of the nature of sen- RTS, thus indicating that the additional morpholog- timent as expressed in the THR data set: Wikipedia ical features are helpful. One possible explanation has a ’Neutral Point of View’ policy based on which may be that variations of some of the morphologi- users are required to focus their contributions not cal features (e.g., existence of a gender, person, ad- on other users but content, and as such sentiment is jective feature) may correlate more frequently with expressed in nuanced indirect ways in THR. While positive or negative sentiment. the subjectivity results show that it is feasible to use the combination of the UNIQUE feature and the po- 5.2 Standard Features for Social Media Data larity lexicon features successfully, even for shorter RQ2 concerns the question whether standard fea- texts, such as in the twitter data (TGRD), this con- tures can be used successfully for classifying social clusion does not always hold for sentiment classi- media text characterized by the usage of dialect and fication. However, we assume that the use of the by differing text lengths. We add the standard fea- polarity lexicon would result in higher gains if the tures, polarity (PL) and UNIQUE (Q), to the two to- lexicon were adapted to the new domains. kenization schemes and the POS tag sets. We report only the best performing conditions here. 5.3 SSA Given Arabic Dialects Table 5 shows the best performing settings per RQ3 investigates how much the results of SSA are corpus from the previous section as well as the best affected by the presence or absence of dialectal Ara- performing setting given the new features. The re- bic in the data. For this question, we focus on the sults show that apart from THR and TGRD for sen- TGRD data set because it contains a non-negligible timent, all corpora gain in accuracy for both sub- amount (i.e., 48.62%) of tweets in dialect. jectivity and sentiment. In the case of subjectiv- First, we investigate how our results change when ity, while considerable improvements are gained for we split the TGRD data set into two subsets, one both DAR (11.51% accuracy) and THR (32.90% ac- containing only MSA, the other one containing only curacy), only slight improvements (< 1% accuracy) DA. We extract the 80-10-10% data split, then train are reached for both TGRD and MONT. For sen- and test the classifier exclusively on either MSA or timent classification, the improvements in accuracy dialect data. The subjectivity results for this exper- are less than the case of subjectivity: 1.84% for DAR iment are shown in Table 6, and the sentiment re- 24 SUBJ SENTI Data Best condition Acc F-O F-S Best condition Acc F-P F-N DAR TOK+RTS 84.32 0.00 91.49 LEM+ERTS 68.75 77.94 46.43 TOK+ERTS+PL+Q3 95.83 0.00 97.87 LEM+ERTS+PL+Q3 70.59 79.51 47.92 TGRD LEM+ERTS 72.19 76.54 71.19 LEM+ERTS 65.32 73.94 48.19 LEM+ERTS+PL 72.52 65.84 77.01 LEM+ERTS+PL 65.32 73.94 48.19 THR L./T.+ERTS 59.42 28.57 71.66 LEM+ERTS 63.37 38.83 73.86 TOK+ERTS +PL+Q3 83.33 0.00 90.91 LEM+RTS+PL+Q3 61.05 34.95 72.20 MONT LEM+ERTS 83.44 0.00 90.97 TOK 74.55 83.63 42.86 LEM+RTS+PL+Q3 84.19 3.92 91.39 TOK+PL+Q3 81.36 88.64 48.10 Table 5: SSA results with standard features. Number in bold signify improvements over the best results in section 5.1. TGRD TGRD-MSA TGRD-DA Cond. Acc F-O F-S Acc F-O F-S Acc F-O F-S Base 61.59 0.00 76.23 51.68 68.14 0.00 78.40 0.00 87.89 TOK 69.54 64.06 73.56 61.74 70.16 46.73 78.40 5.41 87.80 LEM 71.19 64.78 75.63 65.10 72.04 53.57 79.01 15.00 88.03 Table 6: Dialect-specific subjectivity experiments. sults are shown in Table 7. For both tasks, the re- The results for both subjectivity and sentiment on sults show considerable differences between MSA the MSA and DA sets suggest that processing errors and DA: For TGRD-MSA, the results are lower than by AMIRA trained exclusively on MSA newswire for TGRD-DA, which is a direct consequence of data) result in deteriorated performance. However the difference in distribution of subjectivity between we do not observe such trends on the TGRD-DA the two subcorpora. TGRD-DA is mostly subjective data sets. This is not surprising since the TGRD- while TGRD-MSA is more balanced. With regard DA is not very different from the newswire data on to sentiment, TGRD-DA consists of mostly negative which AMIRA was trained: Twitter users discuss tweets while TGRD-MSA again is more balanced. current events topics also discussed in newswire. These results suggest that knowing whether a tweet There is also a considerable lexical overlap between is in dialect would help classification. MSA and DA. Furthermore, dialectal data may be For subjectivity, we can see that TGRD-MSA im- loci for more sentiment cues like emoticons, certain proves by 13.5% over the baseline while for TGRD- punctuation marks (e.g. exclamation marks), etc. DA, the improvement is more moderate, < 3%. We Such clues are usually absent (or less frequent) in assume that this is partly due to the higher skew in MSA data and hence the better sentiment classifica- TGRD-DA, moreover, it is known that our prepro- tion on TGRD-DA. cessing tools yield better performance on MSA data We also experimented with adding POS tags and leading to better tokenization and lemmatization. standard features. These did not have any positive For sentiment classification on TGRD-MSA, nei- effect on the results with one exception, which is ther tokenization nor lemmatization improve over shown in Table 8: For sentiment, adding the RTS the baseline. This is somewhat surprising since we tagset has a positive effect on the two data sets. expect AMIRA to work well on this data set and thus In a second experiment, we used the original to lead to better classification results. However, a TGRD corpus but added the language variety (LV) considerable extent of the MSA tweets are expected (i.e., MSA and DA) features. For both subjectiv- to come from news headlines (Abdul-Mageed et ity and sentiment, the best results are acquired us- al., 2011a), and headlines usually are not loci of ex- ing the LEM+PL+LV settings. However, for subjec- plicitly subjective content and hence are difficult to tivity, we observe a drop in accuracy from 72.52% classify and in essence harder to preprocess since (LEM+ERTS+PL) to 69.54%. For sentiment, we the genre is different from regular newswire even if also observe a performance drop in accuracy, from MSA. For the TGRD-DA data set, both lemmatiza- 65.32% (LEM+ERTS+PL) to 64.52%. This means tion and tokenization improve over the baseline. that knowing the language variety does not provide 25 TGRD TGRD-MSA TGRD-DA Cond. Acc F-P F-N Acc F-P F-N Acc F-P F-N Base 56.45 0.00 72.16 53.49 69.70 0.00 67.47 0.00 80.58 TOK 65.32 49.41 73.62 53.49 56.52 50.00 68.67 23.53 80.30 LEM 62.10 41.98 71.86 48.84 52.17 45.00 73.49 38.89 83.08 TOK+RTS 70.20 64.57 74.29 55.81 61.22 48.65 71.08 29.41 81.82 Table 7: Dialect-specific sentiment experiments. SUBJ SENTI Data Condition Acc F-O F-S Condition Acc F-P F-N DAR TOK+ERTS+PL+Q3 95.83 0.00 97.87 LEM+PL+GEN 71.28 79.86 50.00 TGRD LEM+ERTS+PL 72.52 65.84 77.01 TOK+ERTS+PL+GEN+LV+UID 65.87 49.41 74.25 THR TOK+ERTS+PL+Q3 83.33 0.00 90.91 TOK+PL+GEN+UID 67.44 39.13 77.78 MONT LEM+RTS+PL+Q3 84.19 3.92 91.39 TOK+PL+Q3 81.36 88.64 48.10 Table 8: Overall best SAMAR performance. Numbers in bold show improvement over the baseline. Data Condition Acc F-O F-S (0.52%) improves classification over previous best DAR TOK+ERTS+PL+GEN 84.30 0.00 91.48 TGRD LEM+RTS+PL+UID 71.85 65.31 76.32 settings. For THR, adding the gender and user ID THR LEM+RTS+PL+GEN+UID 66.67 0.00 80.00 information improves classification by 4.07%. MONT LEM+RTS+PL+DID 83.17 0.00 90.81 Our results thus show the utility of the gender, LV, and user ID features for sentiment classification. Table 9: Subjectivity results with genre features. The results for both subjectivity and sentiment show Data Condition Acc F-P F-N that the document ID feature is not a useful feature. DAR LEM+PL+GEN 71.28 79.86 50.00 TGRD TOK+ERTS+PL+GEN+LV 65.87 49.41 74.25 6 Overall Performance +UID THR TOK+PL+GEN+UID 67.44 39.13 77.78 Table 8 provides the best results reached by MONT LEM+PL+DID 76.82 47.42 85.13 SAMAR. For subjectivity classification, SAMAR improves on all data sets when the POS features are Table 10: Sentiment results with genre features. Numbers in bold show improvement over table 5. combined with the standard features. For sentiment classification, SAMAR also improves over the base- line on all the data sets, except MONT. The results enough information for successfully conquering the also show that all optimal feature settings for sub- differences between those varieties. jectivity, except with the MONT data set, include the ERTS POS tags while the results in Section 5.1 5.4 Leveraging Genre Specific Features showed that adding POS information without addi- RQ4 investigates the question whether we can lever- tional features, while helping in most cases with sub- age features typical for social media for classifica- jectivity, does not help with sentiment classification. tion. We apply all GENRE features exhaustively.We report the best performance on each data set. 7 Conclusion and Future Work Table 9 shows the results of adding the genre fea- In this paper, we presented SAMAR, an SSA system tures to the subjectivity classifier. For this task, no for Arabic social media. We explained the rich fea- data sets profit from these features. ture set SAMAR exploits and showed how complex Table 10 shows the results of adding the genre fea- morphology characteristic of Arabic can be handled tures to the sentiment classifier. Here, all the data in the context of SSA. For the future, we plan to sets, with the exception of MONT, profit from the carry out a detailed error analysis of SAMAR in an new features. In the case of DAR, adding gender attempt to improve its performance, use a recently- information improves classification by 1.73% in ac- developed wider coverage polarity lexicon (Abdul- curacy. For TGRD, the combination of the gender Mageed and Diab, 2012b) together with another DA (GN), language variety (LV), and user ID slightly lexicon that we are currently developing. 26 References Workshop on Semitic Language Processing, pages 66– 74, Valetta, Malta. Ahmed Abbasi, Hsinchun Chen, and Arab Salem. 2008. Mona Diab. 2007a. Improved Arabic base phrase chunk- Sentiment analysis in multiple languages: Feature se- ing with a new enriched POS tag set. In Proceedings lection for opinion classification in Web forums. ACM of the 2007 Workshop on Computational Approaches Transactions on Information Systems, 26:1–34. to Semitic Languages: Common Issues and Resources, Muhammad Abdul-Mageed and Mona Diab. 2012a. pages 89–96, Prague, Czech Republic. AWATIF: A multi-genre corpus for Modern Standard Mona Diab. 2007b. Towards an optimal POS tag set for Arabic subjectivity and sentiment analysis. In Pro- Modern Standard Arabic processing. In Proceedings ceedings of LREC, Istanbul, Turkey. of Recent Advances in Natural Language Processing Muhammad Abdul-Mageed and Mona Diab. 2012b. To- (RANLP), Borovets, Bulgaria. ward building a large-scale Arabic sentiment lexicon. Mona Diab. 2009. Second generation AMIRA tools In Proceedings of the 6th International Global Word- for Arabic processing: Fast and robust tokenization, Net Conference, Matsue, Japan. POS tagging, and base phrase chunking. In Proceed- Muhammad Abdul-Mageed, Hamdan Albogmi, Abdul- ings of the Second International Conference on Arabic rahman Gerrio, Emhamed Hamed, and Omar Aldibasi. Language Resources and Tools, pages 285–288, Cairo, 2011a. Tweeting in Arabic: What, how and whither. Egypt. Presented at the 12th Annual Conference of the As- Nizar Habash, Owen Rambow, and Ryan Roth. 2009. sociation of Internet Researchers (Internet Research MADA+TOKAN: A toolkit for Arabic tokenization, 12.0, Performance and Participation), Seattle, WA. diacritization, morphological disambiguation, POS Muhammad Abdul-Mageed, Mona Diab, and Mohamed tagging, stemming and lemmatization. In Proceed- Korayem. 2011b. Subjectivity and sentiment analy- ings of the Second International Conference on Arabic sis of Modern Standard Arabic. In Proceedings of the Language Resources and Tools, pages 102–109, Cairo, 49th Annual Meeting of the Association for Compu- Egypt. tational Linguistics: Human Language Technologies, Susan Herring. 1996. Bringing familiar baggage to pages 587–591, Portland, OR. the new frontier: Gender differences in computer- Muhammad Abdul-Mageed, Mohamed Korayem, and mediated communication. In J. Selzer, editor, Con- Ahmed YoussefAgha. 2011c. ”Yes we can?”: Sub- versations. Allyn & Bacon. jectivity annotation and tagging for the health domain. Minqing Hu and Bing Liu. 2004. Mining and summa- In Proceedings of RANLP2011, Hissar, Bulgaria. rizing customer reviews. In Proceedings of the Tenth Ann Banfield. 1982. Unspeakable Sentences: Narration ACM SIGKDD International Conference on Knowl- and Representation in the Language of Fiction. Rout- edge Discovery and Data Mining, pages 168–177, ledge, Boston. Seattle, WA. Penelope Brown and Stephen Levinson. 1987. Polite- Thorsten Joachims. 2008. Svmlight: Support vector ma- ness: Some Universals in Language Usage. Cam- chine. http://svmlight.joachims.org/, Cornell Univer- bridge University Press. sity, 2008. Rebecca Bruce and Janyce Wiebe. 1999. Recognizing Soo-Min Kim and Eduard Hovy. 2004. Determining the subjectivity. A case study of manual tagging. Natural sentiment of opinions. In Proceedings of the 20th In- Language Engineering, 5(2):187–205. ternational Conference on Computational Linguistics, Kushal Dave, Steve Lawrence, and David Pennock. pages 1367–1373, Geneva, Switzerland. 2003. Mining the peanut gallery: Opinion extrac- Mohamed Maamouri, Anne Bies, Tim Buckwalter, and tion and semantic classification of product reviews. In W. Mekki. 2004. The Penn Arabic Treebank: Build- Proceedings of the 12th International Conference on ing a large-scale annotated Arabic corpus. In NEM- World Wide Web, pages 519–528, Budapest, Hungary. LAR Conference on Arabic Language Resources and ACM. Tools, pages 102–109, Cairo, Egypt. Mona Diab, Dan Jurafsky, and Kadri Hacioglu. 2007. Peter Turney. 2002. Thumbs up or thumbs down? Se- Automatic processing of Modern Standard Arabic text. mantic orientation applied to unsupervised classifica- In Abdelhadi Soudi, Antal van den Bosch, and Günter tion of reviews. In Proceedings of the 40th Annual Neumann, editors, Arabic Computational Morphol- Meeting of the Association for Computational Linguis- ogy. Springer. tics (ACL’02), Philadelphia, PA. Mona Diab, Nizar Habash, Owen Rambow, Mohamed Janyce Wiebe, Rebecca Bruce, and Tim O’Hara. 1999. Altantawy, and Yassin Benajiba. 2010. COLABA: Development and use of a gold standard data set for Arabic dialect annotation and processing. In LREC subjectivity classifications. In Proceedings of the 37th 27 Annual Meeting of the Association for Computational Linguistics (ACL-99), pages 246–253, University of Maryland. Janyce Wiebe, Theresa Wilson, Rebecca Bruce, Matthew Bell, and Melanie Martin. 2004. Learning subjective language. Computational Linguistics, 30:227–308. Jeonghee Yi, Tetsuya Nasukawa, Razvan Bunescu, and Wayne Niblack. 2003. Sentiment analyzer: Extract- ing sentiments about a given topic using natural lan- guage processing techniques. In Proceedings of the 3rd IEEE International Conference on Data Mining, pages 427–434, Melbourne, FL. Hong Yu and Vasileios Hatzivassiloglou. 2003. Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sen- tences. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Sapporo, Japan. 28 Opinum: statistical sentiment analysis for opinion classification Boyan Bonev, Gema Ramı́rez-Sánchez, Sergio Ortiz Rojas Prompsit Language Engineering Avenida Universidad, s/n. Edificio Quorum III. 03202 Elche, Alicante (Spain) {boyan,gramirez,sortiz}@prompsit.com Abstract in two classes: those expressing positive sentiment (the author is in favour of something) and those ex- The classification of opinion texts in positive pressing negative sentiment, and we will refer to and negative can be tackled by evaluating sep- them as positive opinions and negative opinions. arate key words but this is a very limited ap- Sentiment analysis is possible thanks to the opin- proach. We propose an approach based on the order of the words without using any syntac- ions available online. There are vast amounts of text tic and semantic information. It consists of in fora, user reviews, comments in blogs and social building one probabilistic model for the posi- networks. It is valuable for marketing and sociolog- tive and another one for the negative opinions. ical studies to analyse these freely available data on Then the test opinions are compared to both some definite subject or entity. Some of the texts models and a decision and confidence mea- available do include opinion information like stars, sure are calculated. In order to reduce the or recommend-or-not, but most of them do not. A complexity of the training corpus we first lem- good corpus for building sentiment analysis systems matize the texts and we replace most named- entities with wildcards. We present an accu- would be a set of opinions separated by domains. It racy above 81% for Spanish opinions in the should include some information about the cultural financial products domain. origin of authors and their job, and each opinion should be sentiment-evaluated not only by its own author, but by many other readers as well. It would 1 Introduction also be good to have a marking of the subjective and Most of the texts written by humans reflect some objective parts of the text. Unfortunately this kind kind of sentiment. The interpretation of these sen- of corpora are not available at the moment. timents depend on the linguistic skills and emo- In the present work we place our attention at the tional intelligence of both the author and the reader, supervised classification of opinions in positive and but above all, this interpretation is subjective to the negative. Our system, which we call Opinum1 , is reader. They don’t really exist in a string of charac- trained from a corpus labeled with a value indicat- ters, for they are subjective states of mind. Therefore ing whether an opinion is positive or negative. The sentiment analysis is a prediction of how most read- corpus was crawled from the web and it consists of a ers would react to a given text. 160MB collection of Spanish opinions about finan- There are texts which intend to be objective and cial products. Opinum’s approach is general enough texts which are intentionally subjective. The latter is and it is not limited to this corpus nor to the financial the case of opinion texts, in which the authors inten- domain. tionally use an appropriate language to express their There are state-of-the-art works on sentiment positive or negative sentiments about something. In 1 An Opinum installation can be tested from a web interface this paper we work on the classification of opinions at http://aplica.prompsit.com/en/opinum 29 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 29–37, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics analysis which care about differentiating between tivation of our approach in Section 2, then in Sec- the objective and the subjective part of a text. For tion 3 we describe in detail the Opinum approach. instance, in the review of a film there is an objec- In Section 4 we present our experiments with Span- tive part and then the opinion (Raaijmakers et al., ish financial opinions and we state some conclusions 2008). In our case we work directly with opinion and future work in Section 5. texts and we do not make such difference. We have noticed that in customer reviews, even when stating 2 Hypothesis objective facts, some positive or negative sentiment is usually expressed. When humans read an opinion, even if they do Many works in the literature of sentiment anal- not understand it completely because of the techni- ysis take lexicon-based approaches (Taboada et al., cal details or domain-specific terminology, in most 2011). For instance (Hu and Liu, 2004; Blair- cases they can notice whether it is positive or nega- Goldensohn et al., 2008) use WordNet to extend tive. The reason for this is that the author of the opin- the relation of positive and negative words to other ion, consciously or not, uses nuances and structures related lexical units. However the combination of which show a positive or negative feeling. Usually, which words appear together may also be impor- when a user writes an opinion about a product, the tant and there are comparisons of different Ma- intention is to communicate that subjective feeling, chine learning approaches (Pang et al., 2002) in apart from describing the experience with the prod- the literature, like Support Vector Machines, k- uct and giving some technical details. Nearest Neighbours, Naive-Bayes, and other classi- The hypothesis underlying the traditional fiers based on global features. In (McDonald et al., keyword or lexicon-based approaches (Blair- 2007) structured models are used to infer the senti- Goldensohn et al., 2008; Hu and Liu, 2004) consist ment from different levels of granularity. They score in looking for some specific positive or negative cliques of text based on a high-dimensional feature words. For instance, “great” should be positive and vector. “disgusting” should be negative. Of course there In the Opinum approach we score each sentence are some exceptions like “not great”, and some based on its n-gram probabilites. For a complete approaches detect negation to invert the meaning of opinion we sum the scores of all its sentences. Thus, the word. More elaborate cases are constructions if an opinion has several positive sentences and it fi- like “an offer you can’t refuse” or “the best way to nally concludes with a negative sentence which set- lose your money”. tles the whole opinion as negative, Opinum would There are domains in which the authors of the probably fail. The n-gram sequences are good at opinions might not use these explicit keywords. In capturing phrasemes (multiwords), the motivation the financial domain we can notice that many of the for which is stated in Section 2. Basically, there opinions which express the author’s insecurity are are phrasemes which bear sentiment. They may actually negative, even though the words are mostly be different depending on the domain and it is rec- neutral. For example, “I am not sure if I would get ommendable to build the models with opinions be- a loan from this bank” has a negative meaning. An- longing to the target domain, for instance, financial other difficulty is that the same words could be posi- products, computers, airlines, etc. A study of do- tive or negative depending on other words of the sen- main adaptation for sentiment analysis is presented tence: “A loan with high interests” is negative while in (Blitzer et al., 2007). In Opinum different clas- “A savings account with high interests” is positive. sifiers would be built for different domains. Build- In general more complex products have more com- ing the models does not require the aid of experts, plex and subtle opinions. The opinion about a cud- only a labeled set of opinions is necessary. Another dly toy would contain many keywords and would be contribution of Opinum is that it applies some sim- much more explicit than the opinion about the con- plifications on the original text of the opinions for ditions of a loan. Even so, the human readers can improving the performance of the models. get the positive or negative feeling at a glance. In the remainder of the paper we first state the mo- The hypothesis of our approach is that it is pos- 30 sible to classify opinions in negative and positive guage models to some extent, their actual purpose based on canonical (lemmatized) word sequences. is to avoid some negative constructions to be as- Given a set of positive opinions Op and a set of sociated to concrete entities. For instance, we do negative opinions On , the probability distributions not care that “do not trust John Doe Bank” is neg- of their n-gram word sequences are different and ative, instead we prefer to know that “do not trust can be compared to the n-grams of a new opin- company entity” is negative regardless of the entity. ion in order to classify it. In terms of statistical This generality allows us to better evaluate opinions language models, given the language models M p about new entities. Also, in the cases when all the and M n obtained from Op and On , the probability opinions about some entity E1 are good and all the ppo = P (o|Op ) that a new opinion would be gener- opinions about some other entity E2 are bad, entity ated by the positive model is smaller or greater than replacement prevents the models from acquiring this the probability pno = P (o|ON ) that a new opinion kind of bias. would be generated by the negative model. Following we detail the lemmatization process, We build the models based on sequences of the named entities detection and how we build and canonical words in order to simplify the text, as ex- evaluate the positive and negative language models. plained in the following section. We also replace some named entities like names of banks, organiza- 3.1 Lemmatization tions and people by wildcards so that the models do Working with the words in their canonical form is not depend on specific entities. for the sake of generality and simplification of the language model. Removing the morphological in- 3 The Opinum approach formation does not change the semantics of most The proposed approach is based on n-gram language phrasemes (or multiwords). models. Therefore building a consistent model is the There are some lexical forms for which we keep key for its success. In the field of machine transla- the surface form or we add some morphological in- tion a corpus with size of 500MB is usually enough formation to the token. These exceptions are the for building a 5-gram language model, depending on subject pronouns, the object pronouns and the pos- the morphological complexity of the language. sessive forms. The reason for this is that for some In the field of sentiment analysis it is very diffi- phrasemes the personal information is the key for cult to find a big corpus of context-specific opinions. deciding the positive or negative sense. For instance, Opinions labeled with stars or a positive/negative la- let us suppose that some opinions contain the se- bel can be automatically downloaded from differ- quences ent customers’ opinion websites. The sizes of the corpora collected that way range between 1MB and ot = “They made money from me”, 20MB for both positive and negative opinions. oi = “I made money from them”. Such a small amount of text would be suitable for bigrams and would capture the difference between Their lemmatization, referred to as L0 (·), would be2 “not good” and “really good”, but this is not enough for longer sequences like “offer you can’t refuse”. L0 (ot ) = L0 (oi ) = “SubjectPronoun make money In order to build consistent 5-gram language mod- from ObjectPronoun”, els we need to simplify the language complexity by removing all the morphology and replacing the sur- Therefore we would have equally probable face forms by their canonical forms. Therefore we P (ot |M p ) = P (oi |M p ) and P (ot |M n ) = make no difference between “offer you can’t refuse” P (oi |M n ), which does not express the actual and “offers you couldn’t refuse”. sentiment of the phrasemes. In order to capture this We also replace named entities by wildcards: per- son entity, organization entity and company entity. 2 The notation we use here is for the sake of readability and Although these replacements also simplify the lan- it slightly differs from the one we use in Opinum. 31 kind of differences we prefer to have The named entity recognition task is integrated within the lemmatization process. We collected a L1 (ot ) = “SubjectPronoun 3p make money list of names of people, places, companies and orga- from ObjectPronoun 1p”, nizations to complete the morphological dictionary L1 (oi ) = “SubjectPronoun 1p make money of Apertium. The morphological analysis module is still very fast, as the dictionary is first compiled and from ObjectPronoun 3p”. transformed to the minimal deterministic finite au- tomaton. For the dates, phone numbers, e-mails, IP The probabilities still depend on how many times do and URL we use regular expressions which are also these lexical sequences appear in opinions labeled as supported by the same Apertium module. positive or negative, but with L1 (·) we would have Regarding the list of named entities, for a given that language (Spanish in our experiments) we download P (ot |M p ) < P (oi |M p ), its Wikipedia database which is a freely available re- source. We heuristically search it for organizations, P (ot |M n ) > P (oi |M n ), companies, places and people. Based on the number that is, oi fits better the positive model than ot does, of references a given entity has in Wikipedia’s arti- and vice versa for the negative model. cles, we keep the first 1.500.000 most relevant en- In our implementation lemmatization is per- tities, which cover the entities with 4 references or more (the popular entities are referenced from tens formed with Apertium, which is an open-source rule-based machine translation engine. Thanks to to thousands of times). its modularized architecture (described in (Tyers et Finally, unknown surface forms are replaced by al., 2010)) we use its morphological analyser and the “Unknown” lemma (the known lemmas are low- its part-of-speech disambiguation module in order ercase). These would usually correspond to strange to take one lexical form as the most probable one, names of products, erroneous words and finally to in case there are several possibilities for a given sur- words which are not covered by the monolingual face. Apertium currently has morphological anal- dictionary of Apertium. Therefore our approach is ysers for 30 languages (most of them European), suitable for opinions written in a rather correct lan- which allows us to adapt Opinum to other languages guage. If unknown surfaces were not replaced, the without much effort. frequently misspelled words would not be excluded, which is useful in some domains. This is at the cost 3.2 Named entities replacement of increasing the complexity of the model, as all mis- spelled words would be included. Alternatively, the The corpora with labeled opinions are usually lim- frequently misspelled words could be added to the ited to a number of enterprises and organizations. dictionary. For a generalization purpose we make the texts in- dependent of concrete entities. We do make a differ- 3.3 Language models ence between names of places, people and organiza- The language models we build are based on n-gram tions/companies. We also detect dates, phone num- word sequences. They model the likelihood of a bers, e-mails and URL/IP. We substitute them all by word wi given the sequence of n−1 previous words, different wildcards. All the rest of the numbers are P (wi |wi−(n−1) , . . . , wi−1 ). This kind of models as- substituted by a “Num” wildcard. For instance, the sume independence between the word wi and the following subsequence would have a L2 (oe ) lemma- words not belonging to the n-gram, wj , j < i − n. tization + named entity substitution: This is a drawback for unbounded dependencies oe = “Joe bought 300 shares but we are not interested in capturing the complete grammatical relationships. We intend to capture of Acme Corp. in 2012” the probabilities of smaller constructions which may L2 (oe ) = “Person buy Num share hold positive/negative sentiment. Another assump- of Company in Date” tion we make is independence between different sen- 32 tences. models. For a given opinion ot , the log-probability In Opinum the words are lemmas (or wildcards sums can be taken: replacing entities), and the number of words among which we assume dependence is n = 5. A max- X X dot = log P (s|M p ) − log P (s|M n ) ≷ 0 imum n of 5 or 6 is common in machine transla- s∈ot s∈ot ? tion where huge amounts of text are used for build- ing a language model (Kohen et al., 2007). In our If this difference is close to zero, |dot |/wot < ε0 , case we have at our disposal a small amount of data it can be considered that the classification is neutral. but the language is drastically simplified by remov- The number of words wot is used as a normalization ing the morphology and entities, as previously ex- factor. If it is large, |dot |/wot > ε1 , it can be con- plained. We have experimentally found that n > 5 sidered that the opinion has a very positive or very does not improve the classification performance of negative sentiment. Therefore Opinum classifies the lemmatized opinions and could incur over-fitting. opinions with qualifiers: very/somewhat/little posi- In our setup we use the IRSTLM open-source li- tive/negative depending on the magnitude |dot |/wot brary for building the language model. It performs and sign(dot ), respectively. an n-gram count for all n-grams from n = 1 to The previous assessment is also accompanied by a n = 5 in our case. To deal with data sparseness confidence measure given by the level of agreement a redistribution of the zero-frequency probabilities among the different sentences of an opinion. If all its is performed for those sets of words which have not sentences have the same positivity/negativity, mea- been observed in the training set L(O). Relative fre- sured by sign(dsj ), sj ∈ o, with large magnitudes quencies are discounted to assign positive probabil- then the confidence is the highest. In the opposite ities to every possible n-gram. Finally a smoothing case in which there is the same number of positive method is applied. Details about the process can be and negative sentences with similar magnitudes the found in (Federico et al., 2007). For Opinum we run confidence is the lowest. The intermediate cases are IRSTLM twice during the training phase: once tak- those with sentences agreeing in sign but some of ing as input the opinions labeled as positive and once them with very low magnitude, and those with most taking the negatives: sentences of the same sign and some with different sign. We use Shannon’s entropy measure H(·) to Mp ← Irstlm (L (Op )) quantify the amount of disagreement. For its esti- Mn ← Irstlm (L (On )) mation we divide the range of possible values of d in B ranges, referred to as bins: These two models are further used for querying new B opinions on them and deciding whether it is positive X 1 or negative, as detailed in the next subsection. Hot = p(db ) log . p(db ) b=1 3.4 Evaluation and confidence The number of bins should be low (less than 10), In the Opinum system we query the M p , M n mod- otherwise it is difficult to get a low entropy mea- els with the KenLM (Heafield, 2011) open-source sure because of the sparse values of db . We set two library because it answers the queries very quickly thresholds η0 and η1 such that the confidence is said and has a short loading time, which is suitable for to be high/normal/low if Hot < η0 , η0 < Hot < η1 a web application. It also has an efficient mem- or Hot > η1 , respectively ory management which is positive for simultaneous The thresholds ε, η and the number of bins B queries to the server. are experimentally set. The reason for this is that The queries are performed at sentence level. Each they are used to tune subjective qualifiers (very/little, sentence s ∈ ot is assigned a score which is the log high/low confidence) and will usually depend on the probability of the sentence being generated by the training set and on the requirements of the applica- language model. The decision is taken by compar- tion. Note that the classification in positive or neg- ing its scores for the positive and for the negative ative sentiment is not affected by these parameters. 33 From a human point of view it is also a subjective play a different role in this simplification. The “Un- assessment but in our setup it is looked at as a fea- known” wildcard represents a 7,13% of the origi- ture implicitly given by the labeled opinions of the nal text. Entities were detected and replaced 33858 training set. times (7807 locations, 5409 people, 19049 com- panies, 502 e-mails addresses and phone numbers, 4 Experiments and results 2055 URLs, 1136 dates) which is a 4,77% of the text. There are also 46780 number substitutions, a In our experimental setup we have a set of positive 7% of the text. The rest of complexity reduction is and negative opinions in Spanish, collected from a due to the removal of the morphology as explained web site for user reviews and opinions. The opin- in Subsection 3.1. ions are constrained to the financial field including In our experiments, the training of Opinum con- banks, savings accounts, loans, mortgages, invest- sisted of lemmatizing and susbstituting entities of ments, credit cards, and all other related topics. The the 6990 opinions belonging the training set and authors of the opinions are not professionals, they building the language models. The positive model are mainly customers. There is no structure required is built from 4403 positive opinions and the neg- for their opinions, and they are free to tell their ex- ative model is built from 2587 negative opinions. perience, their opinion or their feeling about the en- Balancing the amount of positive and negative sam- tity or the product. The users meant to communicate ples does not improve the performance. Instead, it their review to other humans and they don’t bear in obliges us to remove an important amount of pos- mind any natural language processing tools. The au- itive opinions and the classification results are de- thors decide whether their own opinion is positive or creased by approximately 2%. This is why we use negative and this field is mandatory. all the opinions available in the training set. Both The users provide a number of stars as well: from language models are n-grams with n ∈ [1, 5]. Hav- one to five, but we have not used this information. It ing a 37% less samples for the negative opinions is interesting to note that there are 66 opinions with is not a problem thank to the smoothing techniques only one star which are marked as positive. There applied by IRSTLM. Nonetheless if the amount of are also 67 opinions with five stars which are marked training texts is too low we would recommend tak- as negative. This is partially due to human errors, ing a lower n. A simple way to set n is to take a human can notice when reading them. However the lowest value of n for which classification perfor- we have not filtered these noisy data, as removing mance is improved. An unnecessarily high n could human errors could be regarded as biasing the data overfit the models. set with our own subjective criteria. The tests are performed with 2330 opinions (not Regarding the size of the corpus, it consists of involved in building the models). For measuring the 9320 opinions about 180 different Spanish banks accuracy we do not use the qualifiers information and financial products. From these opinions 5877 but only the decision about the positive or negative are positive and 3443 are negative. There is a total of class. In Figure 1 we show the scores of the opin- 709741 words and the mean length of the opinions ions for the positive and negative models. The score is 282 words for the positive and 300 words for the is the sum of scores of the sentences, thus it can be negative ones. In the experiments we present in this seen that longer opinions (bigger markers) have big- work, we randomly divide the data set in 75% for ger scores. Independence of the size is not necessary training and 25% for testing. We check that the dis- for classifying in positive and negative. In the diag- tribution of positive and negative remains the same onal it can be seen that positive samples are close among test and train. to the negative ones, this is to be expected: both After the L2 (·) lemmatization and entity substitu- positive and negative language models are built for tion, the number of different words in the data set is the same language. However the small difference 13067 in contrast with the 78470 different words in in their scores yields an 81,98% success rate in the the original texts. In other words, the lexical com- classification. An improvement of this rate would be plexity is reduced by 83%. Different substitutions difficult to achieve taking into account that there is 34 Test Original Spanish text Meaning in English Result “Al tener la web, no pierdes As you have the website you Positive el tiempo por teléfono.” don’t waste time on the phone. “En el telfono os hacen perder They waste your time on the phone Similar Negative el tiempo y no tienen web.” and they don’t have a website. words, “De todas formas me different Anyway, they solved my problem. Positive solucionaron el problema.” meaning “No hay forma de que There is no way to make them Negative me solucionen el problema.” solve my problem. “Con XXXXXX me fue muy bien.” I was fine with XXXXXX. Positive “Hasta que surgieron los problemas.” Until the problems began. Negative A negative “Por hacerme cliente me regalaban They gave me 100 euros for Positive opinion 100 euros.” becoming a client. of several “Pero una vez que eres cliente But once you are a client, they Negative sentences no te aportan nada bueno.” they do not offer anything good. I am considering switching to “Estoy pensando cambiar de banco.” Negative another bank. The complete “Con XXXXXX me fue muy I was fine with XXXXXX Negative opinion [. . .] cambiar de banco.” [. . .] switching to another bank. Table 1: Some tests on Opinum for financial opinions in Spanish. noise in the training set and that there are opinions pared to the 69% baseline given by a classifier based without a clear positive or negative feeling. A larger on the frequencies of single words. corpus would also contribute to a better result. Even Similarity to the Language Models and text sizes (Test set) though we have placed many efforts in simplifying 0 the text, this does not help in the cases in which a construction of words is never found in the corpus. −500 A construction could even be present in the corpus Similarity to negative LM but in the wrong class. For instance, in our corpus −1000 “no estoy satisfecho” (meaning “I am not satisfied”) appears 3 times among the positive opinions and 0 −1500 times among the negative ones. This weakness of the corpus is due to sentences referring to a money −2000 back guarantee: “si no esta satisfecho le devolvemos el dinero” which are used in a positive context. −2500 Usually in long opinions a single sentence does −2500 −2000 −1500 −1000 −500 0 Similarity to positive LM not change the positiveness score. For some exam- ples see Table 4. In long opinions every sentence is Figure 1: Relation between similarity to the models (x prone to show the sentiment except for the cases of and y axis) and the relative size of the opinions (size of irony or opinions with an objective part. The per- the points). formance of Opinum depending on the size of the opinions of the test set is shown in Figure 2. In Fig- The query time of Opinum on a standard com- ure 3 the ROC curve of the classifier shows its sta- puter ranges from 1, 63 s for the shortest opinions to bility against changing the true-positive versus false- 1, 67 s for those with more than 1000 words. In our negative rates. A comparison with other methods setup, most of the time is spent in loading the mor- would be a valuable source of evaluation. It is not phological dictionary, few milliseconds are spent in feasible at this moment because of the lack of free the morphological analysis of the opinion and the customers opinions databases and opionion classi- named entity substitution, and less than a millisec- fiers as well. The success rate we obtain can be com- ond is spent in querying each model. In a batch 35 Distribution of test−text sizes ROC on test set 160 1 Successes 140 Errors 120 0.8 True positive rate 100 events 0.6 80 60 0.4 40 20 0.2 0 0 500 1000 1500 0 Opinion size (characters) 0 0.2 0.4 0.6 0.8 1 False positive rate Figure 2: Number of successful and erroneous classifi- cations (vertical axis) depending on the size of the test Figure 3: Receiver Operating Characteristic (ROC) curve opinions (horizontal axis). of the Opinum classifier for financial opinions. mode, the morphological analysis could be done the data and the subjectivity of the labeling in posi- for all the opinions together and thousands of them tive and negative. The next steps would be to study could be evaluated in seconds. In Opinum’s web in- the possibility to classify in more than two classes terface we only provide the single opinion queries by using several language models. The use of an and we output the decision, the qualifiers informa- external neutral corpus should also be considered in tion and the confidence measure. the future. It is necessary to perform a deeper analysis of the 5 Conclusions and future work impact of lexical simplification on the accuracy of the language models. It is also very important to Opinum is a sentiment analysis system designed for establish the limitations of this approach for differ- classifying customer opinions in positive and neg- ent domains. Is it equally successful for a wider do- ative. Its approach based on morphological sim- main? For instance, trying to build the models from plification, entity substitution and n-gram language a mixed set of opinions of the financial domain and models, makes it easily adaptable to other classifica- the IT domain. Would it work for a general domain? tion targets different from positive/negative. In this Regarding applications, Opinum could be trained work we present experiments for Spanish in the fi- for a given domain without expert knowledge. Its nancial domain but Opinum could easily be trained queries are very fast which makes it feasible for free for a different language or domain. To this end an on-line services. An interesting application would Apertium morphological analyser would be neces- be to exploit the named entity recognition and as- sary (30 languages are currently available) as well sociate positive/negative scores to the entities based as a labeled data set of opinions. Setting n for the n- on their surrounding text. If several domains were gram models depends on the size of the corpus but available, then the same entities would have differ- it would usually range from 4 to 6, 5 in our case. ent scores depending on the domain, which would There are other parameters which have to be exper- be a valuable analysis. imentally tuned and they are not related to the pos- itive or negative classification but to the subjective qualifier very/somewhat/little and to the confidence References measure. Philipp Koehn, H. Hoang, A. Birch, C. Callison-Burch, The classification performance of Opinum in M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. our financial-domain experiments is 81,98% which Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin and would be difficult to improve because of the noise in E. Herbst. 2007. Moses: open source toolkit for sta- 36 tistical machine translation. Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Prague, Czech Republic, 2007. Sasha Blair-Goldensohn, Tyler Neylon, Kerry Hannan, George A. Reis, Ryan Mcdonald and Jeff Reynar. 2008. Building a sentiment summarizer for local ser- vice reviews. In NLP in the Information Explosion Era, NLPIX2008, Beiging, China, April 22nd, 2008. Hu, Minqing and Liu, Bing. 2004. Mining and sum- marizing customer reviews. Proceedings of the tenth ACM SIGKDD international conference on Knowl- edge discovery and data mining, Seattle, WA, USA, 2004. Ryan McDonald, Kerry Hannan, Tyler Neylon, Mike Wells and Jeff Reynar. 2007. Structured Models for Fine-to-Coarse Sentiment Analysis. Proceedings of the 45th Annual Meeting of the Association of Com- putational Linguistics, 2007. John Blitzer, Mark Dredze and Fernando Pereira. 2007. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. ACL, 2007. Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP, 2002. Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, Manfred Stede. 2011. Lexicon-based meth- ods for sentiment analysis. Computational Linguis- tics, Vol. 37, Nr. 2, pp. 267–307, June 2011. Stephan Raaijmakers, Khiet P. Truong and Theresa Wil- son. 2008. Multimodal Subjectivity Analysis of Mul- tiparty Conversation. EMNLP, 2008. Tyers, F. M., Snchez-Martnez, F., Ortiz-Rojas, S. and Forcada, M. L. 2010. Free/open-source resources in the Apertium platform for machine translation re- search and development. The Prague Bulletin of Mathematical Linguistics, No. 93, pp. 67–76, 2010. Marcello Federico and Mauro Cettolo. 2007. Efficient Handling of N-gram Language Models for Statistical Machine Translation. ACL 2007 Workshop on SMT, Prague, Czech Republic, 2007. Kenneth Heafield. 2011. KenLM: Faster and Smaller Language Model Queries. ACL 6th Workshop on SMT, Edinburgh, Scotland, UK, July 30–31, 2011. 37 Sentimantics: Conceptual Spaces for Lexical Sentiment Polarity Representation with Contextuality Amitava Das Björn Gambäck Department of Computer and Information Science Norwegian University of Science and Technology Sem Sælands vei 7-9, NO-7094 Trondheim, Norway

[email protected] [email protected]

(1997) proved the effectiveness of empirically Abstract building a sentiment lexicon. Turney (2002) suggested review classification by Thumbs Up and Current sentiment analysis systems rely on Thumbs Down, while the concept of prior polarity static (context independent) sentiment lexica was firmly established with the introduction lexica with proximity based fixed-point of SentiWordNet (Esuli et al., 2004). prior polarities. However, sentiment- More or less all sentiment analysis researchers orientation changes with context and these agree that prior polarity lexica are necessary for lexical resources give no indication of polarity classification, and prior polarity lexicon which value to pick at what context. The development has been attempted for other general trend is to pick the highest one, but languages than English as well, including for which that is may vary at context. To Chinese (He et al., 2010), Japanese (Torii et al., overcome the problems of the present 2010), Thai (Haruechaiyasak et al., 2010), and proximity-based static sentiment lexicon Indian languages (Das and Bandyopadhyay, 2010). techniques, the paper proposes a new way Polarity Classification Using the Lexicon: High to represent sentiment knowledge in a accuracy for prior polarity identification is very Vector Space Model. This model can store hard to achieve, as prior polarity values are dynamic prior polarity with varying approximations only. Therefore the prior polarity contextual information. The representation method may not excel alone; additional techniques of the sentiment knowledge in the are required for contextual polarity Conceptual Spaces of distributional disambiguation. The use of other NLP methods or Semantics is termed Sentimantics. machine learning techniques over human produced prior polarity lexica was pioneered by Pang et al. 1 Introduction (2002). Several researches then tried syntactic- Polarity classification is the classical problem statistical techniques for polarity classification, from where the cultivation of Sentiment Analysis reporting good accuracy (Seeker et al., 2009; (SA) started. It involves sentiment / opinion Moilanen et al., 2010), making the two-step classification into semantic classes such as methodology (sentiment lexicon followed by positive, negative or neutral and/or other fine- further NLP techniques) the standard method for grained emotional classes like happy, sad, anger, polarity classification. disgust,surprise and similar. However, for the Incorporating Human Psychology: The present task we stick to the standard binary existing reported solutions or available systems are classification, i.e., positive and/or negative. still far from perfect or fail to meet the satisfaction The Concept of Prior Polarity: Sentiment level of the end users. The main issue may be that polarity classification (“The text is positive or there are many conceptual rules that govern negative?”) started as a semantic orientation sentiment and there are even more clues (possibly determination problem: by identifying the semantic unlimited) that can convey these concepts from orientation of adjectives, Hatzivassiloglou et al. realization to verbalization of a human being (Liu, 38 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 38–46, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics 2010). The most recent trends in prior polarity For English we choose the widely used MPQA3 adopt an approach to sentiment knowledge corpus, but for the Bengali we had to create our representation which lets the mental lexicon model own corpus as discussed in the following section. hold the contextual polarity, as in human mental The remainder of the paper then concentrates on knowledge representation. the problems with using prior polarity values only, Cambria et al. (2011) made an important in Section 4, while the Sentimantics concept proper contribution in this direction by introducing a new is discussed in Section 5. Finally, some initial paradigm: Sentic Computing1, in which they use an conclusions are presented in Section 6. emotion representation and a Common Sense- based approach to infer affective states from short 2 Bengali Corpus texts over the web. Grassi (2009) conceived the Human Emotion Ontology as a high level ontology News text can be divided into two main types: (1) supplying the most significant concepts and news reports that aim to objectively present factual properties constituting the centerpiece for the information, and (2) opinionated articles that description of human emotions. clearly present authors’ and readers’ views, The Proposed Sentimantics: The present paper evaluation or judgment about some specific events introduces the concept of Sentimantics which is or persons (and appear in sections such as related to the existing prior polarity concept, but ‘Editorial’, ‘Forum’ and ‘Letters to the editor’). A differs from it philosophically in terms of Bengali news corpus has been acquired for the contextual dynamicity. It ideologically follows the present task, based on 100 documents from the path of Minsky (2006), Cambria et al. (2011) and ‘Reader’s opinion’ section (‘Letters to the Editor’) (Grassi, 2009), but with a different notion. from the web archive of a popular Bengali Sentiment analysis research started years ago, newspaper. 4 In total, the corpus contains 2,235 but still the question “What is sentiment or sentences (28,805 word forms, of which 3,435 are opinion?” remains unanswered! It is very hard to distinct). The corpus has been annotated with define sentiment or opinion, and to identify the positive and negative phrase polarities using regulating or the controlling factors of sentiment; Sanchay5, the standard annotation tool for Indian an analytic definition of opinion might even be languages. The annotation was done semi- impossible (Kim and Hovy, 2004). Moreover, no automatically: a module marked the sentiment concise set of psychological forces could be words from SentiWordNet (Bengali)6 and then the defined that really affect the writers’ sentiments, corpus was corrected manually. i.e., broadly the human sentiment. Sentimantics tries to solve the problem with a 3 The Syntactic Polarity Classifier practical necessity and to overcome the problems Adhering to the standard two-step methodology of the present proximity-based static sentiment (i.e., prior polarity lexicon followed by any NLP lexicon techniques. technique), a Syntactic-Statistical polarity As discussed earlier, the two-step methodology classifier based on Support Vector Machines is the most common one in practice. As described (SVMs) has been quickly developed using in Section 3, a syntactic-polarity classifier was SVMTool.7 The intension behind the development therefore developed, to examine the impact of of this syntactic polarity classifier was to examine proposed Sentimantics concept, by comparing it to the effectiveness and the limitations of the standard the standard polarity classification technique. The two-step methodology at the same time. strategy was tested on both English and Bengali. The selection of an appropriate feature set is The intension behind choosing two distinct crucial when working with Machine Learning language families is to establish the credibility of techniques such as SVM. We decided on a feature the proposed methods. 3 http://www.cs.pitt.edu/mpqa/ 4 http://www.anandabazar.com/ 5 http://ltrc.iiit.ac.in/nlpai_contest07/Sanchay/ 6 http://www.amitavadas.com/sentiwordnet.php 1 7 http://sentic.net/sentics/ http://www.lsi.upc.edu/~nlp/SVMTool/ 39 Polarity Precision Recall Precision Features Eng. Bng. Eng. Bng. Eng. Bng. Sentiment Lexicon 50.50% 47.60% Total 76.03% 70.04% 65.8% 63.02% +Negative Words 55.10% 50.40% Positive 58.6% 56.59% 54.0% 52.89% +Stemming 59.30% 56.02% Negative 76.3% 75.57% 69.4% 65.87% + Function Words 63.10% 58.23% Table 1: Overall and class-wise results of + Part of Speech 66.56% 61.90% syntactic polarity classification +Chunking 68.66% 66.80% set including Sentiment Lexicon, Negative Words, +Dependency Relations 76.03% 70.04% Stems, Function Words, Part of Speech and Dependency Relations, as most previous research Table 2: Performance of the syntactic polarity agree that these are the prime features to detect the classifier by feature ablation sentimental polarity from text (see, e.g., Pang and Lee, 2005; Seeker et al., 2009; Moilanen et al., (Eng.) and 47.60% (Bng.) which can be considered 2010; Liu et. al., 2005). as baselines. As seen in Table 2, incremental use of Sentiment Lexicon: SentiWordNet 3.0 8 for other features like negative words, function words, English and SentiWordNet (Bengali) for Bengali. part of speech, chunks and tools like stemming Negative Words: Manually created. Contains improved the precision of the system to 68.66% 80 entries collected semi-automatically from both (Eng.) and 66.80% (Bng.). Further use of syntactic the MPQA9 corpus and the Movie Review dataset10 features in terms of dependency relations improved by Cornell for English. 50 negative words were the system precision to 76.03% (Eng.) and 70.04% collected manually for Bengali. (Bng.). The feature ablation proves the Stems: The Porter Stemmer11 for English. The accountability of the two-step polarity Bengali Shallow Parser12 was used to extract root classification technique. The prior polarity lexicon words (from morphological analysis output). (completely dictionary-based) approach gives Function Words: Collected from the web. 13 about 50% precision; the further improvements of Only personal pronouns are dropped for the the system are obtained by other NLP techniques. present task. A list of 253 entries was collected To support our argumentation for choosing manually from the Bengali corpus. SVM, we tested the same classification problem POS, Chunking and Dependency with another machine learning technique, Relations:The Stanford Dependency parser 14 for Conditional Random Fields (CRF)15 with the same English. The Bengali Shallow Parser was used to data and setup. The performance of the CRF-based extract POS, chunks and dependency relations. model is much worse than the SVM, with a precision of 70.04% and recall of 67.02% for The results of SVM-based syntactic classification English, resp. 61.23% precision and 55.00% recall for English and Bengali are presented in Table 1, for Bengali. The feature ablation method was also both in total and for each polarity class separately. tested for the CRF model and the performance was To understand the effects of various features on more or less the same when the dictionary features the performance of the system, we used the feature and lexical features were used (i.e., SentiWordNet ablation method. The dictionary-based approach + Negative Words + Stemming + Function Words using only SentiWordNet gave a 50.50% precision + Part of Speech). But it was difficult to increase the performance level for the CRF by using 8 syntactic features like chunking and dependency http://sentiwordnet.isti.cnr.it/ 9 relations. SVMs work excellent to normalize this http://www.cs.pitt.edu/mpqa/ 10 http://www.cs.cornell.edu/People/pabo/movie-review-data/ dynamic situation. 11 http://tartarus.org/martin/PorterStemmer/java.txt It has previously been noticed that multi-engine 12 ltrc.iiit.ac.in/showfile.php?filename=downloads/shallow_par ser.php based methods work well for this type of 13 http://www.flesl.net/Vocabulary/Single- heterogeneous tagging task, e.g., in Named Entity word_Lists/function_word_list.php 14 15 http://nlp.stanford.edu/software/lex-parser.shtml http://crfpp.googlecode.com/svn/trunk/doc/index.html 40 Recognition (Ekbal and Bandyopadhyay, 2010) Eng. Bng. and POS tagging (Shulamit et al., 2010). We have not tested with that kind of setup, but rather looked Types Numbers (%) English: n/28,430 at the problem from a different perspective, Bengali: n/30,000 questioning the basics: Is the two-step methodology Total Token 115,424 30,000 for the classification task ideal or should we look Positivity > 0 ∨ Negativity > 0 28,430 30,000 for other alternatives? Positivity > 0 ∧ Negativity > 0 6619 7,654 (23.28 %) (25.51 %) 4 What Knowledge at What Level? Positivity > 0 ∧ Negativity = 0 10,484 8,934 (36.87 %) (29.78 %) In this section we address some limitations Positivity = 0 ∧ Negativity > 0 11,327 11,780 (39.84 %) (39.26 %) regarding the usage of prior polarity values from Positivity > 0 ∧ Negativity > 0 ∧ 3,187 2,677 existing of prior polarity lexical resources. Dealing |Positivity-Negativity| ≥ 0.2 (11.20 %) (8.92 %) with unknown/new words is a common problem. It becomes more difficult for sentiment analysis Table 3: SentiWordNet(s) statistics because it is very hard to find out any contextual The main concern of the present task is the clue to predict the sentimental orientation of any ambiguous entries from SentiWordNet(s). The unknown/new word. There is another problem: basic hypothesis is that if we can add some sort of word sense disambiguation, which is indeed a contextual information with the prior polarity significant subtask when applying a resource like scores in the sentiment lexicon, the updated rich SentiWordNet (Cem et al., 2011). lexicon network will serve better than the existing A prior polarity lexicon is attached with two one, and reduce or even remove the need for probabilistic values (positivity and negativity), but further processing to disambiguate the contextual according to the best of our knowledge no previous polarity. How much contextual information would research clarifies which value to pick in what be needed and how this knowledge should be context? – and there is no information about this in represented could be a perpetual debate. To answer SentiWordNet. The general trend is to pick the these questions we introduce Sentimantics: highest one, but which may vary by context. An Distributed Semantic Lexical Models to hold the example may illustrate the problem better: Suppose sentiment knowledge with context. a word “high” (Positivity: 0.25, Negativity: 0.125 from SentiWordNet) is attached with a positive 5 Technical Solutions for Sentimantics polarity (its positivity value is higher than its negativity value) in the sentiment lexicon, but the In order to propose a model of Sentimantics we polarity of the word may vary in any particular use. started with existing resources such as Sensex reaches high+. ConceptNet 16 (Havasi et al., 2007) and Prices go high-. SentiWordNet for English, and SemanticNet (Das Hence further processing is required to and Bandyopadhyay, 2010) and SentiWordNet disambiguate these types of words. Table 3 shows (Bengali) for Bengali. The common sense lexica how many words in the SentiWordNet(s) are like ConceptNet and SemanticNet are developed ambiguous and need special care. There are 6,619 for general purposes, and to formalize (Eng.) and 7,654 (Bng.) lexicon entries in Sentimantics from these resources is problematic SentiWordNet(s) where both the positivity and the due to lack of dimensionality. Section 5.1 presents negativity values are greater than zero. Therefore a more rational explanation with empirical results. these entries are ambiguous because there is no In the end we developed a Syntactic Co- clue in the SentiWordNet which value to pick in Occurrence Based Vector Space Model to hold the what context. Similarly, there are 3,187 (Eng.) and Sentimantics from scratch by a corpus driven semi- 2,677 (Bng.) lexical entries in SentiWordNet(s) supervised method (Section 5.2). This model whose positivity and negativity value difference is performs better than the previous one and quite less than 0.2. These are also ambiguous words. satisfactory. Generally extracting knowledge from 16 http://csc.media.mit.edu/conceptnet 41 this kind of VSM is very expensive algorithmically because it is a very high dimensional network. Another important limitation of this type of model is that it demands very well defined processed input to extract knowledge, e.g., Input: (high) Context: (sensex, share market, point). Philosophically, the motivation of Sentimantics is to provide a rich lexicon network which will serve better than the existing one and reduce the requirement of further language processing techniques to disambiguate the contextual polarity. This model consists of relatively fewer dimensions. The final model is the best performing lexicon network model, which could be described as the acceptable solution for the Sentimantics Figure 1: The Sentimantics Network problem. The details of the proposed models are necessary to understand the root form of any word described in the following. and for dictionary comparison. The corpus-driven method assigns each sentiment word in the 5.1 Semantic Network Overlap, SNO developed lexical network a contextual prior We started experimentation with network overlap polarity, as shown in Figure 1. techniques. The network overlap technique finds overlaps of nodes between two lexical networks: Semantic network-based polarity calculation namely ConceptNet-SentiWordNet for English and Once the desired lexical semantic network to hold SemanticNet-SentiWordNet (Bengali) for Bengali. the Sentimantics has been developed, we look The working principle of the network overlap further to leverage the developed knowledge for technique is very simple. The algorithm starts with the polarity classification task. The methodology any SentiWordNet node and finds its closest of contextual polarity extraction from the network neighbours from the commonsense networks is very simple, and only a dependency parser and (ConceptNet or SemanticNet). If, for example, a stemmer are required. For example, consider the node chosen from SentiWordNet is “long/ ”, the following sentence. closest neighbours of this concept extracted from We have been waiting in a long queue. the commonsense networks are: “road (40%) / To extract the contextual polarity from this waiting (62%) / car (35%) / building (54%) / queue sentence it must be known that waiting-long-queue (70%) …” The association scores (as the previous are interconnected with dependency relations, and example) are also extracted to understand the stemming is a necessary pre-processing step for semantic similarity association. Hence the desired dictionary matching. To extract contextual polarity Sentimantics lexical network is developed by this from the developed network the desired input is network overlap technique. The next prime (long) with its context (waiting, queue). The challenge is to assign contextual polarity to each accumulated contextual polarity will be Neg: association. For this a corpus-based method was (0.50+0.35)=0.85. For comparison if the score was used; based on the MPQA17 corpus for English and extracted from SentiWordNet (English) it would be the corpus developed by us for. The corpora are Pos: 0.25 as this is higher than the negative score pre-processed with dependency relations and (long: Pos: 0.25, Neg: 0.125 in SentiWordNet). stemming using the same parsers and stemmers as in Section 3. The dependency relations are SNO performance and limitations necessary to understand the relations between the An evaluation proves that the present Network evaluative expression and other modifier-modified Overlap technique outperforms the previous chunks in any subjective sentence. Stemming is syntactic polarity classification technique. The precision scores for this technique are 62.3% for 17 http://www.cs.pitt.edu/mpqa/ English and 59.7% for Bengali on the MPQA and 42 Solved By 5.2 Starting from Scratch: Syntactic Co- Semantic Occurrence Network Construction Type Number Overlap Technique A syntactic word co-occurrence network was Positivity > 0 ∧ Eng. 6,619 2,304 (34.80 %) constructed for only the sentimental words from Negativity > 0 Bng. 7,654 2,450 (32 %) the corpora. The syntactic network is defined in a |Positivity - Eng. 3,187 957 (30 %) way similar to previous work such the Spin Model Negativity| ≥ 0.2 Bng. 2,677 830 (31.5 %) (Takamura et al., 2005) and Latent Semantic Analysis to compute the association strength with Table 4: Results of Semantic Overlap seed words (Turney and Litman, 2003). The Bengali corpora: clearly higher than the baselines hypothesis is that all the words occurring in the based on SentiWordNet (50.5 and 47.6%; Table 2). syntactic territory tend to have similar semantic Still, the overall goal to “reduce/remove the orientation. In order to reduce dimensionality requirement to use further NLP techniques to when constructing the network, only the open word disambiguate the contextual polarity” could not be classes noun, verb, adjective and adverb are established empirically. To understand why, we included, as those classes tend to have maximized performed an analysis of the errors and missed sentiment properties. Involving fewer features cases of the semantic network overlap technique: generates VSMs with fewer dimensions. most of the errors were caused by lack of coverage. For the network creation we again started with ConceptNet and SemanticNet were both developed SentiWordNet 3.0 to mark the sentiment words in from the news domain and for a different task. The the MPQA corpus. As the MPQA corpus is marked comparative coverage of SentiWordNet (English) at expression level, SentiWordNet was used to and MPQA is 74%, i.e., if we make a complete set mark only the lexical entries of the subjective of sentiment words from MPQA then altogether expressions in the corpus. As before, the Stanford 74% of that set is covered by SentiWordNet, which POS tagger and the Porter Stemmer were used to is very good and an acceptable coverage. For get POS classes and stems of the English terms, Bengali the comparative coverage is 72%, which is while SentiWordNet (Bengali), the Bengali corpus also very good. However, the comparative and the Bengali processors were used for Bengali. coverage of SentiWordNet (English)-ConceptNet Features were extracted from a ±4 word window and SentiWordNet (Bengali)-SemanticNet is very around the target terms. To normalize the extracted low: 54% and 50% respectively: only half of the words from the corpus we used CF-IOF, concept sentiment words in the SentiWordNets are covered frequency-inverse opinion frequency (Cambria et by ConceptNet (Eng) resp. SemanticNet (Bng). al., 2011), while a Spectral Clustering technique Now look at the evaluation in Table 4 which we (Dasgupta and Ng, 2009) was used for the in-depth report to support our empirical reasoning behind analysis of word co-occurrence patterns and their the question “What knowledge to keep at what relationships at discourse level. The clustering level?” It shows how much fixed point-based static algorithm partitions a set of lexica into a finite prior polarity is being resolved by the Semantic number of groups or clusters in terms of their Network Overlap technique. The comparative syntactic co-occurrence relatedness. results are noteworthy but not satisfactory: only Numerical weights were assigned to the words 34% (Eng.) and 32% (Bng.) of the cases of and then the cosine similarity measure was used to “Positivity > 0 ∧ Negativity > 0” resp. 30% (Eng.) calculate vector similarity: and 31.5 % (Bng.) of the cases of “|Positivity - → → → → N s qk ,d j  = qk .d j = ∑ wi ,k × wi , j -----(1) Negativity| ≥ 0.2” are resolved by this technique.   i=1 The results are presented in Table 4. When the lexicon collection is relatively static, it As a result of the error analysis, we instead makes sense to normalize the vectors once and decided to develop a Vector Space Model from store them, rather than include the normalization in scratch in order to solve the Sentimantics problem the similarity metric (as in Equation 2). ∑ N and to reach a satisfactory level of coverage. The → → wi ,k × wi , j s qk ,d j  = i=1 -------(2) experiments in this direction are reported below.   ∑ w ∑ w N 2 N 2 × i=1 i ,k j =1 j ,k 43 ID Lexicon 1 2 3 1 Broker 0.63 0.12 0.04 1 NASDAQ 0.58 0.11 0.06 1 Sensex 0.58 0.12 0.03 1 High 0.55 0.14 0.08 2 India 0.11 0.59 0.02 2 Population 0.15 0.55 0.01 2 High 0.12 0.66 0.01 Market Figure 2: Semantic affinity graph for contextual 3 0.13 0.05 0.58 prior polarity 3 Petroleum 0.05 0.01 0.86 3 UAE 0.12 0.04 As an example, the lexicon level semantic 0.65 High orientation from Figure 2 could be calculated as 3 0.03 0.01 0.93 follows: Table 5: Five example cluster centroids ∑ n v S (w ,w ) = d i j k=0 k * w jp ----(3) or After calculating the similarity measures and using k ∑ n a predefined threshold value (experimentally set to v m =∑ *∏ l c * w jp ---(4) m k=0 k 0.5), the lexica are classified using a standard c=0 k c=0 spectral clustering technique: Starting from a set of Where Sd(wi,wj) is the semantic orientation of wi initial cluster centers, each document is assigned to with wj given as context. Equations (3) and (4) are the cluster whose center is closest to the document. for intra-cluster and inter-cluster semantic distance After all documents have been assigned, the center measure respectively. k is the number of weighted of each cluster is recomputed as the centroid or → → vertices between two lexica wi and wj. vk the mean µ j (where µ j is the clustering coefficient) weighted vertex between two lexica, m the number of its members: of cluster centers between them, lc the distance between their cluster centers, and wpj the polarity ( )∑ → → µ = 1/ c j x∈c j x of the known word wj. This network was created and used in particular Table 5 gives an example of cluster centroids by to handle unknown words. For the prediction of spectral clustering. Bold words in the lexicon name semantic orientation of an unknown word, a bag- column are cluster centers. Comparing two of-words method was adopted: the bag-of-words members of Cluster2, ‘India’ and ‘Population’, it chain was formed with most of the known words, can be seen that ‘India’ is strongly associated with syntactically co-located. Cluster2 (p=0.59), but has some affinity with the A classifier based on Conditional Random other clusters as well (e.g., p=0.11 with Cluster1). Fields was then trained on the corpus with a small These non-zero values are still useful for set of features: co-occurrence distance, ConceptNet calculating vertex weights during the contextual similarity scores, known or unknown based on polarity calculation. SentiWordNet. With the help of these very simple Polarity Calculation using the Syntactic Co- features, the CRF classifier identifies the most Occurrence Network probable bag-of-words to predict the semantic orientation of an unknown word. As an example: The relevance of the semantic lexicon nodes was Suppose X marks the unknown words and that the computed by summing up the edge scores of those probable bag-of-words are: edges connecting a node with other nodes in the same cluster. As the cluster centers also are 9_11-X-Pentagon-USA-Bush interconnected with weighted vertices, inter-cluster Discuss-Terrorism-X-President relations could be calculated in terms of weighted Middle_East-X-Osama network distance between two nodes within two separate clusters. 44 Once the target bag-of-words has been identified, Solved By Syntactic the following equation can be used to calculate the Type Number Co-Occurrence polarity of the unknown word X. Network Positivity>0 && Eng. 6,619 2978 (45 %) Discuss-0.012-Terrorism-0.0-X-0.23- Negativity>0 Bng. 7,654 3138 (41 %) President |Positivity- Eng. 3,187 1370 (43 %) Negativity|>=0.2 Bng. 2,677 1017 (38 %) The scores are extracted from ConceptNet and the equation is: Table 6: Results of the syntactic co-occurrence n n based technique wxp = ∑ ei * ∑p i -----(5) i=0 j =1 Where ei is the edge distances extracted from ConceptNet and Pi is the polarity information of 6 Conclusions the lexicon in the bag-of-words. The paper has introduced Sentimantics, a new way The syntactic co-occurrence network gives to represent sentiment knowledge in the reasonable performance increment over the normal Conceptual Spaces of distributional Semantics by linear sentiment lexicon and the Semantic Network using in a Vector Space Model. This model can Overlap technique, but it has some limitations: it is store dynamic prior polarity with varying difficult to formulate a good equation to calculate contextual information. It is clear from the semantic orientation within the network. The experiments presented that developing the Vector formulation we use produced a less distinguishing Space Model from scratch is the best solution to value for different bag of words. As example in solving the Sentimantics problem and to reach a Figure 2: satisfactory level of coverage. Although it could 0.3 + 0.3 = 0.3 not be claimed that the two issues “What (High, Sensex)= 2 0.22 + 0.35 knowledge to keep at what level?” and = 0.29 “reduce/remove the requirement of using further (Price, High)= 2 NLP techniques to disambiguate the contextual The main problem is that it is nearly impossible polarity” were fully solved, our experiments show to predict polarity for an unknown word. Standard that a proper treatment of Sentimantics can polarity classifiers generally degrade in radically increase sentiment analysis performance. performance in the presence of unknown words, As we showed by the syntactic classification but the Syntactic Co-Occurrence Network is very technique the lexicon model only provides 50% good at handling unknown or new words. accuracy and further NLP techniques increase it to The performance of the syntactic co-occurrence 70%, whereas by the VSM based technique it measure on the corpora is shown in Table 6, with a reaches 70% accuracy while utilizing fewer 70.0% performance for English and 68.0% for language processing resources and techniques. Bengali; a good increment over the Semantic To the best of our knowledge this is the first Network Overlap technique: about 45% (Eng.) and research endeavor which enlightens the necessity 41% (Bng.) of the “Positivity > 0 ∧ Negativity > 0” of using the dynamic prior polarity with context. It cases and 43% (Eng.) and 38% (Bng.) of the is an ongoing task and presently we are exploring “|Positivity – Negativity| ≥ 0.2” cases were resolved its possible applications to multiple domains and by the Syntactic co-occurrence based technique. languages. The term Sentimantics may or may not remain in spotlight with time, but we do believe To better aid our understanding of the developed that this is high time to move on for the dynamic lexical network to hold Sentimantics we visualized prior polarity lexica. this network using the Fruchterman Reingold force directed graph layout algorithm (Fruchterman and Reingold, 1991) and the NodeXL 18 network analysis tool (Smith et al., 2009). 18 http://www.codeplex.com/NodeXL 45 References Liu Hugo, Henry Lieberman and Ted Selker. 2003. A Model of Textual Affect Sensing using Real-World Cambria Erik, Amir Hussain and Chris Eckl. 2011. Knowledge. IUI, pp. 125-132. Taking Refuge in Your Personal Sentic Corner. SAAIP, IJCNLP, pp. 35-43. Minsky Marvin. 2006. The Emotion Machine. Simon and Schuster, New York. Cem Akkaya, Janyce Wiebe, Conrad Alexander and Mihalcea Rada. 2011. Improving the Impact of Moilanen Karo, Pulman Stephen and Zhang Yue. 2010. Subjectivity Word Sense Disambiguation on Packed Feelings and Ordered Sentiments: Sentiment Contextual Opinion Analysis. CoNLL. Parsing with Quasi-compositional Polarity Sequencing and Compression. WASSA, pp. 36--43. Das Amitava and Bandyopadhyay S. 2010. SemanticNet-Perception of Human Pragmatics. Ohana Bruno and Brendan Tierney. 2009. Sentiment COGALEX-II, COLING, pp 2-11. classification of reviews using SentiWordNet. In the 9th IT&T Conference. Das Amitava Bandyopadhyay S. 2010. SentiWordNet for Indian Languages. ALR, COLING, pp 56-63. Pang Bo, Lillian Lee and Vaithyanathan Shivakumar. 2002. Thumbs up? Sentiment Classification using Dasgupta, Sajib and Vincent Ng. 2009. Topic-wise, Machine Learning Techniques. EMNLP, pp 79-86. Sentiment-wise, or Otherwise? Identifying the Hidden Dimension for Unsupervised Text Pang, Bo and Lillian Lee. 2005. Seeing stars: Exploiting Classification. EMNLP. class relationships for sentiment categorization with respect to rating scales. ACL, pp. 115-124. Ekbal A. and Bandyopadhyay S. 2010. Voted NER System using Appropriate Unlabeled Data. Seeker Wolfgang, Adam Bermingham, Jennifer Foster Lingvisticae Investigationes Journal. and Deirdre Hogan. 2009. Exploiting Syntax in Sentiment Polarity Classification. National Centre for Esuli Andrea and Fabrizio Sebastiani. 2006. Language Technology Dublin City University, SentiWordNet: A Publicly Available Lexical Ireland. Resource for Opinion Mining. LREC, pp. 417-422. Shulamit Umansky-Pesin, Roi Reichart and Ari Fruchterman Thomas M. J. and Edward M. Reingold. Rappoport. 2010. A Multi-Domain Web-Based 1991. Graph drawing by force-directed placement. Algorithm for POS Tagging of Unknown Words. Software: Practice and Experience, 21(11):1129– COLING. 1164. Smith Marc, Ben Shneiderman, Natasa Milic-Frayling, Grassi, Marco. 2009. Developing HEO Human Eduarda Mendes Rodrigues, Vladimir Barash, Cody Emotions Ontology. Joint International Conference Dunne, Tony Capone, Adam Perer, and Eric Gleave. on Biometric ID management and Multimodal 2009. Analyzing (social media) networks with Communication, vol. 5707 of LNCS, pp 244–251. NodeXL. 4th International Conference on Haruechaiyasak Choochart, Alisa Kongthon, Palingoon Communities and Technologies, pp. 255-264. Pornpimon and Sangkeettrakarn Chatchawal. 2010. Takamura Hiroya, Inui Takashi and Okumura Manabu. Constructing Thai Opinion Mining Resource: A Case 2005. Extracting Semantic Orientations of Words Study on Hotel Reviews. ALR, pp 64–71. using Spin Model. ACL, pp. 133-140. Hatzivassiloglou Vasileios and Kathleen R. McKeown. Torii Yoshimitsu, Das Dipankar, Bandyopadhyay Sivaji 1997. Predicting the Semantic Orientation of and Okumura Manabu. 2011. Developing Japanese Adjectives. ACL, pp. 174–181. WordNet Affect for Analyzing Emotions. WASSA, Havasi, C., Speer, R., Alonso, J. 2007. ConceptNet 3: a ACL, pp. 80-86 Flexible, Multilingual Semantic Network for Turney Peter and Michael Littman. 2003. Measuring Common Sense Knowledge. RANLP. praise and criticism: Inference of semantic He Yulan, Alani Harith and Zhou Deyu. 2010. orientation from association. ACM Transactions on Exploring English Lexicon Knowledge for Chinese Information Systems, 21(4):315–346. Sentiment Analysis. CIPS-SIGHAN, pp 28-29. Turney Peter. 2002. Thumbs up or thumbs down? Kim Soo-Min and Eduard Hovy. 2004. Determining the Semantic orientation applied to unsupervised Sentiment of Opinions. COLING, pp. 1367-1373. classification of reviews. ACL, pp. 417–424. Liu Bing. 2010. NLP Handbook. Chapter: Sentiment Turney Peter. 2006. Similarity of Semantic Relations. Analysis and Subjectivity, 2nd Edition. Computational Linguistics, 32(3):379-416. 46 Analysis of Travel Review Data from Reader’s Point of View Maya Ando Shun Ishizaki Graduate School of Media and Governance Keio University 5322 Endo, Fujisawa-shi, Kanagawa 252-0882, Japan

[email protected] [email protected]

So, the business value of the review lies on the Abstract customer’s point of view, rather than the reviewer’s point of view. The reviews which give In the NLP field, there have been a lot of works which focus on the reviewer’s point of a great influence to the customers should have the view conducted on sentiment analyses, which highest value, rather than the reviews to which ranges from trying to estimate the reviewer’s were assigned the highest score by the writer. We score. However the reviews are used by the defined customers as readers and reviewers as readers. The reviews that give a big influence writers. We found the differences between the to the readers should have the highest value, writer’s view and the reader’s one using scores rather than the reviews to which was assigned given by reviewers. Especially the negative the highest score by the writer. In this paper, information is found much more influential to the we conducted the analyses using the reader’s readers than the positive one (Ando et al., 2012). point of view. We asked 20 subjects to read We conducted the analyses using the reader’s 500 sentences in the reviews of Rakuten travel point of view. We asked 20 subjects to read 500 and extracted the sentences that gave a big review sentences in Rakuten travel reviews1 and influence to the subjects. We analyze the extract the sentences from them that gave a great influential sentences from the following two influence. We analyzed the influential sentences points of view, 1) targets and evaluations and from the following two points of view, 1) targets 2) personal tastes. We found that “room”, and evaluations (Chap. 4) and 2) Personal tastes “service”, “meal” and “scenery” are important (Chap. 5). targets which are items included in the reviews, and that “features” and “human senses” are important evaluations which express sentiment 2 Previous Study or explain targets. Also we showed personal There have been a lot of works on sentiment tastes appeared on “meal” and “service”. analysis in the past decade. Some of them were classifying reviews into positive, negative, or neutral (Turney, 2002; Pang et al., 2002; Koppel 1 Introduction et al., 2006; Pang, 2005; Okanohara et al., 2006; Reviews are indispensable in the current e- Thelwall et al., 2010). These works were commerce business. In the NLP field, there have conducted based on the writer’s point of view, i.e. been a lot of works conducted on sentiment the targets are mainly assigned by the writers. In analyses, which ranges from trying to estimate the our research, we will describe reader’s point of reviewer’s score or analyzing them by the aspects view. of reviewer’s evaluations. However the reviews 1 are used by the customers, not by the reviewers. Rakuten Travel Inc. http://travel.rakuten.co.jp/ (Japanese) 47 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 47–51, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics In some reviews, there is information called contains the Chi-square test results for each class. helpfulness which is given by readers. Ghose et al. It indicates how significantly each class appears (2007) used it as one of the features in order to in the influential sentences compared to the non- rank the reviews. Passos (2010) also used it to influential sentences. “Less than 1%” means that identify authoritativeness of reviews. They didn’t the chance having the number of classes in the conduct any detailed analysis like what we influential sentences and that in the non- conducted in this paper. So far, the usage of the influential sentences is less than 1%, if random helpfulness information is limited, and indeed the distribution is assumed. “None” means there is no information is too obscure to be used in the significant influence. The results of Chi-square analyses we are trying to conduct. test show that the three classes of target, “room”, “meal” and “service” give influence to the readers 3 Data Preparation (less than 1%), and “scenery” is also influential (less than 5%). Two classes of the evaluations, We use hotel’s reviews of Rakuten travel Inc. We “human senses” and “features” are influential defined influential sentences as those that (less than 1%). “Features” are expressions influence readers to make them book the hotel. In describing the writer’s view about particular practice, influential sentences are very sparse. So, targets in the hotel. in order to collect them efficiently, we used a We found that some particular combinations heuristic that it is relatively more likely to find of a target and an evaluation are influential them in the sentences with exclamation marks (Table 2). “-” indicates infrequence (less than 6). (“!”) located at their ends. We randomly extract We will discuss the combinations of “meal + 500 sentences which have more than one “!” at human senses”, “service + feelings” and “room/ the end, and used for the analyses. Note that meal/ service/ scenery + features”. exclamation mark doesn’t change the meaning of In the combination of “meal + human the sentence. We conducted a preliminary survey senses”, “human senses” are all about taste. The and found that our assumption works well. number of the influential sentences is 12, and the We asked 20 subjects to extract influential non-influential sentences are 19. We analyze sentences from the 500 sentences. The task is to each set of sentences, and found that the extract sentences by which each subject thinks it influential sentences include particular name of influential enough to decide he/she wants to book dish like “sukiyaki” much more often (less than or never to do the hotel. We asked them not to 1%). Non-influential sentences include more include their personal tastes. There are 84 abstract expressions, like “breakfast”. The influential sentences on which more than 4 readers are influenced by particular food. subjects agreed. In the following sections, these The combination of “feeling + service” 84 sentences will be called the influential appeared in influential sentences relatively more sentences and the other sentences are regarded as often(less than 2.5%). “Service” includes service the non-influential sentences. of the hotel like “welcome fruit” or “staff’s service”. “Feeling” is influential only when it 4 Analysis of Target and Evaluation combines with “service” (ex. 1). Ex. 1: …there was happy surprise service at the We analyze classes of targets and evaluations dinner!! which are most influential to the readers. Here, “Features” is very frequent. Investigating the the targets are such as meals or locations of the combination with targets, we found that “room”, hotels, and the evaluations are the reader’s “meal” and “service” are the ones which made impressions about the targets such as good or significant difference (less than 1%) by convenient. We allow duplication of the combining with “features”. These are the key to classification, i.e. if a sentence contains more than make “features” more influential for readers. one target or evaluation then we extract all the “Scenery" is a target originally created and has a target or evaluation terms. significant difference less than 5%. It is a bit We categorized the targets into 11 classes and unexpected, but was useful information for some the evaluations into 7 classes (Table1). The table readers. 48 Table 1. Target and Evaluation with Chi-square test Result of Chi-square test Target evaluation Less than 1% Room, meal, service Human sense (e.g. delicious, stink), Features (e.g. marvelous, bad) Less than 5% Scenery - Location, staff, recommendation (e.g. This is my recommendation) None facility, hotel, bath, next visiting (e.g. I’ll never use this hotel), feeling (e.g. happy) plan, price request (e.g. I want you to…), others (e.g. Thank you) Table 2. Combination of Target and Evaluation with Chi-square test room meal bath service facility scenery features less than 1% less than 1% NO less than 1% NO less than 5% feelings NO - - less than 2.5% - - human senses - less than 1% - - - - sentences which 7 or more subjects judged 5 Personal tastes in the influential influential (we will call them as a popular group) sentences and sentences less than 7 subjects judged influential (unpopular group). Although we instructed the subjects not to include particular personal tastes, we observed the selections of the influential sentences are different among the subject. 289 sentences are selected as influential sentences by at least one subject, and 94 sentences are selected by only one subject. The personal tastes often appear on the target, Figure 1. “Service” type Figure 2. “Meal” type so we analyzed differences of targets among the subject. We clustered the subjects based on their choice of the targets. For each subject, we create a frequency vector whose elements are including the most popular 7 targets, namely “location”, “room”, “meal”, “bath”, “service”, “facility”, and “scenery”. Then the cosine metrics is applied to Figure 3. “Service & meal” type calculate the similarity between any pair of the subjects. Next, we run the hierarchical Table 3: the number of influential sentences judged by agglomerative clustering with the farthest certain number of subjects on “service” 10 or more 9 8 7 6 5 Less than 5 neighbor method to form their clusters. Three Positive 3 2 1 0 1 5 33 figures, Figures 1 to 3, show the results of three Negative 3 3 1 0 0 2 4 clusters in Rader charts. Each of three clusters has In the “service” target, 63 sentences are a typical personal taste, namely groups who are selected as influential by at least one subject. influenced more by “service” very strongly (Fig. Among them, 45 sentences are positive, 13 1), by “meal” (Fig. 2) or by both “service” and sentences are negative and 5 sentences are “meal “(Fig. 3). classified other (i.e. neither positive nor negative). We analyze influential sentences by using the There are four sets of data by combining positive- number of sentences including “service”. Table 3 negative axis and axis. We will analyze them one shows the numbers of sentences that were judged by one. influential by certain numbers of subjects on [Negative & Popular] “service”. In this analysis, we categorize the There are 7 sentences in this group and we found influential sentences into positive and negative that 3 of them include “feeling” evaluation, such ones. For example, there were 2 positively as “surprised” or “angry”. In contrast, there is no influential sentences that were judged influential sentence including feeling in the negative & by 9 subjects. From Table 3, we can observe that unpopular group. Also, very unpleasant events the sentences can clearly be grouped into two; 49 like “arrogant attitude of hotel staff,” “lost the [Positive & Unpopular] luggage” and “payment trouble” are found The sentences including "cost performance" and negatively influential by many subjects. "large portion" only appear in the unpopular [Negative & Unpopular] group. We believe that the size might be There are sentences about staff’s attitude in this influential to people who like to eat a lot, but group, too, but it is less important compared to the people who might not be interested in them. ones in the popular group. For example, staff’s The analyses show that there is personal taste attitude is about greetings or conversation by the and we analyzed it in detail by examining the hotel staff. We believe it is depending on people if examples. It indicates that personalization is very they care those issues or not. important for the readers to find the reviews that [Positive & Popular] might satisfy readers. In this group, there are 2 sentences that show unexpected warm service (ex. 2). Also, there are 6 Conclusion sentences that express high satisfactions not only The main focus of our study is on the reader’s in service but also in other targets, such as meal. Ex. 2: …they kept the electric carpet on point view to evaluate reviews, compared to the because it was cold. We, with my elderly writer’s point of view that was the major focus in farther, were so glad and impressed!! the previous studies. We defined the influential sentences as those that could make the reader’s [Positive & Unpopular] decision. We analyzed the 84 influential sentences, All sentences include some positive descriptions based on the selection by the 20 subjects from the about services, such as “carrying the luggage” or 500 sentences. We conducted the following two “welcome fruit”. Some subjects are influenced, analyses. but the others aren’t. We believe it is because 1) We analyzed targets and evaluations in some people think that these are just usual influential sentences. We found that “room”, services to be provided. “service”, “meal” and “scenery” are important Now, we describe analyses on the “meal” target. targets, and “features” and “human senses” are There are 68 influential sentences selected by at important evaluations. We also analyzed least one subject. There are 58 positive sentences, combinations of the targets and evaluations. 5 negative sentences and 4 sentences otherwise. We find that some combinations make it more We analyze the four groups, just like what we did influential than each of them. for “service”. 2) We analyzed the personal tastes. The subjects [Negative & Popular] can be categorized into three clusters, which We find strong negative opinion about meal itself can be explained intuitively. We found that the like “Their rice was cooked terrible”, which are most important targets to characterize the not found in the unpopular group. Many people clusters are ”service” and “meal”. are influenced when the meal is described badly. There are many directions in our future work. [Negative & Unpopular] One of the important topics is to conduct There are 2 sentences about the situation of the cognitive analysis on the influential sentences. restaurant, such as "crowded" or "existence of a We found that expressions can be very influential large group of people". We believe that the most by adding a simple modifier (“really delicious”). important feature of meal is taste, not the situation. Furthermore, many metaphorical expressions are Many people might know such situation happens found in influential sentences (this topic was not by chance, so only some people cares about this covered in this paper). We would like to conduct kind of issue. the cognitive analyses on these topics to clarify [Positive & Popular] the characteristics of the reader’s point of view. The sentences in both popular and unpopular We believe it will reveal new types of information groups include “delicious”, but “delicious” with in reviews that is also useful for applications. emphasizing adjectives, like “really delicious” were found only in the popular group. 50 References Alexandre Passos and Jacques Wainer, 2010, What do you know? A topic-model approach to authority identification, Proc. Of Computational Social Science and Wisdom of Crowds(NIP2010). Anindya Ghose and Panagiotis G. Ipeirotis. 2007. Designing novel review ranking systems: Predicting usefulness and impact of reviews. Proc. of the International Conference on Electronic Commerce (ICEC), pp. 303-309. Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. 2002. "Thumbs up? Sentiment Classification using Machine Learning Techniques". Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86. Bo Pang and Lillian Lee. 2005. "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales". Proceedings of the Association for Computational Linguistics (ACL). pp. 115–124. Daisuke Okanohara and Jun’ichi Tsujii. 2007.Assigning Polarity Scores to Reviews Using Machine Learning Techniques. Journal of Natural Language Processing. 14(3). pp. 273-295. Koppel, M. and Schler, J. 2006. “The Importance of Neutral Examples in Learning Sentiment”. Computational Intelligence. 22(2). pp.100-109. Maya Ando and Shun Ishizaki. 2012, Analysis of influencial reviews on Web(in Japanese), Proc. Of the 18th Annual Conference of the Association for Natural Language Processing, pp. 731-734. P. Victor, C. Cornelis, M. De Cock, and A. Teredesai. 2009. “Trust- and distrustbased recommendations for controversial reviews.” in Proceedings of the WebSci’09, Society On-Line. Peter Turney. 2002. "Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews". Proceedings of the Association for Computational Linguistics. pp. 417- 424. Thelwall, Mike; Buckley, Kevan; Paltoglou, Georgios; Cai, Di; Kappas, Arvid. 2010. "Sentiment strength detection in short informal text". Journal of the American Society for Information Science and Technology 61 (12). pp. 2544-2558. 51 Multilingual Sentiment Analysis using Machine Translation? Alexandra Balahur and Marco Turchi European Commission Joint Research Centre Institute for the Protection and Security of the Citizen Via E. Fermi 2749, Ispra, Italy alexandra.balahur,

[email protected]

Abstract standing these proven advantages, the high quan- tity of user-generated contents makes this informa- The past years have shown a steady growth tion hard to access and employ without the use of in interest in the Natural Language Process- automatic mechanisms. This issue motivated the ing task of sentiment analysis. The research rapid and steady growth in interest from the Natural community in this field has actively proposed and improved methods to detect and classify Language Processing (NLP) community to develop the opinions and sentiments expressed in dif- computational methods to analyze subjectivity and ferent types of text - from traditional press ar- sentiment in text. Different methods have been pro- ticles, to blogs, reviews, fora or tweets. A less posed to deal with these phenomena for the distinct explored aspect has remained, however, the types of text and domains, reaching satisfactory lev- issue of dealing with sentiment expressed in els of performance for English. Nevertheless, for texts in languages other than English. To this certain applications, such as news monitoring, the aim, the present article deals with the prob- information in languages other than English is also lem of sentiment detection in three different languages - French, German and Spanish - us- highly relevant and cannot be disregarded. Addi- ing three distinct Machine Translation (MT) tionally, systems dealing with sentiment analysis in systems - Bing, Google and Moses. Our ex- the context of monitoring must be reliable and per- tensive evaluation scenarios show that SMT form at similar levels as the ones implemented for systems are mature enough to be reliably em- English. ployed to obtain training data for languages Although the most obvious solution to these is- other than English and that sentiment analysis sues of multilingual sentiment analysis would be to systems can obtain comparable performances to the one obtained for English. use machine translation systems, researchers in sen- timent analysis have been reluctant to using such technologies due to the low performance they used 1 Introduction to have. However, in the past years, the performance Together with the increase in the access to tech- of Machine Translation systems has steadily im- nology and the Internet, the past years have shown proved. Open access solutions (e.g. Google Trans- a steady growth of the volume of user-generated late1 , Bing Translator2 ) offer more and more accu- contents on the Web. The diversity of topics cov- rate translations for frequently used languages. ered by this data (mostly containing subjective and Bearing these thoughts in mind, in this article opinionated content) in the new textual types such we study the manner in which sentiment analysis as blogs, fora, microblogs, has been proven to be can be done for languages other than English, using of tremendous value to a whole range of applica- Machine Translation. In particular, we will study tions, in Economics, Social Science, Political Sci- 1 http://translate.google.it/ 2 ence, Marketing, to mention just a few. Notwith- http://www.microsofttranslator.com/ 52 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 52–60, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics this issue in three languages - French, German and ond approach, they use the automatically translated Spanish - using three different Machine Translation entries in the Opinion Finder lexicon to annotate a systems - Google Translate, Bing Translator and set of sentences in Romanian. In the last experi- Moses (Koehn et al., 2007). ment, they reverse the direction of translation and We employ these systems to obtain training and verify the assumption that subjective language can test data for these three languages and subsequently be translated and thus new subjectivity lexicons can extract features that we employ to build machine be obtained for languages with no such resources. learning models using Support Vector Machines Se- Further on, another approach to building lexicons quential Minimal Optimization. We additionally for languages with scarce resources is presented by employ meta-classifiers to test the possibility to min- Banea et al. (Banea et al., 2008a). In this research, imize the impact of noise (incorrect translations) in the authors apply bootstrapping to build a subjectiv- the obtained data. ity lexicon for Romanian, starting with a set of seed Our experiments show that machine translation subjective entries, using electronic bilingual dictio- systems are mature enough to be employed for mul- naries and a training set of words. They start with tilingual sentiment analysis and that for some lan- a set of 60 words pertaining to the categories of guages (for which the translation quality is high noun, verb, adjective and adverb from the transla- enough) the performance that can be attained is sim- tions of words in the Opinion Finder lexicon. Trans- ilar to that of systems implemented for English. lations are filtered using a measure of similarity to the original words, based on Latent Semantic Anal- 2 Related Work ysis (LSA) (Deerwester et al., 1990) scores. Yet another approach to mapping subjectivity lexica to Most of the research in subjectivity and sentiment other languages is proposed by Wan (2009), who analysis was done for English. However, there were uses co-training to classify un-annotated Chinese re- some authors who developed methods for the map- views using a corpus of annotated English reviews. ping of subjectivity lexicons to other languages. To He first translates the English reviews into Chinese this aim, (Kim and Hovy, 2006) use a machine trans- and subsequently back to English. He then performs lation system and subsequently use a subjectivity co-training using all generated corpora. (Kim et al., analysis system that was developed for English to 2010) create a number of systems consisting of dif- create subjectivity analysis resources in other lan- ferent subsystems, each classifying the subjectivity guages. (Mihalcea et al., 2009) propose a method of texts in a different language. They translate a cor- to learn multilingual subjective language via cross- pus annotated for subjectivity analysis (MPQA), the language projections. They use the Opinion Finder subjectivity clues (Opinion finder) lexicon and re- lexicon (Wilson et al., 2005) and use two bilin- train a Nave Bayes classifier that is implemented in gual English-Romanian dictionaries to translate the the Opinion Finder system using the newly gener- words in the lexicon. Since word ambiguity can ap- ated resources for all the languages considered. Fi- pear (Opinion Finder does not mark word senses), nally, (Banea et al., 2010) translate the MPQA cor- they filter as correct translations only the most fre- pus into five other languages (some with a similar quent words. The problem of translating multi-word ethimology, others with a very different structure). expressions is solved by translating word-by-word Subsequently, they expand the feature space used in and filtering those translations that occur at least a Nave Bayes classifier using the same data trans- three times on the Web. Another approach in obtain- lated to 2 or 3 other languages. Their conclusion is ing subjectivity lexicons for other languages than that by expanding the feature space with data from English was explored by Banea et al. (Banea et al., other languages performs almost as well as training 2008b). To this aim, the authors perform three dif- a classifier for just one language on a large set of ferent experiments, obtaining promising results. In training data. the first one, they automatically translate the anno- Attempts of using machine translation in differ- tations of the MPQA corpus and thus obtain subjec- ent natural language processing tasks have not been tivity annotated sentences in Romanian. In the sec- widely used due to poor quality of translated texts, 53 but recent advances in Machine Translation have tems, with varying levels of translation quality. In motivated such attempts. In Information Retrieval, this sense, we employ three different systems - Bing (Savoy and Dolamic, 2009) proposed a comparison Translator, Google Translate and Moses to translate between Web searches using monolingual and trans- data from English to three languages - French, Ger- lated queries. On average, the results show a drop man and Spanish. We subsequently study the perfor- in performance when translated queries are used, mance of classifying sentiment from the translated but it is quite limited, around 15%. For some lan- data and different methods to minimize the effect of guage pairs, the average result obtained is around noise in the data. 10% lower than that of a monolingual search while Our comparative results show, on the one hand, for other pairs, the retrieval performance is clearly that machine translation can be reliably used for lower. In cross-language document summarization, multilingual sentiment analysis and, on the other (Wan et al., 2010; Boudin et al., 2010) combined hand, which are the main characteristics of the data the MT quality score with the informativeness score for such approaches to be successfully employed. of each sentence in a set of documents to automat- ically produce summary in a target language using 4 Dataset Presentation and Analysis a source language texts. In (Wan et al., 2010), each For our experiments, we employed the data provided sentence of the source document is ranked accord- for English in the NTCIR 8 Multilingual Opinion ing both the scores, the summary is extracted and Analysis Task (MOAT)3 . In this task, the organiz- then the selected sentences translated to the target ers provided the participants with a set of 20 top- language. Differently, in (Boudin et al., 2010), sen- ics (questions) and a set of documents in which sen- tences are first translated, then ranked and selected. tences relevant to these questions could be found, Both approaches enhance the readability of the gen- taken from the New York Times Text (2002-2005) erated summaries without degrading their content. corpus. The documents were given in two differ- ent forms, which had to be used correspondingly, 3 Motivation and Contribution depending on the task to which they participated. The main motivation for the experiments we present The first variant contained the documents split into in this article is the known lack of resources and ap- sentences (6165 in total) and had to be used for proaches for sentiment analysos in languages other the task of opinionatedness, relevance and answer- than English. Although, as we have seen in the ness. In the second form, the sentences were also Related Work section, a few attempts were made split into opinion units (6223 in total) for the opin- to build systems that deal with sentiment analysis ion polarity and the opinion holder and target tasks. in other languages, they mostly employed bilingual For each of the sentences, the participants had to dictionaries and used unsupervised approaches. The provide judgments on the opinionatedness (whether very few that employed supervised learning using they contained opinions), relevance (whether they translated data have, in change, concentrated only are relevant to the topic). For the task of polar- on the issue of sentiment classification and have dis- ity classification, the participants had to employ the regarded the impact of the translation quality and dataset containing the sentences that were also split the difference that the use of distinct translation sys- into opinion units (i.e. one sentences could contain tems can make in this settings. Moreover, such ap- two/more opinions, on two/more different targets or proaches have usually employed only simple ma- from two/more different opinion holders). chine learning algorithms. No attempt has been For our experiments, we employed the latter rep- made to study the use of meta-classifiers to enhance resentation. From this set, we randomly chose 600 the performance of the classification through the re- opinion units, to serve as test set. The rest of opin- moval of noise in the data. ion units will be employed as training set. Subse- Our main contribution in this article is the com- quently, we employed the Google Translate, Bing parative study of multilingual sentiment analysis 3 http://research.nii.ac.jp/ntcir/ntcir- performance using distinct machine translation sys- ws8/permission/ntcir8xinhua-nyt-moat.html 54 Translator and Moses systems to translate, on the into a sequence of I phrases f I = {f1 , f2 , . . . fI } one hand, the training set and on the other hand and the same is done for the target sentence e, where the test set, to French, German and Spanish. Ad- the notion of phrase is not related to any grammat- ditionally, we employed the Yahoo system to trans- ical assumption; a phrase is an n-gram. The best late only the test set into these three languages. Fur- translation ebest of f is obtained by: ther on, this translation of the test set by the Yahoo service has been corrected by a person for all the ebest = arg max p(e|f ) = arg max p(f |e)pLM (e) e e languages. This corrected data serves as Gold Stan- I dard4 . Most of these sentences, however, contained Y = arg max φ(fi |ei )λφ d(ai − bi−1 )λd e no opinion (were neutral). Due to the fact that the i=1 neutral examples are majoritary and can produce a Y|e| large bias when classifying, we decided to eliminate pLM (ei |e1 . . . ei−1 )λLM these examples and employ only the positive and i=1 negative sentences in both the training, as well as the test sets. After this elimination, the training set where φ(fi |ei ) is the probability of translating a contains 943 examples (333 positive and 610 nega- phrase ei into a phrase fi . d(ai − bi−1 ) is the tive) and the test set and Gold Standard contain 357 distance-based reordering model that drives the sys- examples (107 positive and 250 negative). tem to penalise significant reorderings of words dur- ing translation, while allowing some flexibility. In 5 Machine Translation the reordering model, ai denotes the start position of the source phrase that is translated into the ith During the 1990’s the research community on Ma- target phrase, and bi−1 denotes the end position of chine Translation proposed a new approach that the source phrase translated into the (i − 1)th target made use of statistical tools based on a noisy chan- phrase. pLM (ei |e1 . . . ei−1 ) is the language model nel model originally developed for speech recogni- probability that is based on the Markov’s chain as- tion (Brown et al., 1994). In the simplest form, Sta- sumption. It assigns a higher probability to flu- tistical Machine Translation (SMT) can be formu- ent/grammatical sentences. λφ , λLM and λd are lated as follows. Given a source sentence written used to give a different weight to each element. For in a foreign language f , the Bayes rule is applied more details see (Koehn et al., 2003). to reformulate the probability of translating f into a Three different SMT systems were used to trans- sentence e written in a target language: late the human annotated sentences: two existing ebest = arg max p(e|f ) = arg max p(f |e)pLM (e) online services such as Google Translate and Bing e e Translator5 and an instance of the open source where p(f |e) is the probability of translating e to f phrase-based statistical machine translation toolkit and pLM (e) is the probability of producing a fluent Moses (Koehn et al., 2007). sentence e. For a full description of the model see To train our models based on Moses we used the (Koehn, 2010). freely available corpora: Europarl (Koehn, 2005), The noisy channel model was extended in differ- JRC-Acquis (Steinberger et al., 2006), Opus (Tiede- ent directions. In this work, we analyse the most mann, 2009), News Corpus (Callison-Burch et al., popular class of SMT systems: PBSMT. It is an ex- 2009). This results in 2.7 million sentence pairs for tension of the noisy channel model using phrases English-French, 3.8 for German and 4.1 for Span- rather than words. A source sentence f is segmented ish. All the modes are optimized running the MERT 4 algorithm (Och, 2003) on the development part of Please note that each sentence may contain more than one opinion unit. In order to ensure a contextual translation, we the News Corpus. The translated sentences are re- translated the whole sentences, not the opinion units separately. cased and detokonized (for more details on the sys- In the end, we eliminate duplicates of sentences (due to the fact tem, please see (Turchi et al., 2012). that they contained multiple opinion units), resulting in around 5 400 sentences in the test and Gold Standard sets and 5700 sen- http://translate.google.com/ and http:// tences in the training set www.microsofttranslator.com/ 55 Performances of a SMT system are automati- the initial set of examples that is classified. If we are cally evaluated comparing the output of the system to employ machine translation, the errors in translat- against human produced translations. Bleu score ing this small initial set would have a high negative (Papineni et al., 2001) is the most used metric and it impact on the subsequently learned examples. The is based on averaging n-gram precisions, combined challenge of using statistical methods is that they re- with a length penalty which penalizes short transla- quire training data (e.g. annotated corpora) and that tions containing only sure words. It ranges between this data must be reliable (i.e. not contain mistakes 0 and 1, and larger value identifies better translation. or “noise”). However, the larger this dataset is, the less influence the translation errors have. 6 Sentiment Analysis Since we want to study whether machine transla- tion can be employed to perform sentiment analy- In the field of sentiment analysis, most work has sis for different languages, we employed statistical concentrated on creating and evaluating methods, methods in our experiments. More specifically, we tools and resources to discover whether a specific used Support Vector Machines Sequential Minimal “target”or “object” (person, product, organization, Optimization (SVM SMO) since the literature in the event, etc.) is “regarded” in a positive or negative field has confirmed it as the most appropriate ma- manner by a specific “holder” or “source” (i.e. a per- chine learning algorithm for this task. son, an organization, a community, people in gen- In the case of statistical methods, the most impor- eral, etc.). This task has been given many names, tant aspect to take into consideration is the manner from opinion mining, to sentiment analysis, review in which texts are represented - i.e. the features that mining, attitude analysis, appraisal extraction and are extracted from it. For our experiments, we repre- many others. sented the sentences based on the unigrams and the The issue of extracting and classifying sentiment bigrams that were found in the training data. Al- in text has been approached using different methods, though there is an ongoing debate on whether bi- depending on the type of text, the domain and the grams are useful in the context of sentiment classi- language considered. Broadly speaking, the meth- fication, we considered that the quality of the trans- ods employed can be classified into unsupervised lation can also be best quantified in the process by (knowledge-based), supervised and semi-supervised using these features (because they give us a measure methods. The first usually employ lexica or dictio- of the translation correctness, both regarding words, naries of words with associated polarities (and val- as well as word order). Higher level n-grams, on the ues - e.g. 1, -1) and a set of rules to compute the other hand, would only produce more sparse feature final result. The second category of approaches em- vectors, due to the high language variability and the ploy statistical methods to learn classification mod- mistakes in the traslation. els from training data, based on which the test data is then classified. Finally, semi-supervised methods 7 Experiments employ knowledge-based approaches to classify an In order to test the performance of sentiment classi- initial set of examples, after which they use different fication when using translated data, we performed a machine learning methods to bootstrap new training series of experiments: examples, which they subsequently use with super- vised methods. • In the first set of experiments, we trained an The main issue with the first approach is that ob- SVM SMO classifier on the training data ob- taining large-enough lexica to deal with the vari- tained for each language, with each of the three ability of language is very expensive (if it is done machine translations, separately (i.e. we gen- manually) and generally not reliable (if it is done erated a model for each of the languages con- automatically). Additionally, the main problem of sidered, for each of the machine translation such approaches is that words outside contexts are systems employed). Subsequently, we tested highly ambiguous. Semi-supervised approaches, on the models thus obtained on the correspond- the other hand, highly depend on the performance of ing test set (e.g. training on the Spanish train- 56 ing set obtained using Google Translate and approach on the Gold Standard (for each language), testing on the Spanish test set obtained using we represented this set using the corresponding un- Google Translate) and on the Gold Standard for igram and bigram features extracted from the cor- the corresponding language (e.g. training on responding training set (for the example given, we the Spanish training set obtained using Google represented each sentence in the Gold Standard by Translate and testing on the Spanish Gold Stan- marking the presence/absence of the unigrams and dard). Additionally, in order to study the man- bigrams from the training data for Spanish using ner in which the noise in the training data can Google Translate). be removed, we employed two meta-classifiers The results of these experiments are presented in - AdaBoost and Bagging (with varying sizes of Table 2, in terms of weighted F1 measure. the bag). 7.2 Joint Training with Translated Data • In the second set of experiments, we combined In the second set of experiments, we added together the translated data from all three machine trans- all the translations of the training data obtained for lation systems for the same language and cre- the same language, with the three different MT sys- ated a model based on the unigram and bigram tems. Subsequently, we represented, for each lan- features extracted from this data (e.g. we cre- guage in part, each of the sentences in the joint train- ated a Spanish training model using the uni- ing corpus as vectors, whose features represented grams and bigrams present in the training sets the presence/absence of the unigrams and bigrams generated by the translation of the training set contained in this corpus. In order to test the perfor- to Spanish by Google Translate, Bing Trans- mance of the sentiment classification, we employed lator and Moses). We subsequently tested the the Gold Standard for the corresponding language, performance of the sentiment classification us- representing each sentence it contains according to ing the Gold Standard for the corresponding the presence or absence of the unigrams and bigrams language, represented using the features of this in the corresponding joint training corpus for that model. language. Finally, we applied SVM SMO to classify Table 1 presents the number of unigram and bi- the sentences according to the polarity of the senti- gram features employed in each of the cases. ment they contained. Additionally, we applied the In the following subsections, we present the re- AdaBoost and Bagging meta-classifiers to test the sults of these experiments. possibilities to minimize the impact of noise in the data. The results are presented in Tables 3 and 4, 7.1 Individual Training with Translated Data again, in terms of weighter F1 measure. In the first experiment, we translated the training Language SMO AdaBoost M1 Bagging and test data from English to all the three other languages considered, using each of the three ma- To German 0.565∗ 0.563∗ 0.565∗ chine translation systems. Subsequently, we rep- To Spanish 0.419 0.494 0.511 resented, for each of the languages and translation To French 0.25 0.255 0.23 systems, the sentences as vectors, whose features marked the presence/absence (1 or 0) of the uni- Table 3: For each language, each classifier has been trained merging the translated data coming form differ- grams and bigrams contained in the corresponding ent SMT systems, and tested using the Gold Standard. trainig set (e.g. we obtained the unigrams and bi- ∗ Classifier is not able to discriminate between positive grams in all the sentences in the training set ob- and negative classes, and assigns most of the test points tained by translating the English training data to to one class, and zero to the other. Spanish using Google and subsequently represented each sentence in this training set, as well as the test 8 Results and Discussion set obtained by translating the test data in English to Spanish using Google marking the presence of the Generally speaking, from our experiments using unigram and bigram features). In order to test the SVM, we could see that incorrect translations imply 57 Bing Google T. Moses correct information for the positive and the negative To German 0.57∗ 0.572∗ 0.562∗ classes, this results in the assignment of most of To Spanish 0.392 0.511 0.448 the test points to one class and zero to the other. In To French 0.612∗ 0.571∗ 0.575∗ Table 3, for the French language we have significant drop in performance, but the classifier is still able Table 4: For each language, the SMO classifiers have to learn something from the training and assign the been trained merging the translated data coming form dif- test points to both the classes. ferent SMT systems, and tested using independently the c) The results for Spanish presented in Table 3 translated test sets. ∗ Classifier is not able to discriminate confirm the capability of Bagging to reduce the between positive and negative classes, and assigns most model variance and increase the performance in of the test points to one class, and zero to the other. classification. d) At system level in Table 4, there is no evidence an increment of the features, sparseness and more that better translated test set allows better classifica- difficulties in identifying a hyperplane which sepa- tion performance. rates the positive and negative examples in the train- ing phase. Therefore, a low quality of the translation leads to a drop in performance, as the features ex- 9 Conclusions and Future Work tracted are not informative enough to allow for the classifier to learn. In this work we propose an extensive evaluation of From Table 2, we can see that: the use of translated data in the context of sentiment a) There is a small difference between performances analysis. Our findings show that SMT systems are of the sentiment analysis system using the English mature enough to produce reliably training data for and translated data, respectively. In the worst case, languages other than English. The gap in classifi- there is a maximum drop of 8 percentages. cation performance between systems trained on En- b) Adaboost is sensitive to noisy data, and it is glish and translated data is minimal, with a maxi- evident in our experiments where in general it does mum of 8 not modify the SMO performances or there is a Working with translated data implies an incre- drop. Vice versa, Bagging, reducing the variance ment number of features, sparseness and noise in the in the estimated models, produces a positive effect data points in the classification task. To limit these on the performances increasing the F-score. These problems, we test three different classification ap- improvements are larger using the German data, proaches showing that bagging has a positive impact this is due to the poor quality of the translated data, in the results. which increases the variance in the data. In future work, we plan to investigate different document representations, in particular we believe Looking at the results in Tables 3 and 4, we can that the projection of our documents in space where see that: the features belong to a sentiment lexical and in- a) Adding all the translated training data together clude syntax information can reduce the impact of drastically increases the noise level in the training the translation errors. As well we are interested to data, creating harmful effects in terms of clas- evaluate different term weights such as tf-idf. sification performance: each classifier loses its discriminative capability. Acknowledgments b) At language level, clearly the results depend on the translation performance. Only for Spanish The authors would like to thank Ivano Azzini, from (for which we have the highest Bleu score), each the BriLeMa Artificial Intelligence Studies, for the classifies is able to properly learn from the training advice and support on using meta-classifiers. We data and try to properly assign the test samples. For would also like to thank the reviewers for their use- the other languages, translated data are so noisy ful comments and suggestions on the paper. that the classifier is not able to properly learn the 58 References P. Koehn. 2010. Statistical Machine Translation. Cam- bridge University Press. Turchi, M. and Atkinson, M. and Wilcox, A. and Craw- P. Koehn and F. J. Och and D. Marcu. 2003. Statistical ley, B. and Bucci, S. and Steinberger, R. and Van der Phrase-Based Translation, Proceedings of the North Goot, E. 2012. ONTS: “Optima” News Translation America Meeting on Association for Computational System.. Proceedings of EACL 2012. Linguistics, 48–54. Banea, C., Mihalcea, R., and Wiebe, J. 2008. A boot- P. Koehn and H. Hoang and A. Birch and C. Callison- strapping method for building subjectivity lexicons for Burch and M. Federico and N. Bertoldi and B. Cowan languages with scarce resources.. Proceedings of the and W. Shen and C. Moran and R. Zens and C. Dyer Conference on Language Resources and Evaluations and O. Bojar and A. Constantin and E. Herbst 2007. (LREC 2008), Maraakesh, Marocco. Moses: Open source toolkit for statistical machine Banea, C., Mihalcea, R., Wiebe, J., and Hassan, S. translation. Proceedings of the Annual Meeting of the 2008. Multilingual subjectivity analysis using ma- Association for Computational Linguistics, demon- chine translation. Proceedings of the Conference on stration session, pages 177–180. Columbus, Oh, USA. Empirical Methods in Natural Language Processing Mihalcea, R., Banea, C., and Wiebe, J. 2009. Learn- (EMNLP 2008), 127-135, Honolulu, Hawaii. ing multilingual subjective language via cross-lingual Banea, C., Mihalcea, R. and Wiebe, J. 2010. Multilin- projections. Proceedings of the Conference of the An- gual subjectivity: are more languages better?. Pro- nual Meeting of the Association for Computational ceedings of the International Conference on Computa- Linguistics 2007, pp.976-983, Prague, Czech Repub- tional Linguistics (COLING 2010), p. 28-36, Beijing, lic. China. F. J. Och 2003. Minimum error rate training in statisti- Boudin, F. and Huet, S. and Torres-Moreno, J.M. and cal machine translation. Proceedings of the 41st An- Torres-Moreno, J.M. 2010. A Graph-based Ap- nual Meeting on Association for Computational Lin- proach to Cross-language Multi-document Summa- guistics, pages 160–167. Sapporo, Japan. rization. Research journal on Computer science K. Papineni and S. Roukos and T. Ward and W. J. Zhu and computer engineering with applications (Polibits), 2001. BLEU: a method for automatic evaluation of 43:113–118. machine translation. Proceedings of the 40th Annual P. F. Brown, S. Della Pietra, V. J. Della Pietra and R. L. Meeting on Association for Computational Linguis- Mercer. 1994. The Mathematics of Statistical Ma- tics, pages 311–318. Philadelphia, Pennsylvania. chine Translation: Parameter Estimation, Computa- J. Savoy, and L. Dolamic. 2009. How effective is tional Linguistics 19:263–311. Google’s translation service in search?. Communi- C. Callison-Burch, and P. Koehn and C. Monz and J. cations of the ACM, 52(10):139–143. Schroeder. 2009. Findings of the 2009 Workshop on R. Steinberger and B. Pouliquen and A. Widiger and C. Statistical Machine Translation. Proceedings of the Ignat and T. Erjavec and D. Tufiş and D. Varga. 2006. Fourth Workshop on Statistical Machine Translation, The JRC-Acquis: A multilingual aligned parallel cor- pages 1–28. Athens, Greece. pus with 20+ languages. Proceedings of the 5th Inter- Deerwester, S., Dumais, S., Furnas, G. W., Landauer, T. national Conference on Language Resources and Eval- K., and Harshman, R. 1990. Indexing by latent se- uation, pages 2142–2147. Genova, Italy. mantic analysis. Journal of the American Society for J. Tiedemann. 2009. News from OPUS-A Collection of Information Science, 3(41). Multilingual Parallel Corpora with Tools and Inter- faces. Recent advances in natural language processing Kim, S.-M. and Hovy, E. 2006. Automatic identification V: selected papers from RANLP 2007, pages 309:237. of pro and con reasons in online reviews. Proceedings of the COLING/ACL Main Conference Poster Ses- Wan, X. and Li, H. and Xiao, J. 2010. Cross-language sions, pages 483490. document summarization based on machine transla- tion quality prediction. Proceedings of the 48th An- Kim, J., Li, J.-J. and Lee, J.-H. 2006. Evaluating nual Meeting of the Association for Computational Multilanguage-Comparability of Subjectivity Analysis Linguistics, pages 917–926. Systems. Proceedings of the 48th Annual Meeting of Wilson, T., Wiebe, J., and Hoffmann, P. 2005. Recogniz- the Association for Computational Linguistics, pages ing contextual polarity in phrase-level sentiment anal- 595603, Uppsala, Sweden, 11-16 July 2010. ysis. Proceedings of HLT-EMNLP 2005, pp.347-354, P. Koehn. 2005. Europarl: A Parallel Corpus for Vancouver, Canada. Statistical Machine Translation. Proceedings of the Machine Translation Summit X, pages 79-86. Phuket, Thailand. 59 Language SMT system Nr. of unigrams Nr. of bigrams Bing 7441 17870 Google 7540 18448 French Moses 6938 18814 Bing+Google+Moses 9082 40977 Bing 7817 16216 Google 7900 16078 German Moses 7429 16078 Bing+Google+Moses 9371 36556 Bing 7388 17579 Google 7803 18895 Spanish Moses 7528 18354 Bing+Google+Moses 8993 39034 Table 1: Features employed. Language SMT Test Set SMO AdaBoost M1 Bagging Bleu Score English GS 0.685 0.685 0.686 To German GS 0.641 0.631 0.648 Bing Tr 0.658 0.636 0.662 0.227 To German GS 0.646 0.623 0.674 Google T. Tr 0.687 0.645 0.661 0.209 To German GS 0.644 0.644 0.676 Moses Tr 0.667 0.667 0.674 0.17 To Spanish GS 0.656 0.658 0.646 Bing Tr 0.633 0.633 0.633 0.316 To Spanish GS 0.653 0.653 0.665 Google T. Tr 0.636 0.667 0.636 0.341 To Spanish GS 0.664 0.664 0.671 Moses Tr 0.649 0.649 0.663 0.298 To French GS 0.644 0.645 0.664 Bing Tr 0.644 0.649 0.652 0.243 To French GS 0.64 0.64 0.659 Google T. Tr 0.652 0.652 0.678 0.274 To French GS 0.633 0.633 0.645 Moses Tr 0.666 0.666 0.674 0.227 Table 2: Results obtained using the individual training sets obtained by translating with each of the three considered MT systems, to each of the three languages considered. 60 Unifying Local and Global Agreement and Disagreement Classification in Online Debates Jie Yin Nalin Narang CSIRO ICT Centre University of New South Wales NSW, Australia NSW, Australia

[email protected] [email protected]

Paul Thomas Cecile Paris CSIRO ICT Centre CSIRO ICT Centre ACT, Australia NSW, Australia

[email protected] [email protected]

Abstract Online debate forums provide a powerful communication platform for individual users to share information, exchange ideas and ex- press opinions on a variety of topics. Under- standing people’s opinions in such forums is an important task as its results can be used in many ways. It is, however, a challeng- ing task because of the informal language use and the dynamic nature of online conversa- tions. In this paper, we propose a new method for identifying participants’ agreement or dis- agreement on an issue by exploiting infor- mation contained in each of the posts. Our proposed method first regards each post in its local context, then aggregates posts timate a participant’s overall position. We have explored the use of sentiment, emotional and durational features to improve the accu- racy of automatic agreement and disagree- ment classification. Our experimental results have shown that aggregating local positions over posts yields better performance than non- aggregation baselines when identifying global positions on an issue. 1 Introduction With their increasing popularity, social media appli- cations provide a powerful communication channel for individuals to share information, exchange ideas and express their opinions on a wide variety of top- ics. An online debate is an open forum where a participant starts a discussion by posting his opin- ion on a particular topic, such as regional politics, health or the military, while other participants state their support or opposition by posting their opinions. Proceedings of the 3rd Workshop on Computational Jeju, Republic of Korea, 2012 for the main topic but for posts in a local context. This poses a difficulty in directly building a global classifier for agreement and disagreement. We illus- trate this with the example below. Here, the topic of the thread is “Beijing starts gating, locking migrant villages” and the discussion is started with a seed post criticising the Chinese government1 . Seed post: I’m most sure there will China sympathisers here justifying tions imposed by the Communist Chinese gov- ernment. . . . Reply 1: Not really seeing a problem there. From you article. They can come and go. Peo- ple in my country pay hundreds of thousands of pounds for security like that in their gated communities.. Reply 2: So, you are OK with living in a Police State? . . . The author of Reply 1 argues that the Chinese icy is not as presented, and is in fact defensible. opposes the seed post, so that the author’s sition for the main topic is “disagree”. ion expressed in Reply 2, however, is not to the seed post: it relates to Reply 1. It that the author of Reply 2 disagrees with made in Reply 1, and thus indirectly implies agree- ment with the seed post. From this example, we can see that it is hard to infer the global position of Re- ply 2’s author only from the text of their post. How- ever, we can exploit information in the local context, such as the relationship between Replies 1 and 2, to indirectly infer the author’s opinion with regard to the seed post. Motivated by this observation, we propose a three-step method for detecting participants’ global agreement or disagreement positions by exploiting local information in the posts within the debate. First, we build a local classifier to determine whether a pair of posts agree with each other or not. Sec- ond, we aggregate over posts for each pair of partic- ipants in one discussion to determine whether they agree with each other. Third, we infer the global po- sitions of participants with respect to the main topic, so that participants can be classified into two classes: 1 Spelling of the posts is per original on the website. be present all the time during the conversation, and therefore, user speech models can be built, and their dependencies can be explored to facilitate agreement and disagreement classification. Our aggregation technique does, however, presuppose consistency of opinions, in a similar way to Thomas et al. (2006). There has been other related work which aims to analyse informal texts for opinion mining and (dis)agreement classification in online discussions. Agrawal et al. (2003) described an observation that reply-to activities always show disagreement with previous authors in newsgroup discussions, and pre- sented a clustering approach to group users into two parties: support and opposition, based on reply- to graphs between users. Murakami and Raymond (2010) proposed a method for deriving simple rules to extract opinion expressions from the content of posts and then applied a similar graph clustering al- gorithm for partitioning participants into supporting and opposing parties. By combining both text and link information, this approach was demonstrated to outperform the method proposed by Agrawal et al. (2003). Due to the nature of clustering mechanisms, the output of these methods are two user parties, in each of which users most agree or disagree with each other. However, users’ positions in the two parties do not necessarily correspond to the global position with respect to the main issue in a debate, which is our interest here. Balasubramanyan and Cohen (2011) proposed a computational method to classify sentiment polarity in blog comments and predict the polarity based on the topics discussed in a blog post. Finally, Somasundaran and Wiebe (2010) explored the utility of sentiment and arguing opinions in ideo- logical debates and applied a support vector machine based approach for classifying stances of individual posts. In our work, we focus on classifying people’s global positions on a main issue by exploiting and aggregating local positions expressed in individual posts. 3 Our Proposed Method To infer support or opposition positions with respect to the seed post, we propose a three-step method. First, we consider each post in its local context and build a local classifier to classify each pair of posts as agreeing with each other or not. Second, we ag- seed A B C C D B (a) Estimate P (y|x) for each post Figure 1: Local agreement/disagreement and ability of two posts xi and xj being in agreement determine L(m, n), the position between two graph back to the seed. word “odd” has a positive score of 1.125, and a neg- ative score of 1.625. To aggregate the sentiment polarity of each post, we calculate the overall pos- itive and negative scores for all the words that can be found in SentiWordNet, and use these two sums as two features for each post. Emotional features We observe that personal emotions could be a good indicator of agree- ment/disagreement expression in online debates. Therefore, we include a set of emotional features, including occurrences of emoticons, number of cap- ital letters, number of foul words, number of excla- mation marks, and number of question marks con- tained in a post. Intuitively, use of foul be linked to emotion in a visceral way, which if used, could be a sign of strong argument and disagree- ment. The presence of question marks could be in- dicative of disagreement, and the use of exclama- tion marks and capital letters could be an emphasis placed on opinions. Durational features Inspired by conversation analysis (Galley et al., 2004; Wang et al., 2011), we which are quite useful in our case for aggregating local positions between participants. Formally, logistic regression estimates tional probability of y given x in the form of 1 Pw (y = ±1|x) = , e−yw x T 1+ where x is the feature vector, y is the class w ∈ Rn is the weight vector. Given the training {xi , yi }li=1 , xi ∈ Rn , yi ∈ {1, −1}, following form of regularised logistic regression l 1 min f (w) = wT w + C X log 1 + e−yi T w 2 i=1 which aims to minimise the regularised negative likelihood of the training data. Above, wT w/2 is used as a regularisation term to achieve good gen- eralisation abilities. Parameter C > 0 is a penalty factor which controls the balance of the two terms in Equation 2. The above optimisation problem can be solved using different iterative methods, such as conjugate gradient and Newton methods (Lin et al., 2008). As a result, an optimal estimate of w can be obtained. Given a representation of a post xm , we can use Equation 1 to estimate its membership probabil- ity of belonging to each class, P (agree|xm ) and P (disagree|xm ), respectively. 3.2 Estimating Local Positions between Participants After obtaining local position between posts, this step aims to aggregate over posts to determine whether each pair of participants agree with each other. The intuition is that, in one threaded dis- cussion, most of the participants tend to retain their positions in the course of their arguments. This as- sumption holds for the ground-truth annotations we have obtained in our data sets. Given local predic- tions obtained from the previous step, we adopt the weighted voting scheme to determine the local posi- tion for each pair of participants. Specifically, given a pair of users i and j, we aggregate over all the reply-to posts between them to calculate the overall Let L(m, n) denote the local position between two users m and n. In the figure, the local position be- tween user B and user A (the author of the seed post), L(A, B), is in agreement, while users B and C, A and C, as well as C and D each disagree. Walking the shortest path between D and the seed in Figure 1(a), we have L(C, D) = disagree and L(A, C) = disagree, so we can infer that the global position between user D and user A is in agreement. That is, user D agrees with the seed post. Had the local position between user A and user C, L(A, C), been in agreement, then we would have concluded that user D disagrees with the seed post. 4 Experiments In this section, we describe our experiments on two real-world data sets and report our experimental re- sults for local and global (dis)agreement classifica- tion. 4.1 Data Sets We used two data sets to evaluate our pro- posed method in our experiments. They were crawled from the U.S. Message Board (www. usmessageboard.com) and the Political Forum (www.politicalforum.com). The two data sets are referred to as usmb and pf, respectively, in our discussion. The detailed characteristics of the two data sets are given in Table 1. Table 1: Characteristics of data sets usmb # of threads 88 # of posts 818 # of participants 270 Mean # of posts per thread 9.3 Mean # of participants per thread 3.1 Mean # of posts per participant 3.0 For the evaluation, each post was labelled with two annotations. The first was a global annotation with respect to the thread’s seed post, and the other was a local annotation with respect to the parent. Seed posts themselves were not annotated, nor were they classified by our algorithms. Global annotations were made by two postgrad- uate students. Each was instructed to read all the Table 2: Classification Naive Bayes, all features SVM, all features Logistic regression, all features Table 3: Feature analysis words words, sentiment words, sentiment, emotional words, sentiment, durational words, sentiment, emotional, durational we used 70% of posts as training and the other 30% were held out for testing. We compared reg- ularised logistic regression against two baselines: naive Bayes and support vector machines (SVMs), which have been used for (dis)agreement classifica- tion in previous works (Thomas et al., 2006; Soma- sundaran and Wiebe, 2010). For SVMs, we used the toolbox LIBSVM in Chang and Lin (2011) to implement the classification and probability estima- tion. We tuned the parameter C in regularised logis- tic regression and SVM, using cross-validation on the training data, and thereafter the optimal C was used on the test data for evaluation. Table 2 compares the local classification accuracy of the three methods on data sets usmb and pf, re- spectively. We can see from the table that logistic regression outperforms naive Bayes and SVM on the two evaluation metrics for local classification. Although logistic regression and SVM have been shown to yield comparable performance on some text categorisation tasks Li and Yang (2003), in our problem, regularised logistic regression was ob- served to outperform SVM for local (dis)agreement classification. Experiments were also carried out to investigate how the performance of local classification would be changed by using different types of features. Table 3 shows the classification accuracy of logistic regres- Table 4: Classification Without aggregation Naive Bayes, all features SVM, all features Logistic regression, all features With aggregation Naive Bayes, all features SVM, all features Logistic regression, all features classification accuracy without aggregation, it has degraded and classifies everything as the majority class in these cases. The F-measure is correspond- ingly poor due to a low recall. This observation is consistent with the findings reported in Agrawal et al. (2003). In all cases — bar logistic regression on the pf set — aggregation of local classifications improves the performance of global classification. This is more marked in the usmb data set, which has slightly more exchanges between each pair of users (mean 1.33 per pair per topic, vs. 1.19 for the pf data set) and therefore more potential for aggregation. We believe that this improvement is because local classification is sometimes error prone, especially when opinions are not expressed clearly in individ- ual posts. If so, and assuming that users tend to re- tain their stances within a debate, aggregation can “wash out” local classification errors. 5 Conclusion and Future Work In this paper, we have proposed a new method for identifying participants’ agreement or disagreement on an issue by exploiting local information con- tained in individual posts. Our proposed method builds a unified framework which enables the clas- sification of participants’ local and global positions in online debates. To evaluate the performance of our proposed method, we conducted experiments on two real-world data sets collected from two online debate forums. Our experiments have shown that regularised logistic regression is useful for this type of task; it has a built-in automatic feature selection References Rakesh Agrawal, Sridhar Rajagopalan, Ramakrishnan Srikant, and Yirong Xu. 2003. Mining newsgroups using networks arising from social bahavior. In Pro- ceedings of the 12th International World Wide Web Conference, pages 529–535, Budapest, Hungary, May. Stefano Baccianella, Andrea Esuli, , and Fabrizio Sebas- tiani. 2010. SENTIWORDNET 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the 7th Conference on In- ternatinal Language Resources and Evaluation, pages 2200–2204, Valletta, Malta, May. Ramnath Balasubramanyan and William W. Cohen. 2011. What pushes their buttons? Predicting com- ment polarity from the content of political blog posts. In Proceedings of the ACL Workshop on Language in Social Media, pages 12–19, Porland, Oregon, USA, June. Farah Benamara, Carmine Cesarano, Antonio Picariello, Diego Reforgiato, and V. S. Subrahmanian. 2007. Sentiment analysis: Adjectives and adverbs are better than adjectives alone. In Proceedings of the Interna- tional AAAI Conference on Weblogs and Social Media, Boulder, CO, USA, March. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transac- tions on Intelligent Systems and Technology, 2(27):1– 27. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A li- brary for large linear classification. chine Learning Research, 9:1871–1874. Michel Galley, Kathleen McKeown, Julia Hirschberg, and Elizabeth Shriberg. 2004. Identifying agree- ment and disagreement in conversational of Bayesian networks to model pragmatic cies. In Proceedings of the 42nd Meeting ciation for Computational Linguistics, Barcelona, Spain, July. Fan Li and Yiming Yang. 2003. A loss function sis for classification methods in text Proceedings of the 20th International Conference Machine Learning, pages 472–479, Washington, USA, July. Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. 2008. Trust region Newton method for large-scale gistic regression. Journal of Machine Learning search, 9:627–650. Akiko Murakami and Rudy Raymond. 2010. Support or oppose? Classifying positions in online reply activities and opinion expressions. ings of the 23rd International Conference Prior versus Contextual Emotion Diman Ghazi EECS, University of Ottawa Understanding participants’ opinions in online de- bates has become an increasingly important task as its results can be used in many ways. For example, by analysing customers’ online discussions, compa- nies can better understand customers’ reviews about their products or services. For government agencies, it could help gather public opinions about policies, legislation, laws, or elections. For social science, it can assist scientists to understand a breadth of social phenomena from online observations of large num- bers of individuals. Despite the potentially wide range of applications, understanding participants’ positions in online de- bates remains a difficult task. One reason is that online conversations are very dynamic in nature. to es- Unlike spoken conversations (Thomas et al., 2006; Wang et al., 2011), users in online debates are not guaranteed to participate in a discussion at all times. They may enter or exit the online discussion at any point, so it is not appropriate to use models assuming continued conversation. In addition, most discus- sions in online debates are essentially dialogic; par- users’ ticipants could choose to implicitly respond to a pre- vious post, or explicitly quote some content from an earlier post and make a response. Therefore, an as- sumption has to be made about what a participant’s post is in response to, particularly when an explicit quote is not present; in most cases, a post is assumed to be in response to the most recent post in the thread (Murakami and Raymond, 2010). In this paper, we address the problem of detecting users’ positions with respect to the main topic in on- line debates; we call this the global position of users on an issue. It is inappropriate to identify each user’s global position with respect to a main topic directly, because most expressions of opinion are made not 61 Approaches to Subjectivity and Sentiment Analysis, pages 61–69, 12 July 2012. c Association for Computational Linguistics agree and disagree. The advantage of our proposed method is that it builds a unified framework which enables the classification of participants’ local and global positions in online debates; the aggregation of local estimates also tends to reduce error in the global classification. In order to evaluate the performance of our method, we have conducted experiments on data sets be some collected from two online debate forums. We have these ac- explored the use of sentiment, emotional and du- rational features for automatic agreement and dis- agreement classification, and our feature analysis suggests that they can significantly improve the per- formance of baselines using only word features. Ex- perimental results have also demonstrated that ag- gregating local positions over posts yields better per- formance for identifying users’ global positions on an issue. The rest of the paper is organised as follows. Sec- tion 2 discusses previous work on agreement and disagreement classification. Section 3 presents our pol- proposed method for both local and global position This classification, which we validate in Section 4 with global po- experiments on two real-world data sets. Section 5 The opin- concludes the paper and discusses possible direc- a response tions for future work. indicates the opinion 2 Related Work Previous work in automatic identification of agree- ment and disagreement has mainly focused on analysing conversational speech. Thomas et al. (2006) presented a method based on support vector machines to determine whether the speeches made by participants represent support or opposition to proposed legislation, using transcripts of U.S. con- gressional floor debates. This method showed that the classification of participants’ positions can be improved by introducing the constraint that a sin- gle speaker retains the same position during one debate. Wang et al. (2011) presented a condi- tional random field based approach for detecting agreement/disagreement between speakers in En- glish broadcast conversations. Galley et al. (2004) proposed the use of Bayesian networks to model pragmatic dependencies of previous agreement or disagreement on the current utterance. These differ from our work in that the speakers are assumed to 62 gregate over posts for each pair of participants in one discussion to determine whether they agree with each other. Third, we infer global positions of par- ticipants with respect to the seed post based on the thread structure. 3.1 Classifying Local Positions between Posts To classify local positions between posts, we need to extract the reply-to pairs of posts from the threading structure. The web forums we work with tend not to present thread structure, so we consider two types of reply-to relationships between individual posts. When a post explicitly quotes the content from an earlier post, we create an explicit link between the post and the quoted post. When a post does not contain a quote, we assume that it is a reply to the preceding post, and thus create an implicit link be- tween the two adjacent posts. After obtaining ex- plicit/implicit links, we build a classifier to classify each pair of posts as agreeing or disagreeing with each other. 3.1.1 Features To build a classifier for identifying local agree- ment and disagreement, we explored different types of features from individual posts with the aim to un- derstand which have predictive power for our agree- ment/disagreement classification task. Words We extract unigram and bigram features to capture the lexical information from each post. Since many words are topic related and might be used by both parties in a debate, we mainly use un- igrams for adjectives, verbs and adverbs because they have been demonstrated to possess discrimi- native power for sentiment classification (Benamara et al., 2007; Subrahmanian and Regorgiato, 2008). Typical examples of such unigrams include “agree”, “glad”, “indeed”, and “wrong”. In addition, we ex- tract bigrams to capture phrases expressing argu- ments, for example, “don’t think” and “how odd” could indicate disagreement, while “I concur” could indicate agreement. Sentiment features In order to detect sentiment opinions, we use a sentiment lexicon referred to as SentiWordNet (Baccianella et al., 2010). This lexi- con assigns a positive and negative score to a large number of words in WordNet. For example, the 63 A A L(A,B)=agree agree L(A,C)=disagree disagree B L(B,C)=disagree C B C L(C,D)=disagree agree D D (b) Aggregate these over (c) Infer the global position pairs of users to get local of each user by walking the agreement L(m, n) tree participants’ global positions. We first estimate P (y|xi , xj ), the prob- or disagreement with each other, then aggregate over posts to users. Finally, we infer the global position for any user by walking this extract durational features, such as the length of a post in words and in characters. These features are analogous to the ones used to capture the duration of a speech for conversation analysis. Intuitively, peo- ple tend to respond with a short post if they agree with a previous opinion. Otherwise, when there is a strong argument, people tend to use a longer post to state and defend their own opinions. Moreover, we also consider the time difference between adjacent posts as additional features. Presumably, when a de- bate is controversial, participants would be actively involved in the discussions, and the thread would un- fold quickly over time. Thus, the time difference be- tween adjacent posts would be smaller in the debate. words might 3.1.2 Classification Model We use logistic regression as the basic classi- fier for local position classification because it has been demonstrated to provide good predictive per- formance across a range of text classification tasks, such as document classification and sentiment anal- ysis (Zhang and Oles, 2001; Pan et al., 2010). In ad- dition to the predicted class, logistic regression can also generate probabilities of class memberships, 64 agreement score r(i, j) as follows: N (i,j) N (i,j) the condi- X X r(i, j) = P (agree|xk )− P (disagree|xk ). k=1 k=1 (3) Here, N (i, j) denotes the number of post exchanges (1) between users i and j, and r(i, j) indicates the de- gree of agreement between users i and j. Let L(i, j) denote the local position between two users i and label, and j. If r(i, j) > 0, we have L(i, j) = agree, that is, data user i agrees with user j. Otherwise, if r(i, j) ≤ 0, we consider the we have L(i, j) = disagree, that is, user i disagrees with user j. Let us consider the example in Figure 1(a) and w xi , 1(b). There are two posts exchanged between users B and C. For each of these posts, two probabilities (2) of class membership can be obtained: log- P (agree|x1 ) = 0.1, P (disagree|x1 ) = 0.9, P (agree|x2 ) = 0.3, P (disagree|x2 ) = 0.7. Then we can calculate the agreement score r(B, C) between users B and C by aggregating over two posts, that is, r(B, C) = (0.1 + 0.3) − (0.9 + 0.7) = −1.2 < 0. We can conclude that user B dis- agrees with user C in the threaded discussion and that L(B, C) = disagree. 3.3 Identifying Participants’ Global Positions After estimating local positions between partici- pants, we now can infer a participant’s global sup- port or opposition position with regards to the seed post. For this purpose, a thread structure must be considered. A thread begins with a seed post, which is further followed by other response posts. Of these responses, many employ a quote mechanism to ex- plicitly state which post they reply to, whereas oth- ers are assumed to be in response to the most recent post in the thread. We construct a tree-like thread structure by examining all the posts in a thread and determining the parent of each post. Then, travers- ing through the thread structure from top to bottom allows us to infer the global position of each user with respect to the seed post. When there is more than one path from the seed to a user, the shortest path is used to infer the user’s global position on the main issue. We illustrate this inference process using Figure 1, an example thread with four users and six posts. 65 posts in a thread, then label each post with agree if the author agreed with the seed post; disagree if they disagreed; or neutral if opinions were mixed or un- clear. The annotators used training data until they reached 85% agreement, then annotated posts sepa- rately. At no time were they allowed to confer. Lo- cal annotations were reverse-engineered from these global annotations. The ratio of posts annotated as agree to those as disagree is about 2 to 1 on both datasets. For our proposed three-stage method, local an- notations were taken as input to train the classi- fier and then used as ground truth to evaluate the performance of local agreement/disagreement clas- sification, while the global annotations were only used to evaluate our final accuracy of global agree- ment/disagreement identification. In contrast, the baseline classifiers that we compare against for global classification were directly trained and evalu- ated using global annotations. 4.2 Evaluation Metrics We used two evaluation metrics to evaluate the per- formance of agreement/disagreement classification. The first metric is accuracy, which is computed as the percentage of correctly classified examples over all the test data: |{x : x ∈ Dtest h(x) = y}| T accuracy = , |Dtest | where Dtest denotes the test data, y is the ground truth annotation label and h(x) is the predicted class pf label. 33 Accuracy can be biased in situations with un- 170 even division between classes, so we also evaluate 103 our classifiers with the F-measure. For each class 5.2 i ∈ {agree, disagree}, we first calculate precision 3.1 P (i) and recall R(i), and the F-measure is computed 1.7 as 2P (i)R(i) F1(i) = . P (i) + R(i) For our binary task, we report the average F-measure over both classes. immediate 4.3 Local Agree/Disagree Classification In our experiments, we used the implementation of L2-regularised logistic regression in Fan et al. (2008) as our local classifier. For each data set, 66 performance for local (dis)agreement usmb pf Accuracy F-measure Accuracy F-measure 0.46 0.42 0.52 0.51 0.56 0.60 0.55 0.52 0.62 0.65 0.68 0.77 for local (dis)agreement using logistic regression usmb pf Accuracy F-measure Accuracy F-measure 0.50 0.55 0.55 0.63 0.53 0.59 0.61 0.71 0.54 0.51 0.55 0.65 0.58 0.61 0.64 0.72 0.62 0.65 0.68 0.77 sion using different types of features on the two data sets. We can see from the table that using both words and sentiment features can improve the performance as compared to using only words features. On the usmb dataset, adding emotional features slightly im- proves the accuracy but degrades F-measure, while on the pf dataset, it degrades on accuracy and F- measure. In addition, durational features substan- tially improve the classification performance on the two metrics. Overall, the highest classification ac- curacy and F-measure can be achieved by using all four types of features. 4.4 Global Support/Opposition Identification We also conducted experiments to validate the ef- fectiveness of our proposed method for global posi- tion identification. Table 4 reports the performance of global classification using the three methods on the two data sets. Classifiers “without aggregation” were trained directly on global annotations, with- out considering local positions at all; those “with aggregation” were developed with our three-stage method, estimating global positions by aggregating local positions L(m, n). As before, logistic regression generally outper- forms SVM or naive Bayes classifiers, although SVM does well on usmb when aggregation (via L(m, n)) is used. Although SVM scores well for 67 performance for global (dis)agreement usmb pf Accuracy F-measure Accuracy F-measure 0.42 0.41 0.48 0.47 0.62 0.46 0.68 0.40 0.60 0.63 0.65 0.77 0.54 0.67 0.65 0.70 0.64 0.77 0.48 0.60 0.64 0.77 0.68 0.76 by assigning a coefficient to each specific feature, and directly estimates probabilities of class mem- berships, which is quite useful for aggregating local positions between users. Our feature analysis has suggested that using sentiment, emotional and du- rational features can significantly improve the per- formance over only using word features. Experi- mental results have also shown that, for identifying users’ global positions on an issue, aggregating lo- cal positions over posts results in better performance than no-aggregation baselines and that more benefit seems to accrue as users exchange more posts. We consider extending this work along several di- rections. First, we would like to examine what other factors would have predictive power in online de- bates and thus could be utilised to improve the per- formance of agreement/disagreement classification. Second, we have so far focused on classifying users’ positions into two categories: agree and disagree. However, there do exist a portion of posts falling into the neutral category; that means posts/users do not express any position towards an issue. We will ex- plore how to extend our computational framework to classify the neutral class. Finally, in online debates, it is not uncommon to have off-topic or topic-drift posts, especially for long threaded discussions. Off- topic posts are the ones totally irrelevant to the main issue being discussed, and topic-drift posts usually exist when the topic of a debate has shifted over time. Taking these posts into consideration would increase the difficulty of automatic agreement and disagreement classification, and therefore it is an- other important issue we plan to investigate. 68 tational Linguistics, pages 869–875, Beijing, China, August. Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen. 2010. Cross-domain senti- ment classification via spectral feature alignment. In Proceedings of the 19th International World Wide Web Conference, pages 751–760, Raleigh, NC, USA, April. Swapna Somasundaran and Janyce Wiebe. 2010. Recog- nizing stances in ideological on-line debates. In Pro- ceedings of the NAACL HLT 2010 Workshop on Com- putational Approaches to Analysis and Generation of Emotion in Text, pages 116–124, Los Angeles, CA, USA, June. V. S. Subrahmanian and Diego Regorgiato. 2008. AVA: Adjective-verb-adverb combinations for senti- ment analysis. Intelligent Systems, 23(4):43–50. Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from con- gressional floor-debate transcripts. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing, pages 327–335, Sydney, Australia, July. Wen Wang, Sibel Yaman, Kristin Precoda, Colleen Richey, and Geoffrey Raymond. 2011. Detection of agreement and disagreement in broadcast conver- sations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 374–378, Porland, Oregon, USA, June. Tong Zhang and Frank J. Oles. 2001. Text categorisation based on regularised linear classification methods. In- formation Retrieval, 4(1):5–31. Journal of Ma- speech: Use dependen- of the Asso- pages 669–676, analy- categorisation. In on DC, lo- Re- debates from In Proceed- on Compu- 69 of a Word in a Sentence Diana Inkpen Stan Szpakowicz EECS, University of Ottawa EECS, University of Ottawa &

[email protected] [email protected]

ICS, Polish Academy of Sciences

[email protected]

Abstract determining the presence of sentiment in the given text, and on determining its polarity – the positive or A set of words labelled with their prior emo- negative orientation. The applications of sentiment tion is an obvious place to start on the auto- analysis range from classifying positive and nega- matic discovery of the emotion of a sentence, but it is clear that context must also be con- tive movie reviews (Pang, Lee, and Vaithyanathan, sidered. No simple function of the labels on 2002; Turney, 2002) to opinion question-answering the individual words may capture the overall (Yu and Hatzivassiloglou, 2003; Stoyanov, Cardie, emotion of the sentence; words are interre- and Wiebe, 2005). The analysis of sentiment must, lated and they mutually influence their affect- however, go beyond differentiating positive from related interpretation. We present a method negative emotions to give a systematic account of which enables us to take the contextual emo- the qualitative differences among individual emo- tion of a word and the syntactic structure of the tion (Ortony, Collins, and Clore, 1988). sentence into account to classify sentences by emotion classes. We show that this promising In this work, we deal with assigning fine-grained method outperforms both a method based on emotion classes to sentences in text. It might seem a Bag-of-Words representation and a system that these two tasks are strongly tied, but the higher based only on the prior emotions of words. level of classification in emotion recognition task The goal of this work is to distinguish auto- and the presence of certain degrees of similarities matically between prior and contextual emo- tion, with a focus on exploring features impor- between some emotion labels make categorization tant for this task. into distinct emotion classes more challenging and difficult. Particularly notable in this regard are two classes, anger and disgust, which human annotators 1 Introduction often find hard to distinguish (Aman and Szpakow- Recognition, interpretation and representation of af- icz, 2007). In order to recognize and analyze affect fect have been investigated by researchers in the in written text – seldom explicitly marked for emo- field of affective computing (Picard 1997). They tions – NLP researchers have come up with a variety consider a wide range of modalities such as affect in of techniques, including the use of machine learn- speech, facial display, posture and physiological ac- ing, rule-based methods and the lexical approach tivity. It is only recently that there has been a grow- (Neviarouskaya, Prendinger, and Ishizuka, 2011). ing interest in automatic identification and extraction There has been previous work using statistical of sentiment, opinions and emotions in text. methods and supervised machine learning applied to Sentiment analysis is the task of identifying posi- corpus-based features, mainly unigrams, combined tive and negative opinions, emotions and evaluations with lexical features (Alm, Roth, and Sproat, 2005; (Wilson, Wiebe, and Hoffmann, 2005). Most of the Aman and Szpakowicz, 2007; Katz, Singleton, and current work in sentiment analysis has focused on Wicentowski, 2007). The weakness of such methods 70 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 70–78, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics is that they neglect negation, syntactic relations and happi- sad- anger dis- sur- fear total semantic dependencies. They also require large (an- ness ness gust prise notated) corpora for meaningful statistics and good 398 201 252 53 71 141 1116 performance. Processing may take time, and anno- Table 1: The distribution of labels in the WordNet- tation effort is inevitably high. Rule-based meth- Affect Lexicon. ods (Chaumartin, 2007; Neviarouskaya, Prendinger, and Ishizuka, 2011) require manual creation of rules. That is an expensive process with weak guaran- sider their effect on the emotion expressed by the tee of consistency and coverage, and likely very sentence. Finally, we use machine learning to clas- task-dependent; the set of rules of rule-based af- sify the sentences, represented by the chosen fea- fect analysis task (Neviarouskaya, Prendinger, and tures, by their contextual emotion. Ishizuka, 2011) can differ drastically from what un- We categorize sentences into six basic emotions derlies other tasks such as rule-based part-of-speech defined by Ekman (1992); that has been the choice tagger, discourse parsers, word sense disambigua- of most of previous related work. These emotions tion and machine translation. are happiness, sadness, fear, anger, disgust and sur- The study of emotions in lexical semantics was prise. There also may, naturally, be no emotion in a the theme of a SemEval 2007 task (Strapparava and sentence; that is tagged as neutral/non-emotional. Mihalcea, 2007), carried out in an unsupervised set- We evaluate our results by comparing our method ting (Strapparava and Mihalcea, 2008; Chaumartin, applied to our set of features with Support Vec- 2007; Kozareva et al., 2007; Katz, Singleton, and tor Machine (SVM) applied to Bag-of-Words, which Wicentowski, 2007). The participants were encour- was found to give the best performance among su- aged to work with WordNet-Affect (Strapparava and pervised methods (Yang and Liu, 1999; Pang, Lee, Valitutti, 2004) and SentiWordNet (Esuli and Sebas- and Vaithyanathan, 2002; Aman and Szpakowicz, tiani, 2006). Word-level analysis, however, will not 2007; Ghazi, Inkpen, and Szpakowicz, 2010). We suffice when affect is expressed by phrases which re- show that our method is promising and that it out- quire complex phrase- and sentence-level analyses: performs both a system which works only with prior words are interrelated and they mutually influence emotions of words, ignoring context, and a system their affect-related interpretation. On the other hand, which applies SVM to Bag-of-Words. words can have more than one sense, and they can Section 2 of this paper describes the dataset and only be disambiguated in context. Consequently, the resources used. Section 3 discusses the features emotion conveyed by a word in a sentence can differ which we use for recognizing contextual emotion. drastically from the emotion of the word on its own. Experiments and results are presented in Section 4. For example, according to the WordNet-Affect lex- In Section 5, we conclude and discuss future work. icon, the word ”afraid” is listed in the ”fear” cate- gory, but in the sentence “I am afraid it is going to 2 Dataset and Resources rain.” the word ”afraid” does not convey fear. We refer to the emotion listed for a word in an Supervised statistical methods typically require emotion lexicon as the word’s prior emotion. A training data and test data, manually annotated word’s contextual emotion is the emotion of the sen- with respect to each language-processing task to be tence in which that word appears, taking the context learned. In this section, we explain the dataset and into account. lexicons used in our experiments. Our method combines several way of tackling the WordNet-Affect Lexicon (Strapparava and Vali- problem. First, we find keywords listed in WordNet- tutti, 2004). The first resource we require is an Affect and select the sentences which include emo- emotional lexicon, a set of words which indicate tional words from that lexicon. Next, we study the the presence of a particular emotion. In our exper- syntactic structure and semantic relations in the text iments, we use WordNet-Affect, which contains six surrounding the emotional word. We explore fea- lists of words corresponding to the six basic emo- tures important in emotion recognition, and we con- tion categories. It is the result of assigning a variety 71 Neutral Negative Positive Both hp sd ag dg sr fr ne total 6.9% 59.7% 31.1% 0.3% 536 173 179 172 115 115 800 2090 Table 2: The distribution of labels in the Prior-Polarity Table 3: The distribution of labels in Aman’s modified Lexicon. dataset. The labels are happiness, sadness, anger, dis- gust, surprise, fear, no emotion. of affect labels to each synset in WordNet. Table 1 3 Features shows the distribution of words in WordNet-Affect. The features used in our experiments were motivated Prior-Polarity Lexicon (Wilson, Wiebe, and both by the literature (Wilson, Wiebe, and Hoff- Hoffmann, 2009). The prior-polarity subjectivity mann, 2009; Choi et al., 2005) and by the explo- lexicon contains over 8000 subjectivity clues col- ration of contextual emotion of words in the anno- lected from a number of sources. To create this tated data. All of the features are counted based on lexicon, the authors began with the list of subjec- the emotional word from the lexicon which occurs in tivity clues extracted by Riloff (2003). The list the sentence. For ease of description, we group the was expanded using a dictionary and a thesaurus, features into four distinct sets: emotion-word fea- and adding positive and negative word lists from tures, part-of-speech features, sentence features and the General Inquirer.1 Words are grouped into dependency-tree features. strong subjective and weak subjective clues; Table 2 Emotion-word features. This set of features are presents the distribution of their polarity. based on the emotion-word itself. Intensifier Lexicon (Neviarouskaya, Prendinger, and Ishizuka, 2010). It is a list of 112 modifiers (ad- • The emotion of a word according to WordNet- verbs). Two annotators gave coefficients for inten- Affect (Strapparava and Valitutti, 2004). sity degree – strengthening or weakening, from 0.0 • The polarity of a word according to the prior- to 2.0 – and the result was averaged. polarity lexicon (Wilson, Wiebe, and Hoff- mann, 2009). Emotion Dataset (Aman and Szpakowicz, • The presence of a word in a small list of modi- 2007). The main consideration in the selection of fiers (Neviarouskaya, Prendinger, and Ishizuka, data for emotional classification task is that the data 2010). should be rich in emotion expressions. That is why we chose for our experiments a corpus of blog sen- Part-of-speech features. Based on the Stanford tences annotated with emotion labels, discussed by tagger’s output (Toutanova et al., 2003), every word Aman and Szpakowicz (2007). Each sentence is in a sentence gets one of the Penn Treebank tags. tagged by its dominant emotion, or as non-emotional if it does not include any emotion. The annotation is • The part-of-speech of the emotional word it- based on Ekman’s six emotions at the sentence level. self, both according to the emotion lexicon and The dataset contains 4090 annotated sentences, 68% Stanford tagger. of which were marked as non-emotional. The highly • The POS of neighbouring words in the same unbalanced dataset with non-emotional sentences as sentence. We choose a window of [-2,2], as it by far the largest class, and merely 3% in the fear is usually suggested by the literature (Choi et and surprise classes, prompted us to remove 2000 of al., 2005). the non-emotional sentences. We lowered the num- ber of non-emotional sentences to 38% of all the Sentence features. For now we only consider the sentences, and thus reduced the imbalance. Table 3 number of words in the sentence. shows the details of the chosen dataset. Dependency-tree features. For each emotional word, we create features based on the parse tree and its dependencies produced by the Stanford parser 1 www.wjh.harvard.edu/∼inquirer/ (Marneffe, Maccartney, and Manning, 2006). The 72 dependencies are all binary relations: a grammati- hp sd ag dg sr fr ne total cal relation holds between a governor (head) and a part 1 196 64 64 63 36 52 150 625 dependent (modifier). part 2 51 18 22 18 9 14 26 158 According to Mohammad and Turney (2010),2 part 1+ 247 82 86 81 45 66 176 783 part 2 adverbs and adjectives are some of the most emotion-inspiring terms. This is not surprising con- Table 4: The distribution of labels in the portions of sidering that they are used to qualify a noun or a Aman’s dataset used in our experiments, named part 1, verb; therefore to keep the number of features small, part 2 and part 1+part 2. The labels are happiness, sad- among all the 52 different type of dependencies, we ness, anger, disgust, surprise, fear, no emotion. only chose the negation, adverb and adjective modi- fier dependencies. After parsing the sentence and getting the de- • Modifies-intensifier-strengthen: w modifies a pendencies, we count the following dependency-tree strengthening intensifier from the intensifier Boolean features for the emotional word. lexicon. • Modifies-intensifier-weaken: w modifies a • Whether the word is in a “neg” dependency weakening intensifier from the intensifier lex- (negation modifier): true when there is a nega- icon. tion word which modifies the emotional word. • Modified-by-intensifier-strengthen: w is the • Whether the word is in a “amod” dependency head of the dependency, which is modified by (adjectival modifier): true if the emotional a strengthening intensifier from the intensifier word is (i) a noun modified by an adjective or lexicon. (ii) an adjective modifying a noun. • Modified-by-intensifier-weaken: w is the head • Whether the word is in a “advmod” depen- of the dependency, which is modified by a dency (adverbial modifier): true if the emo- weakening intensifier from the intensifiers lex- tional word (i) is a non-clausal adverb or adver- icon. bial phrase which serves to modify the meaning of a word, or (ii) has been modified by an ad- 4 Experiments verb. In the experiments, we use the emotion dataset pre- We also have several modification features based sented in Section 2. Our main consideration is to on the dependency tree. These Boolean features cap- classify a sentence based on the contextual emotion ture different types of relationships involving the cue of the words (known as emotional in the lexicon). word.3 We list the feature name and the condition on That is why in the dataset we only choose sentences the cue word w which makes the feature true. which contain at least one emotional word accord- • Modifies-positive: w modifies a positive word ing to WordNet-Affect. As a result, the number of from the prior-polarity lexicon. sentences chosen from the dataset will decrease to • Modifies-negative: w modifies a negative word 783 sentences, 625 of which contain only one emo- from the prior-polarity lexicon. tional word and 158 sentences which contain more • Modified-by-positive: w is the head of the de- than one emotional word. Their details are shown in pendency, which is modified by a positive word Table 4. from the prior-polarity lexicon. Next, we represent the data with the features pre- • Modified-by-negative: w is the head of the sented in Section 3. Those features, however, were dependency, which is modified by a negative defined for each emotional word based on their con- word from the prior-polarity lexicon. text, so we will proceed differently for sentences 2 with one emotional word and sentences with more In their paper, they also explain how they created an emo- tion lexicon by crowd-sourcing, but – to the best of our knowl- than one emotional word. edge – it is not publicly available yet. 3 The terms “emotional word” and “cue word” are used in- • In sentences with one emotional word, we as- terchangeably. sume the contextual emotion of the emotional 73 word is the same as the emotion assigned to the Precision Recall F sentence by the human annotators; therefore all Happiness 0.59 0.67 0.63 the 625 sentences with one emotional word are Sadness 0.38 0.45 0.41 SVM + Anger 0.40 0.31 0.35 represented with the set of features presented Bag-of- Surprise 0.41 0.33 0.37 in Section 3 and the sentence’s emotion will be Disgust 0.51 0.43 0.47 Words considered as their contextual emotion. Fear 0.55 0.50 0.52 • For sentences with more than one emotional Non-emo 0.49 0.48 0.48 word, the emotion of the sentence depends on Accuracy 50.72% all emotional words and their syntactic and se- Happiness 0.68 0.78 0.73 mantic relations. We have 158 sentences where Sadness 0.49 0.58 0.53 no emotion can be assigned to the contextual SVM Anger 0.66 0.48 0.56 + our Surprise 0.61 0.31 0.41 emotion of their emotional words, and all we Disgust 0.43 0.38 0.40 features know is the dominant emotion of the sentence. Fear 0.67 0.63 0.65 Non-emo 0.51 0.53 0.52 We will, therefore, have two different sets of ex- Accuracy 58.88% periments. For the first set of sentences, the data are Happiness 0.78 0.82 0.80 all annotated, so we will take a supervised approach. Sadness 0.53 0.64 0.58 Logistic For the second set of sentences, we combine super- Anger 0.69 0.62 0.66 Regres- vised and unsupervised learning. We train a clas- Surprise 0.89 0.47 0.62 sion + our sifier on the first set of data and we use the model Disgust 0.81 0.41 0.55 features Fear 0.71 0.71 0.71 to classify the emotional words into their contextual Non-emo 0.53 0.64 0.58 emotion in the second set of data. Finally, we pro- Accuracy 66.88% pose an unsupervised method to combine the con- textual emotion of all the emotional words in a sen- Table 5: Classification experiments on the dataset with tence and calculate the emotion of the sentence. one emotional word in each sentence. Each experiment For evaluation, we report precision, recall, F- is marked by the method and the feature set. measure and accuracy to compare the results. We also define two baselines for each set of experiments to compare our results with. The experiments are experiment is 51%, remarkably higher than the first presented in the next two subsections. baseline’s accuracy. The second baseline is particu- larly designed to address the emotion of the sentence 4.1 Experiments on sentences with one only based on the prior emotion of the emotional emotional word words; therefore it will allow us to assess the dif- ference between the emotion of the sentence based In these experiments, we explain first the baselines on the prior emotion of the words in the sentence and then the results of our experiments on the sen- versus the case when we consider the context and its tences with only one emotional word. effect on the emotion of the sentence. Baseline Learning Experiments We develop two baseline systems to assess the dif- ficulty of our task. The first baseline labels the sen- In this part, we use two classification algorithms, tences the same as the most frequent class’s emo- Support Vector Machines (SVM) and Logistic Re- tion, which is a typical baseline in machine learning gression (LR), and two different set of features, tasks (Aman and Szpakowicz, 2007; Alm, Roth, and the set of features from Section 3 and Bag-of- Sproat, 2005). This baseline will result in 31% ac- Words (unigram). Unigram models have been curacy. widely used in text classification and shown to pro- The second baseline labels the emotion of the sen- vide good results in sentiment classification tasks. tence the same as the prior emotion of the only emo- In general, SVM has long been a method of tional word in the sentence. The accuracy of this choice for sentiment recognition in text. SVM has 74 been shown to give good performance in text clas- 4.2 Experiments on sentences with more than sification experiments as it scales well to the large one emotional word numbers of features (Yang and Liu, 1999; Pang, Lee, In these experiments, we combine supervised and and Vaithyanathan, 2002; Aman and Szpakowicz, unsupervised learning. We train a classifier on the 2007). For the classification, we use the SMO al- first set of data, which is annotated, and we use the gorithm (Platt, 1998) from Weka (Hall et al., 2009), model to classify the emotional words in the sec- setting 10-fold cross validation as a testing option. ond group of sentences. We propose an unsuper- We compare applying SMO to two sets of features, vised method to combine the contextual emotion of (i) Bag-of-Words, which are binary features defin- the emotional words and calculate the emotion of the ing whether a unigram exists in a sentence and (ii) sentence. our set of features. In our experiments we use uni- grams from the corpus, selected using feature selec- Baseline tion methods from Weka. We develop two baseline systems. The first base- line labels all the sentences the same: as the emo- We also compare those two results with the third tion of the most frequent class, giving 32% accu- experiment: apply SimpleLogistic (Sumner, Frank, racy. The second baseline labels the emotion of the and Hall, 2005) from Weka to our set of features, sentence the same as the most frequently occurring again setting 10-fold cross validation as a testing op- prior-emotion of the emotional words in the sen- tion. Logistic regression is a discriminative prob- tence. In the case of a tie, we randomly pick one abilistic classification model which operates over of the emotions. The accuracy of this experiment real-valued vector inputs. It is relatively slow to train is 45%. Again, as a second baseline we choose a compared to the other classifiers. It also requires ex- baseline that is based on the prior emotion of the tensive tuning in the form of feature selection and emotional words so that we can compare it with the implementation to achieve state-of-the-art classifica- results based on contextual emotion of the emotional tion performance. Logistic regression models with words in the sentence. large numbers of features and limited amounts of Learning Experiments training data are highly prone to over-fitting (Alias- i, 2008). Besides, logistic regression is really slow For sentences with more than one emotional and it is known to only work on data represented word, we represent each emotional word and its con- by a small set of features. That is why we do not text by the set of features explained in section 3. We apply SimpleLogistic to Bag-of-Words features. On do not have the contextual emotion label for each the other hand, the number of our features is rela- emotional word, so we cannot train the classifier on tively low, so we find logistic regression to be a good these data. Consequently, we train the classifier on choice of classifier for our representation method. the part of the dataset which only includes sentences The classification results are shown in Table 5. with one emotional word. In these sentences, each emotional word is labeled with their contextual emo- tion – the same as the sentence’s emotion. We note consistent improvement. The results of Once we have the classifier model, we get the both experiments using our set of features signifi- probability distribution of emotional classes for each cantly outperform (on the basis of a paired t-test, emotional word (calculated by the logistic regres- p=0.005) both the baselines and SVM applied to sion function learned from the annotated data). We Bag-of-Words features. We get the best result, how- add up the probabilities of each class for all emo- ever, by applying logistic regression to our feature tional words. Finally, we select the class with the set. The number of our features and the nature of maximum probability. The result, shown in Table 6, the features we introduce make them an appropriate is compared using supervised learning, SVM, with choice of data representation for logistic regression Bag-of-Words features, explained in previous sec- methods. tion, with setting 10-fold cross validation as a testing 75 Precision Recall F Precision Recall F Happiness 0.52 0.60 0.54 Happiness 0.813 0.698 0.751 Sadness 0.35 0.33 0.34 Sadness 0.605 0.416 0.493 SVM + Anger 0.30 0.27 0.29 Anger 0.650 0.436 0.522 Bag-of- Surprise 0.14 0.11 0.12 Surprise 0.723 0.409 0.522 Words Disgust 0.30 0.17 0.21 Disgust 0.672 0.488 0.566 Fear 0.44 0.29 0.35 Fear 0.868 0.513 0.645 Non-emo 0.23 0.35 0.28 Non-emo 0.587 0.625 0.605 Accuracy 36.71% Logistic Happiness 0.63 0.71 0.67 Table 7: Aman’s best result on the dataset explained in Regres- Sadness 0.67 0.44 0.53 Section 2. sion + Anger 0.50 0.41 0.45 unsu- Surprise 1.00 0.22 0.36 pervised Disgust 0.80 0.22 0.34 the results of the two experiments on all 758 sen- + our Fear 0.60 0.64 0.62 tences with at least one emotional word. features Non-emo 0.37 0.69 0.48 Accuracy 54.43% For this comparison, we apply SVM with Bag-of- Words features to all of 758 sentences and we get Table 6: Classification experiments on the dataset with an accuracy of 55.17%. Considering our features more than one emotional word in each sentence. Each and methodology, we cannot apply logistic regres- experiment is marked by the method and the feature set. sion with our features to the whole dataset; therefore we calculate its accuracy by counting the percent- age of correctly classified instances in both parts of option.4 the dataset, used in the two experiments, and we get By comparing the results in Table 6, we can see an accuracy of 64.36%. We also compare the re- that the result of learning applied to our set of fea- sults with the baselines. The first baseline, which tures significantly outperforms (on the basis of a is the percentage of most frequent class (happiness paired t-test, p=0.005) both baselines and the result in this case), results in 31.5% accuracy. The second of SVM algorithm applied to Bag-of-Words features. baseline based on the prior emotion of the emotional words results in 50.13% accuracy. It is notable that 4.3 Discussion the result of applying LR to our set of features is We cannot directly compare our results with the pre- still significantly better than the result of applying vious results achieved by Aman and Szpakowicz SVM to Bag-of-Words and both baselines; this sup- (2007), because the datasets differ. F-measure, pre- ports our earlier conclusion. It is hard to compare cision and recall for each class are reported on the the results mentioned thus far, so we have combined whole dataset, but we only used part of that dataset. all the results in Figure 1, which displays the accu- To show how hard this task is, and to see where we racy obtained by each experiment. stand, the best result from (Aman and Szpakowicz, We also looked into our results and assessed the 2007) is shown in Table 7. cases where the contextual emotion is different from In our experiments, we showed that our approach the prior emotion of the emotional word. Consider and our features significantly outperform the base- the sentence “Joe said it does not happen that often lines and the SVM result applied to Bag-of-Words. so it does not bother him.” Based on the emotion For the final conclusion, we add one more compar- lexicon, the word “bother” is classified as angry; so ison. As we can see from Table 6, the accuracy is the emotion of the sentence if we only consider result of applying SVM to Bag-of-Words is really the prior emotion of words. In our set of features, low. Because supervised methods scale well on large however, we consider the negation in the sentence, datasets, one reason could be the size of the data we so the sentence is classified as non-emotional rather use in this experiment; therefore we try to compare than angry. Another interesting sentence is the rather 4 Since SVM does not return a distribution probability, we simple “You look like her I guess.” Based on the lex- cannot apply SVM to our features in this set of experiments. icon, the word “like” is in the happy category, while 76 like to show the robustness of these features by ap- plying them to different datasets. Another direction for future work will be to ex- pand our emotion lexicon using existing techniques for automatically acquiring the prior emotion of words. Based on the number of instances in each emotion class, we noticed there is a tight relation between the number of words in each emotion list in the emotion lexicon and the number of sentences that are derived for each emotion class. It follows that a larger lexicon will have a greater coverage of Figure 1: The comparison of accuracy results of all ex- emotional expressions. periments for sentences with one emotional word (part Last but not least, one of the weaknesses of our 1), sentences with more than one emotional words (part 2), and sentences with at least one emotional word (part approach was the fact that we could not use all the 1+part 2). instances in the dataset. Again, the main reason was the low coverage of the emotion lexicon that was used. The other reason was the limitation of our the sentence is non-emotional. In this case, the part- method: we had to only choose the sentences that of-speech features play an important role and they have one or more emotional words. As future work, catch the fact that “like" is not a verb here; it does we would like to relax the restriction by using the not convey a happy emotion and the sentence is clas- root of the sentence (based on the dependency tree sified as non-emotional. result) as a cue word rather than the emotional word We also analyzed the errors, and we found some from the lexicon. So, for sentences with no emo- common errors due to: tional word, we can calculate all the features regard- ing the root word rather than the emotional word. • complex sentences or unstructured sentences which will cause the parser to fail or return in- correct data, resulting in incorrect dependency- References tree information; Alias-i. 2008. Lingpipe 4.1.0., October. • limited coverage of the emotion lexicon. Alm, Cecilia Ovesdotter, Dan Roth, and Richard Sproat. 2005. Emotions from Text: Machine Learning for These are some of the issues which we would like Text-based Emotion Prediction. In HLT/EMNLP. to address in our future work. Aman, Saima and Stan Szpakowicz. 2007. Identifying expressions of emotion in text. In Proc. 10th Inter- 5 Conclusion and Future Directions national Conf. Text, Speech and Dialogue, pages 196– 205. Springer-Verlag. The focus of this study was a comparison of prior Chaumartin, François-Regis. 2007. UPAR7: a emotion of a word with its contextual emotion, and knowledge-based system for headline sentiment tag- their effect on the emotion expressed by the sen- ging. In Proc. 4th International Workshop on Seman- tence. We also studied features important in recog- tic Evaluations, SemEval ’07, pages 422–425. nizing contextual emotion. We experimented with Choi, Yejin, Claire Cardie, Ellen Riloff, and Siddharth a wide variety of linguistically-motivated features, Patwardhan. 2005. Identifying sources of opinions and we evaluated the performance of these fea- with conditional random fields and extraction patterns. In Proc. Human Language Technology and Empirical tures using logistic regression. We showed that Methods in Natural Language Processing, HLT ’05, our approach and features significantly outperform pages 355–362. the baseline and the SVM result applied to Bag-of- Ekman, Paul. 1992. An argument for basic emotions. Words. Cognition & Emotion, 6(3):169–200. Even though the features we presented did quite Esuli, Andrea and Fabrizio Sebastiani. 2006. SENTI- well on the chosen dataset, in the future we would WORDNET: A Publicly Available Lexical Resource 77 for Opinion Mining. In Proc. 5th Conf. on Language Methods in Natural Language Processing, pages 105– Resources and Evaluation LREC 2006, pages 417– 112. 422. Stoyanov, Veselin, Claire Cardie, and Janyce Wiebe. Ghazi, Diman, Diana Inkpen, and Stan Szpakowicz. 2005. Multi-perspective question answering using the 2010. Hierarchical approach to emotion recognition opqa corpus. In Proc. Conference on Human Lan- and classification in texts. In Canadian Conference on guage Technology and Empirical Methods in Natural AI, pages 40–50. Language Processing, HLT ’05, pages 923–930. Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Strapparava, Carlo and Rada Mihalcea. 2007. SemEval- Pfahringer, Peter Reutemann, and Ian H. Witten. 2007 Task 14: Affective Text. In Proc. Fourth Interna- 2009. The WEKA data mining software: an update. tional Workshop on Semantic Evaluations (SemEval- SIGKDD Explor. Newsl., 11:10–18, November. 2007), pages 70–74, Prague, Czech Republic, June. Katz, Phil, Matthew Singleton, and Richard Wicen- Strapparava, Carlo and Rada Mihalcea. 2008. Learning towski. 2007. SWAT-MP: the SemEval-2007 systems to identify emotions in text. In Proc. 2008 ACM sym- for task 5 and task 14. In Proc. 4th International Work- posium on Applied computing, SAC ’08, pages 1556– shop on Semantic Evaluations, SemEval ’07, pages 1560. 308–313. Strapparava, Carlo and Alessandro Valitutti. 2004. Kozareva, Zornitsa, Borja Navarro, Sonia Vázquez, and WordNet-Affect: an Affective Extension of Word- Andrés Montoyo. 2007. UA-ZBSA: a headline emo- Net. In Proc. 4th International Conf. on Language tion classification through web information. In Proc. Resources and Evaluation, pages 1083–1086. 4th International Workshop on Semantic Evaluations, Sumner, Marc, Eibe Frank, and Mark A. Hall. 2005. SemEval ’07, pages 334–337. Speeding Up Logistic Model Tree Induction. In Proc. Marneffe, Marie-Catherine De, Bill Maccartney, and 9th European Conference on Principles and Practice Christopher D. Manning. 2006. Generating typed de- of Knowledge Discovery in Databases, pages 675– pendency parses from phrase structure parses. In Proc. 683. LREC 2006. Toutanova, Kristina, Dan Klein, Christopher D. Man- Mohammad, Saif M. and Peter D. Turney. 2010. Emo- ning, and Yoram Singer. 2003. Feature-Rich Part-of- tions evoked by common words and phrases: using Speech Tagging with a Cyclic Dependency Network. mechanical turk to create an emotion lexicon. In Proc. In Proc. HLT-NAACL, pages 252–259. NAACL HLT 2010 Workshop on Computational Ap- Turney, Peter D. 2002. Thumbs up or thumbs down?: proaches to Analysis and Generation of Emotion in semantic orientation applied to unsupervised classifi- Text, CAAGET ’10, pages 26–34. cation of reviews. In Proc. 40th Annual Meeting on Neviarouskaya, Alena, Helmut Prendinger, and Mitsuru Association for Computational Linguistics, ACL ’02, Ishizuka. 2010. AM: textual attitude analysis model. pages 417–424. In Proc. NAACL HLT 2010 Workshop on Computa- Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. tional Approaches to Analysis and Generation of Emo- 2005. Recognizing contextual polarity in phrase-level tion in Text, pages 80–88. sentiment analysis. In Proc. HLT-EMNLP, pages 347– Neviarouskaya, Alena, Helmut Prendinger, and Mitsuru 354. Ishizuka. 2011. Affect Analysis Model: novel rule- Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. based approach to affect sensing from text. Natural 2009. Recognizing Contextual Polarity: An Explo- Language Engineering, 17(1):95–135. ration of Features for Phrase-Level Sentiment Analy- Ortony, Andrew, Allan Collins, and Gerald L. Clore. sis. Computational Linguistics, 35(3):399–433. 1988. The cognitive structure of emotions. Cambridge Yang, Yiming and Xin Liu. 1999. A re-examination University Press. of text categorization methods. In Proc. 22nd an- nual international ACM SIGIR conference on Re- Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. search and development in information retrieval, SI- 2002. Thumbs up?: sentiment classification using ma- GIR ’99, pages 42–49. chine learning techniques. In Proc. ACL-02 confer- ence on Empirical methods in natural language pro- Yu, Hong and Vasileios Hatzivassiloglou. 2003. To- cessing - Volume 10, EMNLP ’02, pages 79–86. wards answering opinion questions: separating facts from opinions and identifying the polarity of opin- Platt, John C. 1998. Sequential Minimal Optimization: ion sentences. In Proc. 2003 conference on Empirical A Fast Algorithm for Training Support Vector Ma- methods in natural language processing, EMNLP ’03, chines. pages 129–136. Riloff, Ellen. 2003. Learning extraction patterns for sub- jective expressions. In Proc. 2003 Conf. on Empirical 78 Cross-discourse Development of Supervised Sentiment Analysis in the Clinical Domain Phillip Smith Mark Lee School of Computer Science School of Computer Science University of Birmingham University of Birmingham

[email protected] [email protected]

Abstract Intuitively, the associated polarity of each exam- ple is trivial to determine in these explicit examples. Current approaches to sentiment analysis as- However, expressive statements do not operate in sume that the sole discourse function of isolation of other respective discourse functions. As sentiment-bearing texts is expressivity. How- Biber (1988) notes, a persuasive statement incorpo- ever, the persuasive discourse function also utilises expressive language. In this work, rates elements of the expressive function in order to we present the results of training supervised advise an external party of a proposed action that classifiers on a new corpus of clinical texts should be taken. The following example shows how that contain documents with an expressive dis- persuasive statements make use of expressive func- course function, and we test the learned mod- tions: els on a subset of the same corpus containing 1. “The clumsy nurse who wrongly diagnosed me persuasive texts. The results of this indicate that despite the difference in discourse func- should be fired.” tion, the learned models perform favourably. The role of a persuasive statement is to incite an action in the target, dependent upon the inten- tion that the author communicates. By using plain, 1 Introduction sentiment-neutral language, the reader may misin- Examining the role that discourse function holds is terpret why the request for action is being given, and a critical part of an in-depth analysis into the capa- in the worst-case scenario not carry it out. Through bilities of supervised sentiment classification tech- the incorporation of expressive language, the weight niques. However, it is a field that has not been com- of the persuasive statement is increased. This en- prehensively examined within the domain of sen- ables the speaker to emphasise the underlying senti- timent analysis due to the lack of suitable cross- ment of their statement, thereby increasing the like- discourse corpora to train and test various machine lihood of the intended action being undertaken, and learning methods upon. their goals being accomplished. In the above ex- In order to carry out such an investigation, this ample, the intention communicated by the author study will focus on the relationship between senti- is the firing of the nurse. This in itself holds neg- ment classification and two types of discourse func- ative connotations, but through the use of the word tion: Expressive and Persuasive. The expressive ‘clumsy’, the negative sentiment of the statement be- function denotes the feelings or attitudes of the au- comes clearer to understand. thor of a document. This is demonstrated in the fol- The inclusion of expressive aspects in the lan- lowing examples: guage of the persuasive discourse function, enables us to identify the sentiment of a persuasive com- 1. “I didn’t like the attitude of the nursing staff.” ment. As there is this cross-over in the language of 2. “The doctors treated me with such care.” the two discourse functions, we can hypothesise that 79 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 79–83, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics if we train a supervised classifier on an expressive Corpus D W Davglength V corpus, a learned model will be created that when Expressive applied to a corpus of persuasive documents, will Positive 1152 75052 65.15 6107 classify these texts to an adequate standard. Negative 1108 76062 68.65 6791 As the corpus that we developed is in the clin- Persuasive ical domain, it is worth noting the important role Positive 768 46642 60.73 4679 that sentiment analysis can play for health practi- Negative 864 113632 131.52 7943 tioners, which unfortunately has not received a great deal of attention. In assessing the effectiveness of Table 1: Persuasive & expressive corpus statistics. treatments given by the health service for a condition which is curable, the results themselves indicate the within each comment. effectiveness of such a process. However, for pallia- In developing the corpus, we leverage the fact that tive treatments which merely alleviate the symptoms the data was separated into subfields, as opposed of an illness or relieve pain, it is vital to discover the to one long review, where the all data is merged extent to which these are effective. Feedback has into a single document. We extracted comments progressed from the filling in of paper forms to the which came under three categories in the NHS Pa- ability to give feedback through web pages and mo- tient Feedback dataset: Likes, Dislikes and Advice. bile phones. Text is stored in a highly accessible The Likes were assumed to express positive senti- way, and is now able to be efficiently processed by ment and highlight elements of the health service sentiment classification algorithms to determine the that patients appreciated. Conversely, the documents opinions that patients are expressing. This in turn given under the Dislikes header were assumed to should enable health services to make informed de- convey a negative sentiment. These two subsets cisions about the palliative care which they provide. make up the Expressive subset of the compiled cor- 2 Patient Feedback Corpus pus. The Advice documents did not have an initial sentiment associated with them, so each comment NHS Choices1 is a website run by the National was labelled by two independent annotators at the Health Service (NHS), which acts as an extensive document level as being either a positive or nega- knowledge base for any health-related queries. This tive comment. These Advice comments contributed website not only provides comprehensive articles to the Persuasive subcorpus. In compiling the per- about various ailments, but also gives the users of suasive document sets, we automatically discarded the site the option to rate and comment on the ser- those comments that contained the term “N/A ” or vices that are provided to them at hospitals and GP any of its derivative forms. surgeries. This user feedback provides an excellent basis for the sentiment classification experiments of 3 Method this work. The aim in this work was to examine the effect of The reviews that are submitted are typically pro- training a supervised classifier on a corpus whose vided by a patient or close relative who has experi- discourse function differs to that of the training enced the healthcare system within a hospital. When set. We experimented with three standard super- submitting feedback, the user is asked to split their vised machine learning algorithms: standard Naı̈ve feedback into various fields, as opposed to submit- Bayes (NB), multinomial Naı̈ve Bayes (MN NB) ting a single documents detailing all the comments and Support Vector Machines (SVM) classification. of the user. During corpus compilation, each com- Each has proven to be effective in previous senti- ment was extracted verbatim, so spelling mistakes ment analysis studies (Pang et al. , 2002), so as remain in the developed corpus. All punctuation this experiment is rooted in sentiment classification, also remains in order to enable future experiments to these methods were also assumed to perform well in be carried out on either the sentence or phrase level this cross-discourse setting. 1 http://nhs.uk For the cross-discourse sentiment classification 80 experiments, two variants of the Naı̈ve Bayes algo- Features NB Multinomial NB SVM rithm are used. The difference between the stan- Unigrams 79.65 78.14 76.11 dard NB and MN NB is the way in which the fea- Bigrams 57.79 60.84 63.36 tures for classification, the words, are modelled. In Bigrams + POS 74.25 75.71 72.83 the standard NB learning method, a binary presence approached is taken in modelling the words of the Table 2: Average tenfold cross-validation accuracies on training documents. This differs to the MN NB clas- only the expressive corpus. Boldface: best performance sifier, which takes into account term frequency when for a given classifier. modelling the documents. Each has proven to be a high performing classifier across various sentiment able to considering the frequency of a term when analysis domains, but no distinction has been given generating a machine learning model. as to which is the preferable method to use. There- fore in this paper, both were implemented. 4 Results In the literature, results from the use of SVMs in classification based experiments have outperformed Table 3 shows the classification accuracies achieved other algorithms (Joachims, 1998; Pang et al. , in all experiments. For each classifier, with each fea- 2002). For these cross-discourse experiments we use ture set, if we take the most basic baseline for the the Sequential Minimal Optimization training algo- two-class (positive/negative) problem to be the ran- rithm (Platt, 1998), in order to achieve the maximal dom baseline of 50% classification accuracy, then hyperplane, and maximise the potential of the cre- this is clearly exceeded. However if we take the re- ated classifier. Traditionally SVMs have performed sults of the tenfold cross-validation as a baseline for well in text classification, but across discourse do- each classifier in the experiments, then only the re- mains the results of such classification has not been sults given by the MN NB classifier with unigram examined. and bigram features are able to surpass this. Each document in the corpus was modelled as The results given from the NB and the MN NB a bag of words. Features used within this repre- classifier imply that using frequency based fea- sentation were unigrams, bigrams and bigrams aug- tures are preferable to using presence based features mented with part-of-speech information. Due to when performing cross-discourse sentiment classi- this, and observing the results of preliminary experi- fication. The MN NB is one of the few classifiers mentation that included rare features, it was decided tested that exceeds the results of the cross-validated to remove any feature that did not occur more than model. These results support experiments carried 5 times throughout the training set. A stopword list out for topic based classification using Bayesian and stemmer were also used. classifiers by McCallum and Nigam (1998), but dif- Each supervised classification technique was then fers from sentiment classification results from Pang trained using a random sample of 1,100 documents et al. (2002) that suggest that term-based models from both the positive and negative subsections of perform better than the frequency-based alternative. the expressive corpus. Following this we tested the This also differs to the results that were returned classifiers on a set of 1,500 randomly selected per- during the cross-validation of the classifiers, where suasive documents, using 750 documents from each presence based features produced the greatest clas- of the positive and negative subcorpora. sification accuracy. The results of cross-validation (Table 2) sug- In our tests, the feature set which yielded the high- gested that unigram features may outperform both est degree of classification accuracy across all clas- bigram and part-of-speech augmented bigrams for sifiers is the unigram bag of words model. Tan et all learning methods. In particular, the accuracy al. (2002) suggest that using bigrams enhances text results produced by the NB algorithm surpassed classification, but as sentiment classification goes the results of other classifiers in the tenfold cross- beyond this task, the assumption does not hold, as validation. This suggests that within a single dis- the results here show. The difference in discourse course domain, presence based features are prefer- function could also contribute to bigrams yielding 81 Positive Negative Accuracy Precision Recall F1 Precision Recall F1 NB Uni 76.07 78.29 72.13 75.09 74.17 80.00 76.97 NB Bi 58.93 55.19 94.93 69.80 81.90 22.93 35.83 NB Bi + POS 65.00 71.84 49.33 58.50 61.42 80.67 69.74 MN NB Uni 83.53 82.04 85.87 83.91 85.17 81.20 83.14 MN NB Bi 57.00 63.78 32.40 42.97 54.69 81.60 65.49 MN NB Bi + POS 69.97 69.59 69.87 69.73 69.75 69.47 69.61 SVM Uni 69.00 68.43 70.53 69.47 69.60 67.47 68.52 SVM Bi 55.40 60.98 30.00 40.21 53.58 80.80 64.43 SVM Bi + POS 63.27 63.11 63.87 63.49 63.43 62.67 63.04 Table 3: Results of experimentation, with the expressive corpus as the training set, and the persuasive corpus as the test set. Boldface indicates the best performance for each metric. the lowest accuracy results. Bigrams model quite opposed to solely presence based features. The rea- specific language patterns, but as the expressive and soning for this could be attributed to the way that pa- persuasive language differs in structure and content, tients were asked to submit their feedback. Instead then the patterns learnt in one domain do not accu- of asking a patient to submit a single comment on rately map to another domain. Bigrams contribute their experience with the health service, they were the least to sentiment classification in this cross- asked to submit three distinct comments on what discourse scenario, and only when they are aug- they liked, disliked and any advice that they had. mented with part of speech information does the ac- This gave the user the opportunity to separate their curacy sufficiently pass the random baseline. How- sentiments, and clearly communicate their thoughts. ever for good recall, using bigram based features It is of interest to note that the cross-discourse ac- produces excellent results, at the sacrifice of ade- curacy should surpass the cross-validation accuracy quate precision, which suggests that bigram mod- on the training set. This was not to be expected, due els overfit when they are used as features in such a to the differences in discourse function, and there- learned model. fore features used. However, where just the presence The SVM classifier with a variety of features does of a particular word may have made the difference not perform as well as the multinomial Naı̈ve Bayes in a single domain, across domains, taking into ac- classifier. Joachims (1998) suggests that for text count the frequency of a word in the learned model categorization, the SVM algorithm regularly outper- is effective in correctly classifying a comment by forms other classifiers, but unfortunately the out- its sentiment. Unigram features outperform both the come of our experiments do not correlate with these bigram and bigrams augmented with part-of-speech results. This suggests that SVMs struggle with text features in our experiments. By using single tokens classification when the discourse function between as features, each word is taken out of the context the training and test domains differ. that its neighbours provide. In doing so the language contributing to the relative sentiment is generalised 5 Discussion enough to form a robust model which can then be applied across discourse domains. The results produced through training supervised machine learning methods on an expressive corpus, 6 Related Work and testing on a corpus which contains documents with a persuasive discourse function indicate that A number of studies (Cambria at al. , 2011; Xia et cross-discourse sentiment classification is feasible. al. , 2009) have used patient feedback as the domain The best performance occurred when the classi- for their sentiment classification experiments. How- fier took frequency based features into account, as ever our work differs to these studies as we consider 82 the effect that cross-discourse evaluation has on the tured Health-Care Data through Semantics and classification outcome. Other work that has consid- Sentics. In Proceedings of ACM WebSci, Koblenz. ered different discourse functions in sentiment anal- ysis, have experimented on detecting arguments (So- Andrea Esuli and Fabrizio Sebastiani. 2006. Senti- WordNet: A Publicly Available Lexical Resource masundaran et al. , 2007) and the stance of political for Opinion Mining In Proceedings of Language debates (Thomas et al. , 2006). Resources and Evaluation (LREC), pp 417–422. Machine learning approaches to text classification have typically performed well when using a Sup- Thorsten Joachims. 1998. Text categorization with port Vector Machine (Joachims, 1998) classifier or support vector machines: learning with many relevant a Naı̈ve Bayes (McCallum and Nigam, 1998) based features. In Proceedings of ECML-98, 10th European classifier. Pang et al. (2002) applied these classi- Conference on Machine Learning, pp. 137–142. fiers to the movie review domain, which produced good results. However the difference in domain, Andrew McCallum and Kamal Nigam. 1998. A Comparison of Event Models for Naive Bayes Text and singularity of discourse function differentiates Classification. In Proceedings of the AAAI/ICML-98 the scope of this work from theirs. Workshop on Learning for Text Categorization, pp. 41–48. 7 Conclusion & Future Work In this study we focused on the cross-discourse Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using development of supervised machine learning algo- Machine Learning Techniques. In Proceedings of the rithms in the clinical domain, that trained and tested 2002 Conference on Empirical Methods in Natural across the expressive and persuasive discourse func- Language Processing (EMNLP), pp. 79–87. tions. We demonstrated that despite the differences in function of a corpus of patient feedback, the John Platt. Sequential Minimal Optimization: A Fast greatest classification accuracy was achieved when Algorithm for Training Support Vector Machines. considering word frequency in the features of the In Advances in Kernel Methods - Support Vector learned model. Learning. This study centred on the expressive and persua- Swapna Somasudaran and Josef Ruppenhofer and Janyce sive discourse functions, but it would be interesting Wiebe. 2007. Detecting Arguing and Sentiment in to examine other such functions that convey a sen- Meetings. In Proceedings of the SIGdial Workshop on timent, such as argumentation. Another interesting Discourse and Dialogue, pp.26–34. avenue of investigation for this work would be to ex- plore the lexical semantics of the different discourse Chade-Meng Tan, Yuan-Fang Wang and Chan-Do Lee. functions, that could be used in sentiment classifica- 2002. The use of bigrams to enhance text catego- tion, and factor this into the evaluation of the overall rization. In Information Processing & Management, sentiment of persuasive documents within a corpus. 38(4) pp. 529–546. Matt Thomas, Bo Pang and Lillian Lee. 2006. Get out References the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proceeding Douglas Biber. 1988. Variation Across Speech and of the 2006 Conference on Emperical Methods in Writing. Cambridge University Press. Natural Language Processing (EMNLP), pp.327–335. John Blitzer, Mark Dredze and Fernando Pereira. 2007. Lei Xia, Anna Lisa Gentile, James Munro and José Iria. Biographies, Bollywood, Boom-boxes, and Blenders: 2009. Improving Patient Opinion Mining through Domain Adaptation for Sentiment Classification. In Multi-step Classification. In Proceedings of the 12th Proceedings of the 45th Annual Meeting of the Asso- International Conference on Text, Speech and Dia- ciation of Computational Linguistics, pp. 440–447. logue (TSD’09), pp. 70–76. Erik Cambria, Amir Hussain and Chris Eckl. 2011. Bridging the Gap between Structured and Unstruc- 83 POLITICAL-ADS: An annotated corpus of event-level evaluativity Kevin Reschke Pranav Anand Department of Computer Science Department of Linguistics Stanford University University of California, Santa Cruz Palo Alto, CA 94305 USA Santa Cruz, CA 95064 USA

[email protected] [email protected]

Abstract of a compositional account, in part because no re- source annotating both NP level polarity and event- This paper presents a corpus targeting eval- level polarity in context exists. This paper intro- uative meaning as it pertains to descriptions duces such a corpus, POLITICAL-ADS, a collec- of events. The corpus, POLITICAL-ADS is tion of 2008 U.S. presidential race television ads drawn from 141 television ads from the 2008 with scalar sentiment annotations at the NP and VP U.S. presidential race and contains 3945 NPs and 1549 VPs annotated for scalar sentiment level. After describing the corpus creation and char- from three different perspectives: the narra- acteristics in sections 3 and 4, in section 5, we show tor, the annotator, and general society. We that a compositional system achieves an accuracy of show that annotators can distinguish these per- 84.2%, above a lexical baseline of 65.1%. spectives reliably and that correlation between the annotator’s own perspective and that of a 2 Background generic individual is higher than those with the narrator. Finally, as a sample application, While many sentiment models handle negation we demonstrate that a simple compositional quasi-compositionally (Pang and Lee, 2008; Polanyi model built off of lexical resources outper- and Zaenen, 2005), Nasukawa & Yi (Nasukawa and forms a lexical baseline. Yi, 2003) first noted that predicates like prevent are “flippers”, conveying that their subject and ob- 1 Introduction ject have opposite polarity – since trouble is nega- tive, something that prevents trouble is good. Re- In the past decade, the semantics of evaluative lan- cent work has expanded that idea into a fully com- guage has received renewed attention in both formal positional system (Moilanen and Pulman, 2007; and computational linguistics (Martin and White, Neviarouskaya et al., 2010). Moilanen and Pulman 2005; Potts, 2005; Pang and Lee, 2008; Jackend- construct a system of compositional rules that builds off, 2007). This work has focused on evaluativity polarityin terms of a hand-built lexicon of predicates at either the lexical level or the phrasal/event level as flippers or preservers. However, this system con- stance, without bridging between the two. A par- flates two different assessment perspectives, that of allel tradition of compositional event polarity ((Na- the Narrator and of some mentioned NP (NP-to-NP sukawa and Yi, 2003; Moilanen and Pulman, 2007; perspective). The latter include psychological pred- Choi and Cardie, 2008; Neviarouskaya et al., 2010)) icates such as love and hate, and those of admira- has grown up analogous to approaches to composi- tion or censure (e.g., admonish, praise). Thus, they tionality in formal semantics: event predicates are would mark John dislikes scary movies as negative, a not of constant polarity, but provide functions from correct NP-to-NP claim, but not necessarily correct the polarities of their arguments to event polarities. for the Narrator. Recognizing this, Neviarouskaya Little work exists assessing the relative advantages et al. (Neviarouskaya et al., 2010) develop a pair of 84 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 84–88, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics Announcer: In tough times, who will help Michigan’s volume of sentiment claims subject to perspecti- auto industry? Barack Obama favors loan guarantees to val differences. POLITICAL-ADS is a collec- help Detroit retool and revitalize. But John McCain re- fused to support loan guarantees for the auto industry. tion of 141 television ads that ran during the 2008 Now he’s just paying lip service. Not talking straight. U.S. presidential race between Democratic candi- And McCain voted repeatedly for tax breaks for compa- date Barack Obama and Republican candidate John nies that ship jobs overseas, selling out American anno- McCain. The collection consists of 81 ads from tators. We just can’t afford more of the same. Democratic side and 60 ads from Republican side. Figure 1 provides a sample transcript. Figure 1: Transcript of POLITICAL-ADS ad #57 Figure 5: Snapshot of Mechanical Turk form for Transcript #57 (Dem.) Each transcript was parsed using the Stanford Parser and all NPs and VPs excluding those headed by auxiliaries were extracted. VP annotations were assumed to represent phrasal/event-level polarity and NP ones argument-level polarity. The annota- tion interface is shown in Figure 2. Annotators were shown a transcript and a movie clip, and navigated through the NPs and VPs within the document. At each point they were asked to rate their response on a [-1,1] scale for the following four questions about the highlighted expression: 1) how the nar- rator wants them to feel; 2) how they feel; 3) how Figure 6: Instruction for completing annotation form. Figure 2: POLITICAL-ADS annotation interface people in general feel; 4) how controversial the is- Our Goals: The purpose of this HIT is to help us document the words people sue is (included to test the whether sense of contro- use to persuade others. versy yields sharper differences between the various compositional rules over both perspectives. Impor- Overview: We ask you to read transcripts of political ads from the 2008 US assessment perspectives). Finally, because phrases tantly, presidentialneither of can campaign (you these watchapproaches videos of the ads ashave been well). Then vali- you will were not prefiltered, a ‘Doesn’t Make Sense’ button answer questions about different highlighted portions of the ad. The questions dated against are designed a sufficiently to determine nuanced how different pieces dataset. of text contribute Maila- to the overall was provided for each question. message of the ad. You will answer the same four questions for each nen and Pulman highlighted portion: test against the SemEval-07 Head- 206 annotators on Mechanical Turk completed lines Corpus, which asks annotators to give an over- 985 transcripts at $0.40 per transcript; each tran- 1. How does the narrator want you to feel about the highlighted all impression expression? of sentiment. This approach allows a script was annotated by an average of 4.8 different headline 2. How do such you toas feelOutcry in N Korea about the highlighted expression?‘nuclear test’ annotators living in the U.S. We then filtered anno- 3. In your opinion, how controversial is the highlighted expression in to be American marked negative, even though outcry over society? tators by 200 phrases we deemed relatively uncon- military provocations is arguably good. Similarly, troversial in 20 randomly selected transcripts. To do Neviarouskaya et al. evaluate only against NP-to- this, we scored each annotator in terms of the ab- 14 NP data as well. While the MPQA corpus (Wiebe solute difference between their mean response and et al., 2005), which annotates the source of each the median (each annotator’s scores were first nor- sentiment annotation, separates these two sentiment malized by mean absolute value) in the Narrator sources, work trained on it has not (Choi and Cardie, question. We found when we thresholded annota- 2008; Moilanen et al., 2010). In addition, existing tors at a score above 0.5, agreement with our gold annotation schemes are not designed to tease apart standard was 83.5% and dropped substantially after- perspectival differences. For example, MPQA in- wards. This threshold excluded 74 annotators, leav- cludes a notion of Narrator-oriented evaluativity, but ing 132 high-quality, or HQ, annotators (the full data it does not include the perspectives of you and the is available in the corpus). general public. The corpus consists of 5494 phrases (1549 VPs and 3945 NPs) annotated 6.3 times on average, for 3 The corpus a total of 34, 692 annotations (9800 VP and 24892 POLITICAL-ADS, is drawn from politics, a rich NP). Each phrase was annotated by at least 3 HQ and recently evolving domain for evaluativity re- annotators (average 3.9 annotators), and such an- search that we hypothesized would involve a high notators contributed 5960 VP and 15238 NP an- 85 notations. Of these, 12.1% HQ NP and 5.4% of COND ALL HQ ONLY HQ VP responses were marked as ‘Doesn’t Make RAW RAW NORMED Sense’ (DMS) for the narrator question. In general, controversy and narrator questions had the highest Narr. .10 (.45) .05 (.62) .08 (.87) You .10 (.34) .06 (.46) .09 (.85) and lowest rates of DMS, respectively; NPs showed Gen. .10 (.33) .05 (.45) .08 (.86) higher response rates than VPs; and HQ annotators Contr. .17 (.22) .13 (.30) .17 (.60) had higher rates of button presses.1 In sections 4 and 5, we will ignore the DMS responses. Table 1: Mean response by category and worker type 4 Corpus Findings COND HQ ANNOTATORS Table 1 provides summary statistics for the corpus. RAW NORMED Across the board, the three perspective questions av- ALL VP NP ALL VP NP eraged close to 0, and in general HQ annotators are Narr. .69 .75 .67 .96 1.06 .93 closer to 0 (non-HQ annotators tended to provide You .57 .63 .55 .99 1.12 .94 positive responses). VPs had slightly higher vari- Gen. .53 .58 .51 .99 1.13 .94 ance than NPs, at marginal probability (p < .04), Contr. .53 .58 .51 1.01 1.15 .96 suggesting that VP responses were more extreme ALL ANNOTATORS than NP ones. You and Generic assessments are ALL VP NP highly correlated (Pearson’s ρ = 0.85), but Narra- Narr. .63 .68 .62 tor is less so (ρ = .76/.74). All three are weakly You .54 .59 .53 correlated with Controversy (ρ = .25/.26/.29 for Gen. .52 .56 .51 Narr., You, Gen., respectively). Narrator has the Contr. .54 .56 highest standard deviations for the raw data, but the lowest for the normed data. In the raw data, many Table 2: Average Standard Deviations For HQ and all annotators annotators recognized the narrators intensely parti- san views and rated accordingly (|x| > 0.8), but were more tempered when providing their perspec- 5 Comparing lexical and compositional tive (|x| ∼ 0.35), leading to lower σ. This intensity treatments difference is factored out in normalization, yielding the opposite pattern. While compositional models of event-level evalua- tivity are logically defensible, the extent to which The response data was collected from our anno- these models apply in the wild is an open ques- tators in scalar form, but applications (e.g., evalu- tion. Because other compositional lexicons are not ative polarity classification) it is the polarity of the freely available, we used the system described in response that matters. Ignoring magnitude, Table 3 (Reschke and Anand, 2011), which induces flippers shows the polarity breakdown for all HQ phrasal an- and preservers from the MPQA subjectivity lexi- notations. Positive responses are the dominant class con and FrameNet (Ruppenhofer et al., 2005). The across the board. Neutral responses are less frequent MPQA lexicon is a collection of over 8,000 words for Narrator than for the other types. NPs have fewer marked for polarity. Our functor lexicon uses the negatives and more neutrals than VPs. following heuristic: verbs marked positive in MPQA Table 2 shows average standard deviations (i.e., are preservers; verbs marked negative are flippers. agreement) by worker, question, and XP type. Note For example, dislike has negative MPQA polarity; both that NPs show less variance than VPs and that therefore, it is marked as a flipper in our lexicon. non-HQ annotators less than HQ annotators (non- This gives us 1249 predicates: 869 flippers and 380 HQ annotators gave more 0 responses). preservers. 329 additional verbs were added from 1 In a QUESTION + PHRASE TYPE + QUESTION + ANNOTA - FrameNet according to their membership in five en- TOR TYPE linear model with annotator as a random effect, all of the above effects are significant. This was the simplest model according to χw model comparison. 86 COND POL VP NP the economy and reform Wall Street correct. These Narr. + 2874 (51%) 6877 (51%) exemplify a robust pattern in the errors: cases where - 2654 (47%) 5590 (42%) the event is marked positive while the NP is marked 0 111 (2%) 932 (7%) negative. In examples like grow Washington, the You + 2714 (49%) 6573 (50%) idea that grow is a preserver is reasonable. However, - 2466 (45%) 4967 (38%) in grow the economy, the negativity of the economy 0 337 (6%) 1575 (12%) is arguably measuring the state of some constant en- Gen. + 2615 (48%) 6350 (49%) tity. While reform is marked positive in MPQA, it - 2541 (48%) 5125 (39%) is arguably a reverser; this shows the problems with 0 332 (6%) 1558 (12%) Contr. + 3095 (57%) 6522 (51%) our lexicon induction. - 1755 (32%) 4159 (33%) At an intuitive level, we expect agent evalu- 0 558 (10%) 2051 (16%) ativity to mirror event-level evaluativity because positive/negative entities tend to commit posi- Table 3: Polarity breakdowns for HQ annotations tive/negative acts, and this is borne out. For flip- pers or preservers, the average VP evaluativity is correlated with the average subject evaluativity. For tailment classes (Reschke and Anand, 2011): verbs flippers the correlation is 0.57; for preservers it is of injury/destruction, lacking, benefit, creation, and 0.52. Although our model ignored subject evalua- having. 124 frames across these classes were identi- tivity, we performed a generalized linear regression fied, and then verbs of benefit, creation, and having with subject and object evaluativity as predictors (aid, generate, have) were marked as preservers and and event-level evaluativity as outcome. For flip- the complement set (forget, arrest, lack) as flippers. pers the regression coefficients were 0.52 for subject As a lexical baseline, the MPQA polarity of each (p < 4e − 4) and −0.52 for object (p < 1e − 5). For verb was used – flippers correspond to baseline neg- preservers the coefficients were 0.27 (p < 1e−5) for ative events and preservers to positive ones. subject and 0.93 for object (p < 2e − 7). Thus, sub- A 635 VP test subset of POLITICAL-ADS was ject polarity is an important factor for flipper events constructed by omitting intransitive VPs and VPs (e.g., the hero/villain defeated the enemy, but less so with non-NP complements. Gold standard labels for preservers (e.g. the hero/villain helped the en- were determined from average normed HQ annota- emy.). tor data. This yielded 329 positive, 284 negative, and 2 neutral events. NPs, determined similarly, di- 6 Conclusion vided into 393 positive, 230 negative, and 12 neutral. Of the 635 VPs in the test set, only 272 (43.5%) In this paper we have presented POLITICAL-ADS, are in our FrameNet/MPQA lexicon and we hence a new resource for investigating the relationships be- compare the two systems on this subset. On this tween NP sentiment and VP sentiment systemati- subset, the compositional system has an accuracy of cally. We have demonstrated that annotators can re- 84.2%, while the lexical baseline has an accuracy liably annotate political data with sentiment at the of 65.1%; there were 72 instances where the com- phrasal level from multiple perspectives. We have positional model outperformed the lexical baseline also shown that in the present data set that self- and 22 where the lexical outperformed the composi- reporting and judging generic positions are highly tional. Typical examples where the compositional correlated, while correlation with narrators is ap- system won involve MPQA negatives like break, preciably weaker, as narrators are seen as more ex- cut, and hate and positives like want and trust. The treme. We have also shown that the controversy of a lexical model marks VPs like breaks the grip of for- phrase does not correlate with annotators’ disagree- eign oil and want a massive government as negative ments with the narrator. Finally, as a sample appli- and positive, respectively – because the NPs in ques- cation, we demonstrated that a simple compositional tion are negative, the answers should be reversed. In model built off of lexical resources outperforms a contrast, the lexical model wins on cases like grow purely lexical baseline. 87 References Y. Choi and C Cardie. 2008. Learning with compo- sitional semantics as structural inference for subsen- tential sentiment analysis. In Proceedings of EMNLP 2008. Ray Jackendoff. 2007. Language, consciousness, cul- ture. MIT Press. J. R. Martin and P. R. R. White. 2005. Language of Eval- uation: Appraisal in English. Palgrave Macmillan. Karo Moilanen and Stephen Pulman. 2007. Sentiment composition. In Proceedings of RANLP 2007. K. Moilanen, S. Pulman, and Y Zhang. 2010. Packed feelings and ordered sentiments: Sentiment pars- ing with quasi-compositional polarity sequencing and compression. In Proceedings of WASSA 2010, EACI 2010. T. Nasukawa and J. Yi. 2003. Sentiment analysis: Cap- turing favorability using natural language processing. In Proceedings of the 2nd international conference on Knowledge capture. A. Neviarouskaya, H. Prendinger, , and M. Ishizuka. 2010. Recognition of affect, judgment, and appreci- ation in text. In Proceedings of COLING 2010. Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Infor- mation Retrieval, 2(1-2):1–135. L. Polanyi and A. Zaenen. 2005. Contextual valence shifters. in computing attitude and affect in text. In Janyce Wiebe James G. Shanahan, Yan Qu, editor, Computing Attitude and Affect in Text: Theory and Application. Springer Verlag, Dordrecht, The Nether- lands. Chris Potts. 2005. The Logic of Conventional Implica- ture. Oxford University Press. K. Reschke and P. Anand. 2011. Extracting contextual evaluativity. In Proceedings of ICWS 2011. Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, and Christopher R. Johnson. 2005. Framenet ii: Extended theory and practice. Technical report, ICSI Technical Report. J. Wiebe, T. Wilson, and C. Cardie. 2005. Annotating expressions of opinions and emotions in language. In Proceedings of LREC 2005. 88 Automatically Annotating A Five-Billion-Word Corpus of Japanese Blogs for Affect and Sentiment Analysis Michal Ptaszynski † Rafal Rzepka ‡ Kenji Araki ‡ Yoshio Momouchi § † JSPS Research Fellow / High-Tech Research Center, Hokkai-Gakuen University

[email protected]

‡ Graduate School of Information Science and Technology, Hokkaido University {kabura,araki}@media.eng.hokudai.ac.jp § Department of Electronics and Information Engineering, Faculty of Engineering, Hokkai-Gakuen University

[email protected]

Abstract emotive expressions either appear rarely in these kinds of texts (newspapers), or the vocabulary is not This paper presents our research on automatic up to date (classic literature). Although there ex- annotation of a five-billion-word corpus of ist speech corpora, such as Corpus of Spontaneous Japanese blogs with information on affect and sentiment. We first perform a study in emotion Japanese3 , which could become suitable for this blog corpora to discover that there has been kind of research, due to the difficulties with com- no large scale emotion corpus available for pilation of such corpora they are relatively small. the Japanese language. We choose the largest In research such as the one by Abbasi and Chen blog corpus for the language and annotate it (2007) it was proved that public Internet services, with the use of two systems for affect anal- such as forums or blogs, are a good material for af- ysis: ML-Ask for word- and sentence-level fect analysis because of their richness in evaluative affect analysis and CAO for detailed anal- ysis of emoticons. The annotated informa- and emotive information. One kind of these services tion includes affective features like sentence are blogs, open diaries in which people encapsu- subjectivity (emotive/non-emotive) or emo- late their own experiences, opinions and feelings to tion classes (joy, sadness, etc.), useful in affect be read and commented by other people. Recently analysis. The annotations are also general- blogs have come into the focus of opinion mining or ized on a 2-dimensional model of affect to ob- sentiment and affect analysis (Aman and Szpakow- tain information on sentence valence/polarity icz, 2007; Quan and Ren, 2010). Therefore creating (positive/negative) useful in sentiment analy- a large blog-based emotion corpus could help over- sis. The annotations are evaluated in several ways. Firstly, on a test set of a thousand sen- come both problems: the lack in quantity of corpora tences extracted randomly and evaluated by and their applicability in sentiment and affect anal- over forty respondents. Secondly, the statistics ysis. There have been only a few small Japanese of annotations are compared to other existing emotion corpora developed so far (Hashimoto et al., emotion blog corpora. Finally, the corpus is 2011). On the other hand, although there exist large applied in several tasks, such as generation of Web-based corpora (Erjavec et al., 2008; Baroni and emotion object ontology or retrieval of emo- Ueyama, 2006), access to them is usually allowed tional and moral consequences of actions. only from the Web interface, which makes addi- tional annotations with affective information diffi- 1 Introduction cult. In this paper we present the first attempt to au- There is a lack of large corpora for Japanese ap- tomatically annotate affect on YACIS, a large scale plicable in sentiment and affect analysis. Although corpus of Japanese blogs. To do that we use two sys- there are large corpora of newspaper articles, like tems for affect analysis of Japanese, one for word- Mainichi Shinbun Corpus1 , or corpora of classic lit- and sentence-level affect analysis and another espe- erature, like Aozora Bunko2 , they are usually un- cially for detailed analysis of emoticons, to annotate suitable for research on emotions since spontaneous on the corpus different kinds of affective informa- tion (emotive expressions, emotion classes, etc.). 1 http://www.nichigai.co.jp/sales/mainichi/mainichi-data.html 2 3 http://www.aozora.gr.jp/ http://www.ninjal.ac.jp/products-k/katsudo/seika/corpus/public/ 89 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 89–98, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics The outline of the paper is as follows. Section mation, such as emotive expressions, emotion va- 2 describes the related research in emotion corpora. lence, intensity, etc. However, Wiebe et al. focused Section 3 presents our choice of the corpus for anno- on detecting subjective (emotive) sentences, which tation of affect- and sentiment-related information. do not necessarily convey emotions, and classifying Section 4 describes tools used in annotation. Sec- them into positive and negative. Thus their annota- tion 5 presents detailed data and evaluation of the tion schema, although one of the richest, does not annotations. Section 6 presents tasks in which the include emotion classes. corpus has already been applied. Finally the paper A corpus of Japanese blogs, called KNB, rich in is concluded and future applications are discussed. the amount and diversification of annotated informa- tion was developed by Hashimoto et al. (2011). It 2 Emotion Corpora contains 67 thousand words in 249 blog articles. Al- though it is not a small scale corpus, it developed Research on Affect Analysis has resulted in a a certain standard for preparing corpora, especially number of systems developed within several years blog corpora for sentiment and affect-related stud- (Aman and Szpakowicz, 2007; Ptaszynski et al., ies in Japan. The corpus contains all relevant gram- 2009c; Matsumoto et al., 2011). Unfortunately, matical annotations, including POS tagging, depen- most of such research ends in proposing and evaluat- dency parsing or Named Entity Recognition. It also ing a system. The real world application that would contains sentiment-related information. Words and be desirable, such as annotating affective informa- phrases expressing emotional attitude were anno- tion on linguistic data is limited to processing a usu- tated by laypeople as either positive or negative. ally small test sample in the evaluation. The small One disadvantage of the corpus, apart from its small number of annotated emotion corpora that exist are scale, is the way it was created. Eighty one students mostly of limited scale and are annotated manually. were employed to write blogs about different topics Below we describe and compare some of the most especially for the need of this research. It could be notable emotion corpora. Interestingly, six out of argued that since the students knew their blogs will eight emotion corpora described below are created be read mostly by their teachers, they selected their from blogs. The comparison is summarized in Table words more carefully than they would in private. 1. We also included information on the work de- Aman and Szpakowicz (2007) constructed a scribed in this paper for better comparison (YACIS). small-scale English blog corpus. They did not in- Quan and Ren (2010) created a Chinese emotion clude any grammatical information, but focused on blog corpus Ren-CECps1.0. They collected 500 affect-related annotations. As an interesting remark, blog articles from various Chinese blog services, they were some of the first to recognize the task such as sina blog (http://blog.sina.com.cn/), qq blog of distinguishing between emotive and non-emotive (http://blog.qq.com/), etc., and annotated them with sentences. This problem is usually one of the most a large variety of information, such as emotion class, difficult in text-based Affect Analysis and is there- emotive expressions or polarity level. Although syn- fore often omitted in such research. In our research tactic annotations were simplified to tokenization we applied a system proved to deal with this task and POS tagging, this corpus can be considered a with high accuracy for Japanese. state-of-the-art emotion blog corpus. The motiva- Das and Bandyopadhyay (2010) constructed an tion for Quan and Ren is also similar to ours - deal- emotion annotated corpus of blogs in Bengali. The ing with the lack of large corpora for sentiment anal- corpus contains 12,149 sentences within 123 blog ysis in Chinese (in our case - Japanese). posts extracted from Bengali web blog archive Wiebe et al. (2005) report on creating the MPQA (http://www.amarblog.com/). It is annotated with corpus of news articles. The corpus contains 10,657 face recognition annotation standard (Ekman, 1992). sentences in 535 documents4 . The annotation Matsumoto et al. (2011) created Wakamono Ko- schema includes a variety of emotion-related infor- toba (Slang of the Youth) corpus. It contains un- 4 The new MPQA Opinion Corpus version 2.0 contains ad- related sentences extracted manually from Yahoo! ditional 157 documents, 692 documents in total. blogs (http://blog-search.yahoo.co.jp/). Each sen- 90 Table 1: Comparison of emotion corpora ordered by the amount of annotations (abbreviations: T=tokenization, POS=part-of-speech tagging, L=lemmatization, DP=dependency parsing, NER=Named Entity Recognition). corpus scale language annotated affective information syntactic (in senten- emotion class emotive emotive/ valence/ emotion emotion annota- name ces / docs) standard expressions non-emot. activation intensity objects tions 354 mil. 10 (language and YACIS Japanese ⃝ ⃝ ⃝/⃝ ⃝ ⃝ T,POS,L,DP,NER; /13 mil. culture based) Ren-CECps1.0 12,724/500 Chinese 8 (Yahoo! news) ⃝ ⃝ ⃝/× ⃝ ⃝ T,POS; MPQA 10,657/535 English none (no standard) ⃝ ⃝ ⃝/× ⃝ ⃝ T,POS; KNB 4,186/249 Japanese none (no standard) ⃝ × ⃝/× × ⃝ T,POS,L,DP,NER; Minato et al. 1,191sent. Japanese 8 (chosen subjectively) ⃝ ⃝ ×/× × × POS; Aman&Szpak. 5,205/173 English 6 (face recognition) ⃝ ⃝ ×/× ⃝ × × Das&Bandyo. 12,149/123 Bengali 6 (face recognition) ⃝ × ×/× ⃝ × × Wakamono 4773sen- Japanese 9 (face recognition + ⃝ × ×/× × × × Kotoba tences 3 added subjectively) Mishne ?/815,494 English 132 (LiveJournal) × × ×/× × × × tence contains at least one word from a slang lexicon All of the above corpora were annotated manu- and one word from an emotion lexicon, with addi- ally or semi-automatically. In this research we per- tional emotion class tags added per sentence. The formed the first attempt to annotate a large scale blog emotion class set used for annotation was chosen corpus (YACIS) with affective information fully au- subjectively, by applying the 6 class face recogni- tomatically. We did this with systems based on pos- tion standard and adding 3 classes of their choice. itively evaluated affect annotation schema, perfor- Mishne (2005) collected a corpus of English blogs mance, and standardized emotion class typology. from LiveJournal (http://www.livejournal.com/) blogs. The corpus contains 815,494 blog posts, 3 Choice of Blog Corpus from which many are annotated with emotions (moods) by the blog authors themselves. The Although Japanese is a well recognized and de- LiveJournal service offers an option for its users to scribed world language, there have been only few annotate their mood while writing the blog. The large corpora for this language. For example, Er- list of 132 moods include words like “amused”, or javec et al. (2008) gathered a 400-million-word scale “angry”. The LiveJournal mood annotation standard Web corpus JpWaC, or Baroni and Ueyama (2006) offers a rich vocabulary to describe the writer’s developed a medium-sized corpus of Japanese blogs mood. However, this richness has been considered jBlogs containing 62 million words. However, both troublesome to generalize the data in a meaningful research faced several problems, such as character manner (Quan and Ren, 2010). encoding, or web page metadata extraction, such as Finally, Minato et al. (2006) collected a 14,195 the page title or author which differ between do- word, 1,191 sentence corpus. The corpus was a col- mains. Apart from the above mentioned medium lection of sentence examples from a dictionary of sized corpora at present the largest Web based blog emotional expressions (Hiejima, 1995). The dictio- corpus available for Japanese is YACIS or Yet nary was created for the need of Japanese language Another Corpus of Internet Sentences. We chose learners. Differently to the dictionary applied in our this corpus for the annotation of affective informa- research (Nakamura, 1993), in Hiejima (1995) sen- tion for several reasons. It was collected automati- tence examples were mostly written by the author of cally by Maciejewski et al. (2010) from the pages of the dictionary himself. The dictionary also does not Ameba blog service. It contains 5.6 billion words propose any coherent emotion class list, but rather within 350 million sentences. Maciejewski et al. the emotion concepts are chosen subjectively. Al- were able to extract only pages containing Japanese though the corpus by Minato et al. is the smallest posts (pages with legal disclaimers or written in lan- of all mentioned above, its statistics is described in guages other than Japanese were omitted). In the detail. Therefore in this paper we use it as one of the initial phase they provided their crawler, optimized Japanese emotion corpora to compare our work to. to crawl only Ameba blog service, with 1000 links 91 Figure 2: Output examples for ML-Ask and CAO. Figure 1: The example of YACIS XML structure. Table 3: Distribution of separate expressions across emo- Table 2: General Statistics of YACIS. tion classes in Nakamura’s dictionary (overall 2100 ex.). # of web pages 12,938,606 emotion nunber of emotion nunber of # of unique bloggers 60,658 class expressions class expressions average # of pages/blogger 213.3 dislike 532 fondness 197 # of pages with comments 6,421,577 excitement 269 fear 147 # of comments 50,560,024 sadness 232 surprise 129 average # of comment/page 7.873 joy 224 relief 106 # of words 5,600,597,095 anger 199 shame 65 # of all sentences 354,288,529 sum 2100 # of words per sentence (average) 15 # of characters per sentence (average) 77 was converted into an emotive expression database by Ptaszynski et al. (2009c). Since YACIS is a taken from Google (response to one simple query: Japanese language corpus, for the affect annotation ‘site:ameblo.jp’). They saved all pages to disk as we needed the most appropriate lexicon for the lan- raw HTML files (each page in a separate file) and guage. The dictionary, developed for over 20 years afterward extracted all the posts and comments and by Akira Nakamura, is a state-of-the art example divided them into sentences. The original structure of a hand-crafted emotive expression lexicon. It (blog post and comments) was preserved, thanks to also proposes a classification of emotions that re- which semantic relations between posts and com- flects the Japanese culture: ki/yorokobi5 (joy), ments were retained. The blog service from which dō/ikari (anger), ai/aware (sorrow, sadness, the corpus was extracted (Ameba) is encoded by de- gloom), fu/kowagari (fear), chi/haji (shame, fault in Unicode, thus there was no problem with shyness), kō/suki (fondness), en/iya (dislike), character encoding. It also has a clear and stable kō/takaburi (excitement), an/yasuragi (relief), HTML meta-structure, thanks to which they man- and kyō/odoroki (surprise). All expressions in the aged to extract metadata such as blog title and au- dictionary are annotated with one emotion class or thor. The corpus was first presented as an unanno- more if applicable. The distribution of expressions tated corpus. Recently Ptaszynski et al. (2012b) an- across all emotion classes is represented in Table 3. notated it with syntactic information, such as POS, dependency structure or named entity recognition. ML-Ask (Ptaszynski et al., 2009a; Ptaszynski et al., An example of the original blog structure in XML 2009c) is a keyword-based language-dependent sys- is represented in Figure 1. Some statistics about the tem for affect annotation on sentences in Japanese. corpus are represented in Table 2. It uses a two-step procedure: 1) specifying whether an utterance is emotive, and 2) annotating the partic- 4 Affective Information Annotation Tools ular emotion classes in utterances described as emo- Emotive Expression Dictionary (Nakamura, 1993) tive. The emotive sentences are detected on the ba- is a collection of over two thousand expressions de- sis of emotemes, emotive features like: interjections, scribing emotional states collected manually from a mimetic expressions, vulgar language, emoticons 5 wide range of literature. It is not a tool per se, but Separation by “/” represents two possible readings of the character. 92 Table 4: Evaluation results of ML-Ask and CAO. Table 5: Statistics of emotive sentences. emotive/ emotion 2D (valence # of emotive sentences 233,591,502 non-emotive classes and activation) # of non-emotive sentence 120,408,023 ratio (emotive/non-emotive) 1.94 ML-Ask 98.8% 73.4% 88.6% CAO 97.6% 80.2% 94.6% # of sentences containing emoteme class: ML-Ask+CAO 100.0% 89.9% 97.5% - interjections 171,734,464 - exclamative marks 89,626,215 - emoticons 49,095,123 - endearments 12,935,510 - vulgarities 1,686,943 and emotive markers. The examples in Japanese ratio (emoteme classes in emotive sentence) 1.39 are respectively: sugee (great!), wakuwaku (heart pounding), -yagaru (syntactic morpheme used in co-occurrence in the database. The performance of verb vulgarization), (ˆ ˆ) (emoticon expressing joy) CAO was evaluated as close to ideal (Ptaszynski et and ‘!’, ‘??’ (markers indicating emotive engage- al., 2010b) (over 97%). In this research we used ment). Emotion class annotation is based on Naka- CAO as a supporting procedure in ML-Ask to im- mura’s dictionary. ML-Ask is also the only present prove the overall performance and add detailed in- system for Japanese recognized to implement the formation about emoticons. idea of Contextual Valence Shifters (CVS) (Zaenen and Polanyi, 2005) (words and phrases like “not”, 5 Annotation Results and Evaluation or “never”, which change the valence of an evalua- tive word). The last distinguishable feature of ML- It is physically impossible to manually evaluate all Ask is implementation of Russell’s two dimensional annotations on the corpus6 . Therefore we applied affect model (Russell, 1980), in which emotions three different types of evaluation. First was based are represented in two dimensions: valence (posi- on a sample of 1000 sentences randomly extracted tive/negative) and activation (activated/deactivated). from the corpus and annotated by laypeople. In sec- An example of negative-activated emotion could ond we compared YACIS annotations to other emo- be “anger”; a positive-deactivated emotion is, e.g., tion corpora. The third evaluation was application “relief”. The mapping of Nakamura’s emotion based and is be described in section 6. classes on Russell’s two dimensions was proved re- Evaluation of Affective Annotations: Firstly, we liable in several research (Ptaszynski et al., 2009b; needed to confirm the performance of affect anal- Ptaszynski et al., 2009c; Ptaszynski et al., 2010b). ysis systems on YACIS, since the performance is With these settings ML-Ask detects emotive sen- often related to the type of test set used in evalu- tences with a high accuracy (90%) and annotates af- ation. ML-Ask was positively evaluated on sepa- fect on utterances with a sufficiently high Precision rate sentences and on an online forum (Ptaszynski (85.7%), but low Recall (54.7%). Although low Re- et al., 2009c). However, it was not yet evaluated call is a disadvantage, we assumed that in a corpus on blogs. Moreover, the version of ML-Ask sup- as big as YACIS there should still be plenty of data. ported by CAO has not been evaluated thoroughly CAO (Ptaszynski et al., 2010b) is a system for as well. In the evaluation we used a test set cre- affect analysis of Japanese emoticons, called kao- ated by Ptaszynski et al. (2010b) for the evaluation moji. Emoticons are sets of symbols used to con- of CAO. It consists of thousand sentences randomly vey emotions in text-based online communication, extracted from YACIS and manually annotated with such as blogs. CAO extracts emoticons from in- emotion classes by 42 layperson annotators in an put and determines specific emotions expressed by anonymous survey. There are 418 emotive and 582 them. Firstly, it matches the input to a predeter- non-emotive sentences. We compared the results mined raw emoticon database (with over ten thou- on those sentences for ML-Ask, CAO (described in sand emoticons). The emoticons, which could not be detail by Ptaszynski et al. (2010b)), and both sys- estimated with this database are divided into seman- tems combined. The results showing accuracy, cal- tic areas (representations of “mouth” or “eyes”). The 6 Having one sec. to evaluate one sentence, one evaluator areas are automatically annotated according to their would need 11.2 years to verify the whole corpus (354 mil.s.). 93 Table 6: Emotion class annotations with percentage. Table 7: Comparison of positive and negative sentences emotion # of % emotion # of % between KNB and YACIS. class sentences class sentences positive negative ratio joy 16,728,452 31% excitement 2,833,388 5% dislike 10,806,765 20% surprize 2,398,535 5% KNB* emotional 317 208 1.52 fondness 9,861,466 19% gloom 2,144,492 4% attitude fear 3,308,288 6% anger 1,140,865 2% opinion 489 289 1.69 relief 3,104,774 6% shame 952,188 2% merit 449 264 1.70 acceptation 125 41 3.05 or rejection event 43 63 0.68 culated as a ratio of success to the overall number sum 1,423 865 1.65 of samples, are summarized in Table 4. The perfor- YACIS** only 22,381,992 12,837,728 1.74 only+mostly 23,753,762 13,605,514 1.75 mance of discrimination between emotive and non- * p<.05, ** p<.01 emotive sentences of ML-Ask baseline was a high 98.8%, which is much higher than in original eval- uation of ML-Ask (around 90%). This could indi- search. When it comes to statistics of each emo- cate that sentences with which the system was not tive feature (emoteme), the most frequent class were able to deal with appear much less frequently on interjections. Second frequent was the exclamative Ameblo. As for CAO, it is capable of detecting the marks class, which includes punctuation marks sug- presence of emoticons in a sentence, which is par- gesting emotive engagement (such as “!”, or “??”). tially equivalent to detecting emotive sentences in Third frequent emoteme class was emoticons, fol- ML-Ask, since emoticons are one type of features lowed by endearments. As an interesting remark, determining sentence as emotive. The performance emoteme class that was the least frequent were vul- of CAO was also high, 97.6%. This was due to the garities. As one possible interpretation of this re- fact that grand majority of emotive sentences con- sult we propose the following. Blogs are social tained emoticons. Finally, ML-Ask supported with space, where people describe their experiences to CAO achieved remarkable 100% accuracy. This was be read and commented by other people (friends, a surprisingly good result, although it must be re- colleagues). The use of vulgar language could dis- membered that the test sample contained only 1000 courage potential readers from further reading, mak- sentences (less than 0.0003% of the whole corpus). ing the blog less popular. Next, we checked the Next we verified emotion class annotations on sen- statistics of emotion classes annotated on emotive tences. The baseline of ML-Ask achieved slightly sentences. The results are represented in Table 6. better results (73.4%) than in its primary evalua- The most frequent emotions were joy (31%), dislike tion (Ptaszynski et al., 2009c) (67% of balanced F- (20%) and fondness (19%), which covered over 70% score with P=85.7% and R=54.7%). CAO achieved of all annotations. However, it could happen that 80.2%. Interestingly, this makes CAO a better affect the number of expressions included in each emotion analysis system than ML-Ask. However, the condi- class database influenced the number of annotations tion is that a sentence must contain an emoticon. The (database containing many expressions has higher best result, close to 90%, was achieved by ML-Ask probability to gather more annotations). Therefore supported with CAO. We also checked the results we verified if there was a correlation between the when only the dimensions of valence and activation number of annotations and the number of emotive were taken into account. ML-Ask achieved 88.6%, expressions in each emotion class database. The CAO nearly 95%. Support of CAO to ML-Ask again verification was based on Spearman’s rank corre- resulted in the best score, 97.5%. lation test between the two sets of numbers. The Statistics of Affective Annotations: There were test revealed no statistically significant correlation nearly twice as many emotive sentences than non- between the two types of data, with ρ=0.38. emotive (ratio 1.94). This suggests that the cor- Comparison with Other Emotion Corpora: pus is biased in favor of emotive contents, which Firstly, we compared YACIS with KNB. The KNB could be considered as a proof for the assumption corpus was annotated mostly for the need of sen- that blogs make a good base for emotion related re- timent analysis and therefore does not contain any 94 example, they use class name “hate” to describe Table 8: Comparison of number of emotive expressions in three different corpora including ratio within this set of what in YACIS is called “dislike”. Moreover, they emotions and results of Spearman’s rank correlation test. have no classes such as excitement, relief or shame. Minato et al. YACIS Nakamura To make the comparison possible we used only the dislike 355 (26%) 14,184,697 (23%) 532 (32%) emotion classes appearing in both cases and unified joy 295 (21%) 22,100,500 (36%) 224 (13%) all class names. The results are summarized in Ta- fondness 205 (15%) 13,817,116 (22%) 197 (12%) sorrow 205 (15%) 2,881,166 (5%) 232 (14%) ble 8. There was no correlation between YACIS and anger 160 (12%) 1,564,059 (3%) 199 (12%) Nakamura (ρ=0.25), which confirms the results cal- fear 145 (10%) 4,496,250 (7%) 147 (9%) surprise 25 (2%) 3,108,017 (5%) 129 (8%) culated in previous paragraph. A medium correla- Minato et al. Minato et al. YACIS and tion was observed between YACIS and Minato et al. and Nakamura and YACIS Nakamura (ρ=0.63). Finally, a strong correlation was observed Spearman’s ρ 0.88 0.63 0.25 between Minato et al. and Nakamura (ρ=0.88), which is the most interesting observation. Both Mi- nato et al. and Nakamura are in fact dictionaries of information on specific emotion classes. However, emotive expressions. However, the dictionaries were it is annotated with emotion valence for different collected in different times (difference of about 20 categories valence is expressed in Japanese, such years), by people with different background (lexi- as emotional attitude (e.g., “to feel sad about X” cographer vs. language teacher), based on differ- [NEG], “to like X” [POS]), opinion (e.g., “X is won- ent data (literature vs. conversation) assumptions derful” [POS]), or positive/negative event (e.g., “X and goals (creating a lexicon vs. Japanese language broke down” [NEG], “X was awarded” [POS]). We teaching). The only similarity is in the methodol- compared the ratios of sentences expressing posi- ogy. In both cases the dictionary authors collected tive to negative valence. The comparison was made expressions considered to be emotion-related. The for all KNB valence categories separately and as a fact that they correlate so strongly suggests that for sum. In our research we do not make additional sub- the compared emotion classes there could be a ten- categorization of valence types, but used in the com- dency in language to create more expressions to de- parison ratios of sentences in which the expressed scribe some emotions rather than the others (dislike, emotions were of only positive/negative valence and joy and fondness are often some of the most frequent including the sentences which were mostly (in ma- emotion classes). This phenomenon needs to be ver- jority) positive/negative. The comparison is pre- ified more thoroughly in the future. sented in table 7. In KNB for all valence categories except one the ratio of positive to negative sentences 6 Applications was biased in favor of positive sentences. Moreover, for most cases, including the ratio taken from the 6.1 Extraction of Evaluation Datasets sums of sentences, the ratio was similar to the one in In evaluation of sentiment and affect analysis sys- YACIS (around 1.7). Although the numbers of com- tems it is very important to provide a statistically pared sentences differ greatly, the fact that the ratio reliable random sample of sentences or documents remains similar across the two different corpora sug- as a test set (to be further annotated by laypeople). gests that the Japanese express in blogs more posi- The larger is the source, the more statistically reli- tive than negative emotions. able is the test set. Since YACIS contains 354 mil. Next, we compared the corpus created by Minato sentences in 13 mil. documents, it can be considered et al. (2006). This corpus was prepared on the ba- sufficiently reliable for the task of test set extraction, sis of an emotive expression dictionary. Therefore as probability of extracting twice the same sentence we compared its statistics not only to YACIS, but is close to zero. Ptaszynski et al. (2010b) already also to the emotive lexicon used in our research (see used YACIS to randomly extract a 1000 sentence section 4 for details). Emotion classes used in Mi- sample and used it in their evaluation of emoticon nato et al. differ slightly to those used in our re- analysis system. The sample was also used in this search (YACIS and Nakamura’s dictionary). For research and is described in more detail in section 5. 95 6.2 Generation of Emotion Object Ontology technique to gather consequences of actions apply- One of the applications of large corpora is to ing causality relations, like in the research described extract from them smaller sub-corpora for specified in section 6.2, but with a reversed algorithm and tasks. Ptaszynski et al. (2012a) applied YACIS lexicon containing not only emotional but also eth- for their task of generating an robust emotion ical notions. They cross-referenced emotional and object ontology. They used cross-reference of ethical information about a certain phrase (such as annotations of emotional information described “To kill a person.”) to obtain statistical probability in this paper and syntactic annotations done for emotional (“feeling sad”, “being in joy”, etc.) by Ptaszynski et al. (2012b) to extract only and ethical consequences (“being punished”, “being sentences in which expression of emotion was praised”, etc.). Initially, the moral agent was based proceeded by its cause, like in the example below. on the whole Internet contents. However, multiple queries to search engine APIs made by the agent caused constant blocking of IP address an in effect Kanojo ni furareta kara kanashii... hindered the development of the agent. Girlfriend DAT dump PAS CAUS sad ... The agent was tested on over 100 ethically- I’m sad because my girlfriend dumped me... significant real world problems, such as “killing a The example can be analyzed in the following way. man”, “stealing money”, “bribing someone”, “help- Emotive expression (kanashii, “sad”) is related with ing people” or “saving environment”. In result 86% the sentence contents (Kanojo ni furareta, “my of recognitions were correct. Some examples of the girlfriend dumped me”) with a causality morpheme results are presented in the Appendix on the end of (kara, “because”). In such situation the sentence this paper. contents represent the object of emotion. This can be generalized to the following meta-structure, 7 Conclusions OE CAU S XE , We performed automatic annotation of a five- where OE =[Emotion object], CAU S=[causal billion-word corpus of Japanese blogs with informa- form], and XE =[expression of emotion]. tion on affect and sentiment. A survey in emotion The cause phrases were cleaned of irrelevant blog corpora showed there has been no large scale words like stop words to leave only the object emotion corpus available for the Japanese language. phrases. The evaluation showed they were able to We chose YACIS, a large-scale blog corpus and extract nearly 20 mil. object phrases, from which annotated it using two systems for affect analysis 80% was extracted correctly with a reliable signifi- for word- and sentence-level affect analysis and for cance. Thanks to rich annotations on YACIS corpus analysis of emoticons. The annotated information the ontology included such features as emotion class included affective features like sentence subjectivity (joy, anger, etc.), dimensions (valence/activation), (emotive/non-emotive) or emotion classes (joy, sad- POS or semantic categories (hypernyms, etc.). ness, etc.), useful in affect analysis and information on sentence valence/polarity (positive/negative) use- 6.3 Retrieval of Moral Consequence of Actions ful in sentiment analysis obtained as generalizations Third application of the YACIS corpus annotated of those features on a 2-dimensional model of af- with affect- and sentiment-related information has fect. We evaluated the annotations in several ways. been in a novel research on retrieval of moral con- Firstly, on a test set of thousand sentences extracted sequences of actions, first proposed by Rzepka and and evaluated by over forty respondents. Secondly, Araki (2005) and recently developed by Komuda et we compared the statistics of annotations to other al. (2010)7 . The moral consequence retrieval agent existing emotion corpora. Finally, we showed sev- was based on the idea of Wisdom of Crowd. In eral tasks the corpus has already been applied in, particular Komuda et al. (2010) used a Web-mining such as generation of emotion object ontology or re- 7 See also a mention in Scientific American, by Anderson and trieval of emotional and moral consequences of ac- Anderson (2010). tions. 96 Acknowledgments Paul J. Hopper and Sandra A. Thompson. 1985. “The Iconic- ity of the Universal Categories ’Noun’ and ’Verbs’”. In Ty- This research was supported by (JSPS) KAKENHI pological Studies in Language: Iconicity and Syntax. John Grant-in-Aid for JSPS Fellows (Project Number: 22- Haiman (ed.), Vol. 6, pp. 151-183, Amsterdam: John Ben- 00358). jamins Publishing Company. Daisuke Kawahara and Sadao Kurohashi. 2006. “A Fully- References Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis”, Proceedings of the Human Lan- Ahmed Abbasi and Hsinchun Chen. ”Affect Intensity Analysis guage Technology Conference of the North American Chap- of Dark Web Forums”, Intelligence and Security Informatics ter of the ACL, pp. 176-183. 2007, pp. 282-288, 2007 Radoslaw Komuda, Michal Ptaszynski, Yoshio Momouchi, Saima Aman and Stan Szpakowicz. 2007. “Identifying Ex- Rafal Rzepka, and Kenji Araki. 2010. “Machine Moral De- pressions of Emotion in Text”. In Proceedings of the 10th velopment: Moral Reasoning Agent Based on Wisdom of International Conference on Text, Speech, and Dialogue Web-Crowd and Emotions”, Int. Journal of Computational (TSD-2007), Lecture Notes in Computer Science (LNCS), Linguistics Research, Vol. 1 , Issue 3, pp. 155-163. Springer-Verlag. Taku Kudo and Hideto Kazawa. 2009. “Japanese Web N-gram Michael Anderson and Susan Leigh Anderson. 2010. “Robot be Version 1”, Linguistic Data Consortium, Philadelphia, Good”, Scientific American, October, pp. 72-77. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalog Dipankar Das, Sivaji Bandyopadhyay, “Labeling Emotion in Id=LDC2009T08 Bengali Blog Corpus ? A Fine Grained Tagging at Sentence Vinci Liu and James R. Curran. 2006. “Web Text Corpus for Level”, Proceedings of the 8th Workshop on Asian Language Natural Language Processing”, In Proceedings of the 11th Resources, pages 47?55, 2010. Meeting of the European Chapter of the Association for Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Computational Linguistics (EACL), pp. 233-240. Zanchetta. 2008. “The WaCky Wide Web: A Collection Maciejewski, J., Ptaszynski, M., Dybala, P. 2010. “Developing of Very Large Linguistically Processed Web-Crawled Cor- a Large-Scale Corpus for Natural Language Processing and pora”, Kluwer Academic Publishers, Netherlands. Emotion Processing Research in Japanese”, In Proceedings Marco Baroni and Motoko Ueyama. 2006. “Building General- of the International Workshop on Modern Science and Tech- and Special-Purpose Corpora by Web Crawling”, In Pro- nology (IWMST), pp. 192-195. ceedings of the 13th NIJL International Symposium on Kazuyuki Matsumoto, Yusuke Konishi, Hidemichi Sayama, Language Corpora: Their Compilation and Application, Fuji Ren. 2011. “Analysis of Wakamono Kotoba Emotion www.tokuteicorpus.jp/result/pdf/2006 004.pdf Corpus and Its Application in Emotion Estimation”, Interna- Jürgen Broschart. 1997. “Why Tongan does it differently: Cate- tional Journal of Advanced Intelligence, Vol.3,No.1,pp.1-24. gorial Distinctions in a Language without Nouns and Verbs.” Junko Minato, David B. Bracewell, Fuji Ren and Shingo Linguistic Typology, Vol. 1, No. 2, pp. 123-165. Kuroiwa. 2006. “Statistical Analysis of a Japanese Emotion Corpus for Natural Language Processing”, LNCS 4114. Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale Gilad Mishne. 2005. “Experiments with Mood Classification in and Mark Johnson. 2000. “BLLIP 1987-89 WSJ Corpus Blog Posts”. In The 1st Workshop on Stylistic Analysis of Release 1”, Linguistic Data Consortium, Philadelphia, Text for Information Access, at SIGIR 2005, August 2005. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalog Akira Nakamura. 1993. “Kanjo hyogen jiten” [Dictionary of Id=LDC2000T43 Emotive Expressions] (in Japanese), Tokyodo Publishing, Paul Ekman. 1992. “An Argument for Basic Emotions”. Cogni- Tokyo, 1993. tion and Emotion, Vol. 6, pp. 169-200. Jan Pomikálek, Pavel Rychlý and Adam Kilgarriff. 2009. “Scal- Irena Srdanovic Erjavec, Tomaz Erjavec and Adam Kilgarriff. ing to Billion-plus Word Corpora”, In Advances in Computa- 2008. “A web corpus and word sketches for Japanese”, Infor- tional Linguistics, Research in Computing Science, Vol. 41, mation and Media Technologies, Vol. 3, No. 3, pp.529-551. pp. 3-14. Katarzyna Głowińska and Adam Przepiórkowski. 2010. “The Michal Ptaszynski, Pawel Dybala, Wenhan Shi, Rafal Rzepka Design of Syntactic Annotation Levels in the National Cor- and Kenji Araki. 2009. “A System for Affect Analysis of Ut- pus of Polish”, In Proceedings of LREC 2010. terances in Japanese Supported with Web Mining”, Journal Peter Halacsy, Andras Kornai, Laszlo Nemeth, Andras Rung, of Japan Society for Fuzzy Theory and Intelligent Informat- Istvan Szakadat and Vikto Tron. 2004. “Creating open lan- ics, Vol. 21, No. 2, pp. 30-49 (194-213). guage resources for Hungarian”. In Proceedings of the Michal Ptaszynski, Pawel Dybala, Wenhan Shi, Rafal Rzepka LREC, Lisbon, Portugal. and Kenji Araki. 2009. “Towards Context Aware Emotional Chikara Hashimoto, Sadao Kurohashi, Daisuke Kawahara, Intelligence in Machines: Computing Contextual Appro- Keiji Shinzato and Masaaki Nagata. 2011. “Construction of a priateness of Affective States”. In Proceedings of Twenty- Blog Corpus with Syntactic, Anaphoric, and Sentiment An- first International Joint Conference on Artificial Intelligence notations” [in Japanese], Journal of Natural Language Pro- (IJCAI-09), Pasadena, California, USA, pp. 1469-1474. cessing, Vol 18, No. 2, pp. 175-201. Michal Ptaszynski, Pawel Dybala, Rafal Rzepka and Kenji Ichiro Hiejima. 1995. A short dictionary of feelings and emo- Araki. 2009. “Affecting Corpora: Experiments with Au- tions in English and Japanese, Tokyodo Shuppan. tomatic Affect Annotation System - A Case Study of 97 the 2channel Forum -”, In Proceedings of the Conference Appendix. Examples of emotional and of the Pacific Association for Computational Linguistics ethical consequence retrieval. (PACLING-09), pp. 223-228. SUCCESS CASES Michal Ptaszynski, Rafal Rzepka and Kenji Araki. 2010a. “On the Need for Context Processing in Affective Computing”, emotional ethical results score results score In Proceedings of Fuzzy System Symposium (FSS2010), Or- conseq. conseq. ganized Session on Emotions, September 13-15. “To hurt somebody.” Michal Ptaszynski, Jacek Maciejewski, Pawel Dybala, Rafal anger 13.01/54.1 0.24 penalty/ 4.01/7.1 0.565 Rzepka and Kenji Araki. 2010b. “CAO: Fully Automatic fear 12.01/54.1 0.22 punishment sadness 11.01/54.1 0.2 Emoticon Analysis System”, In Proc. of the 24th AAAI Con- “To kill one’s own mother.” ference on Artificial Intelligence (AAAI-10), pp. 1026-1032. sadness 9.01/35.1 0.26 penalty/ 5.01/5.1 0.982 Michal Ptaszynski, Rafal Rzepka, Kenji Araki and Yoshio Mo- surprise 6.01/35.1 0.17 punishment mouchi. 2012a. “A Robust Ontology of Emotion Objects”, In anger 5.01/35.1 0.14 Proceedings of The Eighteenth Annual Meeting of The Asso- “To steal an apple.” ciation for Natural Language Processing (NLP-2012), pp. surprise 2.01/6.1 0.33 reprimand/ 3.01/3.1 0.971 719-722. anger 2.01/6.1 0.33 scold “To steal money.” Michal Ptaszynski, Rafal Rzepka, Kenji Araki and Yoshio Mo- anger 3.01/9.1 0.33 penalty/punish.3.01/6.1 0.493 mouchi. 2012b. “Annotating Syntactic Information on 5.5 sadness 2.01/9.1 0.22 reprimand/sco. 2.01/6.1 0.330 Billion Word Corpus of Japanese Blogs”, In Proceedings “To kill an animal.” of The 18th Annual Meeting of The Association for Natural dislike 7.01/23.1 0.3 penalty/ 36.01/45.1 0.798 Language Processing (NLP-2012), pp. 385-388. sadness 5.01/23.1 0.22 punishment Changqin Quan and Fuji Ren. 2010. “A blog emotion corpus “To drive after drinking.” for emotional expression analysis in Chinese”, Computer fear 6.01/19.1 0.31 penalty/punish.24.01/36.1 0.665 “To cause a war.” Speech & Language, Vol. 24, Issue 4, pp. 726-749. dislike 7.01/15.1 0.46 illegal 2.01/3.1 0.648 Rafal Rzepka, Kenji Araki. 2005. “What Statistics Could Do fear 3.01/15.1 0.2 for Ethics? - The Idea of Common Sense Processing Based “To stop a war.” Safety Valve”, AAAI Fall Symposium on Machine Ethics, joy 6.01/13.1 0.46 forgiven 1.01/1.1 0.918 Technical Report FS-05-06, pp. 85-87. surprise 2.01/13.1 0.15 James A. Russell. 1980. “A circumplex model of affect”. J. of “To prostitute oneself.” Personality and Social Psychology, Vol. 39, No. 6, pp. 1161- anger 6.01/19.1 0.31 illegal 12.01/19.1 0.629 sadness 5.01/19.1 0.26 1178. “To have an affair.” Peter D. Turney and Michael L. Littman. 2002. “Unsupervised sadness 10,01/35.1 0.29 penalty/punish.8.01/11.1 0.722 Learning of Semantic Orientation from a Hundred-Billion- anger 9.01/35.1 0.26 Word Corpus”, National Research Council, Institute for In- INCONSISTENCY BETWEEN EMOTIONS AND ETHICS formation Technology, Technical Report ERB-1094. (NRC #44929). “To kill a president.” Masao Utiyama and Hitoshi Isahara. 2003. “Reliable Mea- joy 2.01/4.1 0.49 penalty/ 2.01/2.1 0.957 likeness 1.01/4.1 0.25 punishment sures for Aligning Japanese-English News Articles and Sen- “To kill a criminal.” tences”. ACL-2003, pp. 72-79. joy 8.01/39.1 0.2 penalty/ 556/561 0.991 Janyce Wiebe, Theresa Wilson and Claire Cardie. 2005. “An- excite 8.01/39.1 0.2 punishment notating expressions of opinions and emotions in language”. anger 7.01/39.1 0.18 Language Resources and Evaluation, Vol. 39, Issue 2-3, pp. CONTEXT DEPENDENT 165-210. Theresa Wilson and Janyce Wiebe. 2005. “Annotating Attribu- “To act violently.” anger 4.01/11.1 0.36 penalty/punish.1.01/2.1 0.481 tions and Private States”, In Proceedings of the ACL Work- fear 2.01/11.1 0.18 agreement 1.01/2.1 0.481 shop on Frontiers in Corpus Annotation II, pp. 53-60. Annie Zaenen and Livia Polanyi. 2006. “Contextual Valence NO ETHICAL CONSEQUENCES Shifters”. In Computing Attitude and Affect in Text, J. G. “Sky is blue.” Shanahan, Y. Qu, J. Wiebe (eds.), Springer Verlag, Dor- joy 51.01/110,1 0.46 none 0 0 drecht, The Netherlands, pp. 1-10. sadness 21.01/110,1 0.19 98 How to Evaluate Opinionated Keyphrase Extraction? Gábor Berend Veronika Vincze University of Szeged Hungarian Academy of Sciences Department of Informatics Research Group on Artificial Intelligence Árpád tér 2., Szeged, Hungary Tisza Lajos krt. 103., Szeged, Hungary

[email protected] [email protected]

Abstract 2 Related Work Evaluation often denotes a key issue in As the task we aim at involves extracting keyphrases semantics- or subjectivity-related tasks. Here we discuss the difficulties of evaluating opin- that are responsible for the author’s opinion toward ionated keyphrase extraction. We present our the product, aspects of both keyphrase extraction method to reduce the subjectivity of the task and opinion mining determine our methodology and and to alleviate the evaluation process and evaluation procedure. There are several sentiment we also compare the results of human and analysis approaches that make use of manually an- machine-based evaluation. notated review datasets (Zhuang et al., 2006; Li et al., 2010; Jang and Shin, 2010) and Wei and 1 Introduction Gulla (2010) constructed a sentiment ontology tree Evaluation is a key issue in natural language pro- in which attributes of the product and sentiments cessing (NLP) tasks. Although for more basic tasks were paired. such as tokenization or morphological parsing, the For evaluating scientific keyphrase extraction, level of ambiguity and subjectivity is essentially several methods have traditionally been applied. In lower than for higher-level tasks such as question the case of exact match, the gold standard key- answering or machine translation, it is still an open words must be in perfect overlap with the ex- question to find a satisfactory solution for the (auto- tracted keywords (Witten et al., 1999; Frank et al., matic) evaluation of certain tasks. Here we present 1999) – also followed in the SemEval-2010 task the difficulties of finding an appropriate way of eval- on keyphrase extraction (Kim et al., 2010), while uating a highly semantics- and subjectivity-related in other cases, approximate matches or semanti- task, namely opinionated keyphrase extraction. cally similar keyphrases are also accepted (Zesch There has been a growing interest in the NLP and Gurevych, 2009; Medelyan et al., 2009). In this treatment of subjectivity and sentiment analysis – work we applied the former approach for the evalu- see e.g. Balahur et al. (2011) – on the one hand and ation of opinion phrases and made a thorough com- on keyphrase extraction (Kim et al., 2010) on the parison with the human judgement. other hand. The tasks themselves are demanding for Here, we use the framework introduced in Berend automatic systems due to the variety of the linguis- (2011) and conducted further experiments based on tic ways people can express the same linguistic con- it to point out the characteristics of the evaluation tent. Here we focus on the evaluation of subjective of opinionated keyphrase extraction. Here we pin- information mining through the example of assign- point the severe differences in performance mea- ing opinionated keyphrases to product reviews and sures when the output is evaluated by humans com- compare the results of human- and machine-based pared to strict exact match principles and also exam- evaluation on finding opinionated keyphrases. ine the benefit of hand-annotated corpus as opposed 99 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 99–103, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics to an automatically crawled one. In addition, the Auth. Ann1 Ann2 Ann3 extent to which original author keyphrases resemble Auth. – 0.415 0.324 0.396 those of independent readers’ is also investigated in Ann1 0.601 – 0.679 0.708 this paper. Ann2 0.454 0.702 – 0.713 Ann3 0.525 0.690 0.688 – 3 Methodology Table 1: Inter-annotator agreement among the author’s In our experiments, we used the methodology de- and annotators’ sets of opinion phrases. Elements above and under the main diagonal refer to the agreement rates scribed in Berend (2011) to extract opinionated in Dice coefficient for pro and con phrases, respectively. keyphrase candidates from the reviews. The sys- tem treats it as a supervised classification task us- ing Maximum Entropy classifier, in which certain lent, yet much simpler forms, e.g. instead of ‘even I n-grams of the product reviews are treated as classi- found the phones menus to be confusing’, we would fication instances and the task is to classify them as like to have ‘confusing phones menus’. Refinement proper or improper ones. It incorporates a rich fea- was carried out both automatically by using hand- ture set, relying on the usage of SentiWordNet (Esuli crafted transformation rules (based on POS patterns et al., 2010) and further orthological, morphological and parse trees) and manual inspection. The an- and syntactic features. Next, we present the diffi- notation guidelines for the human refinement and culties of opinionated keyphrase extraction and offer various statistics on the dataset can be accessed at our solutions to the emerging problems. http://rgai.inf.u-szeged.hu/proCon. 3.1 Author keyphrases 3.2 Annotator keyphrases In order to find relevant keyphrases in the texts, The second problem with regard to opinionated first the reviews have to be segmented into ana- keyphrase extraction is the subjectivity of the task. lyzable parts. We made use of the dataset de- Different people may have different opinions on the scribed in Berend (2011), which contains 2000 prod- very same product, which is often reflected in their uct reviews each from two quite different domains, reviews. On the other hand, people can gather dif- i.e. mobile phone and video film reviews from the re- ferent information from the very same review due view portal epinions.com. In the free-text parts to differences in interpretation, which again compli- of the reviews, the author describes his subjective cates the way of proper evaluation. feelings and views towards the product, and in the In order to evaluate the difficulty of identifying sections Pros and cons and Bottomline he summa- opinion-related keyphrases, we decided to apply the rizes the advantages and disadvantages of the prod- following methodology. We selected 25 reviews re- uct, usually by providing some keyphrases or short lated to the mobile phone Nokia 6610, which were sentences. However, these pros and cons are noisy also collected from the website epinions.com. since some authors entered full sentences while oth- The task for three linguists was to write positive ers just wrote phrases or keywords. Furthermore, and negative aspects of the product in the form of the segmentation also differs from review to review keyphrases, similar to the original pros and cons. In or even within the same review (comma, semicolon, order not to be influenced by the keyphrases given ampersand etc.). There are also non-informative by the author of the review, the annotators were only comments such as none among cons. For the above given the free-text part of the review, i.e. the origi- reasons, the identification of the appropriate gold nal Pros and cons and Bottomline sections were re- standard phrases is not unequivocal. moved. In this way, three different pro and con an- We had to refine the pros and cons of the re- notations were produced for each review, besides, views so that we could have access to a less noisy those of the original author were also at hand. The database. Refinement included segmenting pros inter-annotator agreement rate is in Table 1. and cons into keyphrase-like units and also bring- Concerning the subjectivity of the task, pro and ing complex phrases into their semantically equiva- con phrases provided by the three annotators and 100 Eval Ref Top-5 Top-10 Top-15 purely on the phrases of the annotators excluding the 3Ann∪ man 32.14 44.66 53.92 original phrases of the author or including them. The 3Ann∪ auto 27.68 38.17 45.78 following example illustrates the way new sets were M erged∪ man 28.52 41.09 52.18 created based on the input sets (in italics): M erged∪ auto 27.39 37.67 46.34 Pro1 : radio, organizer, phone book 3Ann∩ man 34.89 43.31 44.92 Pro2 : radio, organizer, loudspeaker 3Ann∩ auto 29.96 34.34 35.54 Pro3 : radio, organizer, calendar M erged∩ man 24.75 26.12 22.22 Union: radio, organizer, calendar, loud- M erged∩ auto 21.39 20.94 21.89 speaker, phone book Author man 27.14 33.5 35.24 Intersection: radio, organizer Author auto 20.61 22.34 25.03 Proauthor : clear, fun Table 2: F-scores of the human evaluation of the automat- Merged Union: radio, organizer, calen- ically extracted opinion phrases. Columns Eval and Ref dar, loudspeaker, phone book, clear, fun show the way gold standard phrases were obtained and if Merged Intersection: ∅ they were refined manually or automatically. The reason behind this methodology was that it made it possible to evaluate our automatic meth- the original author showed a great degree of variety ods in two different ways. Comparing the automatic although they had access to the very same review. keyphrases to the union of human annotations means Sometimes it happened that one annotator did not that a bigger number of keyphrases is to be identi- give any pro or con phrases for a review whereas the fied, however, with a bigger number of gold standard others listed a bunch of them, which reflects that the keywords it is more probable that the automatic key- very same feature can be judged as still tolerable, words occur among them. At the same time having a neutral or absolutely negative for different people. larger set of gold standard tags might affect the recall Thus, as even human annotations may differ from negatively since there are more keyphrases to return. each other to a great extent, it is not unequivocal to On the other hand, in the case of intersection it can decide which human annotation should be regarded be measured whether the most important features as the gold standard upon evaluation. (i.e. those that every annotator felt relevant) can be extracted from the text. Note that our strategy is sim- 3.3 Evaluation methodology ilar to the one applied in the case of BLEU/ROUGE score (Papineni et al., 2002; Lin, 2004) with respect Since the comparison of annotations highlighted to the fact that multiple good solutions are taken the subjectivity of the task, we voted for smooth- into account whereas the application of union and ing the divergences of annotations. We wanted to intersection is determined by the nature of the task: take into account all the available annotations which different annotators may attach several outputs (in were manually prepared and regarded as acceptable. other words, different numbers of keyphrases) to the Thus, an annotator formed the union and the inter- same document in the case of keyphrase extraction, section of the pro and con features given by each an- which is not realistic in the case of machine trans- notator either including or excluding those defined lation or summarization (only one output is offered by the original author. With this, we aimed at elim- for each sentence / text). inating subjectivity since in the case of union, every keyphrase mentioned by at least one annotator was 3.4 Results taken into consideration while in the case of inter- In our experiments, we used the opinion phrase ex- section, it is possible to detect keyphrases that seem traction system based on the paper of Berend (2011). to be the most salient for the annotators as regards Results vary whether the manually or the automat- the given document. Thus, four sets of pros and cons ically refined set of the original sets of pros and were finally yielded for each review depending on cons were regarded as positive training examples whether the unions or intersections were determined and also whether the evaluation was carried out 101 Mobiles Movies gold standard opinion phrases. Note, however, that A/A 9.95 9.55 8.61 7.58 7.1 6.24 even though results obtained with the automatic re- A/M 13.51 12.73 11.2 9.95 9.05 7.72 finement of training instances tend to stay below the M/A 10.15 9.7 8.69 7.52 6.92 5.97 results that are obtained with the manual refinement M/M 15.27 14.11 12.17 12.22 10.63 8.67 of gold standard phrases, they are still comparable, Table 3: F-scores achieved with different keyphrase re- which implies that with more sophisticated rules, finement strategies. A and M as the first (second) charac- training data could be automatically generated. ter indicate the fact that the training (testing) was based If the inter-annotator agreement rates are com- on the automatically and manually defined sets of gold pared, it can be seen that the agreement rates be- standard expressions, respectively. tween the annotators are considerably higher than those between a linguist and the author of the prod- uct review. This may be due to the fact that the against purely the original set of author-assigned linguists were to conform to the annotation guide- keyphrases or the intersection/union of the man- lines whereas the keyphrases given by the authors ual annotations including and excluding the author- of the reviews were not limited in any way. Still, assigned keyphrases on the 25 mobile phone re- it can be observed that among the author-annotator views. Results of the various combinations in the agreement rates, the con phrases could reach higher experiments for the top 5, 10 and 15 keyphrases agreement than the pro phrases. This can be due to are reported in Table 2 containing both cases when psychological reasons: people usually expect things human and automatic refinement of the gold stan- to be good hence they do not list all the features that dard opinion phrases were carried out. Automatic are good (since they should be good by nature), in keyphrases were manually compared to the above contrast, they list negative features because this is mentioned sets of keyphrases, i.e. human annotators what deviates from the normal expectations. judged them as acceptable or not. Human evaluation In this paper, we discussed the difficulties of eval- had the advantage over automated ones, that they uating opinionated keyphrase extraction and also could accept the extracted term ‘MP3’ when there conducted experiments to investigate the extent of was only its mistyped version ‘MP+’ in the set of overlap between the keyphrases determined by the gold standard phrases (as found in the dataset). original author of a review and those assigned by Table 3 presents the results of our experiments on independent readers. To reduce the subjectivity of keyphrase refinement on the mobiles and movies do- the task and to alleviate the evaluation process, we mains. In these settings strict matches were required presented our method that employs several indepen- instead of human evaluation. Results differ with re- dent annotators and we also compared the results of spect to the fact whether the automatically or manu- human and machine-based evaluation. Our results ally refined sets of the original author phrases were reveal that for now, human evaluation leads to bet- utilized for training and during the strict evaluation. ter results, however, we believe that the proper treat- Having conducted these experiments, we could ex- ment of polar expressions and ambiguous adjectives amine the possibility of a fully automatic system that might improve automatic evaluation among others. needs no manually inspected training data, but it can Besides describing the difficulties of the auto- create it automatically as well. matic evaluation of opinionated keyphrase extrac- 4 Discussion and conclusions tion, the impact of training on automatically crawled gold standard opinionated phrases was investigated. Both human and automatic evaluation reveal that Although not surprisingly they lag behind the ones the results yielded when the system was trained on obtained based on manually refined training data, manually refined keyphrases are better. The usage the automatic creation of gold standard keyphrases of manually refined keyphrases as the training set can be a much cheaper, yet feasible option to manu- leads to better results (the difference being 5.9 F- ally refined opinion phrases. In the future, we plan to score on average), which argues for human annota- reduce the gap between manual and automatic eval- tion as opposed to automatic normalization of the uation of opinionated keyphrase extraction. 102 Acknowledgments Olena Medelyan, Eibe Frank, and Ian H. Witten. 2009. Human-competitive tagging using automatic This work was supported in part by the NIH grant keyphrase extraction. In Proceedings of the 2009 (project codename MASZEKER) of the Hungarian Conference on Empirical Methods in Natural Lan- government. guage Processing, pages 1318–1327, Singapore, Au- gust. ACL. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- References Jing Zhu. 2002. Bleu: a method for automatic eval- Alexandra Balahur, Ester Boldrini, Andres Montoyo, and uation of machine translation. In Proceedings of 40th Patricio Martinez-Barco, editors. 2011. Proceedings Annual Meeting of the ACL, pages 311–318, Philadel- of the 2nd Workshop on Computational Approaches to phia, Pennsylvania, USA, July. ACL. Subjectivity and Sentiment Analysis (WASSA 2.011). Wei Wei and Jon Atle Gulla. 2010. Sentiment learn- ACL, Portland, Oregon, June. ing on product reviews via sentiment ontology tree. In Gábor Berend. 2011. Opinion expression mining by ex- Proceedings of the 48th Annual Meeting of the ACL, ploiting keyphrase extraction. In Proceedings of 5th pages 404–413, Uppsala, Sweden, July. ACL. International Joint Conference on Natural Language Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Processing, pages 1162–1170, Chiang Mai, Thailand, Gutwin, and Craig G. Nevill-Manning. 1999. Kea: November. Asian Federation of Natural Language Pro- Practical automatic keyphrase extraction. In ACM DL, cessing. pages 254–255. Andrea Esuli, Stefano Baccianella, and Fabrizio Se- Torsten Zesch and Iryna Gurevych. 2009. Approxi- bastiani. 2010. Sentiwordnet 3.0: An enhanced mate Matching for Evaluating Keyphrase Extraction. lexical resource for sentiment analysis and opinion In Proceedings of the 7th International Conference mining. In Proceedings of the Seventh conference on Recent Advances in Natural Language Processing, on International Language Resources and Evaluation pages 484–489, September. (LREC’10), Valletta, Malta, May. European Language Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006. Movie Resources Association (ELRA). review mining and summarization. In Proceedings of Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl the 15th ACM international conference on Information Gutwin, and Craig G. Nevill-Manning. 1999. and knowledge management, CIKM ’06, pages 43–50, Domain-specific keyphrase extraction. In Proceed- New York, NY, USA. ACM. ing of 16th International Joint Conference on Artifi- cial Intelligence, pages 668–673. Morgan Kaufmann Publishers. Hayeon Jang and Hyopil Shin. 2010. Language-specific sentiment analysis in morphologically rich languages. In Coling 2010: Posters, pages 498–506, Beijing, China, August. Coling 2010 Organizing Committee. Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Tim- othy Baldwin. 2010. Semeval-2010 task 5: Auto- matic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Se- mantic Evaluation, SemEval ’10, pages 21–26, Mor- ristown, NJ, USA. ACL. Fangtao Li, Chao Han, Minlie Huang, Xiaoyan Zhu, Ying-Ju Xia, Shu Zhang, and Hao Yu. 2010. Structure-aware review mining and summarization. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 653– 661, Beijing, China, August. Coling 2010 Organizing Committee. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Stan Szpakowicz Marie- Francine Moens, editor, Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74– 81, Barcelona, Spain, July. ACL. 103 Semantic frames as an anchor representation for sentiment analysis Josef Ruppenhofer Ines Rehbein Department of Information Science SFB 632: Information Structure and Natural Language Processing German Department University of Hildesheim, Germany Potsdam University, Germany

[email protected] [email protected]

Abstract tiFrameNet, an extension of FrameNet (Baker et al., Current work on sentiment analysis is char- 1998) offering a novel representation for sentiment acterized by approaches with a pragmatic fo- analysis based on frame semantics. cus, which use shallow techniques in the inter- 2 Shallow and pragmatic approaches est of robustness but often rely on ad-hoc cre- ation of data sets and methods. We argue that Current approaches to sentiment analysis are mainly progress towards deep analysis depends on pragmatically oriented, without giving equal weight a) enriching shallow representations with lin- to semantics. One aspect concerns the identifica- guistically motivated, rich information, and b) tion of sentiment-bearing expressions. The anno- focussing different branches of research and tations in the MPQA corpus (Wiebe et al., 2005), combining ressources to create synergies with for instance, were created without limiting what an- related work in NLP. In the paper, we propose notators can annotate in terms of syntax or lexicon. SentiFrameNet, an extension to FrameNet, as a novel representation for sentiment analysis While this serves the spirit of discovering the variety that is tailored to these aims. of opinion expressions in actual contexts, it makes it difficult to match opinion expressions when us- 1 Introduction ing the corpus as an evaluation dataset as the same Sentiment analysis has made a lot of progress on or similar structures may be treated differently. A more coarse-grained analysis levels using shallow similar challenge lies in distinguishing so-called po- techniques. However, recent years have seen a trend lar facts from inherently sentiment-bearing expres- towards more fine-grained and ambitious analyses sions. For example, out of context, one would not requiring more linguistic knowledge and more com- associate any of the words in the sentence Wages plex statistical models. Recent work has tried to pro- are high in Switzerland with a particular evaluative duce relatively detailed summaries of opinions ex- meaning. In specific contexts, however, we may pressed in news texts (Stoyanov and Cardie, 2011); take the sentence as reason to either think positively to assess the impact of quotations from business or negatively of Switzerland: employees receiving leaders on stock prices (Drury et al., 2011); to detect wages may be drawn to Switzerland, while employ- implicit sentiment (Balahur et al., 2011); etc. Ac- ers paying wages may view this state of affairs neg- cordingly, we can expect that greater demands will atively. As shown by the inter-annotator agreement be made on the amount of linguistic knowledge, its results reported by (Toprak et al., 2010), agreement representation, and the evaluation of systems. on distinguishing polar facts from inherently eval- Against this background, we argue that it is uative language is low. Unsurprisingly, many ef- worthwhile to complement the existing shallow forts at automatically building up sentiment lexica and pragmatic approaches with a deep, lexical- simply harvest expressions that frequently occur as semantics based one in order to enable deeper analy- part of polar facts without resolving whether the sub- sis. We report on ongoing work in constructing Sen- jectivity clues extracted are inherently evaluative or 104 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 104–109, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics merely associated with statements of polar fact. the connection to role labeling. One reason not to Pragmatic considerations also lead to certain ex- pursue this is that “in many practical situations, the pressions of sentiment or opinion being excluded annotation beyond opinion holder labeling is too ex- from analysis. (Seki, 2007), for instance, annotated pensive” (Wiegand, 2010, p.121). (Shaikh et al., sentences as “not opinionated” if they contain indi- 2007) use semantic dependencies and composition rect hearsay evidence or widely held opinions. rules for sentence-level sentiment scoring but do not In the case of targets, the work by (Stoyanov and deal with source and target extraction. The focus on Cardie, 2008) exhibits a pragmatic focus as well. robust partial solutions, however, prevents the cre- These authors distinguish between (a) the topic of ation of an integrated high-quality resource. a fine-grained opinion, defined as the real-world ob- 3 The extended frame-semantic approach ject, event or abstract entity that is the subject of the opinion as intended by the opinion holder; (b) the We now sketch a view of sentiment analysis on the topic span associated with an opinion expression is basis of an appropriately extended model of frame the closest, minimal span of text that mentions the semantic representation.1 topic; and (c) the target span defined as the span Link to semantic frames and roles Since the pos- of text that covers the syntactic surface form com- sible sources and targets of opinion are usually iden- prising the contents of the opinion. As the defini- tical to a predicate’s semantic roles, we add opinion tions show, (Stoyanov and Cardie, 2008) focus on frames with slots for Source, Target, Polarity and text-level, pragmatic relevance by paying attention Intensity to the FrameNet database. We map the to what the author intends, rather than concentrat- Source and Target opinion roles to semantic roles ing on the explicit syntactic dependent (their target as appropriate, which enables us to use semantic span) as the topic. This pragmatic focus is also in role labeling systems in the identification of opinion evidence in (Wilson, 2008)’s work on contextual po- roles (Ruppenhofer et al., 2008). larity classification, which uses features in the clas- In SentiFrameNet all lexical units (LUs) that are sification that are syntactically independent of the inherently evaluative are associated with opinion opinion expression such as the number of subjectiv- frames. The language of polar facts is not associ- ity clues in adjoining sentences. ated with opinion frames. However, we show in the Among lexicon-driven approaches, we find that longer version of this paper (cf. footnote 1) how we despite arguments that word sense distinctions are support certain types of inferred sentiment. With re- important to sentiment analysis (Wiebe and Mihal- gard to targets, our representation selects as targets cea, 2006), often-used resources do not take them of opinion the target spans of (Stoyanov and Cardie, into account and new resources are still being cre- 2008) rather than their opinion topics (see Section ated which operate on the more shallow lemma-level 2). For us, opinion topics that do not coincide with (e.g. (Neviarouskaya et al., 2009)). Further, most target spans are inferential opinion targets. lexical resources do not adequately represent cases Formal diversity of opinion expressions For fine- where multiple opinions are tied to one expression grained sentiment-analysis, handling the full vari- and where presuppositions and temporal structure ety of opinion expressions is indispensable. While come into play. An example is the verb despoil: adjectives in particular have often been found to there is a positive opinion by the reporter about the be very useful cues for automatic sentiment anal- despoiled entity in its former state, a negative opin- ysis (Wiebe, 2000; Benamara et al., 2007), eval- ion about its present state, and (inferrable) negative uative meaning pervades all major lexical classes. sentiment towards the despoiler. In most resources, There are many subjective multi-words and idioms the positive opinion will not be represented. such as give away the store and evaluative mean- The most common approach to the task is an in- ing also attaches to grammatical constructions, even formation extraction-like pipeline. Expressions of ones without obligatory lexical material. An exam- opinion, sources and targets are often dealt with sep- 1 We present a fuller account of our ideas in an unpublished arately, possibly using separate resources. Some longer version of this paper, available from the authors’ web- work such as (Kim and Hovy, 2006) has explored sites. 105 ple is the construction exemplified by Him be a doc- tor? The so-called What, me worry?-construction (Fillmore, 1989) consists only of an NP and an in- finitive phrase. Its rhetorical effect is to express the speaker’s surprise or incredulity about the proposi- tion under consideration. The FrameNet database schema accommodates not only single and multi- words but also handles data for a constructicon (Fill- more et al., to appear) that pairs grammatical con- structions with meanings. Multiple opinions We need to accommodate multi- ple opinions relating to the same predicate as in the case of despoil mentioned above. Predicates with multiple opinions are not uncommon: in a 100-item random sample taken from the Pittsburgh subjectiv- ity clues, 17 involved multiple opinions. The use of opinion frames as described above en- ables us to readily represent multiple opinions. For instance, the verb brag in the modified Bragging frame has two opinion frames. The first one has pos- itive polarity and represents the frame-internal point of view. The S PEAKER is the Source relative to the T OPIC as the Target. The second opinion frame has negative polarity, representing the reporter’s point of view. The S PEAKER is the Target but the Source is Figure 1: Frame analysis for "Come around" unspecified, indicating that it needs to be resolved using additional frames and frame relations. A par- to an embedded source. For a similar representation tial analysis of come around is sketched in Figure 1. of multiple opinions in a Dutch lexical resource, see (Maks and Vossen, 2011). We use the newly added Come around scenario Event structure and presuppositions A complete frame as a background frame that ties together all representation of subjectivity needs to include event the information we have about instances of coming and presuppositional structure. This is necessary, around. Indicated by the dashed lines are the S UB - for instance, for predicates like come around (on) in FRAMES of the scenario. Among them are three (1), which involve changes of opinion relative to the instances of the Deciding frame (solid lines), all same target by the same source. Without the pos- related temporally (dashed-dotted) and in terms of sibility of distinguishing between attitudes held at content to an ongoing Discussion. The initial dif- different times, the sentiment associated with these ference of opinion is encoded by the fact that De- predicates cannot be modeled adequately. ciding1 and Deciding2 share the same P OSSIBILI - TIES but differ in the D ECISION . The occurrence (1) Newsom is still against extending weekday me- of Come_around leads to Deciding3, which has the tering to evenings, but has COME AROUND on same C OGNIZER as Deciding1 but its D ECISION is Sunday enforcement. now identical to that in Deciding2, which has been For come around (on), we want to to distinguish unchanged. The sentiment information we need is its semantics from that of predicates such as ambiva- encoded by simply stating that there is a sentiment lent and conflicted, where a C OGNIZER simultane- of positive polarity of the C OGNIZER (as source) ously holds opposing valuations of (aspects of) a tar- towards the D ECISION (as target) in the Deciding get. Following FrameNet’s practice, we model pre- frame. (This opinion frame is not displayed in the supposed knowledge explicitly in SentiFrameNet by graphic.) The Come around frame itself is not as- 106 sociated with sentiment information, which seems lacks. The fact that avoid imposes a negative evalu- right given that it does not include a D ECISION as a ation by its subject on its object can easily be mod- frame element but only includes the I SSUE. eled using opinion frames. For a discussion of how SentiFrameNet captures 4 Impact and Conclusions factuality presuppositions by building on (Saurí, Deep analysis Tying sentiment analysis to frame se- 2008)’s work on event factuality, we refer the inter- mantics enables immediate access to a deeper lexical ested reader to the longer version of the paper. semantics. Given particular application-interests, Modulation, coercion and composition Speakers for instance, identifying statements of uncertainty, can shift the valence or polarity of sentiment-bearing frames and lexical units relevant to the task can expressions through some kind of negation operator, be pulled out easily from the general resource. A or intensify or attenuate the impact of an expression. frame-based treatment also improves over resources Despite these interacting influences, it is desirable to such as SentiWordNet (Baccianella et al., 2008), have at least a partial ordering among predicates re- which, while representing word meanings, lacks any lated to the same semantic scale; we want to be able representation of semantic roles. to find out from our resource that good is less pos- Theoretical insights New research questions await, itive than excellent, while there may be no ordering among them: whether predicates with multiple opin- between terrific and excellent. In SentiFrameNet, an ions can be distinguished automatically from ones ordering between the polarity strength values of dif- with only one, and whether predicates carrying fac- ferent lexical units is added on the level of frames. tivity or other sentiment-related presuppositions can The frame semantic approach also offers new per- be discovered automatically. Further, our approach spectives on sentiment composition. We can, for in- lets us ask how contextual sentiment is, and how stance, recognize cases of presupposed sentiment, much of the analysis of pragmatic annotations can as in the case of the noun revenge, which are not be derived from lexical and syntactic knowledge. amenable to shifting by negation: She did not take Evaluation With a frame-based representation, revenge does not imply that there is no negative eval- the units of annotation are pre-defined by a gen- uation of some injury inflicted by an offender. eral frame semantic inventory and systems can read- Further, many cases of what has been called va- ily know what kind of units to target as potential lence shifting for us are cases where the evaluation opinion-bearing expressions. Once inherent seman- is wholly contained in a predicate. tics and pragmatics are distinguished, the correct- (2) Just barely AVOIDED an accident today. ness of inferred (pragmatic) targets and the polarity (3) I had served the bank for 22 years and had towards them can be weighted differently from that AVOIDED a promotion since I feared that I of immediate (semantic) targets and their polarity. would be transferred out of Chennai city. Synergy On our approach, lexically inherent sen- timent information need not be annotated, it can be If we viewed avoid as a polarity shifter and fur- imported automatically once the semantic frame’s ther treated nouns like promotion and accident as roles are annotated. Only pragmatic information sentiment-bearing (rather than treating them as de- needs to be labeled manually. By expanding the noting events that affect somebody positively or neg- FrameNet inventory and creating annotations, we atively) we should expect that while (2) has positive improve a lexical resource and create role-semantic sentiment, (3) has negative sentiment. But that is not annotationsas well as doing sentiment analysis. so: accomplished intentional avoiding is always pos- We have proposed SentiFrameNet as a linguisti- itive for the avoider. Also, the reversal analysis for cally sound, deep representation for sentiment anal- avoid cannot deal with complements that have no in- ysis, extending an existing resource. Our approach herent polarity. It readily follows from the coercion complements pragmatic approaches, allows us to analysis that I avoid running into her is negative but join forces with related work in NLP (e.g. role label- that cannot be derived in e.g. (Moilanen and Pul- ing, event factuality) and enables new insights into man, 2007)’s compositional model which takes into the theoretical foundations of sentiment analysis. account inherent lexical polarity, which run (into) 107 References Analysis (WASSA 2.011), pages 10–18, Portland, Ore- gon, June. Association for Computational Linguistics. S. Baccianella, A. Esuli, and F. Sebastiani. 2008. SEN- Karo Moilanen and Stephen Pulman. 2007. Senti- TIWORDNET 3.0: An enhanced lexical resource ment composition. In Proceedings of RANLP 2007, for sentiment analysis and opinion mining. In Pro- Borovets, Bulgaria. ceedings of the Seventh conference on International A. Neviarouskaya, H. Prendinger, and M. Ishizuka. Language Resources and Evaluation LREC10, pages 2009. Sentiful: Generating a reliable lexicon for senti- 2200–2204. European Language Resources Associa- ment analysis. In Affective Computing and Intelligent tion (ELRA). Interaction and Workshops, 2009. ACII 2009. 3rd In- Collin F. Baker, Charles J. Fillmore, and John B. Lowe. ternational Conference on, pages 1–6. Ieee. 1998. The Berkeley Framenet Project. In Proceed- J. Ruppenhofer, S. Somasundaran, and J. Wiebe. 2008. ings of the 36th Annual Meeting of the Association Finding the sources and targets of subjective expres- for Computational Linguistics and 17th International sions. In LREC, Marrakech, Morocco. Conference on Computational Linguistics-Volume 1, Roser Saurí. 2008. A Factuality Profiler for Eventualities pages 86–90. Association for Computational Linguis- in Text. Ph.d., Brandeis University. tics. Yohei Seki. 2007. Crosslingual opinion extraction from Alexandra Balahur, Jesús M. Hermida, and Andrés Mon- author and authority viewpoints at ntcir-6. In Proceed- toyo. 2011. Detecting implicit expressions of senti- ings of NTCIR-6 Workshop Meeting, Tokyo, Japan. ment in text based on commonsense knowledge. In Mostafa Shaikh, Helmut Prendinger, and Ishizuka Mit- Proceedings of the 2nd Workshop on Computational suru. 2007. Assessing sentiment of text by semantic Approaches to Subjectivity and Sentiment Analysis dependency and contextual valence analysis. Affec- (WASSA 2.011), pages 53–60, Portland, Oregon, June. tive Computing and Intelligent Interaction, pages 191– Association for Computational Linguistics. 202. Farah Benamara, Sabatier Irit, Carmine Cesarano, Napoli Veselin Stoyanov and Claire Cardie. 2008. Topic Federico, and Diego Reforgiato. 2007. Sentiment identification for fine-grained opinion analysis. In analysis : Adjectives and adverbs are better than ad- Proceedings of the 22nd International Conference on jectives alone. In Proc of Int Conf on Weblogs and Computational Linguistics - Volume 1, COLING ’08, Social Media, pages 1–4. pages 817–824, Stroudsburg, PA, USA. Association Brett Drury, Gaël Dias, and Luís Torgo. 2011. A con- for Computational Linguistics. textual classification strategy for polarity analysis of Veselin Stoyanov and Claire Cardie. 2011. Auto- direct quotations from financial news. In Proceedings matically creating general-purpose opinion summaries of the International Conference Recent Advances in from text. In Proceedings of RANLP 2011, pages 202– Natural Language Processing 2011, pages 434–440, 209, Hissar, Bulgaria, September. Hissar, Bulgaria, September. RANLP 2011 Organising Cigdem Toprak, Niklas Jakob, and Iryna Gurevych. Committee. 2010. Sentence and expression level annotation of Charles J. Fillmore, Russell Lee-Goldman, and Russell opinions in user-generated discourse. In Proceedings Rhodes, to appear. Sign-based Construction Gram- of ACL-10, the 48th Annual Meeting of the Association mar, chapter The FrameNet Constructicon. CSLI, for Computational Linguistics, Portland. Association Stanford, CA. for Computational Linguistics. Charles J. Fillmore. 1989. Grammatical construction Janyce Wiebe and Rada Mihalcea. 2006. Word sense and theory and the familiar dichotomies. In R. Dietrich subjectivity. In Proceedings of the 21st International and C.F. Graumann, editors, Language processing in Conference on Computational Linguistics and the 44th social context, pages 17–38. North-Holland/Elsevier, annual meeting of the Association for Computational Amsterdam. Linguistics, ACL-44, pages 1065–1072, Stroudsburg, S.M. Kim and E. Hovy. 2006. Extracting opinions, opin- PA, USA. Association for Computational Linguistics. ion holders, and topics expressed in online news media Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. text. In Proceedings of the Workshop on Sentiment and Annotating expressions of opinions and emotions Subjectivity in Text, pages 1–8. Association for Com- in language. Language Resources and Evaluation, putational Linguistics. 39(2/3):164–210. Isa Maks and Piek Vossen. 2011. A verb lexicon model Janyce Wiebe. 2000. Learning subjective adjectives for deep sentiment analysis and opinion mining appli- from corpora. In Proceedings of the Seventeenth cations. In Proceedings of the 2nd Workshop on Com- National Conference on Artificial Intelligence (AAAI- putational Approaches to Subjectivity and Sentiment 2000), pages 735–740, Austin, Texas. 108 Michael Wiegand. 2010. Hybrid approaches to senti- ment analysis. Ph.D. thesis, Saarland University, Saar- brücken. Theresa Ann Wilson. 2008. Fine-grained Subjectivity and Sentiment Analysis: Recognizing the Intensity, Po- larity, and Attitudes of Private States. Ph.D. thesis, University of Pittsburgh. 109 On the Impact of Sentiment and Emotion Based Features in Detecting Online Sexual Predators Dasha Bogdanova Paolo Rosso Thamar Solorio University of NLE Lab - ELiRF CoRAL Lab Saint Petersburg Universitat University of dasha.bogdanova Politècnica de València Alabama at Birmingham @gmail.com

[email protected] [email protected]

Abstract to solicit children have become common as well. Mitchell (2001) found out that 19% of children have According to previous work on pedophile psy- been sexually approached online. However, manual chology and cyberpedophilia, sentiments and monitoring of each conversation is impossible, due emotions in texts could be a good clue to de- tect online sexual predation. In this paper, we to the massive amount of data and privacy issues. A have suggested a list of high-level features, in- good alternative is the development of reliable tools cluding sentiment and emotion based ones, for for detecting pedophilia in online social media is of detection of online sexual predation. In partic- great importance. ular, since pedophiles are known to be emo- In this paper, we address the problem of detecting tionally unstable, we were interested in inves- pedophiles with natural language processing (NLP) tigating if emotion-based features could help in their detection. We have used a corpus of techniques. This problem becomes even more chal- predators’ chats with pseudo-victims down- lenging because of the chat data specificity. Chat loaded from www.perverted-justice.com and conversations are very different not only from the two negative datasets of different nature: cy- written text but also from other types of social media bersex logs available online and the NPS chat interactions, such as blogs and forums, since chat- corpus. Naive Bayes classification based on ting in the Internet usually involves very fast typing. the proposed features achieves accuracies of The data usually contains a large amount of mis- up to 94% while baseline systems of word and takes, misspellings, specific slang, character flood- character n-grams can only reach up to 72%. ing etc. Therefore, accurate processing of this data with automated syntactic analyzers is rather chal- 1 Introduction lenging. Child sexual abuse and pedophilia are both problems Previous research on pedophilia reports that the of great social concern. On the one hand, law en- expression of certain emotions in text could be help- forcement is working on prosecuting and preventing ful to detect pedophiles in social media (Egan et al., child sexual abuse. On the other hand, psycholo- 2011). Following these insights we suggest a list gists and mental specialists are investigating the phe- of features, including sentiments as well as other nomenon of pedophilia. Even though the pedophilia content-based features. We investigate the impact has been studied from different research points, it re- of these features on the problem of automatic detec- mains to be a very important problem which requires tion of online sexual predation. Our experimental further research, especially from the automatic de- results show that classification based on such fea- tection point of view. tures discriminates pedophiles from non-pedophiles Previous studies report that in the majority of with high accuracy. cases of sexual assaults the victims are under- The remainder of the paper is structured as fol- aged (Snyder, 2000). On the Internet, attempts lows: Section 2 overviews related work on the topic, 110 Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pages 110–118, Jeju, Republic of Korea, 12 July 2012. 2012 c Association for Computational Linguistics Section 3 outlines the profile of a pedophile based on first step in detecting online predators. Peersman the previous research. Our approach to the problem et al. (2011) have analyzed chats from the Bel- of detecting pedophiles in social media on the ba- gium Netlog social network. Discrimination be- sis of high-level features is presented in Section 4. tween those who are older than 16 from those who Experimental data is described in Section 5. We are younger based on a Support Vector Machine show the results of the conducted experiments in classification yields 71.3% accuracy. The accuracy Section 6; they are followed by discussion and plans is even higher when the age gap is increased (e.g. for future research in Section 7. We finally draw the accuracy of classifying those who are less than some conclusions in Section 8. 16 from those who are older than 25 is 88.2%). They have also investigated the issues of the minimum 2 Related Research amount of training data needed. Their experiments The problem of automatic detection of pedophiles have shown that with 50% of the original dataset the in social media has been rarely addressed so far. In accuracy remains almost the same, and with only part, this is due to the difficulties involved in hav- 10% it is still much better than the random baseline ing access to useful data. There is an American performance. foundation called Perverted Justice (PJ). It investi- NLP techniques were as well applied to capture gates cases of online sexual predation: adult volun- child sexual abuse data in P2P networks (Panchenko teers enter chat rooms as juveniles (usually 12-15 et al., 2012). The proposed text classification system year old) and if they are sexually solicited by adults, is able to predict with high accuracy if a file contains they work with the police to prosecute the offenders. child pornography by analyzing its name and textual Some chat conversations with online sexual preda- description. tors are available at www.perverted-justice.com and Our work neither aims at classification of chat they have been the subject of analysis of recent re- lines into categories as it was done by McGhee et search on this topic. al. (2011) nor at discriminating between victim and Pendar (2007) experimented with PJ data. He sep- predator as it was done by Pendar (2007), but at dis- arated the lines written by pedophiles from those tinguishing between pedophile’s and not pedophile’s written by pseudo-victims and used a kNN classi- chats, in particular, by utilizing clues provided by fier based on word n-grams to distinguish between psychology and sentiment analysis. them. Another related research has been carried out by 3 Profiling the Pedophile McGhee et al. (2011). The chat lines from PJ were Pedophilia is a “disorder of adult personality and be- manually classified into the following categories: havior” which is characterized by sexual interest in 1. Exchange of personal information prepubescent children (International statistical clas- sification of diseases and related health problems, 2. Grooming 1988). Even though solicitation of children is not a medical diagnosis, Abel and Harlow (2001) reported 3. Approach that 88% of child sexual abuse cases are commit- 4. None of the listed above classes ted by pedophiles. Therefore, we believe that under- standing behavior of pedophiles could help to detect Their experiments have shown that kNN classifi- and prevent online sexual predation. Even though an cation achieves up to 83% accuracy and outperforms online sexual offender is not always a pedophile, in a rule-based approach. this paper we use these terms as synonyms. As it was already mentioned, pedophiles often Previous research reports that about 94% of sex- create false profiles and pretend to be younger or ual offenders are males. With respect to female sex- of another gender. Moreover, they try to copy ual molesters, it is reported, that they tend to be children’s behavior. Automatically detecting age young and, in these cases, men are often involved and gender in chat conversations could then be the as well (Vandiver and Kercher, 2004). Sexual as- 111 sault offenders are more often adults (77%), though • Fixated discourse. Predators are not willing to in 23% of cases children are solicited by other juve- step aside from the sexual conversation. For niles. example, in this conversation the predator al- Analysis of pedophiles’ personality characterizes most ignores the question of pseudo-victim and them with feelings of inferiority, isolation, lone- comes back to the sex-related conversation: liness, low self-esteem and emotional immaturity. Moreover, 60%-80% of them suffer from other psy- Predator: licking dont hurt chiatric illnesses (Hall and Hall, 2007). In general, Predator: its like u lick ice cream pedophiles are less emotionally stable than mentally Pseudo-victim: do u care that im 13 in healthy people. march and not yet? i lied a little bit b4 Predator: its all cool 3.1 Profile of the Online Sexual Predator Predator: i can lick hard Hall and Hall (2007) noticed that five main types of computer-based sexual offenders can be distin- • Offenders often understand that what they are guished: (1) the stalkers, who approach children in doing is not moral: chat rooms in order to get physical access to them; (2) the cruisers, who are interested in online sexual Predator: i would help but its not moral molestation and not willing to meet children offline; (3) the masturbators, who watch child pornography; (4) the networkers or swappers, who trade informa- • They transfer responsibility to the victim: tion, pornography, and children; and (5) a combi- nation of the four types. In this study we are in- Pseudo-victim: what ya wanta do when u terested in detecting stalkers (type (1)) and cruisers come over (type (2)). Predator: whatever–movies, games, drink, The language sexual offenders use was analyzed play around–it’s up to you–what would you by Egan et al. (2011). The authors considered the like to do? chats available from PJ. The analysis of the chats Pseudo-victim: that all sounds good revealed several characteristics of predators’ lan- Pseudo-victim: lol guage: Predator: maybe get some sexy pics of you :-P • Implicit/explicit content. On the one hand, Predator: would you let me take pictures of predators shift gradually to the sexual conversa- you? of you naked? of me and you playing? tion, starting with more ordinary compliments: :-D Predator: hey you are really cute Predator: u are pretty Predator: hi sexy • Predators often behave as children, copying their linguistic style. Colloquialisms appear of- ten in their messages: On the other hand, the conversa- tion then becomes overtly related to Predator: howwwww dy sex. They do not hide their intentions: ... Predator: i know PITY MEEEE Predator: can we have sex? Predator: you ok with sex with me and • They try to minimize the risk of being prose- drinking? cuted: they ask to delete chat logs and warn victims not to tell anyone about the talk: 112 start using words expressing anger and negative lex- Predator: don’t tell anyone we have been icon. Other emotions can be as well a clue to detect talking pedophiles. For example, offenders often demon- Pseudo-victim: k strate fear, especially with respect to being prose- Pseudo-victim: lol who would i tell? no cuted, and they often lose temper and express anger: one’s here. Predator: well I want it to be our secret Pseudo-victim: u sad didnt car if im 13. now u car. Predator: well, I am just scared about being in trouble or going to jail • Though they finally stop being cautious and in- Pseudo-victim: u sad run away now u say no. i sist on meeting offline: gues i dont no what u doin Predator: I got scared Predator: well let me come see you Predator: we would get caugth sometime Pseudo-victim: why u want 2 come over so bad? In this example pseudo-victim is not answering: Predator: i wanna see you Predator: hello Predator: r u there Predator: In general Egan et al. (Egan et al., 2011) have Predator: thnx a lot found online solicitation to be more direct, while in Predator: thanx a lot real life children seduction is more deceitful. Predator: Predator: u just wast my time 4 Our Approach Predator: drive down there We address the problem of automatic detection of Predator: can u not im any more online sexual predation. While previous studies were focused on classifying chat lines into differ- Here the offender is angry because the pseudo- ent categories (McGheeet al., 2011) or distinguish- victim did not call him: ing between offender and victim (Pendar, 2007), in Predator: u didnt call this work we address the problem of detecting sex- Predator: i m angry with u ual predators. Therefore, we have decided to use markers of We formulate the problem of detecting pedophiles basic emotions as features. At the SemEval 2007 in social media as the task of binary text categoriza- task on “Affective Text” (Strapparava and Mihal- tion: given a text (a set of chat lines), the aim is to cea, 2007) the problem of fine-grained emotion an- predict whether it is a case of cyberpedophilia or not. notation was defined: given a set of news titles, the system is to label each title with the appropri- 4.1 Features ate emotion out of the following list: ANGER, DIS- On the basis of previous analysis of pedophiles’ per- GUST, FEAR, JOY, SADNESS, SURPRISE. In this sonality (described in previous section), we consider research work we only use the percentages of the as features those emotional markers that could un- markers of each emotion. veil a certain degree of emotional instability, such We have also borrowed several features from as feelings of inferiority, isolation, loneliness, low McGhee et al. (2011): self-esteem and emotional immaturity. On the one hand, pedophiles try to be nice with a • Percentage of approach words. Approach victim and make compliments, at least in the begin- words include verbs such as come and meet and ning of a conversation. Therefore, the use of posi- such nouns as car and hotel. tive words is expected. On the other hand, as it was described earlier, pedophiles tend to be emotionally • Percentage of relationship words. These words unstable and prone to lose temper, hence they might refer to dating (e.g. boyfriend, date). 113 • Percentage of family words. These words are (c) Predator/Law enforcement officer posing the names of family members (e.g. mum, dad, as a child brother). 2. Adult/Adult (consensual relationship) • Percentage of communicative desensitization words. These are explicit sexual terms offend- The most interesting from our research point of ers use in order to desensitize the victim (e.g. view is data of the type 1a, but obtaining such penis, sex). data is not easy. However, the data of the type 1b is freely available at the web site www.perverted- • Percentage of words expressing sharing infor- justice.com. For our study, we have extracted chat mation. This implies sharing basic information, logs from the perverted-justice website. Since the such as age, gender and location, and sending victim is not real, we considered only the chat lines photos. The words include asl, pic. written by predators. Since our goal is to distinguish sex related chat Since pedophiles are known to be emotionally un- conversations where one of the parties involved is a stable and suffer from psychological problems, we pedophile, the ideal negative dataset would be chat consider features reported to be helpful to detect conversations of type 2 (consensual relations among neuroticism level by Argamon et al. (2009). In par- adults) and the PJ data will not meet this condition ticular, the features include percentages of personal for the negative instances. We need additional chat and reflexive pronouns and modal obligation verbs logs to build the negative dataset. We used two neg- (have to, has to, had to, must, should, mustn’t, and ative datasets in our experiments: cybersex chat logs shouldn’t). and the NPS chat corpus. We consider the use of imperative sentences and We downloaded the cybersex chat logs available emoticons to capture the predators tendencies to at www.oocities.org/urgrl21f/. The archive contains be dominant and copy childrens’ behaviour respec- 34 one-on-one cybersex logs. We have separated tively. lines of different authors, thereby obtaining 68 files. The study of Egan et al. (Egan et al., 2011) has We have also used the subset the of NPS chat cor- revealed several recurrent themes that appear in PJ pus (Forsythand and Martell, 2007), though it is not chats. Among them, fixated discourse: the unwill- of type 2. We have extracted chat lines only for those ingness of the predator to change the topic. In (Bog- adult authors who had more than 30 lines written. danova et al., 2012) we present experiments on mod- Finally the dataset consisted of 65 authors. From eling the fixated discourse. We have constructed lex- each dataset we have left 20 files for testing. ical chains (Morris and Hirst, 1991) starting with the anchor word “sex” in the first WordNet mean- 6 Experiments ing: “sexual activity, sexual practice, sex, sex activ- ity (activities associated with sexual intercourse)”. To distinguish between predators and not predators We have finally used as a feature the length of the we used a Naive Bayes classifier, already success- lexical chain constructed with the Resnik similarity fully utilized for analyzing chats by previous re- measure (Resnik, 1995) with the threshold = 0.7. search (Lin, 2007). To extract positive and nega- The full list of features is presented in Table 1. tive words, we used SentiWordNet (Baccianella et al., 2010). The features borrowed from McGhee et 5 Datasets al. (2011), were detected with the list of words au- thors made available for us. Imperative sentences Pendar (2007) has summarized the possible types of were detected as affirmative sentences starting with chat interactions with sexually explicit content: verbs. Emoticons were captured with simple regular 1. Predator/Other expressions. Our dataset is imbalanced, the majority of the chat (a) Predator/Victim (victim is underaged) logs are from PJ. To make the experimental data (b) Predator/Volunteer posing as a children more balanced, we have created 5 subsets of PJ cor- 114 Feature Class Feature Example Resource Emotional Positive Words cute, pretty SentiWordNet Markers Negative Words dangerous, annoying (Baccianella et al., 2010) JOY words happy, cheer WordNet-Affect SADNESS words bored, sad (Strapparava and ANGER words annoying, furious Valitutti, 2004) SURPRISE words astonished, wonder DISGUST words yucky, nausea FEAR words scared, panic Features borrowed Approach words meet, car McGhee et al. (2011) from McGhee Relationship nouns boyfriend, date et al. (2011) Family words mum, dad Communicative desensitization words sex. penis Information words asl, home Features helpful Personal pronouns I, you Argamon et al. (2009) to detect Reflexive pronouns myself, yourself neuroticism level Obligation verbs must, have to Features derived Fixated Discourse see in Section 3.1 Bogdanova et al. (2012) from pedophile’s psychological profile Other Emoticons 8), :( Imperative sentences Do it! Table 1: Features used in the experiments. pus, each of which contained chat lines from 60 ran- perverted-justice data, whilst the NPS data gener- domly selected predators. ally is not sex-related. Therefore, we expected low- For the cybersex logs, half of the chat sessions level features to provide better results on the NPS belong to the same author. We used this author for data. The experiments have shown that, except for training, and the rest for testing, in order to prevent the character bigrams, all low-level features consid- the classification algorithm from learning to distin- ered indeed work worse in case of cybersex logs guish this author from pedophiles. (see the average rows in both tables). The aver- age accuracy in this case varies between 48% and For comparison purposes, we experimented with 58%. Surprisingly, low-level features do not work several baseline systems using low-level features as good as we expected in case of the NPS chat based on n-grams at the word and character level, dataset: bag of words provides only 61% accuracy. which were reported as useful features by related re- Among other low-level features, character trigrams search (Peersman et al., 2011). We trained naive provide the highest accuracy of 72%, which is still Bayes classifiers using word level unigrams, bi- much lower than the one of the high-level features grams and trigrams. We also trained naive Bayes (90%). The high-level features yield a lower accu- classifiers using character level bigrams and tri- racy (90% accuracy) on the PJ-NPS dataset than in grams. the case of PJ-cybersex logs (94% accuracy). This is The classification results are presented in Tables 2 probably due to the data diversity: cybersex chat is and 3. The high-level features outperform all the a very particular type of a conversation, though NPS low-level ones in both the cybersex logs and the NPS chat corpora can contain any type of conversations chat datasets and achieve 94% and 90% accuracy on up to sexual predation. these datasets respectively. Cybersex chat logs are data of type 2 (see previ- ous section), they contain sexual content and, there- fore, share same of the same vocabulary with the 115 Accuracy High-level Bag of Term Term Character Character features words bigrams trigrams bigrams trigrams Run 1 0.93 0.38 0.55 0.60 0.73 0.78 Run 2 0.95 0.40 0.50 0.53 0.75 0.45 Run 3 0.95 0.70 0.45 0.53 0.48 0.50 Run 4 0.98 0.43 0.53 0.53 0.50 0.38 Run 5 0.90 0.50 0.48 0.53 0.45 0.50 Average 0.94 0.48 0.50 0.54 0.58 0.52 Table 2: Results of Naive Bayes classification applied to perverted-justice data and cybersex chat logs. Accuracy High-level Bag of Term Term Character Character features words bigrams trigrams bigrams trigrams Run 1 0.93 0.73 0.60 0.60 0.68 0.75 Run 2 0.95 0.68 0.53 0.53 0.48 0.45 Run 3 0.95 0.58 0.53 0.53 0.48 0.85 Run 4 0.98 0.53 0.53 0.53 0.23 0.80 Run 5 0.90 0.53 0.53 0.53 0.25 0.75 Average 0.92 0.61 0.54 0.54 0.42 0.72 Table 3: Results of Naive Bayes classification applied to perverted-justice data and NPS chats. 7 Discussion and Future Work dophiles training data than the training data of the NPS corpus, which is very diverse. These examples We have conducted experiments on detecting pe- are taken from misclassified NPS chat logs: dophiles in social media with a binary classification algorithm. In the experiments we used two negative User: love me like a bomb baby come on get it on datasets of different nature: the first one is more ap- ... propriate, it contains one-on-one cybersex conversa- User: ryaon so sexy tions, while the second dataset is extracted from the User: you are so anal NPS chat corpus and contains logs from chat rooms, User: obviously i didn’t get it and, therefore, is less appropriate since the conver- User: just loosen up babe sations are not even one on one. ... It is reasonable to expect that in the case of the User: i want to make love to him negative data consisting of cybersex logs, distin- guishing cyberpedophiles is a harder task, than in the User: right field wrong park lol j/k case of the NPS data. The results obtained with the User: not me i put them in the jail lol baseline systems support this assumption: we obtain User: or at least tell the cops where to go to get the higher accuracy for the NPS chats in all but character bad guys lol bi-grams. The interesting insight from these results is that our proposed higher-level features are able to In the future we plan to further investigate the boost accuracy to 94% on the seemingly more chal- misclassified data. The feature extraction we have lenging task. implemented does not use any word sense disam- Our error analysis showed that the NPS logs mis- biguation. This can as well cause mistakes since classified with the high-level features are also mis- the markers are not just lemmas but words in par- classified by the baseline systems. These instances ticular senses, since for example the lemma “fit” either share the same lexicon or are about the same can be either a positive marker (“a fit candidate”) topics. Therefore they are more similar to cyberpe- or negative (“a fit of epilepsy”), depending on the 116 context. Therefore we plan to employ word sense supported by Google Research Award. The collab- disambiguation techniques during the feature extrac- oration with Thamar Solorio was possible thanks tion phase. to her one-month research visit at the Universi- So far we have only seen that the list of fea- tat Politècnica de València (program PAID-PAID- tures we have suggested provides good results. 02-11 award n. 1932). The research work of They outperform all thelow-level features consid- Paolo Rosso was done in the framework of the Eu- ered. Among those low-level features, character tri- ropean Commission WIQ-EI IRSES project (grant grams provide the best results on the NPS data (72% no. 269180) within the FP 7 Marie Curie People, accuracy), though on the cybersex logs they achieve the MICINN research project TEXT-ENTERPRISE only 54%. We plan to merge low-level and high- 2.0 TIN2009-13391-C04-03(Plan I+D+i), and the level features in order to see if this could improve VLC/CAMPUS Microcluster on Multimodal Inter- the results. action in Intelligent Systems. In the future we plan also to explore the impact of each high-level feature. To better understand which References ones carry more discriminative power and if we can reduce the number of features. All these experi- Gene G. Abel and Nora Harlow. The Abel and Har- ments will be done employing naive Bayes as well low child molestation prevention study. Philadelphia, Xlibris, 2001. as Support Vector Machines as classifiers. Shlomo Argamon, Moshe Koppel, James Pennebaker, and Jonathan Schler. Automatically profiling the au- 8 Conclusions thor of an anonymous text. Communications of the ACM, 52 (2):119–123, 2009. This paper presents some results of an ongoing re- Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- search project on the detection of online sexual pre- tiani. Sentiwordnet 3.0: An enhanced lexical resource dation, a problem the research community is inter- for sentiment analysis and opinion mining. the Sev- ested in, as the PAN task on Sexual Predator Identi- enth International conference on Language Resources fication1 suggests. and Evaluation, 2010. Following the clues given by psychological re- Regina Barzilay and Michael Elhadad. Using lexical search, we have suggested a list of high-level fea- chains for text summarization. In Proceedings of the Intelligent Scalable Text Summarization Workshop, tures that should take into account the level of emo- 1997. tional instability of pedophiles, as well as their feel- Dasha Bogdanova, Paolo Rosso, Thamar Solorio. Mod- ings of inferiority, isolation, loneliness, low self- elling Fixated Discourse in Chats with Cyberpe- esteem etc. We have considered as well such low- dophiles. Proceedings of the Workshop on Compu- level features as character bigrams and trigrams and tational Approaches to Deception Detection, EACL, word unigrams, bigrams and trigrams. The Naive 2012. Bayes classification based on high-level features Vincent Egan, James Hoskinson, and David Shewan. achieves 90% and 94% accuracy when using NPS Perverted justice: A content analysis of the language chat corpus and the cybersex chat logs as a nega- used by offenders detected attempting to solicit chil- dren for sex. Antisocial Behavior: Causes, Correla- tive dataset respectively, whereas low-level features tions and Treatments, 2011. achieve only 42%-72% and 48%-58% accuracy on Eric N Forsythand and Craig H Martell. Lexical and dis- the same data. course analysis of online chat dialog. International Conference on Semantic Computing ICSC 2007, pages Acknowledgements 19–26, 2007. Michel Galley and Kathleen McKeown. Improving word The research of Dasha Bogdanova was carried out sense disambiguation in lexical chaining. In Proceed- during the 3-month internship at the Universitat ings of IJCAI-2003, 2003. Politècnica de València (scholarship of the Univer- Ryan C. W. Hall and Richard C. W. Hall. A profile sity of St.Petersburg). Her research was partially of pedophilia: Definition, characteristics of offenders, recidivism, treatment outcomes, and forensic issues. 1 http://pan.webis.de/ Mayo Clinic Proceedings, 2007. 117 David Hope. Java wordnet similarity library. Carlo Strapparava and Alessandro Valitutti. Wordnet- http://www.cogs.susx.ac.uk/users/drh21. affect: an affective extension of wordnet. In Proceed- Claudia Leacock and Martin Chodorow. C-rater: Auto- ings of the 4th International Conference on Language mated scoring of short-answer questions. Computers Re-sources and Evaluation, 2004. and the Humanities, 37(4):389–405, 2003. Frederik Vaassen and Walter Daelemans. Automatic emotion classification for interpersonal communica- Timothy Leary. Interpersonal diagnosis of personality; tion. In Proceedings of the 2nd Workshop on Com- a functional theory and methodology for personality putational Approaches to Subjectivity and Sentiment evaluation. Oxford, England: Ronald Press, 1957. Analysis (WASSA 2.011), pages 104–110. Association Jane Lin. Automatic author profiling of online chat logs. for Computational Linguistics, 2011. PhD thesis, 2007. Donna M. Vandiver and Glen Kercher. Offender and vic- India McGhee, Jennifer Bayzick, April Kontostathis, tim characteristics of registered female sexual offend- Lynne Edwards, Alexandra McBride and Emma ers in Texas: A proposed typology of female sexual Jakubowski. Learning to identify Internet sexual pre- offenders. Sex Abuse, 16:121–137, 2004 dation. International Journal on Electronic Commerce World health organization, international statistical clas- 2011. sification of diseases and related health problems: Icd- Kimberly J. Mitchell, David Finkelhor, and Janis Wolak. 10 section f65.4: Paedophilia. 1988. Risk factors for and impact of online sexual solicita- tion of youth. Journal of the American Medical Asso- ciation, 285:3011–3014, 2001. Jane Morris and Graeme Hirst. Lexical cohesion com- puted by thesaural relations as an indicator of the struc- ture of text. Computational Linguistics, 17(1):21–43, 1991. Ted Pedersen, Siddharth Patwardhan, Jason Miche- lizzi, and Satanjeev Banerjee. Wordnet:similarity. http://wn-similarity.sourceforge.net/. Claudia Peersman, Walter Daelemans, and Leona Van Vaerenbergh. Predicting age and gender in online so- cial networks. In Proceedings of the 3rd Workshop on Search and Mining User-Generated Contents, 2011. Nick Pendar. Toward spotting the pedophile: Telling vic- tim from predator in text chats. In Proceedings of the International Conference on Semantic Computing, pages 235–241, Irvine, California, 2007. Alexander Panchenko, Richard Beaufort, Cedrick Fairon. Detection of Child Sexual Abuse Media on P2P Net- works: Normalization and Classification of Associated Filenames. In Proceedings of the LREC Workshop on Language Resources for Public Security Applications, 2012. Philip Resnik. Using information content to evaluate se- mantic similarity in a taxonomy. In IJCAI, pages 448– 453, 1995. Howard N. Snyder. Sexual assault of young children as reported to law enforcement: Victim, incident, and of- fender characteristics. a nibrs statistical report. Bureau of Justice Statistics Clearinghouse, 2000. Carlo Strapparava and Rada Mihalcea. Semeval-2007 task 14: affective text. In Proceedings of the 4th In- ternational Workshop on Semantic Evaluations, Se- mEval’07, pages 70–74, 2007. 118 Author Index Abdul-Mageed, Muhammad, 19 Rehbein, Ines, 104 Anand, Pranav, 84 Reschke, Kevin, 84 Ando, Maya, 47 Rosso, Paolo, 110 Araki, Kenji, 89 Ruppenhofer, Josef, 104 Arora, Piyush, 11 Rzepka, Rafal, 89 Bakliwal, Akshat, 11 Singh, Mukesh, 11 Balahur, Alexandra, 52 Smith, Phillip, 79 Berend, Gábor, 99 Solorio, Thamar, 110 Björn, Gambäck, 38 Szpakowicz, Stan, 70 Bogdanova, Dasha, 110 Bonev, Boyan, 29 Thomas, Paul, 61 Turchi, Marco, 52 Das, Amitava, 38 Diab, Mona, 19 Ureña-López, L. Alfonso, 3 Ghazi, Diman, 70 Varma, Vasudeva, 11 Vincze, Veronika, 99 Inkpen, Diana, 70 Ishizaki, Shun, 47 Wiebe, Janyce, 2 Kapre, Nikhil, 11 Yin, Jie, 61 Kuebler, Sandra, 19 Lee, Mark, 79 Madhappan, Senthil, 11 Martı́n-Valdivia, M. Teresa, 3 Martı́nez-Cámara, Eugenio, 3 Mihalcea, Rada, 1 Momouchi, Yoshio, 89 Montejo-Ráez, Arturo, 3 Narang, Nalin, 61 Ortiz Rojas, Sergio, 29 Paris, Cecile, 61 Ptaszynski, Michal, 89 Ramı́rez Sánchez, Gema, 29 119

(PDF) Automatically annotating a five-billion-word corpus of Japanese blogs for affect and sentiment analysis