EACL 2012 13th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of the Conference April 23 - 27 2012 Avignon France c 2012 The Association for Computational Linguistics ISBN 978-1-937284-19-0 Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860

[email protected]

ii Preface: General Chair Welcome to EACL 2012, the 13th Conference of the European Chapter of the Association for Computational Linguistics. We are happy that despite strong competition from other Computational Linguistics events and economic turmoil in many European countries, this EACL is comparable to the successful previous ones, both in terms of the number of papers submitted and in terms of attendance. We have a strong scientific program, including ten workshops, four tutorials, a demos session and a student research workshop. I am convinced that you will appreciate our program. What does a General Chair at EACL have to do? Not much, it turns out. My job was to act as a liaison between the local organizing team, the scientific committees, and the EACL board, and to give advice when needed. Looking back at the thousands of e-mails I was copied on reminded me of the Jerome K. Jerome quote: ”I like work. I can sit and look at it for hours”. It has been an enjoyable experience to cooperate with the many people who made this conference happen, and to see them work. I have learned a lot from them. The Program Committee at an ACL conference is a trained army of Area Chairs, Program Committee members, and additional reviewers. Mirella Lapata and Llu´ıs M`arquez commanded this particular one. It is thanks to the voluntary peer reviewing work, year after year, of this large group of people, formed by the top researchers in our field, that you will find a high-quality program. It is thanks to Mirella and Llu´ıs that you will not only find the quality we expect from EACL, but also innovation, coherence, breadth, and depth. I can’t thank them enough for their work on all aspects of the scientific program and for their advice on virtually any other aspect of the organization. Many thanks also to Regina Barzilay, Raymond Mooney, and Martin Cooke for accepting to present an invited lecture and thereby increase the appeal of this event even more. As in previous years, the selection of the workshops of all ACL conferences in the same year is coordinated in a single committee. For EACL, Kristiina Jokinen and Alessandro Moschitti collaborated with the NAACL and ACL chairs in reviewing and selecting the workshops. As EACL is the first conference of the three, they had to initiate the call for proposals and activate their colleagues long before they were planning to. Thanks to their professionalism and efficiency, the process went very smoothly, and the resulting workshops program reflects the diversity and maturity of the field. For even more variation during the first two days of the conference, we also have a strong tutorial program. Tutorial Chairs Lieve Macken and Eneko Agirre managed to attract an impressive list of high-quality submissions and performed a thorough and thoughtful review and selection. It is truly a pity only four could be accommodated in the program, but their quality and timeliness is inspiring. Many thanks to Kristiina, Alessandro, Lieve, and Eneko for making this important part of the scientific program such a success. As is previous editions of EACL, the Student Research Workshop was organized by the student members of the EACL board: Pierre Lison, Mattias Nilsson, and Marta Recasens, with help from faculty advisor Laurence Danlos. Their task was a huge one: to organize a mini-conference within the conference. This included finding reviewers, selecting papers, setting up a program for the student session, finding mentors for the accepted papers, selecting a best paper award, . . . The amount of work they did cannot be overestimated, and the result is brilliant. Thank you! To round of the scientific program, we have stimulating demonstration sessions, selected and coordinated by Demonstrations Chair Fr´ed´erique Segond. Thank you for showing so clearly the rapid progress application-oriented computational linguistics is making. Thanks also to Gertjan van Noord and Caroline Sporleder for accepting the role of coordinators of the mentoring service. In the end, they didn’t have to assign mentors, but it is important that such a service is available when needed. iii For EACL 2012 we decided to switch to digital proceedings only. They were available before the conference from the website, during the conference on the memory stick you received with your registration material, and afterwards from the website and the ACL Anthology. An exception was made for the tutorial notes, which are available to participants on paper as well. I warned the Publications Chairs, Adri`a de Gispert and Fabrice Lef`evre, beforehand that theirs was probably the most demanding and stressful task of the conference: making sure that huge volumes of material from so many sources are available in time and in the right format, incorporating last minute corrections, and handling unavoidable glitches in the publications software. It is a formidable task, but they completed it without flinching. We all owe them our gratitude. EACL seems to follow economical crises, let us hope it does not become a habit. Both the previous conference in 2009 and the current one happened in grim economical times. Being a Sponsorship Chair is not a happy occasion in such times. Nevertheless, both the international ACL Sponsorship Committee (with Massimiliano Ciaramita as EACL member) and the local Sponsorship Chairs (Eric SanJuan and St´ephane Huet) left no stone unturned looking for sponsors. We would have ended up in a much worse financial situation if it hadn’t been for their efforts. Thank you! And of course also many thanks to our sponsors who, despite the economic situation, decided to help us financially with the conference. I am convinced their investment will be rewarded. Organizing large conferences like this is a complex undertaking, even with the help of extensive material (the ACL conference handbook). Whenever in doubt, I have had the opportunity to interact with the EACL Board, and occasionally with the ACL Board and with Priscilla Rasmussen. This has always been a pleasure. I have learned that the people running our associations are dedicated, know everything, and never sleep. Last but not least, the local organizing team has had to carry the largest burden in the organization. The sheer number of tasks and actions the local organizers of a conference like EACL have to assume is astonishing. Marc El-Beze has been a wonderful chair and his team (Frederic Bechet, Yann Fernandez, St´ephane Huet, Tania Jimenez, Fabrice Lefevre, Georges Linares, Alexis Nasr, Eric SanJuan, and Iria Da Cunha) has done outstanding work. There is no beginning in mentioning the many tasks they had to fulfill for making this a top conference. I am very grateful for all the work they put in the event and for the stress-free and friendly cooperation. I am also grateful for the support of the University of Avignon. I hope you will have many fond memories of EACL 2012, organized in these stunning surroundings in Avignon, both about the exciting scientific program and about the superb social program and local arrangements. Walter Daelemans General Chair March 2012 iv Preface: Program Chairs We are delighted to present you with this volume containing the papers accepted for presentation at the 13th Conference of the European Chapter of the Association for Computational Linguistics, held in Avignon, France, from April 23 till April 27 2012. EACL 2012 received 326 submissions. We were able to accept 85 papers in total (an acceptance rate of 26%). 48 of the papers (14.7%) were accepted for oral presentation, and 34 (10.4%) for poster presentation. One oral paper was subsequently withdrawn after acceptance. The papers were selected by a program committee of 28 area chairs, from Asia, Europe, and North America, assisted by a panel of 471 reviewers. Each submission was reviewed by three reviewers, who were furthermore encouraged to discuss any divergences they might have, and the papers in each area were ranked by the area chairs. The final selection was made by the program co-chairs after an independent check of all reviews and discussions with the area chairs. This year EACL introduced an author response period. Authors were able to read and respond to the reviews of their paper before the program committee made a final decision. They were asked to correct factual errors in the reviews and answer questions raised in the reviewers comments. The intention was to help produce more accurate reviews. In some cases, reviewers changed their scores in view of the authors response and the area chairs read all responses carefully prior to making recommendations for acceptance. Another new feature was to allow authors to include optional supplementary material in addition to the paper itself (e.g., code, data sets, and resources). Finally, in an attempt to eliminate any bias from the reviewing process we put in place a double-blind reviewing system where the identity of the authors was not revealed to the area chairs. After the program was selected, each of the area chairs was asked to nominate the best paper from his or her area, or to explicitly decline to nominate any. This resulted in several nominations out of which three stood out and were further considered in more detail by an dedicated committee chaired by Stephen Clark. This independent committee selected the best paper of the conference, which will be also awarded with a prize sponsored by Google. The best paper and the other two finalists will be presented in plenary sessions at the conference. In addition to the main conference program, EACL 2012 will feature the now traditional Student Research Workshop, 10 workshops, 4 tutorials and a demo session with 21 presentations. We are also fortunate to have three invited speakers, Martin Cooke, Ikerbasque (Basque Foundation for Science), Regina Barzilay, Massachusetts Institute of Technology, and Raymond Mooney, University of Texas at Austin. Martin Cooke will speak about “Speech Communication in the Wild”, Regina Barzilay will discuss the topic of “Learning to Behave by Reading”, and Raymond Mooney will present on “Learning Language from Perceptual Context”. First and foremost, we would like to thank the authors who submitted their work to EACL. The sheer number of submissions reflects how broad and active our field is. We are deeply indebted to the area chairs and the reviewers for their hard work. They enabled us to select an exciting program and to provide valuable feedback to the authors. We are grateful to our invited speakers who graciously agreed to give talks at EACL. Additional thanks to the Publications Chairs, Adri`a de Gispert and Fabrice Lef`evre who put this volume together. We are grateful to Rich Gerber and the START team who always responded to our questions quickly, and helped us manage the large number of submissions smoothly. Thanks are due to the local organizing committee chair, Marc El-Beze for his cooperation with us over many organisational issues. We are also grateful to the Student Research Workshop chairs, Pierre Lison, Mattias Nilsson, and Marta Recasens, and the NAACL-HLT (Srinivas Bangalore, Eric Fosler-Lussier and Ellen Riloff) and ACL (Chin-Yew Lin and Miles Osborne) program chairs for their smooth collaboration in the handling of double submissions. Last but not least, we are indebted to the v General Chair, Walter Daelemans, for his guidance and support throughout the whole process. We hope you enjoy the conference! Mirella Lapata and Llu´ıs M´arquez EACL 2012 Program Chairs vi Organizing Committee General Chair: Walter Daelemans, University of Antwerp, Belgium Programme Committee Chairs: Mirella Lapata, University of Edinburgh, UK Llu´ıs M`arquez, Universitat Polit`ecnica de Catalunya, Spain Area Chairs: Katja Filippova, Google Min-Yen Kan, National University of Singapore Charles Sutton, University of Edinburgh Ivan Titov, Saarland University Xavier Carreras, Universitat Polit`ecnica de Catalunya (UPC) Kenji Sagae, University of Southern California Kallirroi Georgila, Institute for Creative Technologies, University of Southern California Michael Strube, HITS gGmbH Pascale Fung, The Hong Kong University of Science and Technology Bing Liu, University of Illinois at Chicago Theresa Wilson, Johns Hopkins University David McClosky, Stanford University Sebastian Riedel, University of Massachusetts Phil Blunsom, University of Oxford Mikel L. Forcada, Universitat d’Alacant Christof Monz, University of Amsterdam Sharon Goldwater, University of Edinburgh Richard Wicentowski, Swarthmore College Patrick Pantel, Microsoft Research Hiroya Takamura, Tokyo Institute of Technology Alexander Koller, University of Potsdam Sebastian Pad´o, Universit¨at Heidelberg Maarten de Rijke, University of Amsterdam Julio Gonzalo, UNED Lori Levin, Carnegie Mellon University Piek Vossen, VU University Amsterdam Afra Alishahi, Tilburg University, The Netherlands John Hale, Cornell University Workshop Committee chairs: Kristiina Jokinen, University of Helsinki, Finland Alessandro Moschitti, University of Trento, Italy Tutorials Committee chairs: Eneko Agirre, University of the Basque Country, Spain Lieve Macken, University College Ghent, Belgium vii Student research workshop chairs: Pierre Lison, University of Oslo, Norway Mattias Nilsson, Uppsala University, Sweden Marta Recasens, University of Barcelona, Spain Student research workshop faculty advisor: Laurence Danlos, Universit´e Paris 7 System Demonstrations Committee: Fr´ed´erique Segond Publications Committee: Adri`a de Gispert, University of Cambridge, UK Fabrice Lef`evre, University of Avignon, France Sponsorship Committee: Massimiliano Ciaramita Mentoring service: Caroline Sporleder, Saarland University, Germany Gertjan van Noord, University of Groningen, The Netherlands Local Organising Committee: Marc El-Beze (Chair), University of Avignon, France Frederic Bechet (Publicity chair), University Aix-Marseille 2, France Yann Fernandez, University of Avignon, France St´ephane Huet (Exhibits local chair), University of Avignon, France Tania Jimenez, University of Avignon, France Fabrice Lefevre, University of Avignon, France Georges Linares, University of Avignon, France Alexis Nasr, University Aix-Marseille 2, France Eric SanJuan (Sponsorship local chair), University of Avignon, France Iria Da Cunha, Pompeu Fabra University, Spain viii Table of Contents Speech Communication in the Wild Martin Cooke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Power-Law Distributions for Paraphrases Extracted from Bilingual Corpora Spyros Martzoukos and Christof Monz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 A Bayesian Approach to Unsupervised Semantic Role Induction Ivan Titov and Alexandre Klementiev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Entailment above the word level in distributional semantics Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do and Chung-chieh Shan . . . . . . . . . . . . . . . . . . . 23 Evaluating Distributional Models of Semantics for Syntactically Invariant Inference Jackie Chi Kit Cheung and Gerald Penn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Cross-Framework Evaluation for Statistical Parsing Reut Tsarfaty, Joakim Nivre and Evelina Andersson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Dependency Parsing of Hungarian: Baseline Results and Challenges Rich´ard Farkas, Veronika Vincze and Helmut Schmid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Dependency Parsing with Undirected Graphs Carlos G´omez-Rodr´ıguez and Daniel Fern´andez-Gonz´alez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 The Best of BothWorlds – A Graph-based Completion Model for Transition-based Parsers Bernd Bohnet and Jonas Kuhn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Answer Sentence Retrieval by Matching Dependency Paths acquired from Question/Answer Sentence Pairs Michael Kaisser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Can Click Patterns across User’s Query Logs Predict Answers to Definition Questions? Alejandro Figueroa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Retrieval in a Ser- vice Context Vassilina Nikoulina, Bogomil Kovachev, Nikolaos Lagos and Christof Monz . . . . . . . . . . . . . . . . 109 Computing Lattice BLEU Oracle Scores for Machine Translation Artem Sokolov, Guillaume Wisniewski and Francois Yvon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Toward Statistical Machine Translation without Parallel Corpora Alexandre Klementiev, Ann Irvine, Chris Callison-Burch and David Yarowsky . . . . . . . . . . . . . . 130 Character-Based Pivot Translation for Under-Resourced Languages and Domains J¨org Tiedemann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Does more data always yield better translations? Guillem Gasc´o, Martha-Alicia Rocha, Germ´an Sanchis-Trilles, Jes´us Andr´es-Ferrer and Francisco Casacuberta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Recall-Oriented Learning of Named Entities in Arabic Wikipedia Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer and Noah A. Smith . . . . 162 ix Tree Representations in Probabilistic Models for Extended Named Entities Detection Marco Dinarelli and Sophie Rosset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 When Did that Happen? — Linking Events and Relations to Timestamps Dirk Hovy, James Fan, Alfio Gliozzo, Siddharth Patwardhan and Christopher Welty . . . . . . . . . . 185 Compensating for Annotation Errors in Training a Relation Extractor Bonan Min and Ralph Grishman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Incorporating Lexical Priors into Topic Models Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 DualSum: a Topic-Model based approach for update summarization Jean-Yves Delort and Enrique Alfonseca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Large-Margin Learning of Submodular Summarization Models Ruben Sipos, Pannaga Shivaswamy and Thorsten Joachims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utterances and their Meanings Tom Kwiatkowski, Sharon Goldwater, Luke Zettlemoyer and Mark Steedman . . . . . . . . . . . . . . . 234 Active learning for interactive machine translation Jes´us Gonz´alez-Rubio, Daniel Ortiz-Mart´ınez and Francisco Casacuberta . . . . . . . . . . . . . . . . . . . 245 Adapting Translation Models to Translationese Improves SMT Gennadi Lembersky, Noam Ordan and Shuly Wintner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Aspectual Type and Temporal Relation Classification Francisco Costa and Ant´onio Branco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Automatic generation of short informative sentiment summaries Andrea Glaser and Hinrich Sch¨utze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Bootstrapped Training of Event Extraction Classifiers Ruihong Huang and Ellen Riloff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Bootstrapping Events and Relations from Text Ting Liu and Tomek Strzalkowski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 CLex: A Lexicon for Exploring Color, Concept and Emotion Associations in Language Svitlana Volkova, William B. Dolan and Theresa Wilson. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .306 Extending the Entity-based Coherence Model with Multiple Ranks Vanessa Wei Feng and Graeme Hirst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Generalization Methods for In-Domain and Cross-Domain Opinion Holder Extraction Michael Wiegand and Dietrich Klakow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Skip N-grams and Ranking Functions for Predicting Script Events Bram Jans, Steven Bethard, Ivan Vuli´c and Marie-Francine Moens . . . . . . . . . . . . . . . . . . . . . . . . . 336 The Problem with Kappa David Martin Ward Powers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 x User Edits Classification Using Document Revision Histories Amit Bronner and Christof Monz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 User Participation Prediction in Online Forums Zhonghua Qu and Yang Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Inferring Selectional Preferences from Part-Of-Speech N-grams Hyeju Jang and Jack Mostow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 WebCAGe – A Web-Harvested Corpus Annotated with GermaNet Senses Verena Henrich, Erhard Hinrichs and Tatiana Vodolazova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Learning to Behave by Reading Regina Barzilay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Lexical surprisal as a general predictor of reading time Irene Fernandez Monsalve, Stefan L. Frank and Gabriella Vigliocco . . . . . . . . . . . . . . . . . . . . . . . . 398 Spectral Learning for Non-Deterministic Dependency Parsing Franco M. Luque, Ariadna Quattoni, Borja Balle and Xavier Carreras . . . . . . . . . . . . . . . . . . . . . . . 409 Combining Tree Structures, Flat Features and Patterns for Biomedical Relation Extraction Md. Faisal Mahbub Chowdhury and Alberto Lavelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 Coordination Structure Analysis using Dual Decomposition Atsushi Hanamoto, Takuya Matsuzaki and Jun’ichi Tsujii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 Cutting the Long Tail: Hybrid Language Models for Translation Style Adaptation Arianna Bisazza and Marcello Federico . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge Ivan Vuli´c and Marie-Francine Moens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 Efficient parsing with Linear Context-Free Rewriting Systems Andreas van Cranenburgh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system Myroslava O. Dzikovska, Peter Bell, Amy Isard and Johanna D. Moore . . . . . . . . . . . . . . . . . . . . . 471 Experimenting with Distant Supervision for Emotion Classification Matthew Purver and Stuart Battersby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgar- ian Georgi Georgiev, Valentin Zhikov, Kiril Simov, Petya Osenova and Preslav Nakov . . . . . . . . . . . 492 Instance-Driven Attachment of Semantic Annotations over Conceptual Hierarchies Janara Christensen and Marius Pasca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 Joint Satisfaction of Syntactic and Pragmatic Constraints Improves Incremental Spoken Language Un- derstanding Andreas Peldszus, Okko Buß, Timo Baumann and David Schlangen . . . . . . . . . . . . . . . . . . . . . . . . 514 Learning How to Conjugate the Romanian Verb. Rules for Regular and Partially Irregular Verbs Liviu P. Dinu, Vlad Niculae and Octavia-Maria Sulea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 xi Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History Torsten Zesch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation Rico Sennrich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 Subcat-LMF: Fleshing out a standardized format for subcategorization frame interoperability Judith Eckle-Kohler and Iryna Gurevych . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550 The effect of domain and text type on text prediction quality Suzan Verberne, Antal van den Bosch, Helmer Strik and Lou Boves . . . . . . . . . . . . . . . . . . . . . . . . 561 The Impact of Spelling Errors on Patent Search Benno Stein, Dennis Hoppe and Tim Gollub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570 UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian Wirth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580 Word Sense Induction for Novel Sense Detection Jey Han Lau, Paul Cook, Diana McCarthy, David Newman and Timothy Baldwin . . . . . . . . . . . . 591 Learning Language from Perceptual Context Raymond Mooney . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter Micol Marchetti-Bowick and Nathanael Chambers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 Learning from evolving data streams: online triage of bug reports Grzegorz Chrupala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 Towards a model of formal and informal address in English Manaal Faruqui and Sebastian Pado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623 Character-based kernels for novelistic plot structure Micha Elsner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 Smart Paradigms and the Predictability and Complexity of Inflectional Morphology Gr´egoire D´etrez and Aarne Ranta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 Probabilistic Hierarchical Clustering of Morphological Paradigms Burcu Can and Suresh Manandhar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654 Modeling Inflection and Word-Formation in SMT Alexander Fraser, Marion Weller, Aoife Cahill and Fabienne Cap . . . . . . . . . . . . . . . . . . . . . . . . . . . 664 Identifying Broken Plurals, Irregular Gender, and Rationality in Arabic Text Sarah Alkuhlani and Nizar Habash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Framework of Semantic Role Assignment based on Extended Lexical Conceptual Structure: Comparison with VerbNet and FrameNet Yuichiroh Matsubayashi, Yusuke Miyao and Akiko Aizawa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686 Unsupervised Detection of Downward-Entailing Operators By Maximizing Classification Certainty Jackie Chi Kit Cheung and Gerald Penn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .696 xii Elliphant: Improved Automatic Detection of Zero Subjects and Impersonal Constructions in Spanish Luz Rello, Ricardo Baeza-Yates and Ruslan Mitkov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706 Validation of sub-sentential paraphrases acquired from parallel monolingual corpora Houda Bouamor, Aur´elien Max and Anne Vilnat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716 Determining the placement of German verbs in English–to–German SMT Anita Gojun and Alexander Fraser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726 Syntax-Based Word Ordering Incorporating a Large-Scale Language Model Yue Zhang, Graeme Blackwood and Stephen Clark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736 Midge: Generating Image Descriptions From Computer Vision Detections Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Stratos, Xufeng Han, Alyssa Mensch, Alex Berg, Tamara Berg and Hal Daume III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747 Generation of landmark-based navigation instructions from open-source data Markus Dr¨ager and Alexander Koller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757 To what extent does sentence-internal realisation reflect discourse context? A study on word order Sina Zarrieß, Aoife Cahill and Jonas Kuhn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767 Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages Oliver Ferschke, Iryna Gurevych and Yevgen Chebotar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777 An Unsupervised Dynamic Bayesian Network Approach to Measuring Speech Style Accommodation Mahaveer Jain, John McDonough, Gahgene Gweon, Bhiksha Raj and Carolyn Penstein Ros´e . 787 Learning the Fine-Grained Information Status of Discourse Entities Altaf Rahman and Vincent Ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798 Composing extended top-down tree transducers Aurelie Lagoutte, Fabienne Braune, Daniel Quernheim and Andreas Maletti . . . . . . . . . . . . . . . . . 808 Structural and Topical Dimensions in Multi-Task Patent Translation Katharina Waeschle and Stefan Riezler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818 Not as Awful as it Seems: Explaining German Case through Computational Experiments in Fluid Con- struction Grammar Remi van Trijp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829 Managing Uncertainty in Semantic Tagging Silvie Cinkov´a, Martin Holub and Vincent Kr´ızˇ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840 Parallel and Nested Decomposition for Factoid Questions Aditya Kalyanpur, Siddharth Patwardhan, Branimir Boguraev, Jennifer Chu-Carroll and Adam Lally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851 xiii Conference Program Wednesday April 25, 2012 (8:45) Session 1: Plenary Session 9:00 Speech Communication in the Wild Martin Cooke (10:30) Session 2a: Semantics 10:30 Power-Law Distributions for Paraphrases Extracted from Bilingual Corpora Spyros Martzoukos and Christof Monz 10:55 A Bayesian Approach to Unsupervised Semantic Role Induction Ivan Titov and Alexandre Klementiev 11:20 Entailment above the word level in distributional semantics Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do and Chung-chieh Shan 11:45 Evaluating Distributional Models of Semantics for Syntactically Invariant Inference Jackie Chi Kit Cheung and Gerald Penn (10:30) Session 2b: Parsing 10:30 Cross-Framework Evaluation for Statistical Parsing Reut Tsarfaty, Joakim Nivre and Evelina Andersson 10:55 Dependency Parsing of Hungarian: Baseline Results and Challenges Rich´ard Farkas, Veronika Vincze and Helmut Schmid 11:20 Dependency Parsing with Undirected Graphs Carlos G´omez-Rodr´ıguez and Daniel Fern´andez-Gonz´alez 11:45 The Best of BothWorlds – A Graph-based Completion Model for Transition-based Parsers Bernd Bohnet and Jonas Kuhn xv Wednesday April 25, 2012 (continued) (10:30) Session 2c: QA and IR 10:30 Answer Sentence Retrieval by Matching Dependency Paths acquired from Ques- tion/Answer Sentence Pairs Michael Kaisser 10:55 Can Click Patterns across User’s Query Logs Predict Answers to Definition Questions? Alejandro Figueroa 11:20 Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Re- trieval in a Service Context Vassilina Nikoulina, Bogomil Kovachev, Nikolaos Lagos and Christof Monz (14:00) Session 3a: Machine Translation 14:00 Computing Lattice BLEU Oracle Scores for Machine Translation Artem Sokolov, Guillaume Wisniewski and Francois Yvon 14:25 Toward Statistical Machine Translation without Parallel Corpora Alexandre Klementiev, Ann Irvine, Chris Callison-Burch and David Yarowsky 14:50 Character-Based Pivot Translation for Under-Resourced Languages and Domains J¨org Tiedemann 15:15 Does more data always yield better translations? Guillem Gasc´o, Martha-Alicia Rocha, Germ´an Sanchis-Trilles, Jes´us Andr´es-Ferrer and Francisco Casacuberta (14:00) Session 3b: Information Extraction 14:00 Recall-Oriented Learning of Named Entities in Arabic Wikipedia Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer and Noah A. Smith 14:25 Tree Representations in Probabilistic Models for Extended Named Entities Detection Marco Dinarelli and Sophie Rosset 14:50 When Did that Happen? — Linking Events and Relations to Timestamps Dirk Hovy, James Fan, Alfio Gliozzo, Siddharth Patwardhan and Christopher Welty xvi Wednesday April 25, 2012 (continued) 15:15 Compensating for Annotation Errors in Training a Relation Extractor Bonan Min and Ralph Grishman (14:00) Session 3c: Machine Learning and Summarization 14:00 Incorporating Lexical Priors into Topic Models Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa 14:25 DualSum: a Topic-Model based approach for update summarization Jean-Yves Delort and Enrique Alfonseca 14:50 Large-Margin Learning of Submodular Summarization Models Ruben Sipos, Pannaga Shivaswamy and Thorsten Joachims (16:10) Session 4: Posters (1) and Demos (1) 16:10 A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utter- ances and their Meanings Tom Kwiatkowski, Sharon Goldwater, Luke Zettlemoyer and Mark Steedman 16:10 Active learning for interactive machine translation Jes´us Gonz´alez-Rubio, Daniel Ortiz-Mart´ınez and Francisco Casacuberta 16:10 Adapting Translation Models to Translationese Improves SMT Gennadi Lembersky, Noam Ordan and Shuly Wintner 16:10 Aspectual Type and Temporal Relation Classification Francisco Costa and Ant´onio Branco 16:10 Automatic generation of short informative sentiment summaries Andrea Glaser and Hinrich Sch¨utze 16:10 Bootstrapped Training of Event Extraction Classifiers Ruihong Huang and Ellen Riloff 16:10 Bootstrapping Events and Relations from Text Ting Liu and Tomek Strzalkowski xvii Wednesday April 25, 2012 (continued) 16:10 CLex: A Lexicon for Exploring Color, Concept and Emotion Associations in Language Svitlana Volkova, William B. Dolan and Theresa Wilson 16:10 Extending the Entity-based Coherence Model with Multiple Ranks Vanessa Wei Feng and Graeme Hirst 16:10 Generalization Methods for In-Domain and Cross-Domain Opinion Holder Extraction Michael Wiegand and Dietrich Klakow 16:10 Skip N-grams and Ranking Functions for Predicting Script Events Bram Jans, Steven Bethard, Ivan Vuli´c and Marie-Francine Moens 16:10 The Problem with Kappa David Martin Ward Powers 16:10 User Edits Classification Using Document Revision Histories Amit Bronner and Christof Monz 16:10 User Participation Prediction in Online Forums Zhonghua Qu and Yang Liu 16:10 Inferring Selectional Preferences from Part-Of-Speech N-grams Hyeju Jang and Jack Mostow 16:10 WebCAGe – A Web-Harvested Corpus Annotated with GermaNet Senses Verena Henrich, Erhard Hinrichs and Tatiana Vodolazova xviii Thursday April 26, 2012 (9:00) Session 5: Plenary Session 9:00 Learning to Behave by Reading Regina Barzilay (10:30) Session 6a: Student Workshop (10:30) Session 6b: Student Workshop (10:30) Session 6c: Student Workshop (14:00) Session 7: EACL business meeting (14:50) Session 8: Plenary Session 14:50 Lexical surprisal as a general predictor of reading time Irene Fernandez Monsalve, Stefan L. Frank and Gabriella Vigliocco 15:15 Spectral Learning for Non-Deterministic Dependency Parsing Franco M. Luque, Ariadna Quattoni, Borja Balle and Xavier Carreras (16:10) Session 9: Posters (2) and Demos (2) 16:10 Combining Tree Structures, Flat Features and Patterns for Biomedical Relation Extraction Md. Faisal Mahbub Chowdhury and Alberto Lavelli 16:10 Coordination Structure Analysis using Dual Decomposition Atsushi Hanamoto, Takuya Matsuzaki and Jun’ichi Tsujii 16:10 Cutting the Long Tail: Hybrid Language Models for Translation Style Adaptation Arianna Bisazza and Marcello Federico 16:10 Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge Ivan Vuli´c and Marie-Francine Moens xix Thursday April 26, 2012 (continued) 16:10 Efficient parsing with Linear Context-Free Rewriting Systems Andreas van Cranenburgh 16:10 Evaluating language understanding accuracy with respect to objective outcomes in a dia- logue system Myroslava O. Dzikovska, Peter Bell, Amy Isard and Johanna D. Moore 16:10 Experimenting with Distant Supervision for Emotion Classification Matthew Purver and Stuart Battersby 16:10 Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Applica- tion to Bulgarian Georgi Georgiev, Valentin Zhikov, Kiril Simov, Petya Osenova and Preslav Nakov 16:10 Instance-Driven Attachment of Semantic Annotations over Conceptual Hierarchies Janara Christensen and Marius Pasca 16:10 Joint Satisfaction of Syntactic and Pragmatic Constraints Improves Incremental Spoken Language Understanding Andreas Peldszus, Okko Buß, Timo Baumann and David Schlangen 16:10 Learning How to Conjugate the Romanian Verb. Rules for Regular and Partially Irregular Verbs Liviu P. Dinu, Vlad Niculae and Octavia-Maria Sulea 16:10 Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revi- sion History Torsten Zesch 16:10 Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation Rico Sennrich 16:10 Subcat-LMF: Fleshing out a standardized format for subcategorization frame interoper- ability Judith Eckle-Kohler and Iryna Gurevych 16:10 The effect of domain and text type on text prediction quality Suzan Verberne, Antal van den Bosch, Helmer Strik and Lou Boves 16:10 The Impact of Spelling Errors on Patent Search Benno Stein, Dennis Hoppe and Tim Gollub xx Thursday April 26, 2012 (continued) 16:10 UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian Wirth 16:10 Word Sense Induction for Novel Sense Detection Jey Han Lau, Paul Cook, Diana McCarthy, David Newman and Timothy Baldwin Friday April 27, 2012 (9:00) Session 10: Plenary Session 9:00 Learning Language from Perceptual Context Raymond Mooney (10:30) Session 11a: Data Mining and Discourse 10:30 Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter Micol Marchetti-Bowick and Nathanael Chambers 10:55 Learning from evolving data streams: online triage of bug reports Grzegorz Chrupala 11:20 Towards a model of formal and informal address in English Manaal Faruqui and Sebastian Pado 11:45 Character-based kernels for novelistic plot structure Micha Elsner xxi Friday April 27, 2012 (continued) (10:30) Session 11b: Morphology 10:30 Smart Paradigms and the Predictability and Complexity of Inflectional Morphology Gr´egoire D´etrez and Aarne Ranta 10:55 Probabilistic Hierarchical Clustering of Morphological Paradigms Burcu Can and Suresh Manandhar 11:20 Modeling Inflection and Word-Formation in SMT Alexander Fraser, Marion Weller, Aoife Cahill and Fabienne Cap 11:45 Identifying Broken Plurals, Irregular Gender, and Rationality in Arabic Text Sarah Alkuhlani and Nizar Habash (10:30) Session 11c: Semantics 10:30 Framework of Semantic Role Assignment based on Extended Lexical Conceptual Struc- ture: Comparison with VerbNet and FrameNet Yuichiroh Matsubayashi, Yusuke Miyao and Akiko Aizawa 10:55 Unsupervised Detection of Downward-Entailing Operators By Maximizing Classification Certainty Jackie Chi Kit Cheung and Gerald Penn 11:20 Elliphant: Improved Automatic Detection of Zero Subjects and Impersonal Constructions in Spanish Luz Rello, Ricardo Baeza-Yates and Ruslan Mitkov 11:45 Validation of sub-sentential paraphrases acquired from parallel monolingual corpora Houda Bouamor, Aur´elien Max and Anne Vilnat xxii Friday April 27, 2012 (continued) (14:00) Session 12a: Generation and Word Ordering 14:00 Determining the placement of German verbs in English–to–German SMT Anita Gojun and Alexander Fraser 14:25 Syntax-Based Word Ordering Incorporating a Large-Scale Language Model Yue Zhang, Graeme Blackwood and Stephen Clark 14:50 Midge: Generating Image Descriptions From Computer Vision Detections Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Stratos, Xufeng Han, Alyssa Mensch, Alex Berg, Tamara Berg and Hal Daume III 15:15 Generation of landmark-based navigation instructions from open-source data Markus Dr¨ager and Alexander Koller (14:00) Session 12b: Discourse and Dialogue 14:00 To what extent does sentence-internal realisation reflect discourse context? A study on word order Sina Zarrieß, Aoife Cahill and Jonas Kuhn 14:25 Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages Oliver Ferschke, Iryna Gurevych and Yevgen Chebotar 14:50 An Unsupervised Dynamic Bayesian Network Approach to Measuring Speech Style Ac- commodation Mahaveer Jain, John McDonough, Gahgene Gweon, Bhiksha Raj and Carolyn Penstein Ros´e 15:15 Learning the Fine-Grained Information Status of Discourse Entities Altaf Rahman and Vincent Ng xxiii Friday April 27, 2012 (continued) (14:00) Session 12c: Parsing and MT 14:00 Composing extended top-down tree transducers Aurelie Lagoutte, Fabienne Braune, Daniel Quernheim and Andreas Maletti 14:25 Structural and Topical Dimensions in Multi-Task Patent Translation Katharina Waeschle and Stefan Riezler 14:50 Not as Awful as it Seems: Explaining German Case through Computational Experiments in Fluid Construction Grammar Remi van Trijp (15:45) Session 13: Plenary Session 15:45 Managing Uncertainty in Semantic Tagging Silvie Cinkov´a, Martin Holub and Vincent Kr´ızˇ Parallel and Nested Decomposition for Factoid Questions Aditya Kalyanpur, Siddharth Patwardhan, Branimir Boguraev, Jennifer Chu-Carroll and Adam Lally xxiv Speech Communication in the Wild Martin Cooke Language and Speech Laboratory University of the Basque Country Ikerbasque (Basque Science Foundation)

[email protected]

Abstract Much of what we know about speech perception comes from laboratory studies with clean, canonical speech, ideal listeners and artificial tasks. But how do interlocutors manage to communicate effec- tively in the seemingly less-than-ideal conditions of everyday listening, which frequently involve try- ing to make sense of speech while listening in a non-native language, or in the presence of competing sound sources, or while multitasking? In this talk I’ll examine the effect of real-world conditions on speech perception and quantify the contributions made by factors such as binaural hearing, visual in- formation and prior knowledge to speech communication in noise. I’ll present a computational model which trades stimulus-related cues with information from learnt speech models, and examine how well it handles both energetic and informational masking in a two-sentence separation task. Speech communication also involves listening-while-talking. In the final part of the talk I’ll describe some ways in which speakers might be making communication easier for their interlocutors, and demon- strate the application of these principles to improving the intelligibility of natural and synthetic speech in adverse conditions. 1 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, page 1, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Power-Law Distributions for Paraphrases Extracted from Bilingual Corpora Spyros Martzoukos Christof Monz Informatics Institute, University of Amsterdam Science Park 904, 1098 XH Amsterdam, The Netherlands {s.martzoukos, c.monz}@uva.nl Abstract was shown to outperform pivoting with syntac- tic information, when multiple phrase-tables are We describe a novel method that extracts used. In SMT, extracted paraphrases with asso- paraphrases from a bitext, for both the ciated pivot-based (Callison-Burch et al., 2006; source and target languages. In order to reduce the search space, we decom- Onishi et al., 2010) and cluster-based (Kuhn et pose the phrase-table into sub-phrase-tables al., 2010) probabilities have been found to im- and construct separate clusters for source prove the quality of translation. Pivoting has also and target phrases. We convert the clus- been employed in the extraction of syntactic para- ters into graphs, add smoothing/syntactic- phrases, which are a mixture of phrases and non- information-carrier vertices, and compute terminals (Zhao et al., 2008; Ganitkevitch et al., the similarity between phrases with a ran- 2011). dom walk-based measure, the commute time. The resulting phrase-paraphrase We develop a method for extracting para- probabilities are built upon the conversion phrases from a bitext for both the source and tar- of the commute times into artificial co- get languages. Emphasis is placed on the qual- occurrence counts with a novel technique. ity of the phrase-paraphrase probabilities as well The co-occurrence count distribution be- as on providing a stepping stone for extracting longs to the power-law family. syntactic paraphrases with equally reliable prob- abilities. In line with previous work, our method 1 Introduction depends on the connectivity of the phrase-table, Paraphrase extraction has emerged as an impor- but the resulting construction treats each side sep- tant problem in NLP. Currently, there exists an arately, which can potentially be benefited from abundance of methods for extracting paraphrases additional monolingual data. from monolingual, comparable and bilingual cor- The initial problem in harvesting paraphrases pora (Madnani and Dorr, 2010; Androutsopou- from a phrase-table is the identification of the los and Malakasiotis, 2010); we focus on the lat- search space. Previous work has relied on breadth ter and specifically on the phrase-table that is ex- first search from the query phrase with a depth tracted from a bitext during the training stage of of 2 (pivoting) and 6 (KB). The former can be Statistical Machine Translation (SMT). Bannard too restrictive and the latter can lead to excessive and Callison-Burch (2005) introduced the pivot- noise contamination when taking shallow syntac- ing approach, which relies on a 2-step transition tic information features into account. Instead, we from a phrase, via its translations, to a paraphrase choose to cluster the phrase-table into separate candidate. By incorporating the syntactic struc- source and target clusters and in order to make this ture of phrases (Callison-Burch, 2005), the qual- task computationally feasible, we decompose the ity of the paraphrases extracted with pivoting can phrase-table into sub-phrase-tables. We propose be improved. Kok and Brockett (2010) (hence- a novel heuristic algorithm for the decomposition forth KB) used a random walk framework to de- of the phrase-table (Section 2.1), and use a well- termine the similarity between phrases, which established co-clustering algorithm for clustering 2 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 2–11, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics each sub-phrase-table (Section 2.2). done by identifying all vertices such that, upon The underlying connectivity of the source removal, the component becomes disconnected. and target clusters gives rise to a natural graph Such vertices are called articulation points or cut- representation for each cluster (Section 3.1). vertices. Cut-vertices of high connectivity degree The vertices of the graphs consist of phrases are removed from the giant component (see Sec- and features with a dual smoothing/syntactic- tion 4.1). For the remaining vertices of the giant information-carrier role. The latter allow (a) re- component, new components are identified and distribution of the mass for phrases with no appro- we proceed iteratively, while keeping track of the priate paraphrases and (b) the extraction of syn- cut-vertices that are removed at each iteration, un- tactic paraphrases. The proximity among vertices til the size of the largest component is less than a of a graph is measured by means of a random walk certain threshold θ (see Section 4.1). distance measure, the commute time (Aldous and Note that at each iteration, when removing cut- Fill, 2001). This measure is known to perform vertices from a giant component, the resulting col- well in identifying similar words on the graph of lection of components may include graphs con- WordNet (Rao et al., 2008) and a related measure, sisting of a single vertex. We refer to such ver- the hitting time is known to perform well in har- tices as residues. They are excluded from the re- vesting paraphrases on a graph constructed from sulting collection and are considered for separate multiple phrase-tables (KB). treatment, as explained later in this section. Generally in NLP, power-law distributions are The cut-vertices need to be inserted appropri- typically encountered in the collection of counts ately back to the components: Starting from the during the training stage. The distances of Sec- last iteration step, the respective cut-vertices are tion 3.1 are converted into artificial co-occurrence added to all the components of P0 which they counts with a novel technique (Section 3.2). Al- used to ‘glue’ together; this process is performed though they need not be integers, the main chal- iteratively, until there are no more cut-vertices to lenge is the type of the underlying distributions; add. By ‘addition’ of a cut-vertex to a component, it should ideally emulate the resulting count dis- we mean the re-establishment of edges between tributions from the phrase extraction stage of a the former and other vertices of the latter. The monolingual parallel corpus (Dolan et al., 2004). result is a collection of components whose total These counts give rise to the desired probability number of unique vertices is less than the number distributions by means of relative frequencies. of vertices of the initial giant component P0 . These remaining vertices are the residues. We 2 Sub-phrase-tables & Clustering then construct the graph R which consists of the residues together with all their translations 2.1 Extracting Connected Components (even those that are included in components of For the decomposition of the phrase-table into the above collection) and then identify its compo- sub-phrase-tables it is convenient to view the nents {R0 , ..., Rm }. It turns out, that the largest phrase-table as an undirected, unweighted graph component, say R0 , is giant and we repeat the de- P with the vertex set being the source and target composition process that was performed on P0 . phrases and the edge set being the phrase-table en- This results in a new collection of components tries. For the rest of this section, we do not distin- as well as new residues: The components need guish between source and target phrases, i.e. both to be pruned (see Section 4.1) and the residues types are treated equally as vertices of P . When give rise to a new graph R0 which is constructed referring to the size of a graph, we mean the num- in the same way as R. We proceed iteratively until ber of vertices it contains. the number of residues stops changing. For each A trivial initial decomposition of P is achieved remaining residue u, we identify its translations, by identifying all its connected components (com- and for each translation v we identify the largest ponents for brevity), i.e. the mutually disjoint component of which v is a member and add u to connected subgraphs, {P0 , P1 , ..., Pn }. It turns that component. out (see Section 4.1) that the largest component, The final result is a collection C = D ∪ F, say P0 , is of significant size. We call P0 giant where D is the collection of components emerg- and it needs to be further decomposed. This is ing from the entire iterative decomposition of P0 3 and R, and F = {P1 , ..., Pn }. Figure 1 shows which the phrases x and x0 co-occur, and equiv- the decomposition of a connected graph G0 ; for alently for c(·). The purpose of this measure is simplicity we assume that only one cut-vertex is for pruning paraphrase candidates and its use is removed at each iteration and ties are resolved ar- explained in Section 3.1. Note that idf (x, x0 ) ∈ bitrarily. In Figure 2 the residue graph is con- [0, 1]. structed and its two components are identified. The merging process and the idf measure are The iterative insertion of the cut vertices is also irrelevant for phrases belonging to the compo- depicted. The resulting two components together nents of F, since the vertex set of each compo- with those from R form the collection D for G0 . nent of F is mutually disjoint with the vertex set The addition of cut-vertices into multiple com- of any other component in C. ponents, as well as the construction method of the residue-based graph R, can yield the occurrences s1 t1 s1 t1 G 11 of a vertex in multiple components in D. We ex- G0 s2 t2 s3 t3 ploit this property in two ways: G 12 s3 t 3 c 0={s 2 } s 4 t4 (a) In order to mitigate the risk of excessive de- s4 t4 composition (which implies greater risk of good r={t 2 } paraphrases being in different components), as well as to reduce the size of D, a conserva- s3 t3 s3 t 4 G 21 G 12 tive merging algorithm of components is em- s4 t 4 c 1={t 3 } ployed. Suppose that the elements of D are r  r∪{s 4 } ranked according to size in ascending order as D = {D1 , ..., Dk , Dk+1 , ..., D|D| }, where |Di | ≤ Figure 1: The decomposition of G0 with vertices δ, for i = 1, ..., k, and some threshold δ (see Sec- si and tj : The cut-vertex of the ith iteration is de- tion 4.1). Each component Di with i ∈ {1, ..., k} noted by ci , and r collects the residues after each is examined as follows: For each vertex of Di the iteration. The task is completed in Figure 2. number of its occurrences in D is inspected; this is done in order to identify an appropriate vertex b to act as a bridge between Di and other components s2 t2 s2 t2 of which b is a member. Note that translations of R a vertex b with smaller number of occurrences in s4 t3 s4 t3 D are less likely to capture their full spectrum of c 0 paraphrases. We thus choose a vertex b from Di s1 t1 c 1 t3 with the smallest number of occurrences in D , s3 t 4 s3 t4 t3 c 0 resolving ties arbitrarily, and proceed with merg- s3 t4 ing Di with the largest component, say Dj with s2 t3 s1 t1 j ∈ {1, ..., |D| − 1}, of which b is also a member. s2 s3 t4 The resulting merged component Dj 0 contains all vertices and edges of Di and Dj and new edges, which are formed according to the rule: if u is a Figure 2: Top: Residue graph with its components vertex of Di and v is a vertex of Dj and (u, v) is (no further decomposition is required). Bottom: a phrase-table entry, then (u, v) is an edge in Dj 0 . Adding cut-vertices back to their components. As long as no connected component has identi- fied Di as the component with which it should be merged, then Di is deleted from the collection D. 2.2 Clustering Connected Components (b) We define an idf -inspired measure for each The aim of this subsection is to generate sep- phrase pair (x, x0 ) of the same type (source or tar- arate clusters for the source and target phrases get) as of each sub-phrase-table (component) C ∈ C. 1 2c(x, x0 )|D| For this purpose the Information-Theoretic Co- 0 idf (x, x ) = log , (1) Clustering (ITC) algorithm (Dhillon et al., 2003) log |D| c(x) + c(x0 ) is employed, which is a general principled cluster- where c(x, x0 ) is the number of components in ing algorithm that generates hard clusters (i.e. ev- 4 ery element belongs to exactly one cluster) of two than some threshold σ (see Section 4.1). If two interdependent quantities and is known to per- phrases that satisfy condition (b) and have trans- form well on high-dimensional and sparse data. lations in more than one common target cluster, In our case, the interdependent quantities are the a distinct such edge is established. All edges are source and target phrases and the sparse data is bi-directional with distinct weights for both direc- the phrase-table. tions. ITC is a search algorithm similar to K-means, Figure 3 depicts an example of such a construc- in the sense that a cost function, is minimized at tion; a link between a phrase si and a target cluster each iteration step and the number of clusters for implies the existence of at least one translation for both quantities are meta-parameters. The number si in that cluster. We are not interested in the tar- of clusters is set to the most conservative initial- get phrases and they are thus not shown. For sim- ization for both source and target phrases, namely plicity we assume that condition (b) is always sat- to as many clusters as there are phrases. At each isfied and the extracted graph contains the maxi- iteration, new clusters are constructed based on mum possible edges. Observe that phrases s3 and the identification of the argmin of the cost func- s4 have two edges connecting them, (due to tar- tion for each phrase, which gradually reduces the get clusters Tc and Td ) and that the target cluster number of clusters. Ta is irrelevant to the construction of the graph, We observe that conservative choices for the since s1 is the only phrase with translations in it. meta-parameters often result in good paraphrases This conversion of a source cluster into a graph G being in different clusters. To overcome this prob- lem, the hard clusters are converted into soft (i.e. s1 s2 s3 s4 s5 s6 s7 s8 an element may belong to several clusters): One step before the stopping criterion is met, we mod- ify the algorithm so that instead of assigning a Ta Tb Tc Td Te Tf phrase to the cluster with the smallest cost we se- lect the bottom-X clusters ranked by cost. Addi- s1 s4 s7 tionally, only a certain number of phrases is cho- s3 sen for soft clustering. Both selections are done s6 conservatively with criteria based on the proper- s5 s8 s2 ties of the cost functions. The formation of clusters leads to a natural re- finement of the idf measure defined in eqn. (1): Figure 3: Top: A source cluster containing The quantity c(x, x0 ) is redefined as the number phrases s1 ,..., s8 and the associated target clusters of components in which the phrases x and x0 co- Ta ,..., Tf . Bottom: The extracted graph from the occur in at least one cluster. source cluster. All edges are bi-directional. 3 Monolingual Graphs & Counts results in the formation of subgraphs in G, where We proceed with converting the clusters into di- each subgraph is generated by a target cluster. In rected, weighted graphs and then extract para- general, if condition (b) is not always satisfied, phrases for both the source and target side. For then G need not be connected and each connected brevity we explain the process restricted to the component is treated as a distinct graph. source clusters of a sub-phrase-table, but the same Analogous to KB, we introduce feature vertices method applies for the target side and for all sub- to G: For each phrase vertex s, its part-of-speech phrase-tables in the collection C. (POS) tag sequence and stem sequence are iden- tified and inserted into G as new vertices with 3.1 Monolingual graphs bi-directional weighted edges connected to s. If Each source cluster is converted into a graph G as phrase vertices s and s0 have the same POS tag se- follows: The vertex set consists of the phrases of quence, then they are connected to the same POS the cluster and an edge between s and s0 exists, if tag feature vertex. Similarly for stem feature ver- (a) s and s0 have at least one translation from the tices. See Figure 4 for an example. Note that we same target cluster, and (b) idf (s, s0 ) is greater do not allow edges between POS tag and stem fea- 5 Phrase* )feature weights: As mentioned HAVE OWN OWN I HAVE above, feature vertices have the dual role of car- rying syntactic information and smoothing. From eqn. (3) it can be deduced that, if for a phrase has owns i have i had s, the amount of its outgoing weights is close to the amount of its incoming weights, then this is VBZ PRP VBP PRP VBD an indication that at least a significant part of its neighborhood is reliable; the larger the strengths, the more certain the indication. Otherwise, either Figure 4: Adding feature vertices to the extracted s or a significant part of its neighborhood is graph (has) *) (owns) * ) (i have) * ) (i had). unreliable. The amount of weight from s to its Phrase, POS tag feature and stem feature ver- feature vertices should depend on this observation tices are drawn in circles, dotted rectangles and and we thus let solid rectangles respectively. All edges are bi- directional. X 0 0 net(s) = (w(s → s ) − w(s → s)) + , s0 ∈Γ(s) ture vertices. The purpose of the feature vertices, (4) unlike KB, is primarily for smoothing and secon- where prevents net(s) from becoming 0 (see darily for identifying paraphrases with the same Section 4.1). The net weight of a phrase vertex syntactic information and this will become clear s is distributed over its feature vertices as in the description of the computation of weights. The set of all phrase vertices that are adja- w(s → fX ) =< w(s → s0 ) > +net(s), (5) cent to s is written as Γ(s), and referred to as the neighborhood of s. Let n(s, t) denote where the first summand is the average weight the co-occurrence count of a phrase-table entry from s to its neighboring phrase vertices and (s, t) (Koehn, 2009). We define the strength of X = POS, STEM. If s has multiple POS tag s in the subgraph generated by cluster T as sequences, we distribute the weight of eqn. (5) X relatively to the co-occurrences of s with the re- n(s; T ) = n(s, t), (2) spective POS tag feature vertices. The quantity t∈T < w(s → s0 ) > accounts for the basic smoothing which is simply a partial occurrence count for s. and is augmented by a value net(s) that measures We proceed with computing weights for all edges the reliability of s’s neighborhood; the more unre- of G: liable the neighborhood, the larger the net weight and thus larger the overall weights to the feature Phrase* )phrase weights: Inspired by the vertices. notion of preferential attachment (Yule, 1925), which is known to produce power-law weight dis- The choice for the opposite direction is trivial: tributions for evolving weighted networks (Barrat 1 et al., 2004), we set the weight of a directed w(fX → s) = , (6) |{s0 : (fX , s0 ) is an edge }| edge from s to s0 to be proportional to the strengths of s0 in all subgraphs in which both where X = POS, STEM. Note the effect of s and s0 are members. Thus, in the random eqns. (4)–(6) in the case where the neighborhood walk framework, s is more likely to visit of s has unreliable strengths: In a random walk a stronger (more reliable) neighbor. If Ts,s0 = the feature vertices of s will be preferred and the {T |s and s0 coexist in subgraph generated by T }, resulting similarities between s and other phrase then the weight w(s → s0 ) of the directed edge vertices will be small, as desired. Nonetheless, from s to s0 is given by if the syntactic information is the same with any X other phrase vertex in G, then the paraphrases will w(s → s0 ) = n(s0 ; T ), (3) be captured. T ∈Ts,s0 The transition probability from any vertex u to if s0 ∈ Γ(s) and 0 otherwise. any other vertex v in G, i.e., the probability of 6 hopping from u to v in one step, is given by for all pairs of vertices u, v in G until conver- gence. Experimentally, we find that convergence w(u → v) is always achieved. After the execution of this it- p(u → v) = P 0 , (7) v 0 w(u → v ) erative process we divide each count by the small- where we sum over all vertices adjacent to u in G. est count in order to achieve a lower bound of 1. We can thus compute the similarity between any A pair u, v may appear in multiple graphs in the two vertices u and v in G by their commute time, same sub-phrase-table C. The total co-occurrence i.e., the expected number of steps in a round trip, count of u and v in C and the associated condi- in a random walk from u to v and then back to u, tional probabilities are thus given by which is denoted by κ(u, v) (see Section 4.1 for X the method of computation of κ). Since κ(u, v) is nC (u, v) = nG (u, v) (13) a distance measure, the smaller its value, the more G∈C similar u and v are. nC (u, v) pC (v|u) = P . (14) x∈C nC (u, x) 3.2 Counts We convert the distance κ(u, v) of a vertex pair A pair u, v may appear in multiple sub-phrase- u, v in a graph G into a co-occurrence count tables and for the calculation of the final count nG (u, v) with a novel technique: In order to as- n(u, v) we need to average over the associated sess the quality of the pair u, v with respect to G counts from all sub-phrase-tables. Moreover, we we compare κ(u, v) with κ(u, x) and κ(v, x) for have to take into account the type of the vertices: all other vertices x in G. We thus consider the av- For the simplest case where both u and v repre- erage distance of u with the other vertices of G sent phrase vertices, their expected count is, by other than v, and similarly for v. This quantity is definition, given by denoted by κ(u; v) and κ(v; u) respectively, and X by definition it is given by n(s, s0 ) = nC (s, s0 )p(C|s, s0 ). (15) X C κ(i; j) = κ(i, x)pG (x|i) (8) x∈G On the other hand, if at least one of u or v is x6=j a feature vertex, then we have to consider the where pG (x|i) ≡ p(x|G, i) is a yet unknown phrase vertex that generates this feature: Suppose probability distribution with respect to G. The that u is the phrase vertex s=‘acquire’ and v the quantity (κ(u; v)+κ(v; u))/2 can then be viewed POS tag vertex f =‘NN’ and they co-occur in two as the average distance of the pair u, v to the rest sub-phrase-tables C and C 0 with positive counts of the graph G. The co-occurrence count of u and nC (s, f ) and nC 0 (s, f ) respectively; the feature v in G is thus defined by vertex f is generated by the phrase vertices ‘own- ership’ in C and by ‘possession’ in C 0 . In that κ(u; v) + κ(v; u) case, an interpolation of the counts nC (s, f ) and nG (u, v) = . (9) 2κ(u, v) nC 0 (s, f ) as in eqn. (15) would be incorrect and a direct sum nC (s, f ) + nC 0 (s, f ) would provide In order to calculate the probabilities pG (·|·) we the true count. As a result we have employ the following heuristic: Starting with a (0) uniform distribution pG (·|·) at timestep t = 0, n(s, f ) = XX nC (s, f (s0 ))p(C|s, f (s0 )), we iterate s0 C X (t) (16) κ(t) (i; j) = κ(i, x)pG (x|i) (10) where the first summation is over all phrase ver- x∈G x6=j tices s0 such that f (s0 ) = f . With a similar argu- ment we can write (t) κ(t) (u; v) + κ(t) (v; u) nG (u, v) = (11) 2κ(u, v) XX n(f, f 0 ) = nC (f (s), f (s0 ))× (t) (t+1) nG (u, v) s,s0 C pG (v|u) = (t) (12) × p(C|f (s), f (s0 )). P x∈G nG (u, v) (17) 7 For the interpolants, from standard probability we 7 10 find 6 5 10 10 pC (v|u)p(C|u) P0 p(C|u, v) = P 0 , (18) 5 10 C 0 pC 0 (v|u)p(C |u) 4 10 size 0 where the probabilities p(C|u) can be computed 3 10 10 0 10 10 2 4 10 6 10 by considering the likelihood function 2 10 N Y N X Y 1 10 `(u) = p(xi |u) = pC (xi |u)p(C|u) 0 i=1 i=1 C 10 0 2 4 6 10 10 10 10 rank and by maximizing the average log-likelihood 1 N log `(u), where N is the total number of ver- Figure 5: Log-log plot of ranked components ac- tices with which u co-occurs with positive counts cording to their size (number of source and target in all sub-phrase-tables. phrases) for: (a) Components extracted from P . Finally, the desired probability distributions are ‘1-1’ components are not shown. (b) Components given by the relative frequencies extracted from the decomposition of P0 . n(u, v) p(v|u) = P , (19) x n(u, x) In the components emerging from the decompo- for all pairs of vertices u, v. sition of R0 , we observe an excessive number of cut-vertices. Note that vertices that consist 4 Experiments these components can be of two types: i) for- 4.1 Setup mer residues, i.e., residues that emerged from the The data for building the phrase-table P decomposition of P0 , and ii) other vertices of is drawn from DE-EN bitexts crawled from P0 . Cut-vertices can be of either type. For each www.project-syndicate.org, which is component, we remove cut-vertices that are not a standard resource provider for the WMT translations of the former residues of that com- campaigns (News Commentary bitexts, see, ponent. Following this pruning strategy, the de- e.g. (Callison-Burch et al., 2007) ). The filtered generacy of excessive cut-vertices does not reap- bitext consists of 125K sentences; word align- pear in the subsequent iterations of decompos- ment was performed running GIZA++ in both di- ing components generated by new residues, but rections and generating the symmetric alignments the emergence of two giant components was ob- using the ‘grow-diag-final-and’ heuristics. The served: One consisting mostly of source type ver- resulting P has 7.7M entries, 30% of which are tices and one of target type vertices. Without go- ‘1-1’, i.e. entries (s, t) that satisfy p(s|t) = ing into further details, the algorithm can extend p(t|s) = 1. These entries are irrelevant for para- to multiple giant components straightforwardly. phrase harvesting for both the baseline and our For the merging process of the collection D we method, and are thus excluded from the process. set δ = 5000, to avoid the emergence of a giant The initial giant component P0 contains 1.7M component. The sizes of the resulting sub-phrase- vertices (Figure 5), of which 30% become tables are shown in Figure 6. For the ITC algo- residues and are used to construct R. At each it- rithm we use the smoothing technique discussed eration of the decomposition of a giant compo- in (Dhillon and Guan, 2003) with α = 106 . nent, we remove the top 0.5% · size cut-vertices For the monolingual graphs, we set σ = 0.65 ranked by degree of connectivity, where size is and discard graphs with more than 20 phrase ver- the number of vertices of the giant component and tices, as they contain mostly noise. Thus, the sizes set θ = 2500 as the stopping criterion. The latter of the graphs allow us to use analytical methods choice is appropriate for the subsequent step of to compute the commute times: For a graph G, co-clustering the components, for both time com- we form the transition matrix P , whose entries plexity and performance of the ITC algorithm. P (u, v) are given by eqn. (7), and the fundamen- 8 10 6 the graph. Figure 8 depicts the new graph, where the lengths of the edges represent the magnitude 5 10 before merging of commute times. Observe that the quality of after merging 10 4 the probabilities is preserved but the counts are inflated, as required. size 3 10 In general, if a source phrase vertex s has at 10 2 least one translation t such that n(s, t) ≥ 3, then a 1 triplet (is , f (is ), g(is )) is added to the graph as in 10 Figure 8. The inflation vertex is establishes edges 0 10 0 10 10 2 10 4 10 6 with all other phrase and inflation vertices in the rank graph and weights are computed as in Section 3.1. The pipeline remains the same up to eqn. (13), Figure 6: Log-log plot of ranked sub-phrase- where all counts that include inflation vertices are tables according to their size (number of source ignored. and target phrases). f a f b tal matrix (Grinstead and Snell, 2006; Boley et al., a b 2011) Z = (I − P + 1π T )−1 , where I is the iden- tity matrix, 1 denotes the vector of all ones and π g a g b is the vector of stationary probabilities (Aldous and Fill, 2001) which is such that π T P = π T and π T 1 = 1 and can be computed as in (Hunter, na , b = 2.0 p b∣a = .20 2000). The commute time between any vertices u na , f a = 2.6 p  f a∣a = .27 na , g a = 2.6 p g a∣a = .27 and v in G is then given by (Grinstead and Snell, na , f b = 1.3 p  f b∣a = .13 2006) na , g b = 1.3 p g b∣a = .13 κ(u, v) = (Z(v, v) − Z(u, v))/π(v) + Figure 7: Top: A graph with source phrase ver- + (Z(u, u) − Z(v, u))/π(u). (20) tices a and b, both of strength 40, with accom- panying distinct POS sequence vertices f (·) and For the parameter of eqn. (4), an appropriate stem sequence vertices g(·). Bottom: The result- choice is = |Γ(s)| + 1; for reliable neighbor- ing co-occurrence counts and conditional proba- hoods, this quantity is insignificant. POS tags and bilities for a. lemmata are generated with TreeTagger1 . Figure 7 depicts the most basic type of graph that can be extracted from a cluster; it includes two source phrase vertices a, b, of different syn- tactic information. Suppose that both a and g a g i a  b are highly reliable with strengths n(a; T ) = f a ia f i a  a n(b; T ) = 40, for some target cluster T . The re- b ib sulting conditional probabilities adequately repre- f b f i b  sent the proximity of the involved vertices. On g b g i b  the other hand, the range of the co-occurrence na , b = 11.3 p b∣a = .22 counts is not compatible with that of the strengths. na , f a = 13.5 p  f a∣a = .26 This is because i) there are no phrase vertices with p g a∣a = .26 na , g a = 13.5 small strength in the graph, and ii) eqn. (9) is es- na , f b = 6.7 p  f b∣a = .13 sentially a comparison between a pair of vertices na , g b = 6.7 p g b∣a = .13 and the rest of the graph. To overcome this prob- lem inflation vertices ia and ib of strength 1 with Figure 8: The inflated version of Figure 7. accompanying feature vertices are introduced to 1 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ 9 4.2 Results Lenient MEP Strict MEP Method Our method generates conditional probabilities @1 @5 @10 @1 @5 @10 for any pair chosen from {phrase, POS sequence, Baseline .58 .47 .41 .43 .33 .28 stem sequence}, but for this evaluation we restrict Graphs .72 .61 .52 .53 .40 .33 ourselves to phrase pairs. For a phrase s, the qual- Table 1: Mean Expected Precision (MEP) at k un- ity of a paraphrase s0 is assessed by der lenient and strict evaluation criteria. P (s0 |s) ∝ p(s0 |s) + p(f1 (s0 )|s) + p(f2 (s0 )|s), (21) by eqns. (15)–(17)) for all vertices u and v, be- where f1 (s0 ) and f2 (s0 ) denote the POS tag se- longs to the power-law family (Figure 9). This is quence and stem sequence of s0 respectively. All evidence that the monolingual graphs can simu- three summands of eqn. (21) are computed from late the phrase extraction process of a monolin- eqn. (19). The baseline is given by pivoting (Ban- gual parallel corpus. Intuitively, we may think of nard and Callison-Burch, 2005), X the German side of the DE–EN parallel corpus as P (s0 |s) = p(t|s)p(s0 |t), (22) the ‘English’ approximation to a ‘EN’–EN par- t allel corpus, and the monolingual graphs as the where p(t|s) and p(s0 |t) are the phrase-based rel- word alignment process. ative frequencies of the translation model. 5 We select 150 phrases (an equal number for 10 unigrams, bigrams and trigrams), for which we 4 expect to see paraphrases, and keep the top-10 10 co−occurrence count paraphrases for each phrase, ranked by the above 3 measures. We follow (Kok and Brockett, 2010; 10 Metzler et al., 2011) in the evaluation of the ex- 2 tracted paraphrases: Each phrase-paraphrase pair 10 is manually annotated with the following options: 0) Different meaning; 1) (i) Same meaning, but 1 10 potential replacement of the phrase with the para- phrase in a sentence ruins the grammatical struc- 0 10 0 2 4 6 8 10 10 10 10 10 ture of the sentence. (ii) Tokens of the paraphrase rank are morphological inflections of the phrase’s to- kens. 2) Same meaning. Although useful for SMT Figure 9: Log-log plot of ranked pairs of English purposes, ‘super/substrings of’ are annotated with vertices according to their counts 0 to achieve an objective evaluation. Both methods are evaluated in terms of the Mean Expected Precision (MEP) at k; the Ex- 5 Conclusions & Future Work pected Precision for each selected phrase P s at We have described a new method that harvests rank k is computed by Es [p@k] = k1 ki=1 pi , paraphrases from a bitext, generates artificial where pi is the proportion of positive annotations co-occurrence counts for any pair chosen from for item i. The P desired metric is thus given by 1 {phrase, POS sequence, stem sequence}, and po- MEP@k = 150 s Es [p@k]. The contribution tentially identifies patterns for the syntactic infor- to pi can be restricted to perfect paraphrases only, mation of the phrases. The quality of the para- which leads to a strict strategy for harvesting para- phrases’ ranked lists outperforms that of a stan- phrases. Table 1 summarizes the results of our dard baseline. The quality of the resulting condi- evaluation and tional probabilities is promising and will be eval- uated implicitly via an application to SMT. we deduce that our method can lead to improve- This research was funded by the European ments over the baseline. Commission through the CoSyne project FP7- An important accomplishment of our method ICT- 4-248531. is that the distribution of counts n(u, v), (as given 10 References Stanley Kok and Chris Brockett. 2010. Hitting the Right Paraphrases in Good Time. Proc. NAACL, David Aldous and James A. Fill. 2001. Reversible pp.145–153. Markov Chains and Random Walks on Graphs. Roland Kuhn, Boxing Chen, George Foster, and Evan http://www.stat.berkeley.edu/∼aldous/RWG/ Stratford. 2010. Phrase Clustering for Smoothing book.html TM Probabilities: or, how to Extract Paraphrases Ion Androutsopoulos and Prodromos Malakasiotis. from Phrase Tables. Proc. COLING, pp.608–616. 2010. A Survey of Paraphrasing and Textual En- Nitin Madnani and Bonnie Dorr. 2010. Generating tailment Methods. Journal of Artificial Intelligence Phrasal and Sentential Paraphrases: A Survey of Research, 38:135–187. Data-Driven Methods. Computational Linguistics, Colin Bannard and Chris Callison-Burch. 2005. Para- 36(3):341–388. phrasing with Bilingual Parallel Corpora. Proc. Donald Metzler, Eduard Hovy, and Chunliang ACL, pp. 597–604. Zhang. 2011. An Empirical Evaluation of Data- Alain Barrat, Marc Barthlemy, and Alessandro Vespig- Driven Paraphrase Generation Techniques. Proc. nani. 2004. Modeling the Evolution of Weighted ACL:Short Papers, pp. 546–551. Networks. Phys. Rev. Lett., 92. Takashi Onishi, Masao Utiyama, and Eiichiro Sumita. Daniel Boley, Gyan Ranjan, and Zhi-Li Zhang. 2011. 2010. Paraphrase Lattice for Statistical Machine Commute Times for a Directed Graph using an Translation. Proc. ACL:Short Papers, pp. 1–5. Asymmetric Laplacian. Linear Algebra and its Ap- Delip Rao, David Yarowsky, and Chris Callison- plications, Issue 2, pp. 224–242. Burch. 2008. Affinity Measures based on the Graph Chris Callison-Burch. 2008. Syntactic Constraints Laplacian. Proc. Textgraphs Workshop on Graph- on Paraphrases Extracted from Parallel Corpora. based Algorithms for NLP at COLING, pp. 41–48. Proc. EMNLP, pp. 196–205. George U. Yule. 1925. A Mathematical Theory of Chris Callison-Burch, Cameron Fordyce, Philipp Evolution, based on the Conclusions of Dr. J. C. Koehn, Christof Monz, and Josh Schroeder. 2007 Willis, F.R.S. Philos. Trans. R. Soc. London, B 213, (Meta-) Evaluation of Machine Translation. Proc. pp. 21–87. Workshop on Statistical Machine Translation, pp. Shiqi Zhao, Haifeng Wang, Ting Liu, and Sheng Li. 136–158. 2008. Pivot Approach for Extracting Paraphrase Chris Callison-Burch, Philipp Koehn, and Miles Os- Patterns from Bilingual Corpora. Proc. ACL, pp. borne. 2006 Improved statistical machine trans- 780–788. lation using paraphrases. Proc. HLT/NAACL, pp. 17–24. Inderjit S. Dhillon and Yuqiang Guan. 2003. Informa- tion Theoretic Clustering of Sparse Co-Occurrence Data. Proc. IEEE Int’l Conf. Data Mining, pp. 517– 520. Inderjit S. Dhillon, Subramanyam Mallela, and Dhar- mendra S. Modha. 2003. Information-Theoretic Coclustering. Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 89–98. William Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large para- phrase corpora: Exploiting massively parallel news sources. Proc. COLING, pp. 350-356. Juri Ganitkevitch, Chris Callison-Burch, Courtney Napoles, and Benjamin Van Durme 2011. Learn- ing Sentential Paraphrases from Bilingual Paral- lel Corpora for Text-to-Text Generation. Proc. EMNLP, pp. 1168–1179. Charles Grinstead and Laurie Snell. 2006. Introduc- tion to Probability. Second ed., American Mathe- matical Society. Jeffrey J. Hunter. 2000. A Survey of Generalized In- verses and their Use in Stochastic Modelling. Res. Lett. Inf. Math. Sci., Vol. 1, pp. 25–36. Philipp Koehn. 2009. Statistical Machine Translation. Cambridge University Press, Cambridge, UK. 11 A Bayesian Approach to Unsupervised Semantic Role Induction Ivan Titov Alexandre Klementiev Saarland University Saarbr¨ucken, Germany {titov|aklement}@mmci.uni-saarland.de Abstract Mary always takes an agent role (A0) for the pred- icate open, and door is always a patient (A1). We introduce two Bayesian models for un- SRL representations have many potential appli- supervised semantic role labeling (SRL) cations in natural language processing and have task. The models treat SRL as clustering recently been shown to be beneficial in question of syntactic signatures of arguments with clusters corresponding to semantic roles. answering (Shen and Lapata, 2007; Kaisser and The first model induces these clusterings Webber, 2007), textual entailment (Sammons et independently for each predicate, exploit- al., 2009), machine translation (Wu and Fung, ing the Chinese Restaurant Process (CRP) 2009; Liu and Gildea, 2010; Wu et al., 2011; Gao as a prior. In a more refined hierarchical and Vogel, 2011), and dialogue systems (Basili et model, we inject the intuition that the clus- al., 2009; van der Plas et al., 2011), among others. terings are similar across different predi- Though syntactic representations are often predic- cates, even though they are not necessar- ily identical. This intuition is encoded as tive of semantic roles (Levin, 1993), the interface a distance-dependent CRP with a distance between syntactic and semantic representations is between two syntactic signatures indicating far from trivial. The lack of simple determinis- how likely they are to correspond to a single tic rules for mapping syntax to shallow semantics semantic role. These distances are automat- motivates the use of statistical methods. ically induced within the model and shared across predicates. Both models achieve Although current statistical approaches have state-of-the-art results when evaluated on been successful in predicting shallow seman- PropBank, with the coupled model consis- tic representations, they typically require large tently outperforming the factored counter- amounts of annotated data to estimate model pa- part in all experimental set-ups. rameters. These resources are scarce and ex- pensive to create, and even the largest of them 1 Introduction have low coverage (Palmer and Sporleder, 2010). Moreover, these models are domain-specific, and Semantic role labeling (SRL) (Gildea and Juraf- their performance drops substantially when they sky, 2002), a shallow semantic parsing task, has are used in a new domain (Pradhan et al., 2008). recently attracted a lot of attention in the com- Such domain specificity is arguably unavoidable putational linguistic community (Carreras and for a semantic analyzer, as even the definitions M`arquez, 2005; Surdeanu et al., 2008; Hajiˇc et of semantic roles are typically predicate specific, al., 2009). The task involves prediction of predi- and different domains can have radically different cate argument structure, i.e. both identification of distributions of predicates (and their senses). The arguments as well as assignment of labels accord- necessity for a large amounts of human-annotated ing to their underlying semantic role. For exam- data for every language and domain is one of the ple, in the following sentences: major obstacles to the wide-spread adoption of se- mantic role representations. (a) [A0 Mary] opened [A1 the door]. These challenges motivate the need for unsu- (b) [A0 Mary] is expected to open [A1 the door]. pervised methods which, instead of relying on la- (c) [A1 The door] opened. beled data, can exploit large amounts of unlabeled (d) [A1 The door] was opened [A0 by Mary]. texts. In this paper, we propose simple and effi- 12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 12–22, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics cient hierarchical Bayesian models for this task. parses, and with gold and automatically identified It is natural to split the SRL task into two arguments). stages: the identification of arguments (the iden- Both models admit efficient inference: the es- tification stage) and the assignment of semantic timation time on the Penn Treebank WSJ corpus roles (the labeling stage). In this and in much does not exceed 30 minutes on a single proces- of the previous work on unsupervised techniques, sor and the inference algorithm is highly paral- the focus is on the labeling stage. Identification, lelizable, reducing inference time down to sev- though an important problem, can be tackled with eral minutes on multiple processors. This sug- heuristics (Lang and Lapata, 2011a; Grenager and gests that the models scale to much larger corpora, Manning, 2006) or, potentially, by using a super- which is an important property for a successful vised classifier trained on a small amount of data. unsupervised learning method, as unlabeled data We follow (Lang and Lapata, 2011a), and regard is abundant. the labeling stage as clustering of syntactic sig- The rest of the paper is structured as follows. natures of argument realizations for every predi- Section 2 begins with a definition of the seman- cate. In our first model, as in most of the previous tic role labeling task and discuss some specifics work on unsupervised SRL, we define an indepen- of the unsupervised setting. In Section 3, we de- dent model for each predicate. We use the Chi- scribe CRPs and dd-CRPs, the key components nese Restaurant Process (CRP) (Ferguson, 1973) of our models. In Sections 4 – 6, we describe as a prior for the clustering of syntactic signatures. our factored and coupled models and the infer- The resulting model achieves state-of-the-art re- ence method. Section 7 provides both evaluation sults, substantially outperforming previous meth- and analysis. Finally, additional related work is ods evaluated in the same setting. presented in Section 8. In the first model, for each predicate we inde- pendently induce a linking between syntax and se- 2 Task Definition mantics, encoded as a clustering of syntactic sig- In this work, instead of assuming the availabil- natures. The clustering implicitly defines the set ity of role annotated data, we rely only on auto- of permissible alternations, or changes in the syn- matically generated syntactic dependency graphs. tactic realization of the argument structure of the While we cannot expect that syntactic structure verb. Though different verbs admit different alter- can trivially map to a semantic representation nations, some alternations are shared across mul- (Palmer et al., 2005)1 , we can use syntactic cues tiple verbs and are very frequent (e.g., passiviza- to help us in both stages of unsupervised SRL. tion, example sentences (a) vs. (d), or dativiza- Before defining our task, let us consider the two tion: John gave a book to Mary vs. John gave stages separately. Mary a book) (Levin, 1993). Therefore, it is nat- In the argument identification stage, we imple- ural to assume that the clusterings should be sim- ment a heuristic proposed in (Lang and Lapata, ilar, though not identical, across verbs. 2011a) comprised of a list of 8 rules, which use Our second model encodes this intuition by re- nonlexicalized properties of syntactic paths be- placing the CRP prior for each predicate with tween a predicate and a candidate argument to it- a distance-dependent CRP (dd-CRP) prior (Blei eratively discard non-arguments from the list of and Frazier, 2011) shared across predicates. The all words in a sentence. Note that inducing these distance between two syntactic signatures en- rules for a new language would require some lin- codes how likely they are to correspond to a sin- guistic expertise. One alternative may be to an- gle semantic role. Unlike most of the previous notate a small number of arguments and train a work exploiting distance-dependent CRPs (Blei classifier with nonlexicalized features instead. and Frazier, 2011; Socher et al., 2011; Duan et al., In the argument labeling stage, semantic roles 2007), we do not encode prior or external knowl- are represented by clusters of arguments, and la- edge in the distance function but rather induce it beling a particular argument corresponds to decid- automatically within our Bayesian model. The ing on its role cluster. However, instead of deal- coupled dd-CRP model consistently outperforms 1 Although it provides a strong baseline which is diffi- the factored CRP counterpart across all the experi- cult to beat (Grenager and Manning, 2006; Lang and Lapata, mental settings (with gold and predicted syntactic 2010; Lang and Lapata, 2011a). 13 ing with argument occurrences directly, we rep- for describing CRPs is assignment of tables to resent them as predicate specific syntactic signa- restaurant customers. Assume a restaurant with a tures, and refer to them as argument keys. This sequence of tables, and customers who walk into representation aids our models in inducing high the restaurant one at a time and choose a table to purity clusters (of argument keys) while reducing join. The first customer to enter is assigned the their granularity. We follow (Lang and Lapata, first table. Suppose that when a client number i 2011a) and use the following syntactic features to enters the restaurant, i − 1 customers are sitting form the argument key representation: at each of the k ∈ (1, . . . , K) tables occupied so • Active or passive verb voice (ACT/PASS). far. The new customer is then either seated at one Nk • Argument position relative to predicate of the K tables with probability i−1+α , where Nk (LEFT/RIGHT). is the number customers already sitting at table k, or assigned to a new table with the probability • Syntactic relation to its governor. α i−1+α . The concentration parameter α encodes • Preposition used for argument realization. the granularity of the drawn partitions: the larger In the example sentences in Section 1, the argu- α, the larger the expected number of occupied ta- ment keys for candidate arguments Mary for sen- bles. Though it is convenient to describe CRP in a tences (a) and (d) would be ACT:LEFT:SBJ and sequential manner, the probability of a seating ar- PASS:RIGHT:LGS->by,2 respectively. While rangement is invariant of the order of customers’ aiming to increase the purity of argument key arrival, i.e. the process is exchangeable. In our clusters, this particular representation will not al- factored model, we use CRPs as a prior for clus- ways produce a good match: e.g. the door in tering argument keys, as we explain in Section 4. sentence (c) will have the same key as Mary in Often CRP is used as a part of the Dirich- sentence (a). Increasing the expressiveness of the let Process mixture model where each subset in argument key representation by flagging intransi- the partition (each table) selects a parameter (a tive constructions would distinguish that pair of meal) from some base distribution over parame- arguments. However, we keep this particular rep- ters. This parameter is then used to generate all resentation, in part to compare with the previous data points corresponding to customers assigned work. to the table. The Dirichlet processes (DP) are In this work, we treat the unsupervised seman- closely connected to CRPs: instead of choosing tic role labeling task as clustering of argument meals for customers through the described gener- keys. Thus, argument occurrences in the corpus ative story, one can equivalently draw a distribu- whose keys are clustered together are assigned the tion G over meals from DP and then draw a meal same semantic role. Note that some adjunct-like for every customer from G. We refer the reader modifier arguments are already explicitly repre- to Teh (2010) for details on CRPs and DPs. In sented in syntax and thus do not need to be clus- our method, we use DPs to model distributions of tered (modifiers AM-TMP, AM-MNR, AM-LOC, and arguments for every role. AM-DIR are encoded as ‘syntactic’ relations TMP, In order to clarify how similarities between MNR, LOC, and DIR, respectively (Surdeanu et al., customers can be integrated in the generative pro- 2008)); instead we directly use the syntactic labels cess, we start by reformulating the traditional as semantic roles. CRP in an equivalent form so that distance- dependent CRP (dd-CRP) can be seen as its gen- 3 Traditional and Distance-dependent eralization. Instead of selecting a table for each CRPs customer as described above, one can equiva- The central components of our non-parametric lently assume that a customer i chooses one of Bayesian models are the Chinese Restaurant Pro- the previous customers ci as a partner with prob- cesses (CRPs) and the closely related Dirichlet 1 ability i−1+α and sits at the same table, or occu- Processes (DPs) (Ferguson, 1973). pies a new table with the probability i−1+α α . The CRPs define probability distributions over par- transitive closure of this seating-with relation de- titions of a set of objects. An intuitive metaphor termines the partition. 2 LGS denotes a logical subject in a passive construction A generalization of this view leads to the defini- (Surdeanu et al., 2008). tion of the distance-dependent CRP. In dd-CRPs, 14 a customer i chooses a partner ci = j with Our model associates two distributions with the probability proportional to some non-negative each predicate: one governs the selection of argu- score di,j (di,j = dj,i ) which encodes a similarity ment fillers for each semantic role, and the other between the two customers.3 More formally, models (and penalizes) duplicate occurrence of roles. Each predicate occurrence is generated in- di,j , i 6= j p(ci = j|D, α) ∝ (1) dependently given these distributions. Let us de- α, i = j scribe the model by first defining how the set of where D is the entire similarity graph. This pro- model parameters and an argument key clustering cess lacks the exchangeability property of the tra- are drawn, and then explaining the generation of ditional CRP but efficient approximate inference individual predicate and argument instances. The with dd-CRP is possible with Gibbs sampling. generative story is formally presented in Figure 1. For more details on inference with dd-CRPs, we We start by generating a partition of argument refer the reader to Blei and Frazier (2011). keys Bp with each subset r ∈ Bp representing Though in previous work dd-CRP was used ei- a single semantic role. The partitions are drawn ther to encode prior knowledge (Blei and Fra- from CRP(α) (see the Factored model section of zier, 2011) or other external information (Socher Figure 1) independently for each predicate. The et al., 2011), we treat D as a latent variable crucial part of the model is the set of selectional drawn from some prior distribution over weighted preference parameters θp,r , the distributions of ar- graphs. This view provides a powerful approach guments x for each role r of predicate p. We for coupling a family of distinct but similar clus- represent arguments by their syntactic heads,4 or terings: the family of clusterings can be drawn by more specifically, by either their lemmas or word first choosing a similarity graph D for the entire clusters assigned to the head by an external clus- family and then re-using D to generate each of the tering algorithm, as we will discuss in more detail clusterings independently of each other as defined in Section 7.5 For the agent role A0 of the pred- by equation (1). In Section 5, we explain how we icate open, for example, this distribution would use this formalism to encode relatedness between assign most of the probability mass to arguments argument key clusterings for different predicates. denoting sentient beings, whereas the distribution 4 Factored Model for the patient role A1 would concentrate on ar- guments representing “openable” things (doors, In this section we describe the factored method boxes, books, etc). which models each predicate independently. In In order to encode the assumption about sparse- Section 2 we defined our task as clustering of ar- ness of the distributions θp,r , we draw them from gument keys, where each cluster corresponds to a the DP prior DP (β, H (A) ) with a small concen- semantic role. If an argument key k is assigned tration parameter β, the base probability distribu- to a role r (k ∈ r), all of its occurrences are la- tion H (A) is just the normalized frequencies of ar- beled r. guments in the corpus. The geometric distribution Our Bayesian model encodes two common as- ψp,r is used to model the number of times a role sumptions about semantic roles. First, we enforce r appears with a given predicate occurrence. The the selectional restriction assumption: we assume decision whether to generate at least one role r is that the distribution over potential argument fillers drawn from the uniform Bernoulli distribution. If is sparse for every role, implying that ‘peaky’ dis- 0 is drawn then the semantic role is not realized tributions of arguments for each role r are pre- for the given occurrence, otherwise the number ferred to flat distributions. Second, each role nor- of additional roles r is drawn from the geometric mally appears at most once per predicate occur- distribution Geom(ψp,r ). The Beta priors over ψ rence. Our inference will search for a clustering 4 which meets the above requirements to the maxi- For prepositional phrases, we take as head the head noun of the object noun phrase as it encodes crucial lexical infor- mal extent. mation. However, the preposition is not ignored but rather 3 It may be more standard to use a decay function f : encoded in the corresponding argument key, as explained R → R and choose a partner with the probability propor- in Section 2. 5 tional to f (−di,j ). However, the two forms are equivalent Alternatively, the clustering of arguments could be in- and using scores di,j directly is more convenient for our in- duced within the model, as done in (Titov and Klementiev, duction purposes. 2011). 15 Clustering of argument keys: nations for a predicate. E.g., passivization can be Factored model: roughly represented with the clustering of the key for each predicate p = 1, 2, . . . : ACT:LEFT:SBJ with PASS:RIGHT:LGS->by Bp ∼ CRP (α) [partition of arg keys] and ACT:RIGHT:OBJ with PASS:LEFT:SBJ. Coupled model: The set of permissible alternations is predicate- D ∼ N onInf orm [similarity graph] specific,6 but nevertheless they arguably repre- for each predicate p = 1, 2, . . . : sent a small subset of all clusterings of argu- Bp ∼ dd-CRP (α, D) [partition of arg keys] ment keys. Also, some alternations are more likely to be applicable to a verb than others: for Parameters: example, passivization and dativization alterna- for each predicate p = 1, 2, . . . : tions are both fairly frequent, whereas, locative- for each role r ∈ Bp : θp,r ∼ DP (β, H (A) ) [distrib of arg fillers] preposition-drop alternation (Mary climbed up the ψp,r ∼ Beta(η0 , η1 ) [geom distr for dup roles] mountain vs. Mary climbed the mountain) is less common and applicable only to several classes Data Generation: of predicates representing motion (Levin, 1993). for each predicate p = 1, 2, . . . : We represent this observation by quantifying how for each occurrence l of p: likely a pair of keys is to be clustered. These for every role r ∈ Bp : scores (di,j for every pair of argument keys i and if [n ∼ U nif (0, 1)] = 1: [role appears at least once] j) are induced automatically within the model, GenArgument(p, r) [draw one arg] while [n ∼ ψp,r ] = 1: [continue generation] and treated as latent variables shared across pred- GenArgument(p, r) [draw more args] icates. Intuitively, if data for several predicates strongly suggests that two argument keys should GenArgument(p, r): kp,r ∼ U nif (1, . . . , |r|) [draw arg key] be clustered (e.g., there is a large overlap be- xp,r ∼ θp,r [draw arg filler] tween argument fillers for the two keys) then the posterior will indicate that di,j is expected to be greater for the pair {i, j} than for some other pair Figure 1: Generative stories for the factored and cou- {i0 , j 0 } for which the evidence is less clear. Con- pled models. sequently, argument keys i and j will be clustered even for predicates without strong evidence for can indicate the preference towards generating at such a clustering, whereas i0 and j 0 will not. most one argument for each role. For example, One argument against coupling predicates may it would express the preference that a predicate stem from the fact that we are using unlabeled open typically appears with a single agent and a data and may be able to obtain sufficient amount single patient arguments. of learning material even for less frequent pred- Now, when parameters and argument key clus- icates. This may be a valid observation, but an- terings are chosen, we can summarize the re- other rationale for sharing this similarity structure mainder of the generative story as follows. We is the hypothesis that alternations may be easier begin by independently drawing occurrences for to detect for some predicates than for others. For each predicate. For each predicate role we in- example, argument key clustering of predicates dependently decide on the number of role occur- with very restrictive selectional restrictions on ar- rences. Then we generate each of the arguments gument fillers is presumably easier than clustering (see GenArgument) by generating an argument for predicates with less restrictive and overlap- key kp,r uniformly from the set of argument keys ping selectional restriction, as compactness of se- assigned to the cluster r, and finally choosing its lectional preferences is a central assumption driv- filler xp,r , where the filler is either a lemma or a ing unsupervised learning of semantic roles. E.g., word cluster corresponding to the syntactic head predicates change and defrost belong to the same of the argument. Levin class (change-of-state verbs) and therefore admit similar alternations. However, the set of po- 5 Coupled Model tential patients of defrost is sufficiently restricted, As we argued in Section 1, clusterings of argu- 6 Or, at least specific to a class of predicates (Levin, ment keys implicitly encode the pattern of alter- 1993). 16 whereas the selectional restrictions for the patient key implies some computations for all its occur- of change are far less specific and they overlap rences in the corpus. Instead of more complex with selectional restrictions for the agent role, fur- MAP search algorithms (see, e.g., (Daume III, ther complicating the clustering induction task. 2007)), we use a greedy procedure where we start This observation suggests that sharing clustering with each argument key assigned to an individual preferences across verbs is likely to help even if cluster, and then iteratively try to merge clusters. the unlabeled data is plentiful for every predicate. Each move involves (1) choosing an argument key More formally, we generate scores di,j , or and (2) deciding on a cluster to reassign it to. This equivalently, the full labeled graph D with ver- is done by considering all clusters (including cre- tices corresponding to argument keys and edges ating a new one) and choosing the most probable weighted with the similarity scores, from a prior. one. In our experiments we use a non-informative prior Instead of choosing argument keys randomly at which factorizes over pairs (i.e. edges of the the first stage, we order them by corpus frequency. graph D), though more powerful alternatives can This ordering is beneficial as getting clustering be considered. Then we use it, in a dd-CRP(α, right for frequent argument keys is more impor- D), to generate clusterings of argument keys for tant and the corresponding decisions should be every predicate. The rest of the generative story is made earlier.7 We used a single iteration in our the same as for the factored model. The part rele- experiments, as we have not noticed any benefit vant to this model is shown in the Coupled model from using multiple iterations. section of Figure 1. 6.2 Similarity Graph Induction Note that this approach does not assume that the frequencies of syntactic patterns correspond- In the coupled model, clusterings for different ing to alternations are similar, and a large value predicates are statistically dependent, as the simi- for di,j does not necessarily mean that the corre- larity structure D is latent and shared across pred- sponding syntactic frames i and j are very fre- icates. Consequently, a more complex inference quent in a corpus. What it indicates is that a large procedure is needed. For simplicity here and in number of different predicates undergo the corre- our experiments, we use the non-informative prior sponding alternation; the frequency of the alterna- distribution over D which assigns the same prior tion is a different matter. We believe that this is an probability to every possible weight di,j for every important point, as we do not make a restricting pair {i, j}. assumption that an alternation has the same dis- Recall that the dd-CRP prior is defined in terms tributional properties for all verbs which undergo of customers choosing other customers to sit with. this alternation. For the moment, let us assume that this relation among argument keys is known, that is, every ar- 6 Inference gument key k for predicate p has chosen an argu- ment key cp,k to ‘sit’ with. We can compute the An inference algorithm for an unsupervised MAP estimate for all di,j by maximizing the ob- model should be efficient enough to handle vast jective: amounts of unlabeled data, as it can easily be ob- X X dk,cp,k tained and is likely to improve results. We use arg max log P , di,j , i6=j p k0 ∈K p dk,k0 a simple approximate inference algorithm based k∈K p on greedy MAP search. We start by discussing where K p is the set of all argument keys for the MAP search for argument key clustering with the predicate p. We slightly abuse the notation by us- factored model and then discuss its extension ap- ing di,i to denote the concentration parameter α plicable to the coupled model. in the previous expression. Note that we also as- sume that similarities are symmetric, di,j = dj,i . 6.1 Role Induction If the set of argument keys K p would be the same For the factored model, semantic roles for every for every predicate, then the optimal di,j would predicate are induced independently. Neverthe- 7 This idea has been explored before for shallow semantic less, search for a MAP clustering can be expen- representations (Lang and Lapata, 2011a; Titov and Klemen- sive, as even a move involving a single argument tiev, 2011). 17 be proportional to the number of times either i se- rior with Naive Bayes tends to be overconfident lects j as a partner, or j chooses i as a partner.8 due to violated conditional independence assump- This no longer holds if the sets are different, but tions (Rennie, 2001). The same behavior is ob- the solution can be found efficiently using a nu- served here: the shared prior does not have suf- meric optimization strategy; we use the gradient ficient effect on frequent predicates.10 Though descent algorithm. different techniques have been developed to dis- We do not learn the concentration parameter count the over-confidence (Kolcz and Chowdhury, α, as it is used in our model to indicate the de- 2005), we use the most basic one: we raise the sired granularity of semantic roles, but instead likelihood term in power T1 , where the parameter only learn di,j (i 6= j). However, just learning T is chosen empirically. the concentration parameter would not be suffi- cient as the effective concentration can be reduced 7 Empirical Evaluation or increased arbitrarily by scaling all the similar- 7.1 Data and Evaluation ities di,j (i 6= j) at once, as follows from expres- We keep the general setup of (Lang and Lapata, sion (1). Instead, we enforce the normalization 2011a), to evaluate our models and compare them constraint on the similarities di,j . We ensure that to the current state of the art. We run all of our the prior probability of choosing itself as a part- experiments on the standard CoNLL 2008 shared ner, averaged over predicates, is the same as it task (Surdeanu et al., 2008) version of Penn Tree- would be with uniform di,j (di,j = 1 for every bank WSJ and PropBank. In addition to gold key pair {i, j}, i 6= j). This roughly says that dependency analyses and gold PropBank annota- we want to preserve the same granularity of clus- tions, it has dependency structures generated au- tering as it was with the uniform similarities. We tomatically by the MaltParser (Nivre et al., 2007). accomplish this normalization in a post-hoc fash- We vary our experimental setup as follows: ion P by P dividing the weightsPafter optimization by • We evaluate our models on gold and auto- p k,k0 ∈K p , k0 6=k dk,k0 / p |K p |(|K p | − 1). matically generated parses, and use either If D is fixed, partners for every predicate p and gold PropBank annotations or the heuristic every k can be found using virtually the same al- from Section 2 to identify arguments, result- gorithm as in Section 6.1: the only difference is ing in four experimental regimes. that, instead of a cluster, each argument key itera- tively chooses a partner. • In order to reduce the sparsity of predicate Though, in practice, both the choice of partners argument fillers we consider replacing lem- and the similarity graphs are latent, we can use an mas of their syntactic heads with word clus- iterative approach to obtain a joint MAP estimate ters induced by a clustering algorithm as a of ck (for every k) and the similarity graph D by preprocessing step. In particular, we use alternating the two steps.9 Brown (Br) clustering (Brown et al., 1992) Notice that the resulting algorithm is again induced over RCV1 corpus (Turian et al., highly parallelizable: the graph induction stage 2010). Although the clustering is hierarchi- is fast, and induction of the seat-with relation cal, we only use a cluster at the lowest level (i.e. clustering argument keys) is factorizable over of the hierarchy for each word. predicates. We use the purity (PU) and collocation (CO) met- One shortcoming of this approach is typical rics as well as their harmonic mean (F1) to mea- for generative models with multiple ‘features’: sure the quality of the resulting clusters. Purity when such a model predicts a latent variable, it measures the degree to which each cluster con- tends to ignore the prior class distribution and re- tains arguments sharing the same gold role: lies solely on features. This behavior is due to 1 X PU = max |Gj ∩ Ci | the over-simplifying independence assumptions. N j i It is well known, for instance, that the poste- where if Ci is the set of arguments in the i-th in- 8 Note that weights di,j are invariant under rescaling duced cluster, Gj is the set of arguments in the jth when the rescaling is also applied to the concentration pa- 10 rameter α. The coupled model without discounting still outper- 9 In practice, two iterations were sufficient. forms the factored counterpart in our experiments. 18 gold cluster, and N is the total number of argu- gold parses auto parses ments. Collocation evaluates the degree to which PU CO F1 PU CO F1 LLogistic 79.5 76.5 78.0 77.9 74.4 76.2 arguments with the same gold roles are assigned SplitMerge 88.7 73.0 80.1 86.5 69.8 77.3 to a single cluster. It is computed as follows: GraphPart 88.6 70.7 78.6 87.4 65.9 75.2 1 X Factored 88.1 77.1 82.2 85.1 71.8 77.9 CO = max |Gj ∩ Ci | N i Coupled 89.3 76.6 82.5 86.7 71.2 78.2 j Factored+Br 86.8 78.8 82.6 83.8 74.1 78.6 We compute the aggregate PU, CO, and F1 Coupled+Br 88.7 78.1 83.0 86.2 72.7 78.8 scores over all predicates in the same way as SyntF 81.6 77.5 79.5 77.1 70.9 73.9 (Lang and Lapata, 2011a) by weighting the scores Table 1: Argument clustering performance with gold of each predicate by the number of its argument argument identification. Bold-face is used to highlight occurrences. Note that since our goal is to evalu- the best F1 scores. ate the clustering algorithms, we do not include incorrectly identified arguments (i.e. mistakes tion stage, and minimize the noise due to auto- made by the heuristic defined in Section 2) when matic syntactic annotations. All four variants of computing these metrics. the models we propose substantially outperform We evaluate both factored and coupled models other models: the coupled model with Brown proposed in this work with and without Brown clustering of argument fillers (Coupled+Br) beats word clustering of argument fillers (Factored, the previous best model SplitMerge by 2.9% F1 Coupled, Factored+Br, Coupled+Br). Our mod- score. As mentioned in Section 2, our approach els are robust to parameter settings, they were specifically does not cluster some of the modifier tuned (to an order of magnitude) on the develop- arguments. In order to verify that this and argu- ment set and were the same for all model variants: ment filler clustering were not the only aspects α = 1.e-3, β = 1.e-3, η0 = 1.e-3, η1 = 1.e-10, of our approach contributing to performance im- T = 5. Although they can be induced within the provements, we also evaluated our coupled model model, we set them by hand to indicate granular- without Brown clustering and treating modifiers ity preferences. We compare our results with the as regular arguments. The model achieves 89.2% following alternative approaches. The syntactic purity, 74.0% collocation, and 80.9% F1 scores, function baseline (SyntF) simply clusters predi- still substantially outperforming all of the alter- cate arguments according to the dependency re- native approaches. Replacing gold parses with lation to their head. Following (Lang and Lapata, MaltParser analyses we see a similar trend, where 2010), we allocate a cluster for each of 20 most Coupled+Br outperforms the best alternative ap- frequent relations in the CoNLL dataset and one proach SplitMerge by 1.5%. cluster for all other relations. We also compare 7.2.2 Automatic Arguments our performance with the Latent Logistic classifi- Results are summarized in Table 2.11 The cation (Lang and Lapata, 2010), Split-Merge clus- precision and recall of our re-implementation of tering (Lang and Lapata, 2011a), and Graph Parti- the argument identification heuristic described in tioning (Lang and Lapata, 2011b) approaches (la- Section 2 on gold parses were 87.7% and 88.0%, beled LLogistic, SplitMerge, and GraphPart, re- respectively, and do not quite match 88.1% and spectively) which achieve the current best unsu- 87.9% reported in (Lang and Lapata, 2011a). pervised SRL results in this setting. Since we could not reproduce their argument 7.2 Results identification stage exactly, we are omitting their results for the two regimes, instead including the 7.2.1 Gold Arguments results for our two best models Factored+Br and Experimental results are summarized in Ta- Coupled+Br. We see a similar trend, where the ble 1. We begin by comparing our models to the coupled system consistently outperforms its fac- three existing clustering approaches on gold syn- tored counterpart, achieving 85.8% and 83.9% F1 tactic parses, and using gold PropBank annota- 11 Note, that the scores are computed on correctly iden- tions to identify predicate arguments. In this set of tified arguments only, and tend to be higher in these ex- experiments we measure the relative performance periments probably because the complex arguments get dis- of argument clustering, removing the identifica- carded by the heuristic. 19 gold parses auto parses balizations of relations (Lin and Pantel, 2001; PU CO F1 PU CO F1 Banko et al., 2007). Early unsupervised ap- Factored+Br 87.8 82.9 85.3 85.8 81.1 83.4 proaches to the SRL problem include the work Coupled+Br 89.2 82.6 85.8 87.4 80.7 83.9 by Swier and Stevenson (2004), where the Verb- SyntF 83.5 81.4 82.4 81.4 79.1 80.2 Net verb lexicon was used to guide unsupervised Table 2: Argument clustering performance with auto- learning, and a generative model of Grenager and matic argument identification. Manning (2006) which exploits linguistic priors on syntactic-semantic interface. for gold and MaltParser analyses, respectively. More recently, the role induction problem has We observe that consistently through the four been studied in Lang and Lapata (2010) where regimes, sharing of alternations between predi- it has been reformulated as a problem of detect- cates captured by the coupled model outperforms ing alterations and mapping non-standard link- the factored version, and that reducing the argu- ings to the canonical ones. Later, Lang and La- ment filler sparsity with clustering also has a sub- pata (2011a) proposed an algorithmic approach stantial positive effect. Due to the space con- to clustering argument signatures which achieves straints we are not able to present detailed anal- higher accuracy and outperforms the syntactic ysis of the induced similarity graph D, however, baseline. In Lang and Lapata (2011b), the role argument-key pairs with the highest induced sim- induction problem is formulated as a graph parti- ilarity encode, among other things, passivization, tioning problem: each vertex in the graph corre- benefactive alternations, near-interchangeability sponds to a predicate occurrence and edges repre- of some subordinating conjunctions and preposi- sent lexical and syntactic similarities between the tions (e.g., if and whether), as well as, restoring occurrences. Unsupervised induction of seman- some of the unnecessary splits introduced by the tics has also been studied in Poon and Domin- argument key definition (e.g., semantic roles for gos (2009) and Titov and Klementiev (2010) but adverbials do not normally depend on whether the the induced representations are not entirely com- construction is passive or active). patible with the PropBank-style annotations and they have been evaluated only on a question an- 8 Related Work swering task for the biomedical domain. Also, the Most of SRL research has focused on the super- related task of unsupervised argument identifica- vised setting (Carreras and M`arquez, 2005; Sur- tion was considered in Abend et al. (2009). deanu et al., 2008), however, lack of annotated re- sources for most languages and insufficient cover- 9 Conclusions age provided by the existing resources motivates In this work we introduced two Bayesian models the need for using unlabeled data or other forms for unsupervised role induction. They treat the of weak supervision. This work includes methods task as a family of related clustering problems, based on graph alignment between labeled and one for each predicate. The first factored model unlabeled data (F¨urstenau and Lapata, 2009), us- induces each clustering independently, whereas ing unlabeled data to improve lexical generaliza- the second model couples them by exploiting a tion (Deschacht and Moens, 2009), and projection novel technique for sharing clustering preferences of annotation across languages (Pado and Lapata, across a family of clusterings. Both methods 2009; van der Plas et al., 2011). Semi-supervised achieve state-of-the-art results with the coupled and weakly-supervised techniques have also been model outperforming the factored counterpart in explored for other types of semantic representa- all regimes. tions but these studies have mostly focused on re- stricted domains (Kate and Mooney, 2007; Liang Acknowledgements et al., 2009; Titov and Kozhevnikov, 2010; Gold- The authors acknowledge the support of the MMCI wasser et al., 2011; Liang et al., 2011). Cluster of Excellence, and thank Hagen F¨urstenau, Unsupervised learning has been one of the cen- Mikhail Kozhevnikov, Alexis Palmer, Manfred Pinkal, tral paradigms for the closely-related area of re- Caroline Sporleder and the anonymous reviewers for lation extraction, where several techniques have their suggestions, and Joel Lang for answering ques- been proposed to cluster semantically similar ver- tions about their methods and data. 20 References Michael Kaisser and Bonnie Webber. 2007. Question answering based on semantic roles. In ACL Work- Omri Abend, Roi Reichart, and Ari Rappoport. 2009. shop on Deep Linguistic Processing. Unsupervised argument identification for semantic Rohit J. Kate and Raymond J. Mooney. 2007. Learn- role labeling. In ACL-IJCNLP. ing language semantics from ambigous supervision. Michele Banko, Michael J Cafarella, Stephen Soder- In AAAI. land, Matt Broadhead, and Oren Etzioni. 2007. Aleksander Kolcz and Abdur Chowdhury. 2005. Dis- Open information extraction from the web. In IJ- counting over-confidence of naive bayes in high- CAI. recall text classification. In ECML. Roberto Basili, Diego De Cao, Danilo Croce, Joel Lang and Mirella Lapata. 2010. Unsupervised Bonaventura Coppola, and Alessandro Moschitti. induction of semantic roles. In ACL. 2009. Cross-language frame semantics transfer in bilingual corpora. In CICLING. Joel Lang and Mirella Lapata. 2011a. Unsupervised David M. Blei and Peter Frazier. 2011. Distance de- semantic role induction via split-merge clustering. pendent chinese restaurant processes. Journal of In ACL. Machine Learning Research, 12:2461–2488. Joel Lang and Mirella Lapata. 2011b. Unsupervised Peter F. Brown, Vincent Della Pietra, Peter V. deSouza, semantic role induction with graph partitioning. In Jenifer C. Lai, and Robert L. Mercer. 1992. Class- EMNLP. based n-gram models for natural language. Compu- Beth Levin. 1993. English Verb Classes and Alter- tational Linguistics, 18(4):467–479. nations: A Preliminary Investigation. University of Xavier Carreras and Llu´ıs M`arquez. 2005. Intro- Chicago Press. duction to the CoNLL-2005 Shared Task: Semantic Percy Liang, Michael I. Jordan, and Dan Klein. 2009. Role Labeling. In CoNLL. Learning semantic correspondences with less super- Hal Daume III. 2007. Fast search for dirichlet process vision. In ACL-IJCNLP. mixture models. In AISTATS. Percy Liang, Michael Jordan, and Dan Klein. 2011. Koen Deschacht and Marie-Francine Moens. 2009. Learning dependency-based compositional seman- Semi-supervised semantic role labeling using the tics. In ACL: HLT. Latent Words Language Model. In EMNLP. Dekang Lin and Patrick Pantel. 2001. DIRT – discov- Jason Duan, Michele Guindani, and Alan Gelfand. ery of inference rules from text. In KDD. 2007. Generalized spatial dirichlet process models. Ding Liu and Daniel Gildea. 2010. Semantic role fea- Biometrika, 94:809–825. tures for machine translation. In Coling. Thomas S. Ferguson. 1973. A Bayesian analysis J. Nivre, J. Hall, S. K¨ubler, R. McDonald, J. Nilsson, of some nonparametric problems. The Annals of S. Riedel, and D. Yuret. 2007. The CoNLL 2007 Statistics, 1(2):209–230. shared task on dependency parsing. In EMNLP- Hagen F¨urstenau and Mirella Lapata. 2009. Graph CoNLL. alignment for semi-supervised semantic role label- Sebastian Pado and Mirella Lapata. 2009. Cross- ing. In EMNLP. lingual annotation projection for semantic roles. Qin Gao and Stephan Vogel. 2011. Corpus expansion Journal of Artificial Intelligence Research, 36:307– for statistical machine translation with semantic role 340. label substitution rules. In ACL:HLT. Alexis Palmer and Caroline Sporleder. 2010. Evalu- Daniel Gildea and Daniel Jurafsky. 2002. Automatic ating FrameNet-style semantic parsing: the role of labelling of semantic roles. Computational Linguis- coverage gaps in FrameNet. In COLING. tics, 28(3):245–288. M. Palmer, D. Gildea, and P. Kingsbury. 2005. The Dan Goldwasser, Roi Reichart, James Clarke, and Dan proposition bank: An annotated corpus of semantic Roth. 2011. Confidence driven unsupervised se- roles. Computational Linguistics, 31(1):71–106. mantic parsing. In ACL. Hoifung Poon and Pedro Domingos. 2009. Unsuper- Trond Grenager and Christoph Manning. 2006. Unsu- vised semantic parsing. In EMNLP. pervised discovery of a statistical verb lexicon. In Sameer Pradhan, Wayne Ward, and James H. Martin. EMNLP. 2008. Towards robust semantic role labeling. Com- Jan Hajiˇc, Massimiliano Ciaramita, Richard Johans- putational Linguistics, 34:289–310. son, Daisuke Kawahara, Maria Ant`onia Mart´ı, Llu´ıs Jason Rennie. 2001. Improving multi-class text M`arquez, Adam Meyers, Joakim Nivre, Sebastian classification with Naive bayes. Technical Report ˇ ep´anek, Pavel Straˇna´ k, Mihai Surdeanu, Pad´o, Jan Stˇ AITR-2001-004, MIT. Nianwen Xue, and Yi Zhang. 2009. The CoNLL- M. Sammons, V. Vydiswaran, T. Vieira, N. Johri, 2009 shared task: Syntactic and semantic depen- M. Chang, D. Goldwasser, V. Srikumar, G. Kundu, dencies in multiple languages. In Proceedings Y. Tu, K. Small, J. Rule, Q. Do, and D. Roth. 2009. of the 13th Conference on Computational Natural Relation alignment for textual entailment recogni- Language Learning (CoNLL-2009), June 4-5. tion. In Text Analysis Conference (TAC). 21 Dan Shen and Mirella Lapata. 2007. Using semantic roles to improve question answering. In EMNLP. Richard Socher, Andrew Maas, and Christopher Man- ning. 2011. Spectral chinese restaurant processes: Nonparametric clustering based on similarities. In AISTATS. Mihai Surdeanu, Adam Meyers Richard Johansson, Llu´ıs M`arquez, and Joakim Nivre. 2008. The CoNLL-2008 shared task on joint parsing of syn- tactic and semantic dependencies. In CoNLL 2008: Shared Task. Richard Swier and Suzanne Stevenson. 2004. Unsu- pervised semantic role labelling. In EMNLP. Yee Whye Teh. 2010. Dirichlet processes. In Ency- clopedia of Machine Learning. Springer. Ivan Titov and Alexandre Klementiev. 2011. A Bayesian model for unsupervised semantic parsing. In ACL. Ivan Titov and Mikhail Kozhevnikov. 2010. Bootstrapping semantic analyzers from non- contradictory texts. In ACL. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In ACL. Lonneke van der Plas, Paola Merlo, and James Hen- derson. 2011. Scaling up automatic cross-lingual semantic role annotation. In ACL. Dekai Wu and Pascale Fung. 2009. Semantic roles for SMT: A hybrid two-pass model. In NAACL. Dekai Wu, Marianna Apidianaki, Marine Carpuat, and Lucia Specia, editors. 2011. Proc. of Fifth Work- shop on Syntax, Semantics and Structure in Statisti- cal Translation. ACL. 22 Entailment above the word level in distributional semantics Marco Baroni Ngoc-Quynh Do Chung-chieh Shan Raffaella Bernardi Free University of Bozen-Bolzano Cornell University University of Trento

[email protected]

University of Tsukuba

[email protected] [email protected]

Abstract lexical domain. On the other hand, FS has pro- vided sophisticated models of sentence meaning, We introduce two ways to detect entail- but it has been largely limited to hand-coded mod- ment using distributional semantic repre- els that do not scale up to real-life challenges by sentations of phrases. Our first experiment learning from data. shows that the entailment relation between adjective-noun constructions and their head Given these complementary strengths, we nat- nouns (big cat |= cat), once represented as urally ask if DS and FS can address each other’s semantic vector pairs, generalizes to lexical limitations. Two recent strands of research are entailment among nouns (dog |= animal). bringing DS closer to meeting core FS chal- Our second experiment shows that a classi- lenges. One strand attempts to model compo- fier fed semantic vector pairs can similarly sitionality with DS methods, representing both generalize the entailment relation among primitive and composed linguistic expressions quantifier phrases (many dogs|=some dogs) to entailment involving unseen quantifiers as distributional vectors (Baroni and Zamparelli, (all cats|=several cats). Moreover, nominal 2010; Grefenstette and Sadrzadeh, 2011; Gue- and quantifier phrase entailment appears to vara, 2010; Mitchell and Lapata, 2010). The be cued by different distributional corre- other strand attempts to reformulate FS’s notion lates, as predicted by the type-based view of logical inference in terms that DS can cap- of entailment in formal semantics. ture (Erk, 2009; Geffet and Dagan, 2005; Kotler- man et al., 2010; Zhitomirsky-Geffet and Dagan, 2010). In keeping with the lexical emphasis of 1 Introduction DS, this strand has focused on inference at the Distributional semantics (DS) approximates lin- word level, or lexical entailment, that is, discover- guistic meaning with vectors summarizing the ing from distributional vectors of hyponyms (dog) contexts where expressions occur. The success that they entail their hypernyms (animal). of DS in lexical semantics has validated the hy- This paper brings these two strands of research pothesis that semantically similar expressions oc- together by demonstrating two ways in which the cur in similar contexts (Landauer and Dumais, distributional vectors of composite expressions 1997; Lund and Burgess, 1996; Sahlgren, 2006; bear on inference. Here we focus on phrasal vec- Sch¨utze, 1997; Turney and Pantel, 2010). For- tors harvested directly from the corpus rather than mal semantics (FS) represents linguistic mean- obtained compositionally. In a first experiment, ings as symbolic formulas and assemble them via we exploit the entailment properties of a class composition rules. FS has successfully modeled of composite expressions, namely adjective-noun quantification and captured inferential relations constructions (ANs), to harvest training data for between phrases and between sentences (Mon- an entailment recognizer. The recognizer is then tague, 1970; Thomason, 1974; Heim and Kratzer, successfully applied to detect lexical entailment. 1998). The strengths of DS and FS have been In short, since almost all ANs entail the noun they complementary to date: On one hand, DS has in- contain (red car entails car), the distributional duced large-scale semantic representations from vectors of AN-N pairs can train a classifier to de- corpora, but it has been largely limited to the tect noun pairs that stand in the same relation (dog 23 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 23–32, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics entails animal). With almost no manual effort, 2 Background we achieve performance nearly identical with the state-of-the-art balAPinc measure that Kotlerman 2.1 Distributional semantics above the word et al. (2010) crafted, which detects feature inclu- level sion between the two nouns’ occurrence contexts. DS models such as LSA (Landauer and Dumais, 1997) and HAL (Lund and Burgess, 1996) ap- Our second experiment goes beyond lexical in- proximate the meaning of a word by a vector that ference. We look at phrases built from a quanti- summarizes its distribution in a corpus, for exam- fying determiner1 and a noun (QNs) and use their ple by counting co-occurrences of the word with distributional vectors to recognize entailment re- other words. Since semantically similar words lations of the form many dogs |= some dogs, be- tend to share similar contexts, DS has been very tween two QNs sharing the same noun. It turns successful in tasks that require quantifying se- out that a classifier trained on a set of Q1 N |= Q2 N mantic similarity among words, such as synonym pairs can recognize entailment in pairs with a new detection and concept clustering (Turney and Pan- quantifier configuration. For example, we can tel, 2010). train on many dogs |= some dogs then correctly predict all cats|=several cats. Interestingly, on the Recently, there has been a flurry of interest QN entailment task, neither our classifier trained in DS to model meaning composition: How can on AN-N pairs nor the balAPinc method beat we derive the DS representation of a composite baseline methods. This suggests that our success- phrase from that of its constituents? Although the ful QN classifiers tap into vector properties be- general focus in the area is to perform algebraic yond such relations as feature inclusion that those operations on word semantic vectors (Mitchell methods for nominal entailment rely upon. and Lapata, 2010), some researchers have also di- rectly examined the corpus contexts of phrases. Together, our experiments show that corpus- For example, Baldwin et al. (2003) studied vec- harvested DS representations of composite ex- tor extraction for phrases because they were inter- pressions such as ANs and QNs contain suffi- ested in the decomposability of multiword expres- cient information to capture and generalize their sions. Baroni and Zamparelli (2010) and Gue- inference patterns. This result brings DS closer vara (2010) look at corpus-harvested phrase vec- to the central concerns of FS. In particular, the tors to learn composition functions that should de- QN study is the first to our knowledge to show rive such composite vectors automatically. Ba- that DS vectors capture semantic properties not roni and Zamparelli, in particular, showed qual- only of content words, but of an important class of itatively that directly corpus-harvested vectors for function words (quantifying determiners) deeply AN constructions are meaningful; for example, studied in FS but of little interest until now in DS. the vector of young husband has nearest neigh- Besides these theoretical implications, our re- bors small son, small daughter and mistress. Fol- sults are of practical import. First, our AN study lowing up on this approach, we show here quanti- presents a novel, practical method for detect- tatively that corpus-harvested AN vectors are also ing lexical entailment that reaches state-of-the- useful for detecting entailment. We find moreover art performance with little or no manual interven- distributional vectors informative and useful not tion. Lexical entailment is in turn fundamental only for phrases made of content words (such as for constructing ontologies and other lexical re- ANs) but also for phrases containing functional sources (Buitelaar and Cimiano, 2008). Second, elements, namely quantifying determiners. our QN study demonstrates that phrasal entail- 2.2 Entailment from formal to distributional ment can be automatically detected and thus paves semantics the way to apply DS to advanced NLP tasks such as recognizing textual entailment (Dagan et al., Entailment in FS To characterize the condi- 2009). tions under which a sentence is true, FS begins with the lexical meanings of the words in the sen- tence and builds up the meanings of larger and 1 In the sequel we will simply refer to a “quantifying de- larger phrases until it arrives at the meaning of the terminer” as a “quantifier”. whole sentence. The meanings throughout this 24 compositional process inhabit a variety of seman- for phrasal entailment in a way that can be cap- tic domains, depending on the syntactic category tured and generalized to unseen phrase pairs. of the expressions: typically, a sentence denotes a Rather recently, the study of sentential entail- truth value (true or false) or truth conditions, ment has taken an empirical turn, thanks to the de- a noun such as cat denotes a set of entities, and a velopment of benchmarks for entailment systems. quantifier phrase (QP) such as all cats denotes a The FS definition of entailment has been modified set of sets of entities. by taking common sense into account. Instead of The entailment relation (|=) is a core notion of a relation from the truth of the consequent to the logic: it holds between one or more sentences and truth of the antecedent in any circumstance, the a sentence such that it cannot be that the former applied view looks at entailment in terms of plau- (antecedent) are true and the latter (consequent) sibility: φ |= ψ if a human who reads (and trusts) is false. FS extends this notion from formal-logic φ would most likely infer that ψ is also true. En- sentences to natural-language expressions. By as- tailment systems have been compared under this signing meanings to parts of a sentence, FS allows new perspective in various evaluation campaigns, defining entailment not only among sentences but the best known being the Recognizing Textual En- also among words and phrases. Each semantic tailment (RTE) initiative (Dagan et al., 2009). domain A has its own entailment relation |=A . Most RTE systems are based on advanced NLP The entailment relation |=S among sentences is components, machine learning techniques, and/or the logical notion just described, whereas the en- syntactic transformations (Zanzotto et al., 2007; tailment relations |=N and |=QP among nouns Kouleykov and Magnini, 2005). A few systems and quantifier phrases are the inclusion relations exploit deep FS analysis (Bos and Markert, 2006; among sets of entities and sets of sets of entities Chambers et al., 2007). In particular, the FS re- respectively. Our results in Section 5 show that sults about QP properties that affect entailment DS needs to treat |=N and |=QP differently as well. have been exploited by Chambers et al, who com- plement a core broad-coverage system with a Nat- Empirical, corpus-based perspectives on en- ural Logic module to trade lower recall for higher tailment Until recently, the corpus-based re- precision. For instance, they exploit the mono- search tradition has studied entailment mostly at tonicity properties of no that cause the follow- the word level, with applied goals such as clas- ing reversal in entailment direction: some bee- sifying lexical relations and building taxonomic tles |= some insects but no insects |= no beetles. WordNet-like resources automatically. The most To investigate entailment step by step, we ad- popular approach, first adopted by Hearst (1992), dress here a much simpler and clearer type of extracts lexical relations from patterns in large entailment than the more complex notion taken corpora. For instance, from the pattern N1 such up by the RTE community. While RTE is out- as N2 one learns that N2 |= N1 (from insects such side our present scope, we do focus on QP entail- as beetles, derive beetles |= insects). Several stud- ment as Natural Logic does. However, our eval- ies have refined and extended this approach (Pan- uation differs from Chambers et al.’s, since we tel and Ravichandran, 2004; Snow et al., 2005; rely on general-purpose DS vectors as our only Snow et al., 2006; Turney, 2008). resource, and we look at phrase pairs with differ- While empirically very successful, the pattern- ent quantifiers but the same noun. For instance, based method is mostly limited to single content we aim to predict that all beetles |= many beetles words (or frequent content-word phrases). We are but few beetles 6|= all beetles. QPs, of course, have interested in entailment between phrases, where it many well-known semantic properties besides en- is not obvious how to use lexico-syntactic patterns tailment; we leave their analysis to future study. and cope with data sparsity. For instance, it seems hard to find a pattern that frequently connects one Entailment in DS Erk (2009) suggests that it QP to another it entails, as in all beetles PATTERN may not be possible to induce lexical entailment many beetles. Hence, we aim to find a more gen- directly from a vector space representation, but it eral method and investigate whether DS vectors is possible to encode the relation in this space af- (whether corpus-harvested or compositionally de- ter it has been derived through other means. On rived) encode the information needed to account the other hand, recent studies (Geffet and Dagan, 25 2005; Kotlerman et al., 2010; Weeds et al., 2004) into pointwise mutual information (PMI) scores have pursued the intuition that entailment is the (Church and Hanks, 1990). The result of this step asymmetric ability of one term to “substitute” for is a sparse matrix (with both positive and negative another. For example, baseball contexts are also entries) with 48K rows (one per phrase of interest) sport contexts but not vice versa, hence baseball and 27K columns (one per content word). is “narrower” than sport and baseball |= sport. On this view, entailment between vectors corresponds 3.2 The AN |= N data set to inclusion of contexts or features, and can be To characterize entailment between nouns using captured by asymmetric measures of distribution their semantic vectors, we need data exemplifying similarity. In particular, Kotlerman et al. (2010) which noun entails which. This section introduces carefully crafted the balAPinc measure (see Sec- one cheap way to collect such a training data set tion 3.5 below). We adopt this measure because exploiting semantic vectors for composed expres- it has been shown to outperform others in several sions, namely AN sequences. We rely on the lin- tasks that require lexical entailment information. guistic fact that ANs share a syntactic category Like Kotlerman et al., we want to capture the and semantic type with plain common nouns (big entailment relation between vectors of features. cat shares syntactic category and semantic type However, we are interested in entailment not only with cat). Furthermore, most adjectives are re- between words but also between phrases, and we strictive in the sense that, for every noun N, the ask whether the DS view of entailment as fea- AN sequence entails the N alone (every big cat ture inclusion, which captures entailment between is a cat). From a distributional point of view, the nouns, also captures entailment between QPs. To vector for an N should by construction include the this end, we complement balAPinc with a more information in the vector for an AN, given that the flexible supervised classifier. contexts where the AN occurs are a subset of the contexts where the N occurs (cat occurs in all the 3 Data and methods contexts where big cat occurs). This ideal inclu- sion suggests that the DS notion of lexical entail- 3.1 Semantic space ment as feature inclusion (see Section 2.2 above) We construct distributional semantic vectors from should be reflected in the AN |= N pattern. the 2.83-billion-token concatenation of the British Because most ANs entail their head Ns, we can National Corpus (http://www.natcorp. create positive examples of AN |= N without any ox.ac.uk/), WackyPedia and ukWaC (http: manual inspection of the corpus: simply pair up //wacky.sslmit.unibo.it/). We tok- the semantic vectors of ANs and Ns. Furthermore, enize and POS-tag this corpus, then lemmatize because an AN usually does not entail another N, it with TreeTagger (Schmid, 1995) to merge sin- we can create negative examples (AN1 6|= N2 ) just gular and plural instances of words and phrases by randomly permuting the Ns. Of course, such (some dogs is mapped to some dog). unsupervised data would be slightly noisy, espe- We process the corpus in two steps to compute cially because some of the most frequent adjec- semantic vectors representing our phrases of in- tives are not restrictive. terest. We use phrases of interest as a general To collect cleaner data and to be sure that we term to refer to both multiword phrases and sin- are really examining the phenomenon of entail- gle words, and more precisely to: those AN and ment, we took a mere few moments of man- QN sequences that are in the data sets (see next ual effort to select the 256 restrictive adjectives subsections), the adjectives, quantifiers and nouns from the most frequent 300 adjectives in the cor- contained in those sequences, and the most fre- pus. We then took the Cartesian product of these quent (9.8K) nouns and (8.1K) adjectives in the 256 adjectives with the 200 concrete nouns in the corpus. The first step is to count the content BLESS data set (Baroni and Lenci, 2011). Those words (more precisely, the most frequent 9.8K nouns were chosen to avoid highly polysemous nouns, 8.1K adjectives, and 9.6K verbs in the cor- words. From the Cartesian product, we obtain a pus) that occur in the same sentence as phrases total of 1246 AN sequences, such as big cat, that of interest. In the second step, following standard occur more than 100 times in the corpus. These practice, the co-occurrence counts are converted AN sequences encompass 190 of the 256 adjec- 26 tives and 128 of the 200 nouns. Quantifier pair Instances Correct The process results in 1246 positive instances all |= some 1054 1044 (99%) of AN |= N entailment, which we use as training all |= several 557 550 (99%) data. To create a comparable amount of negative each |= some 656 647 (99%) data, we randomly permuted the nouns in the pos- all |= many 873 772 (88%) itive instances to obtain pairs of AN1 6|= N2 (e.g., much |= some 248 217 (88%) big cat 6|= dog). We manually double-checked that every |= many 460 400 (87%) all positive and negative examples are correctly many |= some 951 822 (86%) all |= most 465 393 (85%) classified (2 of 1246 negative instances were re- several |= some 580 439 (76%) moved, leaving 1244 negative training examples). both |= some 573 322 (56%) many |= several 594 113 (19%) 3.3 The lexical entailment N1 |= N2 data set most |= many 463 84 (18%) For testing data, we first listed all WordNet nouns both |= either 63 1 (2%) in our corpus, then extracted hyponym-hypernym Subtotal 7537 5804 (77%) chains linking the first synsets of these nouns. For some 6|= every 484 481 (99%) example, pope is found to entail leader because several 6|= all 557 553 (99%) WordNet contains the chain pope → spiritual several 6|= every 378 375 (99%) leader → leader. Eliminating the 20 hypernyms some 6|= all 1054 1043 (99%) many 6|= every 460 452 (98%) with more than 180 hyponyms (mostly very ab- some 6|= each 656 640 (98%) stract nouns such as entity, object, and quality) few 6|= all 157 153 (97%) yields 9734 hyponym-hypernym pairs, encom- many 6|= all 873 843 (97%) passing 6402 nouns. Manually double-checking both 6|= most 369 347 (94%) these pairs leaves us with 1385 positive instances several 6|= few 143 134 (94%) of N1 |= N2 entailment. both 6|= many 541 397 (73%) We created the negative instances of again 1385 many 6|= most 463 300 (65%) either 6|= both 63 39 (62%) pairs by inverting 33% of the positive instances many 6|= no 714 369 (52%) (from pope|=leader to leader6|=pope), and by ran- some 6|= many 951 468 (49%) domly shuffling the words across the positive in- few 6|= many 161 33 (20%) stances. We also manually double-checked these both 6|= several 431 63 (15%) pairs to make sure that they are not hyponym- Subtotal 8455 6690 (79%) hypernym pairs. Total 15992 12494 (78%) 3.4 The Q1 N |= Q2 N data set Table 1: Entailing and non-entailing quantifier pairs We study 12 quantifiers: all, both, each, either, with number of instances per pair (Section 3.4) and every, few, many, most, much, no, several, some. SVMpair-out performance breakdown (Section 5). We took the Cartesian product of these quantifiers with the 6402 WordNet nouns described in Sec- rise to an instance of entailment (Q1 N |= Q2 N if tion 3.3. From this Cartesian product, we obtain Q1 |= Q2 ; example: many dogs |= several dogs) or a total of 28926 QN sequences, such as every cat, non-entailment (Q1 N6|=Q2 N if Q1 6|=Q2 ; example: that occur at least 100 times in the corpus. These many dogs6|=most dogs). The number of QN pairs are our QN phrases of interest to which the proce- that each quantifier pair gives rise to in this way is dure in Section 3.1 assigns a semantic vector. listed in the second column of Table 1. As shown Also, from the set of quantifier pairs (Q1 , Q2 ) there, we have a total of 7537 positive instances where Q1 6= Q2 , we identified 13 clear cases and 8455 negative instances of QN entailment. where Q1 |=Q2 and 17 clear cases where Q1 6|=Q2 . These 30 cases are listed in the first column of 3.5 Classification methods Table 1. For each of these 30 quantifier pairs (Q1 , Q2 ), we enumerate those WordNet nouns N We consider two methods to classify candidate such that semantic vectors are available for both pairs as entailing or non-entailing, the balAPinc Q1 N and Q2 N (that is, both sequences occur in measure of Kotlerman et al. (2010) and a standard at least 100 times). Each such noun then gives Support Vector Machine (SVM) classifier. 27 balAPinc As discussed in Section 2.2, balAP- To adapt balAPinc to recognize entailment, we inc is optimized to capture a relation of feature must select a threshold t above which we classify inclusion between the narrower (entailing) and a pair as entailing. In the experiments below, we broader (entailed) terms, while capturing other in- explore two approaches. In balAPincupper , we op- tuitions about the relative relevance of features. timize the threshold directly on the test data, by balAPinc averages two terms, APinc and LIN. setting t to maximize the F-measure on the test APinc is given by: set. This gives us an upper bound on how well bal- P|Fu | 0 APinc could perform on the test set (but note that r=1 P (r) · rel (fr ) optimizing F does not necessarily translate into a APinc(u |= v) = |Fu | good accuracy performance, as clearly illustrated APinc is a version of the Average Precision by Table 3 below). In balAPincAN |= N , we use the measure from Information Retrieval tailored to AN |= N data set as training data and pick the t lexical inclusion. Given vectors Fu and Fv rep- that maximizes F on this training set. resenting the dimensions with positive PMI val- We use the balAPinc measure as a refer- ues in the semantic vectors of the candidate pair ence point because, on the evidence provided by u |= v, the idea is that we want the features (that Kotlerman et al., it is the state of the art in various is, vector dimensions) that have larger values in tasks related to lexical entailment. We recognize Fu to also have large values in Fv (the opposite however that it is somewhat complex and specifi- does not matter because it is u that should be in- cally tuned to capturing the relation of feature in- cluded in v, not vice versa). The Fu features are clusion. Consequently, we also experiment with ranked according to their PMI value so that fr a more flexible classifier, which can detect other is the feature in Fu with rank r, i.e., r-th high- systematic properties of vectors in an entailment est PMI. Then the sum of the product of the two relation. We present this classifier next. terms P (r) and rel0 (fr ) across the features in Fu is computed. The first term is the precision at r, SVM Support vector machines are widely used which is higher when highly ranked u features are high-performance discriminative classifiers that present in Fv as well. The relevance term rel0 (fr ) find the hyperplane providing the best separation is higher when the feature fr in Fu also appears between negative and positive instances (Cristian- in Fv with a high rank. (See Kotlerman et al. for ini and Shawe-Taylor, 2000). Our SVM classifiers how P (r) and rel0 (fr ) are computed.) The result- are trained and tested using Weka 3 and LIBSVM ing score is normalized by dividing by the entail- 2.8 (Chang and Lin, 2011). We use the default ing vector size |Fu | (in accordance with the idea polynomial kernel ((u · v/600)3 ) with (tolerance that having more v features should not hurt be- of termination criterion) set to 1.6. This value was cause the u features should be included in the v tuned on the AN|=N data set, which we never use features, not vice versa). for testing. In the same initial tuning experiments To balance the potentially excessive asymmetry on the AN |= N data set, SVM outperformed deci- of APinc towards the features of the antecedent, sion trees, naive Bayes, and k-nearest neighbors. Kotlerman et al. average it with LIN, the widely We feed each potential entailment pair to SVM used symmetric measure of distributional similar- by concatenating the two vectors representing the ity proposed by Lin (1998): antecedent and consequent expressions.2 How- P ever, for efficiency and to mitigate data sparse- f ∈Fu ∩Fv [wu (f ) + wv (f )] LIN(u, v) = P P ness, we reduce the dimensionality of the seman- f ∈Fu wu (f ) + f ∈Fv wv (f ) tic vectors to 300 columns using Singular Value LIN essentially measures feature vector overlap. Decomposition (SVD) before feeding them to the The positive PMI values wu (f ) and wv (f ) of a classifier.3 Because the SVD-reduced semantic feature f in Fu and Fv are summed across those 2 We have tried also to represent a pair by subtracting and features that are positive in both vectors, normal- by dividing the two vectors. The concatenation operation izing by the cumulative positive PMI mass in both gave more successful results. 3 vectors. Finally, balAPinc is the geometric aver- To keep a manageable parameter space, we picked 300 age of APinc and LIN: columns without tuning. This is the best value reported in p many earlier studies, including classic LSA. Since SVD balAPinc(u|=v) = APinc(u |= v) · LIN(u, v) sometimes improves the semantic space (Landauer and Du- 28 vectors occupy a 300-dimensional space, the en- P R F Accuracy tailment pairs occupy a 600-dimensional space. (95% C.I.) An SVM with a polynomial kernel takes into SVMupper 88.6 88.6 88.5 88.6 (87.3–89.7) account not only individual input features but also balAPincAN |= N 65.2 87.5 74.7 70.4 (68.7–72.1) their interactions (Manning et al., 2008, chapter balAPincupper 64.4 90.0 75.1 70.1 (68.4–71.8) 15). Thus, our classifier can capture not just prop- SVMAN |= N 69.3 69.3 69.3 69.3 (67.6–71.0) erties of individual dimensions of the antecedent and consequent pairs, but also properties of their cos(N1 , N2 ) 57.7 57.6 57.5 57.6 (55.8–59.5) combinations (e.g., the product of the first dimen- fq(N1 ) < fq(N2 ) 52.1 52.1 51.8 53.3 (51.4–55.2) sions of the antecedent and the consequent). We conjecture that this property of SVMs is funda- Table 2: Detecting lexical entailment. Results ranked mental to their success at detecting entailment, by accuracy and expressed as percentages. 95% con- where relations between the antecedent and the fidence intervals around accuracy calculated by bino- mial exact tests. consequent should matter more than their inde- pendent characteristics. accuracy on the test set, which is balanced be- 4 Predicting lexical entailment from tween positive and negative instances. Interest- AN |= N evidence ingly, the balAPinc decision thresholds tuned on Since the contexts of AN must be a subset of the the AN |= N set and on the test data are very contexts of N, semantic vectors harvested from close (0.26 vs. 0.24), resulting in very similar per- AN phrases and their head Ns are by construc- formance for balAPincAN |= N and balAPincupper . tion in an inclusion relation. The first experiment This suggests that the relation captured by bal- shows that these vectors constitute excellent train- APinc on the phrasal entailment training data is ing data to discover entailment between nouns. indeed the same that the measure captures when This suggests that the vector pairs representing applied to lexical entailment data. entailment between nouns are also in an inclusion The success of this first experiment shows that relation, supporting the conjectures of Kotlerman the entailment relation present in the distribu- et al. (2010) and others. tional representation of AN phrases and their Table 2 reports the results we obtained with head Ns transfers to lexical entailment (entailment balAPincupper , balAPincAN |= N (Section 3.5) and among Ns). Most importantly, this result demon- SVMAN |= N (the SVM classifier trained on the strates that the semantic vectors of composite ex- AN |= N data). As an upper bound for meth- pressions (such as ANs) are useful for lexical en- ods that generalize from AN |= N, we also re- tailment. Moreover, the result is in accordance port the performance of SVM trained with 10-fold with the view of FS, that ANs and Ns have the cross-validation on the N1 |= N2 data themselves same semantic type, and thus they enter entail- (SVMupper ). Finally, we tried two baseline classi- ment relations of the same kind. Finally, the hy- fiers. The first baseline (fq(N1 ) < fq(N2 )) guesses pothesis that entailment among nouns is reflected entailment if the first word is less frequent than by distributional inclusion among their semantic the second. The second (cos(N1 , N2 )) applies a vectors (Kotlerman et al., 2010) is supported both threshold (determined on the test set) to the co- by the successful generalization of the SVM clas- sine similarity of the pair. The results of these sifier trained on AN |= N pairs and by the good baselines shown in Table 2 use SVD; those with- performance of the balAPinc measure. out SVD are similar. Both baselines outperformed 5 Generalizing QN entailment more trivial methods such as random guessing or fixed response, but they performed significantly The second study is somewhat more ambitious, worse than SVM and balAPinc. as it aims to capture and generalize the entailment Both methods that generalize entailment from relation between QPs (of shape QN) using only AN |= N to N1 |= N2 perform well, with 70% the corpus-harvested semantic vectors represent- mais, 1997; Rapp, 2003; Sch¨utze, 1997), we tried balAPinc ing these phrases as evidence. We are thus first on the SVD-reduced vectors as well, but results were consis- and foremost interested in testing whether these tently worse than with PMI vectors. vectors encode information that can help a power- 29 P R F Accuracy baselines are only slightly better overall than more (95% C.I.) trivial baselines.) We consider moreover an alter- SVMpair-out 76.7 77.0 76.8 78.1 (77.5–78.8) native approach that ignores the noun altogether SVMquantifier-out 70.1 65.3 68.0 71.0 (70.3–71.7) and uses vectors for the quantifiers only (e.g., the decision about all dogs|=some dogs considers the SVMQ pair-out 67.9 69.8 68.9 70.2 (69.5–70.9) corpus-derived all and some vectors only). The SVMQ quantifier-out 53.3 52.9 53.1 56.0 (55.2–56.8) models resulting from this Q-only strategy are cos(QN1 , QN2 ) 52.9 52.3 52.3 53.1 (52.3–53.9) marked with the superscript Q in the table. balAPincAN |= N 46.7 5.6 10.0 52.5 (51.7–53.3) The results confirm clearly that semantic vec- SVMAN |= N 2.8 42.9 5.2 52.4 (51.7–53.2) tors for QNs contain enough information to allow fq(QN1 )<fq(QN2 ) 51.0 47.4 49.1 50.2 (49.4–51.0) a classifier to detect entailment: SVMquantifier-out balAPincupper 47.1 100 64.1 47.2 (46.4–47.9) performs as well as the lexical entailment classi- fiers of our first study, and SVMpair-out does even Table 3: Detecting quantifier entailment. Results better. This success is especially impressive given ranked by accuracy and expressed as percentages. our challenging training and testing regimes. 95% confidence intervals around accuracy calculated In contrast to the first study, now SVMAN |= N , by binomial exact tests. the classifier trained on the AN |= N data set, and balAPinc perform no better than the base- lines. (Here balAPincupper and balAPincAN |= N ful classifier, such as SVM, to detect entailment. pick very different thresholds: the first settling To abstract away from lexical or other effects on a very low t = 0.01, whereas for the sec- linked to a specific quantifier, we consider two ond t = 0.26.) As predicted by FS (see Section challenging training and testing regimes. In the 2.2 above), noun-level entailment does not gen- first (SVMpair-out ), we hold out one quantifier pair eralize to quantifier phrase entailment, since the as testing data and use the other 29 pairs in Table 1 two structures have different semantic types, cor- as training data. Thus, for example, the classifier responding to different kinds of entailment rela- must discover all dogs |= some dogs without see- tions. Moreover, the failure of balAPinc suggests ing any all N |= some N instance in the training that, whatever evidence the SVMs rely upon, it is data. In the second (SVMquantifier-out ), we hold out not simple feature inclusion. one of the 12 quantifiers as testing data (that is, Interestingly, even the Q vectors alone encode hold out every pair involving a certain quantifier) enough information to capture entailment above and use the rest as training data. For example, chance. Still, the huge drop in performance from the quantifier must guess all dogs |= some dogs SVMQ Q pair-out to SVMquantifier-out suggests that the Q- without ever seeing all in the training data. We only method learned ad-hoc properties that do not expect the second training regime to be more dif- generalize (e.g., “all entails every Q2 ”). ficult, not just because there is less training data, Tables 1 and 4 break down the SVM results by but also because the trained classifier is tested on (pairs of) quantifiers. We highlight the remark- a quantifier that it has never encountered within able dichotomy in Table 4 between the good per- any training QN sequence.4 formance on the universal-like quantifiers (each, Table 3 reports the results for SVMpair-out and every, all, much) and the poor performance on the SVMquantifier-out , as well as for the methods we existential-like ones (some, no, both, either). tried in the lexical entailment experiments. (As In sum, the QN experiments show that seman- in the first study, the frequency- and cosine-based tic vectors contain enough information to detect 4 a logical relation such as entailment not only be- In our initial experiments, we added negative entail- ment instances by blindly permuting the nouns, under the tween words, but also between phrases contain- assumption that Q1 N1 typically does not entail Q2 N2 when ing quantifiers that determine their entailment re- Q1 6= Q2 and N1 6= N2 . These additional instances turned lation. While a flexible classifier such as SVM out to be much easier to classify: adding an equal proportion performs this task well, neither measuring fea- of them to the training data and testing data, such that the number of instances where N1 = N2 and where N1 6= N2 ture inclusion nor generalizing nominal entail- is equal, reduced every error rate roughly by half. The re- ment works. SVMs are evidently tapping into ported results do not involve these additional instances. other properties of the vectors. 30 Quantifier Instances Correct Very importantly, instead of extracting vectors |= 6|= |= 6|= representing phrases directly from the corpus, we each 656 656 649 637 (98%) intend to derive them by compositional operations every 460 1322 402 1293 (95%) proposed in the literature (see Section 2.1 above). much 248 0 216 0 (87%) We will look for composition methods producing all 2949 2641 2011 2494 (81%) vector representations of composite expressions several 1731 1509 1302 1267 (79%) that are as good as (or better than) vectors directly many 3341 4163 2349 3443 (77%) extracted from the corpus at encoding entailment. few 0 461 0 311 (67%) Finally, we would like to evaluate our entail- most 928 832 549 511 (60%) some 4062 3145 1780 2190 (55%) ment detection strategies for larger phrases and no 0 714 0 380 (53%) sentences, possibly containing multiple quanti- both 636 1404 589 303 (44%) fiers, and eventually embed them as core compo- either 63 63 2 41 (34%) nents of an RTE system. Total 15074 16910 9849 12870 (71%) Acknowledgments Table 4: Breakdown of results with leaving-one- We thank the Erasmus Mundus EMLCT Program quantifier-out (SVMquantifier-out ) training regime. for the student and visiting scholar grants to the third and fourth author, respectively. The first 6 Conclusion two authors are partially funded by the ERC 2011 Starting Independent Research Grant supporting Our main results are as follows. the COMPOSES project (nr. 283554). We are grateful to Gemma Boleda, Louise McNally, and 1. Corpus-harvested semantic vectors repre- the anonymous reviewers for valuable comments, senting adjective-noun constructions and and to Ido Dagan for important insights into en- their heads encode a relation of entailment tailment from an empirical point of view. that can be exploited to train a classifier to detect lexical entailment. In particular, a relation of feature inclusion between the References narrower antecedent and broader consequent terms captures both AN |= N and N1 |= N2 Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003. An empirical model entailment. of multiword expression decomposability. In Pro- ceedings of the ACL 2003 Workshop on Multiword 2. The semantic vectors of quantifier-noun con- Expressions, pages 89–96. structions also encode information sufficient Marco Baroni and Alessandro Lenci. 2011. How to learn an entailment relation that general- we BLESSed distributional semantic evaluation. In izes to QNs containing quantifiers that were Proceedings of the Workshop on Geometrical Mod- not seen during training. els of Natural Language Semantics. Marco Baroni and Roberto Zamparelli. 2010. Nouns 3. Neither the entailment information encoded are vectors, adjectives are matrices: Representing in AN |= N vectors nor the balAPinc mea- adjective-noun constructions in semantic space. In sure generalizes well to entailment detection Proceedings of EMNLP, pages 1183–1193, Boston, in QNs. This result suggests that QN vectors MA. encode a different kind of entailment, as also Johan Bos and Katja Markert. 2006. When logical suggested by type distinctions in Formal Se- inference helps determining textual entailment (and when it doesn’t. In Proceedings of the Second PAS- mantics. CAL Challenges Workshop on Recognising Textual Entailment. In future work, we want first of all to conduct Paul Buitelaar and Philipp Cimiano. 2008. Bridging an analysis of the features in the Q1 N |= Q2 N vec- the Gap between Text and Knowledge. IOS, Ams- tors that are crucially exploited by our success- terdam. ful entailment recognizers, in order to understand Nathanael Chambers, Daniel Cer, Trond Grenager, which characteristics of entailment are encoded in David Hall, Chloe Kiddon, Bill MacCartney, Marie- these vectors. Catherine de Marneffe, Daniel Ramage, Eric Yeh, 31 and Christopher D. Manning. 2007. Learning Kevin Lund and Curt Burgess. 1996. Producing alignments and leveraging natural logic. In ACL- high-dimensional semantic spaces from lexical co- PASCAL Workshop on Textual Entailment and Para- occurrence. Behavior Research Methods, 28:203– phrasing. 208. Chih-Chung Chang and Chih-Jen Lin. 2011. LIB- Chris Manning, Prabhakar Raghavan, and Hinrich SVM: A library for support vector machines. ACM Sch¨utze. 2008. Introduction to Information Re- Transactions on Intelligent Systems and Technol- trieval. Cambridge University Press, Cambridge. ogy, 2(3):27:1–27:27. Jeff Mitchell and Mirella Lapata. 2010. Composi- Kenneth Church and Peter Hanks. 1990. Word associ- tion in distributional models of semantics. Cogni- ation norms, mutual information, and lexicography. tive Science, 34(8):1388–1429. Computational Linguistics, 16(1):22–29. Richard Montague. 1970. Universal Grammar. Theo- Nello Cristianini and John Shawe-Taylor. 2000. An ria, 36:373–398. introduction to Support Vector Machines and other Patrick Pantel and Deepak Ravichandran. 2004. Au- kernel-based learning methods. Cambridge Univer- tomatically labeliing semantic classes. In Proceed- sity Press, Cambridge. ings of HLT-NAACL 2004, pages 321–328. Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Reinhard Rapp. 2003. Word sense discovery based on Roth. 2009. Recognizing textual entailment: ratio- sense descriptor dissimilarity. In Proceedings of the nal, evaluation and approaches. Natural Language 9th MT Summit, pages 315–322, New Orleans, LA. Engineering, 15:459–476. Magnus Sahlgren. 2006. The Word-Space Model. Katrin Erk. 2009. Supporting inferences in semantic Dissertation, Stockholm University. space: representing words as regions. In Proceed- Helmut Schmid. 1995. Improvements in part-of- ings of IWCS, pages 104–115, Tilburg, Netherlands. speech tagging with an application to German. Maayan Geffet and Ido Dagan. 2005. The distribu- In Proceedings of the EACL-SIGDAT Workshop, tional inclusion hypotheses and lexical entailment. Dublin, Ireland. In Proceedings of ACL, pages 107–114, Ann Arbor, Hinrich Sch¨utze. 1997. Ambiguity Resolution in Nat- MI. ural Language Learning. CSLI, Stanford, CA. Edward Grefenstette and Mehrnoosh Sadrzadeh. Rion Snow, Daniel Juravsky, and Andrew Y. Ng. 2011. Experimental support for a categorical com- 2005. Learning syntactic patterns for automatic hy- positional distributional model of meaning. In Pro- pernym discovery. In Proceedings of NIPS 17. ceedings of EMNLP, pages 1395–1404, Edinburgh. Rion Snow, Daniel Juravsky, and Andrew Y. Ng. 2006. Semantic taxonomy induction from het- Emiliano Guevara. 2010. A regression model erogenous evidence. In Proceedings of ACL 2006, of adjective-noun compositionality in distributional pages 801–808. semantics. In Proceedings of the ACL GEMS Work- shop, pages 33–37, Uppsala, Sweden. Richmond H. Thomason, editor. 1974. Formal Phi- losophy: Selected Papers of Richard Montague. Marti Hearst. 1992. Automatic acquisition of hy- Yale University Press, New York. ponyms from large text corpora. In Proceedings of Peter Turney and Patrick Pantel. 2010. From fre- COLING, pages 539–545, Nantes, France. quency to meaning: Vector space models of se- Irene Heim and Angelika Kratzer. 1998. Semantics in mantics. Journal of Artificial Intelligence Research, Generative Grammar. Blackwell, Oxford. 37:141–188. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Peter Turney. 2008. A uniform approach to analogies, Maayan Zhitomirsky-Geffet. 2010. Directional synonyms, antonyms and associations. In Proceed- distributional similarity for lexical inference. Natu- ings of COLING, pages 905–912, Manchester, UK. ral Language Engineering, 16(4):359–389. Julie Weeds, David Weir, and Diana McCarthy. 2004. Milen Kouleykov and Bernardo Magnini. 2005. Tree Characterising measures of lexical distributional edit sistance for textual entailment. In Proceed- similarity. In Proceedings of the 20th Interna- ings of RALNP-2005, International Conference on tional Conference of Computational Linguistics, Recent Advances in Natural Language Processing, COLING-2004, pages 1015–1021. pages 271–278. Fabio M. Zanzotto, Marco Pennacchiotti, and Alessan- Thomas Landauer and Susan Dumais. 1997. A dro Moschitti. 2007. Shallow semantics in fast tex- solution to Plato’s problem: The latent semantic tual entailment rule learners. In Proceedings of the analysis theory of acquisition, induction, and rep- ACL-PASCAL Workshop on Textual Entailment and resentation of knowledge. Psychological Review, Paraphrasing. 104(2):211–240. Maayan Zhitomirsky-Geffet and Ido Dagan. 2010. Dekang Lin. 1998. An information-theoretic defini- Bootstrapping distributional feature vector quality. tion of similarity. In Proceedings of ICML, pages Computational Linguistics, 35(3):435–461. 296–304, Madison, WI, USA. 32 Evaluating Distributional Models of Semantics for Syntactically Invariant Inference Jackie CK Cheung and Gerald Penn Department of Computer Science University of Toronto Toronto, ON, M5S 3G4, Canada {jcheung,gpenn}@cs.toronto.edu Abstract the notion of compositionality as the litmus test of a truly semantic model. Compositionality is a nat- A major focus of current work in distri- ural way to construct representations of linguistic butional models of semantics is to con- units larger than a word, and it has a long history struct phrase representations composition- in Montagovian semantics for dealing with argu- ally from word representations. However, the syntactic contexts which are modelled ment structure and assembling rich semantical ex- are usually severely limited, a fact which pressions of the kind found in predicate logic. is reflected in the lexical-level WSD-like While compositionality may thus provide a evaluation methods used. In this paper, we convenient recipe for producing representations broaden the scope of these models to build of propositionally typed phrases, it is not a nec- sentence-level representations, and argue essary condition for a semantic representation. that phrase representations are best eval- Rather, that distinction still belongs to the crucial uated in terms of the inference decisions that they support, invariant to the partic- ability to support inference. It is not the inten- ular syntactic constructions used to guide tion of this paper to argue for or against composi- composition. We propose two evaluation tionality in semantic representations. Rather, our methods in relation classification and QA interest is in evaluating semantic models in order which reflect these goals, and apply several to determine their suitability for inference tasks. recent compositional distributional models In particular, we contend that it is desirable and to the tasks. We find that the models out- arguably necessary for a compositional semantic perform a simple lemma overlap baseline representation to support inference invariantly, in slightly, demonstrating that distributional approaches can already be useful for tasks the sense that the particular syntactic construction requiring deeper inference. that guided the composition should not matter rel- ative to the representations of syntactically differ- ent phrases with the same meanings. For example, 1 Introduction we can assert that John threw the ball and The ball A number of unsupervised semantic models was thrown by John have the same meaning for (Mitchell and Lapata, 2008, for example) have re- the purposes of inference, even though they differ cently been proposed which are inspired at least syntactically. in part by the distributional hypothesis (Harris, An analogy can be drawn to research in image 1954)—that a word’s meaning can be character- processing, in which it is widely regarded as im- ized by the contexts in which it appears. Such portant for the representations of images to be in- models represent word meaning as one or more variant to rotation and scaling. What we should high-dimensional vectors which capture the lex- want is a representation of sentence meaning that ical and syntactic contexts of the word’s occur- is invariant to diathesis, other regular syntactic al- rences in a training corpus. ternations in the assignment of argument struc- Much of the recent work in this area has, fol- ture, and, ideally, even invariant to other meaning- lowing Mitchell and Lapata (2008), focused on preserving or near-preserving paraphrases. 33 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 33–43, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Existing evaluations of distributional semantic 2 Compositionality and Distributional models fall short of measuring this. One evalua- Semantics tion approach consists of lexical-level word sub- The idea of compositionality has been central to stitution tasks which primarily evaluate a sys- understanding contemporary natural language se- tem’s ability to disambiguate word senses within a mantics from an historiographic perspective. The controlled syntactic environment (McCarthy and idea is often credited to Frege, although in fact Navigli, 2009, for example). Another approach is Frege had very little to say about compositional- to evaluate parsing accuracy (Socher et al., 2010, ity that had not already been repeated since the for example), which is really a formalism-specific time of Aristotle (Hodges, 2005). Our modern approximation to argument structure analysis. notion of compositionality took shape primarily These evaluations may certainly be relevant to with the work of Tarski (1956), who was actu- specific components of, for example, machine ally arguing that a central difference between for- translation or natural language generation sys- mal languages and natural languages is that nat- tems, but they tell us little about a semantic ural language is not compositional. This in turn model’s ability to support inference. was the “the contention that an important theo- In this paper, we propose a general framework retical difference exists between formal and nat- for evaluating distributional semantic models that ural languages,” that Richard Montague so fa- build sentence representations, and suggest two mously rejected (Montague, 1974). Composi- evaluation methods that test the notion of struc- tionality also features prominently in Fodor and turally invariant inference directly. Both rely on Pylyshyn’s (1988) rejection of early connection- determining whether sentences express the same ist representations of natural language semantics, semantic relation between entities, a crucial step which seems to have influenced Mitchell and La- in solving a wide variety of inference tasks like pata (2008) as well. recognizing textual entailment, information re- Logic-based forms of compositional semantics trieval, question answering, and summarization. have long strived for syntactic invariance in mean- The first evaluation is a relation classification ing representations, which is known as the doc- task, where a semantic model is tested on its abil- trine of the canonical form. The traditional justifi- ity to recognize whether a pair of sentences both cation for canonical forms is that they allow easy contain a particular semantic relation, such as access to a knowledge base to retrieve some de- Company X acquires Company Y. The second task sired information, which amounts to a form of in- is a question answering task, the goal of which is ference. Our work can be seen as an extension of to locate the sentence in a document that contains this notion to distributional semantic models with the answer. Here, the semantic model must match a more general notion of representational similar- the question, which expresses a proposition with a ity and inference. missing argument, to the answer-bearing sentence There are many regular alternations that seman- which contains the full proposition. tics models have tried to account for such as pas- We apply these new evaluation protocols to sive or dative alternations. There are also many several recent distributional models, extending lexical paraphrases which can take drastically dif- several of them to build sentence representa- ferent syntactic forms. Take the following exam- tions. We find that the models outperform a sim- ple from Poon and Domingos (2009), in which the ple lemma overlap model only slightly, but that same semantic relation can be expressed by a tran- combining these models with the lemma overlap sitive verb or an attributive prepositional phrase: model can improve performance. This result is likely due to weaknesses in current models’ abil- (1) Utah borders Idaho. ity to deal with issues such as named entities, Utah is next to Idaho. coreference, and negation, which are not empha- sized by existing evaluation methods, but it does In distributional semantics, the original sen- suggest that distributional models of semantics tence similarity test proposed by Kintsch (2001) can play a more central role in systems that re- served as the inspiration for the evaluation per- quire deep, precise inference. formed by Mitchell and Lapata (2008) and most later work in the area. Intransitive verbs are given 34 in the context of their syntactic subject, and can- which words are given in the context of the sur- didate synonyms are ranked for their appropri- rounding sentence, and the task is to rank a given ateness. This method targets the fact that a syn- list of proposed substitutions for that word. The onym is appropriate for only some of the verb’s list of substitutions as well as the correct rankings senses, and the intended verb sense depends on are elicited from annotators. This task was origi- the surrounding context. For example, burn and nally conceived as an applied evaluation of WSD beam are both synonyms of glow, but given a par- systems, not an evaluation of phrase representa- ticular subject, one of the synonyms (called the tions. High similarity landmark) may be a more appro- Parsing accuracy has been used as a prelimi- priate substitution than the other (the Low similar- nary evaluation of semantic models that produce ity landmark). So, if the fire is the subject, glowed syntactic structure (Socher et al., 2010; Wu and is the High similarity landmark, and beamed the Schuler, 2011). However, syntax does not always Low similarity landmark. reflect semantic content, and we are specifically Fundamentally, this method was designed as interested in supporting syntactic invariance when a demonstration that compositionality in com- doing semantic inference. Also, this type of eval- puting phrasal semantic representations does not uation is tied to a particular grammar formalism. interfere with the ability of a representation to The existing evaluations that are most similar in synthesize non-compositional collocation effects spirit to what we propose are paraphrase detection that contribute to the disambiguation of homo- tasks that do not assume a restricted syntactic con- graphs. Here, word-sense disambiguation is im- text. Washtell (2011) collected human judgments plicitly viewed as a very restricted, highly lexi- on the general meaning similarity of candidate calized case of inference for selecting the appro- phrase pairs. Unfortunately, no additional guid- priate disjunct in the representation of a word’s ance on the definition of “most similar in mean- meaning. ing” was provided, and it appears likely that sub- Kintsch (2001) was interested in sentence sim- jects conflated lexical, syntactic, and semantic re- ilarity, but he only conducted his evaluation on latedness. Dolan and Brockett (2005) define para- a few hand-selected examples. Mitchell and La- phrase detection as identifying sentences that are pata (2008) conducted theirs on a much larger in a bidirectional entailment relation. While such scale, but chose to focus only on this single case sentences do support exactly the same inferences, of syntactic combination, intransitive verbs and we are also interested in the inferences that can their subjects, in order to “factor out inessential be made from similar sentences that are not para- degrees of freedom” to compare their various al- phrases according to this strict definition — a sit- ternative models more equitably. This was not uation that is more often encountered in end ap- necessary—using the same, sufficiently large, un- plications. Thus, we adopt a less restricted notion biased but syntactically heterogeneous sample of of paraphrasis. evaluation sentences would have served as an ade- quate control—and this decision furthermore pre- 3 An Evaluation Framework vents the evaluation from testing the desired in- We now describe a simple, general framework variance of the semantic representation. for evaluating semantic models. Our framework Other lexical evaluations suffer from the same consists of the following components: a seman- problem. One uses the WordSim-353 dataset tic model to be evaluated, pairs of sentences that (Finkelstein et al., 2002), which contains hu- are considered to have high similarity, and pairs man word pair similarity judgments that seman- of sentences that are considered to have low simi- tic models should reproduce. However, the word larity. pairs are given without context, and homography In particular, the semantic model is a binary is unaddressed. Also, it is unclear how reliable function, s = M(x, x′ ), which returns a real- the similarity scores are, as different annotators valued similarity score, s, given a pair of arbitrary may interpret the integer scale of similarity scores linguistic units (that is, words, phrases, sentences, differently. Recent work uses this dataset mostly etc.), x and x′ . Note that this formulation of the for parameter tuning. Another is the lexical para- semantic model is agnostic to whether the models phrase task of McCarthy and Navigli (2009), in use compositionality to build a phrase represen- 35 tation from constituent representations, and even ontology construction, recognizing textual entail- to the actual representation used. The model is ment and question answering. tested by applying it to each element in the fol- In this task, the high and the low similarity sen- lowing two sets: tence pairs are constructed in the following man- ner. First, a target semantic relation, such as Com- H = {(h, h′ )|h and h′ are linguistic units (2) pany X acquires Company Y is chosen, and enti- with high similarity} ties are chosen for each slot in the relation, such as L = {(l, l′ )|l and l′ are linguistic units (3) Company X=Pfizer and Company Y=Rinat Neu- with low similarity} roscience. Then, sentences containing these enti- ties are extracted and divided into two subsets. In The resulting sets of similarity scores are: one of them, E, the entities are in the target se- S H = M(h, h′ )|(h, h′ ) ∈ H (4) mantic relation, while in the other, N E, they are not. The evaluation sets H and L are then con- S L = M(l, l′ )|(l, l′ ) ∈ L (5) structed as follows: The semantic model is evaluated according to H = E × E \ {(e, e)|e ∈ E} (6) its ability to separate S H and S L . We will de- fine specific measures of separation for the tasks L = E × NE (7) that we propose shortly. While the particular def- In other words, the high similarity sentence initions of “high similarity” and “low similarity” pairs are all the pairs where both express the tar- depend on the task, at the crux of both our evalu- get semantic relation, except the pairs between a ations is that two sentences are similar if they ex- sentence and itself, while the low similarity pairs press the same semantic relation between a given are all the pairs where exactly one of the two sen- entity pair, and dissimilar otherwise. This thresh- tences expresses the target relation. old for similarity is closely tied to the argument Several sentences expressing the relation Pfizer structure of the sentence, and allows considerable acquires Rinat Neuroscience are shown in Exam- flexibility in the other semantic content that may ples 8 to 10. These sentences illustrate the amount be contained in the sentence, unlike the bidirec- of syntactic and lexical variation that the semantic tional paraphrase detection task. Yet it ensures model must recognize as expressing the same se- that a consistent and useful distinction for infer- mantic relation. In particular, besides recognizing ence is being detected, unlike unconstrained sim- synonymy or near-synonymy at the lexical level, ilarity judgments. models must also account for subcategorization Also, compared to word similarity assessments differences, extra arguments or adjuncts, and part- or paraphrase elicitation, determining whether a of-speech differences due to nominalization. sentence expresses a semantic relation is a much easier task cognitively for human judges. This bi- (8) Pfizer buys Rinat Neuroscience to extend nary judgment does not involve interpreting a nu- neuroscience research and in doing so merical scale or coming up with an open-ended acquires a product candidate for OA. set of alternative paraphrases. It is thus easier to (lexical difference) get reliable annotated data. Below, we present two tasks that instantiate (9) A month earlier, Pfizer paid an estimated this evaluation framework and choice of similar- several hundred million dollars for biotech ity threshold. They differ in that the first is tar- firm Rinat Neuroscience. (extra argument, geted towards recognizing declarative sentences subcategorization) or phrases, while the second is targeted towards a (10) Pfizer to Expand Neuroscience Research question answering scenario, where one argument With Acquisition of Biotech Company Rinat in the semantic relation is queried. Neuroscience (nominalization) 3.1 Task 1: Relation Classification Since our interest is to measure the models’ The first task is a relation classification task. Rela- ability to separate S H and S L in an unsuper- tion extraction and recognition are central to a va- vised setting, standard supervised classification riety of other tasks, such as information retrieval, accuracy is not applicable. Instead, we employ 36 the area under a ROC curve (AUC), which does manually checked. We use only those cases that not depend on choosing an arbitrary classification have thus been determined to be correct question- threshold. A ROC curve is a plot of the true pos- answer pairs. As a result of this restriction, this itive versus false positive rate of a binary classi- task is rather more like Task 1 in how it tests a fier as the classification threshold is varied. The model’s ability to recognize lexical and syntac- area under a ROC curve can thus be seen as the tic paraphrases. This task also involves recog- performance of linear classifiers over the scores nizing voicing alternations, which were automati- produced by the semantic model. The AUC can cally extracted by the semantic parser. also be interpreted as the probability that a ran- An example of a question-answer pair involv- domly chosen positive instance will have a higher ing a voicing alternation that is used in this task is similarity score than a randomly chosen negative presented in Example 13. instance. A random classifier is expected to have an AUC of 0.5. (13) Q: What does il-2 activate? A: PI3K 3.2 Task 2: Restricted QA Sentence: Phosphatidyl inositol 3-kinase The second task that we propose is a restricted (PI3K) is activated by IL-2. form of question answering. In this task, the sys- Since there is only one element in H and hence tem is given a question q and a document D con- H S for each question and document, we measure sisting of a list of sentences, in which one of the the separation between S H and S L using the rank sentences contains the answer to the question. We of the score of answer-bearing sentence among define: the scores of all the sentences in the document. We normalize the rank so that it is between 0 H = {(q, d)|d ∈ D and d answers q} (11) (ranked least similar) and 1 (ranked most simi- L = {(q, d)|d ∈ D and d does not answer q} lar). Where ties occur, the sentence is ranked as (12) if it were in the median position among the tied sentences. If the question-answer pairs are zero- In other words, the sentences are divided into two indexed by i, answer(i) is the index of the sen- subsets; those that contain the answer to q should tence containing the answer for the ith pair, and be similar to q, while those that do not should be length(i) is the number of sentences in the doc- dissimilar. We also assume that only one sentence ument, then the mean normalized rank score of a in each document contains the answer, so H con- system is: tains only one sentence. Unrestricted question answering is a difficult answer(i) problem that forces a semantic representation to norm rank = E 1 − (14) i length(i) − 1 deal sensibly with a number of other semantic is- sues such as coreference and information aggre- 4 Experiments gation which still seem to be out of reach for We drew a number of recent distributional seman- contemporary distributional models of meaning. tic models to compare in this paper. We first de- Since our focus in this work is on argument struc- scribe the models and our reimplementation of ture semantics, we restrict the question-answer them, before describing the tasks and the datasets pairs to those that only require dealing with para- used in detail and the results. phrases of this type. To do so, we semi-automatically restrict the 4.1 Distributional Semantic Models question-answer pairs by using the output of an We tested four recent distributional models and a unsupervised clustering semantic parser (Poon lemma overlap baseline, which we now describe. and Domingos, 2009). The semantic parser clus- We extended several of the models to compo- ters semantic sub-expressions derived from a de- sitionally construct phrase representations using pendency parse of the sentence, so that those sub- component-wise vector addition and multiplica- expressions that express the same semantic re- tion, as we note below. Since the focus of this pa- lations are clustered. The parser is used to an- per is on evaluation methods for such models, we swer questions, and the output of the parser is did not experiment with other compositionality 37 operators. We do note, however, that component- a distributional representation of a, va , the repre- wise operators have been popular in recent liter- sentation of a in context, a′ , is given by ature, and have been applied across unrestricted syntactic contexts (Mitchell and Lapata, 2009), a′ = va ⊙ Rb (r −1 ) (17) X so there is value in evaluating the performance of Rb (r) = f (c, r, b) · vc , (18) these operators in itself. The models were trained c:f (c,r,b)>θ on the Gigaword corpus (2nd ed., ~2.3B words). All models use cosine similarity to measure the where Rb (r) is the vector describing the selec- similarity between representations, except for the tional preference of word b in relation r, f (c, r, b) baseline model. is the frequency of this dependency triple, θ is a frequency threshold to weed out uncommon de- Lemma Overlap This baseline simply repre- pendency triples (10 in our experiments), and ⊙ sents a sentence as the counts of each lemma is a vector combination operator, here component- present in the sentence after removing stop wise multiplication. We extend the model to com- words. Let a sentence x consist of lemma-tokens pute sentence representations from the contextu- m1 , . . . , m|x| . The similarity between two sen- alized word vectors using component-wise addi- tences is then defined as tion and multiplication. M(x, x′ ) = #In(x, x′ ) + #In(x′ , x) (15) TFP Thater et al. (2010)’s model is also sensi- |x| tive to selectional preferences, but to two degrees. For example, the vector for catch might contain X #In(x, x′ ) = 1x′ (mi ) (16) i=1 a dimension labelled (OBJ,OBJ-1,throw), which indicates the strength of connection be- where 1x′ (mi ) is an indicator function that returns tween the two verbs through all of the co- 1 if mi ∈ x′ , and 0 otherwise. This definition occurring direct objects which they share. Unlike accounts for multiple occurrences of a lemma. E&P, TFP’s model encodes the selectional prefer- M&L Mitchell and Lapata (2008) propose a ences in a single vector using frequency counts. framework for compositional distributional se- We extend the model to the sentence level with mantics using a standard term-context vector component-wise addition and multiplication, and space word representation. A phrase is repre- word vectors are contextualized by the depen- sented as a vector of context-word counts (actu- dency neighbours. We use a frequency threshold ally, pmi-scaled values), which is derived compo- of 10 and a pmi threshold of 2 to prune infrequent sitionally by a function over constituent vectors, word and dependencies. such as component-wise addition or multiplica- D&L Dinu and Lapata (2010) (D&L) assume tion. This model ignores syntactic relations and a global set of latent senses for all words, and is insensitive to word-order. models each word as a mixture over these latent E&P Erk and Pad´o (2008) introduce a struc- senses. The vector for a word ti in the context of tured vector space model which uses syntactic de- a word cj is modelled by pendencies to model the selectional preferences v(ti , cj ) = P (z1 |ti , cj ), ...P (zK |ti , cj ) (19) of words. The vector representation of a word in context depends on the inverse selectional prefer- where z1...K are the latent senses. By mak- ences of its dependents, and the selectional pref- ing independence assumptions and decomposing erences of its head. For example, suppose catch probabilities, training becomes a matter of esti- occurs with a dependent ball in a direct object mating the probability distributions P (zk |ti ) and relation. The vector for catch would then be in- P (cj |zk ) from data. While Dinu and Lapata fluenced by the inverse direct object preferences (2010) describe two methods to do so, based of ball (e.g. throw, organize), and the vector for on non-negative matrix factorization and latent ball would be influenced by the selectional pref- Dirichlet allocation, the performances are similar, erences of catch (e.g. cold, drift). More formally, so we tested only the latent Dirichlet allocation given words a and b in a dependency relation r, method. Like the two previous models, we ex- tend the model to build sentence representations 38 Pfizer/Rinat N. Yahoo/Inktomi Besson/Paris Antoinette/Vienna Average Overlap 0.7393 0.6007 0.7395 0.8914 0.7427 Models trained on the entire GigaWord M&L add 0.6196 0.5387 0.5259 0.7275 0.6029 M&L mult 0.9036 0.6099 0.6443 0.8467 0.7511 D&L add 0.9214 0.8168 0.6989 0.8932 0.8326 D&L mult 0.7732 0.6734 0.6527 0.7659 0.7163 Models trained on the AFP section E&P add 0.7536 0.4933 0.2780 0.6408 0.5414 E&P mult 0.5268 0.5328 0.5252 0.8421 0.6067 TFP add 0.4357 0.5325 0.8725 0.7183 0.6398 TFP mult 0.5554 0.5524 0.7283 0.6917 0.6320 M&L add 0.5643 0.5504 0.4594 0.7640 0.5845 M&L mult 0.8679 0.6324 0.4356 0.8258 0.6904 D&L add 0.8143 0.9062 0.6373 0.8664 0.8061 D&L mult 0.8429 0.7461 0.645 0.5948 0.7072 Table 1: Task 1 results in AUC scores. The values in bold indicate the best performing model for a particular training corpus. The expected random baseline performance is 0.5. Entities: {X, Y} + N tion for comparison. Note that the AFP portion Relation: acquires of Gigaword is three times larger than the BNC {Pfizer, Rinat Neuroscience} 41 50 corpus (~100M words), on which several previ- {Yahoo, Inktomi} 115 433 ous syntactic models were trained. Because our Relation: was born in main goal is to test the general performance of the {Luc Besson, Paris} 6 126 models and to demonstrate the feasibility of our {Marie Antoinette, Vienna} 39 105 evaluation methods, we did not further tune the parameter settings to each of the tasks, as doing Table 2: Task 1 dataset characteristics. N is the total number of sentences. + is the number of sentences so would likely only yield minor improvements. that express the relation. 4.3 Task 1 We used the dataset by Bunescu and Mooney from the contextualized representations. We set (2007), which we selected because it contains the number of latent senses to 1200, and train for multiple realizations of an entity pair in a target 600 Gibbs sampling iterations. semantic relation, unlike similar datasets such as the one by Roth and Yih (2002). Controlling for 4.2 Training and Parameter Settings the target entity pair in this manner makes the task We reimplemented these four models, following more difficult, because the semantic model cannot the parameter settings described by previous work make use of distributional information about the where possible, though we also aimed for consis- entity pair in inference. The dataset is separated tency in parameter settings between models (for into subsets depending on the target binary rela- example, in the number of context words). For the tion (Company X acquires Company Y or Person non-baseline models, we followed previous work X was born in Place Y) and the entity pair (e.g., and model only the 30000 most frequent lemmata. Yahoo and Inktomi) (Table 2). Context vectors are constructed using a symmet- The dataset was constructed semi- ric window of 5 words, and their dimensions rep- automatically using a Google search for the resent the 3000 most frequent lemmatized context two entities in order with up to seven content words excluding stop words. Due to resource lim- words in between. Then, the extracted sentences itations, we trained the syntactic models over the were hand-labelled with whether they express the AFP subset of Gigaword (~338M words). We also target relation. Because the order of the entities trained the other two models on just the AFP por- has been fixed, passive alternations do not appear 39 Pure models Mixed models ing off to word vectors from the GENIA corpus All Subset All Subset when a word vector could not be found in the Overlap 0.8770 0.7291 0.8770 0.7291 Gigaword-trained model. We could not do this Models trained on the entire GigaWord for the D&L model, since the global latent senses M&L add 0.7467 0.6106 0.8782 0.7523 that are found by latent Dirichlet allocation train- M&L mult 0.5331 0.5690 0.8841 0.7678 ing do not have any absolute meaning that holds D&L add 0.6552 0.5716 0.8791 0.7539 across multiple runs. Instead, we found the 5 D&L mult 0.5488 0.5255 0.8841 0.7466 words in the Gigaword-trained D&L model that Models trained on the AFP section were closest to each novel word in the GENIA E&P add 0.4589 0.4516 0.8748 0.7375 corpus according to cosine similarity over the co- E&P mult 0.5201 0.5584 0.8882 0.7719 occurrence vectors of the words in the GENIA TFP add 0.6887 0.6443 0.8940 0.7871 corpus, and took their average latent sense distri- TFP mult 0.5210 0.5199 0.8785 0.7432 butions as the vector for that word. M&L add 0.7588 0.6206 0.8710 0.7371 Unlike in Task 1, there is no control for the M&L mult 0.5710 0.5540 0.8801 0.7540 named entities in a sentence, because one of the D&L add 0.6358 0.5402 0.8713 0.7305 entities in the semantic relation is missing. Also, D&L mult 0.5647 0.5461 0.8856 0.7683 distributional models have problems in dealing with named entities which are common in this Table 3: Task 2 results, in normalized rank scores. Subset is the cases where lemma overlap does not corpus, such as the names of genes and proteins. achieve a perfect score. The two columns on the right To address these issues, we tested hybrid models indicate performance using the sum of the scores from where the similarity score from a semantic model the lemma overlap and the semantic model. The ex- is added to the similarity score from the lemma pected random baseline performance is 0.5. overlap model. The results are presented in Table 3. Lemma overlap again presents a strong baseline, but the in this dataset. hybridized models are able to outperform simple The results for Task 1 indicate that the D&L ad- lemma overlap. Unlike in Task 1, the E&P and dition model performs the best (Table 1), though TFP models are comparable to the D&L model, the lemma overlap model presents a surprisingly and the mixed TFP addition model achieves the strong baseline. The syntax-modulated E&P and best result, likely due to the need to more pre- TFP models perform poorly on this task, even cisely distinguish syntactic roles in this task. The when compared to the other models trained on the D&L addition model, which achieved the best AFP subset. The M&L multiplication model out- performance in Task 1, does not perform as well performs the addition model, a result which cor- in this task. This could be due to the domain adap- roborates previous findings on the lexical substi- tation procedure for the D&L model, which could tution task. The same does not hold in the D&L not be reasonably trained on such a small, special- latent sense space. Overall, some of the datasets ized corpus. (Yahoo and Antoinette) appear to be easier for the models than others (Pfizer and Besson), but more 5 Related Work entity pairs and relations would be needed to in- vestigate the models’ variance across datasets. Turney and Pantel (2010) survey various types of vector space models and applications thereof in 4.4 Task 2 computational linguistics. We summarize below We used the question-answer pairs extracted by a number of other word- or phrase-level distribu- the Poon and Domingos (2009) semantic parser tional models. from the GENIA biomedical corpus that have Several approaches are specialized to deal with been manually checked to be correct (295 pairs). homography. The top-down multi-prototype ap- Because our models were trained on newspaper proach determines a number of senses for each text, they required adaptation to this specialized word, and then clusters the occurrences of the domain. Thus, we also trained the M&L, E&P word (Reisinger and Mooney, 2010) into these and TFP models on the GENIA corpus, back- senses. A prototype vector is created for each of these sense clusters. When a new occurrence 40 of a word is encountered, it is represented as a results demonstrate that compositional distribu- combination of the prototype vectors, with the de- tional models of semantics already have some gree of influence from each prototype determined utility in the context of more empirically complex by the similarity of the new context to the exist- semantic tasks than WSD-like lexical substitution ing sense contexts. In contrast, the bottom-up ex- tasks, in which compositional invariance is a req- emplar-based approach assumes that each occur- uisite property. Simply computing lemma over- rence of a word expresses a different sense of the lap, however, is a very competitive baseline, due word. The most similar senses of the word are ac- to issues in these protocols with named entities tivated when a new occurrence of it is encountered and domain adaptivity. The better performance and combined, for example with a kNN algorithm of the mixture models in Task 2 shows that such (Erk and Pad´o, 2010). weaknesses can be addressed by hybrid seman- The models we compared and the above work tic models. Future work should investigate more assume each dimension in the feature vector cor- refined versions of such hybridization, as well as responds to a context word. In contrast, Washtell extend this idea to other semantic phenomena like (2011) uses potential paraphrases directly as di- coreference, negation and modality. mensions in his expectation vectors. Unfortu- We also observe that no single model or com- nately, this approach does not outperform vari- position operator performs best for all tasks and ous context word-based approaches in two phrase datasets. The latent sense mixture model of Dinu similarity tasks. and Lapata (2010) performs well in recognizing In terms of the vector composition function, semantic relations in general web text. Because component-wise addition and multiplication are of the difficulty of adapting it to a specialized the most popular in recent work, but there ex- domain, however, it does less well in biomedi- ist a number of other operators such as tensor cal question answering, where the syntax-based product and convolution product, which are re- model of Thater et al. (2010) performs the best. viewed by Widdows (2008). Instead of vector A more thorough investigation of the factors that space representations, one could also use a matrix can predict the performance and/or invariance of space representation with its much more expres- a given composition operator is warranted. sive matrix operators (Rudolph and Giesbrecht, In the future, we would like to evaluate other 2010). So far, however, this has only been ap- models of compositional semantics that have been plied to specific syntactic contexts (Baroni and recently proposed. We would also like to collect Zamparelli, 2010; Guevara, 2010; Grefenstette more comprehensive test data, to increase the ex- and Sadrzadeh, 2011), or tasks (Yessenalina and ternal validity of our evaluations. Cardie, 2011). Neural networks have been used to learn both Acknowledgments phrase structure and representations. In Socher et We would like to thank Georgiana Dinu and Ste- al. (2010), word representations learned by neu- fan Thater for help with reimplementing their ral network models such as (Bengio et al., 2006; models. Saif Mohammad, Peter Turney, and Collobert and Weston, 2008) are fed as input into the anonymous reviewers provided valuable com- a recursive neural network whose nodes represent ments on drafts of this paper. This project was syntactic constituents. Each node models both the supported by the Natural Sciences and Engineer- probability of the input forming a constituent and ing Research Council of Canada. the phrase representation resulting from composi- tion. References 6 Conclusions Marco Baroni and Roberto Zamparelli. 2010. Nouns We have proposed an evaluation framework for are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In distributional models of semantics which build Proceedings of the 2010 Conference on Empirical phrase- and sentence-level representations, and Methods in Natural Language Processing, pages instantiated two evaluation tasks which test for 1183–1193. the crucial ability to recognize whether sen- Yoshua Bengio, Holger Schwenk, Jean-S´ebastien tences express the same semantic relation. Our Sen´ecal, Fr´ederic Morin, and Jean-Luc Gauvain. 41 2006. Neural probabilistic language models. In- Diana McCarthy and Roberto Navigli. 2009. The en- novations in Machine Learning, pages 137–186. glish lexical substitution task. Language Resources Razvan C. Bunescu and Raymond J. Mooney. 2007. and Evaluation, 43(2):139–159. Learning to extract relations from the web using Jeff Mitchell and Mirella Lapata. 2008. Vector-based minimal supervision. In Proceedings of the 45th models of semantic composition. In Proceedings of Annual Meeting of the Association for Computa- ACL-08: HLT, pages 236–244. tional Linguistics, pages 576–583. Jeff Mitchell and Mirella Lapata. 2009. Language Ronan Collobert and Jason Weston. 2008. A unified models based on semantic composition. In Pro- architecture for natural language processing: Deep ceedings of the 2009 Conference on Empirical neural networks with multitask learning. In Pro- Methods in Natural Language Processing, pages ceedings of the 25th International Conference on 430–439. Machine Learning, page 160–167. Richard Montague. 1974. English as a formal lan- Georgiana Dinu and Mirella Lapata. 2010. Measuring guage. Formal Philosophy, pages 188–221. distributional similarity in context. In Proceedings Hoifung Poon and Pedro Domingos. 2009. Unsuper- of the 2010 Conference on Empirical Methods in vised semantic parsing. In Proceedings of the 2009 Natural Language Processing, pages 1162–1172. Conference on Empirical Methods in Natural Lan- William B. Dolan and Chris Brockett. 2005. Auto- guage Processing, pages 1–10. matically constructing a corpus of sentential para- Joseph Reisinger and Raymond J. Mooney. 2010. phrases. In Proceedings of the Third International Multi-prototype vector-space models of word Workshop on Paraphrasing, pages 9–16. meaning. In Human Language Technologies: The Katrin Erk and Sebastian Pad´o. 2008. A structured 2010 Annual Conference of the North American vector space model for word meaning in context. In Chapter of the Association for Computational Lin- Proceedings of the Conference on Empirical Meth- guistics. ods in Natural Language Processing, pages 897– Dan Roth and Wen-tau Yih. 2002. Probabilistic rea- 906. soning for entity & relation recognition. In Pro- ceedings of the 19th International Conference on Katrin Erk and Sebastian Pad´o. 2010. Exemplar- Computational Linguistics, pages 835–841. based models for word meaning in context. In Pro- Sebastian Rudolph and Eugenie Giesbrecht. 2010. ceedings of the ACL 2010 Conference Short Papers, Compositional matrix-space models of language. pages 92–97. In Proceedings of the 48th Annual Meeting of the Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Association for Computational Linguistics, pages Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey- 907–916. tan Ruppin. 2002. Placing search in context: The Richard Socher, Christopher D. Manning, and An- concept revisited. ACM Transactions on Informa- drew Y. Ng. 2010. Learning continuous phrase tion Systems, 20(1):116–131. representations and syntactic parsing with recursive Jerry A. Fodor and Zenon W. Pylyshyn. 1988. Con- neural networks. Proceedings of the Deep Learn- nectionism and cognitive architecture: A critical ing and Unsupervised Feature Learning Workshop analysis. Cognition, 28:3–71. of NIPS 2010, pages 1–9. Edward Grefenstette and Mehrnoosh Sadrzadeh. Alfred Tarski. 1956. The concept of truth in formal- 2011. Experimental support for a categorical com- ized languages. Logic, Semantics, Metamathemat- positional distributional model of meaning. In ics, pages 152–278. Proceedings of the 2011 Conference on Empirical Stefan Thater, Hagen F¨urstenau, and Manfred Pinkal. Methods in Natural Language Processing, pages 2010. Contextualizing semantic representations us- 1394–1404. ing syntactically enriched vector models. In Pro- Emiliano Guevara. 2010. A regression model ceedings of the 48th Annual Meeting of the Associa- of adjective-noun compositionality in distributional tion for Computational Linguistics, pages 948–957. semantics. In Proceedings of the 2010 Workshop on Peter D. Turney and Patrick Pantel. 2010. From GEometrical Models of Natural Language Seman- frequency to meaning: Vector space models of se- tics, pages 33–37. mantics. Journal of Artificial Intelligence Research, Zeller S. Harris. 1954. Distributional structure. Word, 37:141–188. 10(23):146–162. Justin Washtell. 2011. Compositional expectation: Wilfred Hodges. 2005. The interplay of fact and the- A purely distributional model of compositional se- ory in separating syntax from meaning. In Work- mantics. In Proceedings of the Ninth International shop on Empirical Challenges and Analytical Al- Conference on Computational Semantics (IWCS ternatives to Strict Compositionality. 2011), pages 285–294. Walter Kintsch. 2001. Predication. Cognitive Sci- Dominic Widdows. 2008. Semantic vector products: ence, 25(2):173–202. Some initial investigations. In Second AAAI Sym- posium on Quantum Interaction. 42 Stephen Wu and William Schuler. 2011. Structured composition of semantic vectors. In Proceedings of the Ninth International Conference on Computa- tional Semantics (IWCS 2011), pages 295–304. Ainur Yessenalina and Claire Cardie. 2011. Com- positional matrix-space models for sentiment analy- sis. In Proceedings of the 2011 Conference on Em- pirical Methods in Natural Language Processing, pages 172–182. 43 Cross-Framework Evaluation for Statistical Parsing Reut Tsarfaty Joakim Nivre Evelina Andersson Uppsala University, Box 635, 75126 Uppsala, Sweden

[email protected]

,{joakim.nivre,evelina.andersson}@lingfil.uu.se Abstract a phrase-structure tree using hard-coded conver- sion procedures (de Marneffe et al., 2006). This A serious bottleneck of comparative parser diversity poses a challenge to cross-experimental evaluation is the fact that different parsers parser evaluation, namely: How can we evaluate subscribe to different formal frameworks the performance of these different parsers relative and theoretical assumptions. Converting to one another? outputs from one framework to another is less than optimal as it easily introduces Current evaluation practices assume a set of noise into the process. Here we present a correctly annotated test data (or gold standard) principled protocol for evaluating parsing for evaluation. Typically, every parser is eval- results across frameworks based on func- uated with respect to its own formal representa- tion trees, tree generalization and edit dis- tion type and the underlying theory which it was tance metrics. This extends a previously trained to recover. Therefore, numerical scores proposed framework for cross-theory eval- uation and allows us to compare a wider of parses across experiments are incomparable. class of parsers. We demonstrate the useful- When comparing parses that belong to different ness and language independence of our pro- formal frameworks, the notion of a single gold cedure by evaluating constituency and de- standard becomes problematic, and there are two pendency parsers on English and Swedish. different questions we have to answer. First, what is an appropriate gold standard for cross-parser evaluation? And secondly, how can we alle- 1 Introduction viate the differences between formal representa- The goal of statistical parsers is to recover a for- tion types and theoretical assumptions in order to mal representation of the grammatical relations make our comparison sound – that is, to make sure that constitute the argument structure of natural that we are not comparing apples and oranges? language sentences. The argument structure en- A popular way to address this has been to compasses grammatical relationships between el- pick one of the frameworks and convert all ements such as subject, predicate, object, etc., parser outputs to its formal type. When com- which are useful for further (e.g., semantic) pro- paring constituency-based and dependency-based cessing. The parses yielded by different parsing parsers, for instance, the output of constituency frameworks typically obey different formal and parsers has often been converted to dependency theoretical assumptions concerning how to rep- structures prior to evaluation (Cer et al., 2010; resent the grammatical relationships in the data Nivre et al., 2010). This solution has vari- (Rambow, 2010). For example, grammatical rela- ous drawbacks. First, it demands a conversion tions may be encoded on top of dependency arcs script that maps one representation type to another in a dependency tree (Mel’ˇcuk, 1988), they may when some theoretical assumptions in one frame- decorate nodes in a phrase-structure tree (Marcus work may be incompatible with the other one. et al., 1993; Maamouri et al., 2004; Sima’an et In the constituency-to-dependency case, some al., 2001), or they may be read off of positions in constituency-based structures (e.g., coordination 44 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 44–54, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics and ellipsis) do not comply with the single head 2 Preliminaries: Relational Schemes for assumption of dependency treebanks. Secondly, Cross-Framework Parse Evaluation these scripts may be labor intensive to create, and are available mostly for English. So the evalua- Traditionally, different statistical parsers have tion protocol becomes language-dependent. been evaluated using specially designated evalu- In Tsarfaty et al. (2011) we proposed a gen- ation measures that are designed to fit their repre- eral protocol for handling annotation discrepan- sentation types. Dependency trees are evaluated cies when comparing parses across different de- using attachment scores (Buchholz and Marsi, pendency theories. The protocol consists of three 2006), phrase-structure trees are evaluated using phases: converting all structures into function ParsEval (Black et al., 1991), LFG-based parsers trees, for each sentence, generalizing the different postulate an evaluation procedure based on f- gold standard function trees to get their common structures (Cahill et al., 2008), and so on. From a denominator, and employing an evaluation mea- downstream application point of view, there is no sure based on tree edit distance (TED) which dis- significance as to which formalism was used for cards edit operations that recover theory-specific generating the representation and which learning structures. Although the protocol is potentially methods have been utilized. The bottom line is applicable to a wide class of syntactic represen- simply which parsing framework most accurately tation types, formal restrictions in the procedures recovers a useful representation that helps to un- effectively limit its applicability only to represen- ravel the human-perceived interpretation. tations that are isomorphic to dependency trees. Relational schemes, that is, schemes that en- The present paper breaks new ground in the code the set of grammatical relations that con- ability to soundly compare the accuracy of differ- stitute the predicate-argument structures of sen- ent parsers relative to one another given that they tences, provide an interface to semantic interpre- employ different formal representation types and tation. They are more intuitively understood than, obey different theoretical assumptions. Our solu- say, phrase-structure trees, and thus they are also tion generally confines with the protocol proposed more useful for practical applications. For these in Tsarfaty et al. (2011) but is re-formalized to reasons, relational schemes have been repeatedly allow for arbitrary linearly ordered labeled trees, singled out as an appropriate level of representa- thus encompassing constituency-based as well as tion for the evaluation of statistical parsers (Lin, dependency-based representations. The frame- 1995; Carroll et al., 1998; Cer et al., 2010). work in Tsarfaty et al. (2011) assumes structures The annotated data which statistical parsers are that are isomorphic to dependency trees, bypass- trained on encode these grammatical relationships ing the problem of arbitrary branching. Here we in different ways. Dependency treebanks provide lift this restriction, and define a protocol which a ready-made representation of grammatical rela- is based on generalization and TED measures to tions on top of arcs connecting the words in the soundly compare the output of different parsers. sentence (K¨ubler et al., 2009). The Penn Tree- We demonstrate the utility of this protocol by bank and phrase-structure annotated resources en- comparing the performance of different parsers code partial information about grammatical rela- for English and Swedish. For English, our tions as dash-features decorating phrase structure parser evaluation across representation types al- nodes (Marcus et al., 1993). Treebanks like Tiger lows us to analyze and precisely quantify previ- for German (Brants et al., 2002) and Talbanken ously encountered performance tendencies. For for Swedish (Nivre and Megyesi, 2007) explic- Swedish we show the first ever evaluation be- itly map phrase structures onto grammatical rela- tween dependency-based and constituency-based tions using dedicated edge labels. The Relational- parsing models, all trained on the Swedish tree- Realizational structures of Tsarfaty and Sima’an bank data. All in all we show that our ex- (2008) encode relational networks (sets of rela- tended protocol, which can handle linearly- tions) projected and realized by syntactic cate- ordered labeled trees with arbitrary branch- gories on top of ordinary phrase-structure nodes. ing, can soundly compare parsing results across Function trees, as defined in Tsarfaty et al. frameworks in a representation-independent and (2011), are linearly ordered labeled trees in which language-independent fashion. every node is labeled with the grammatical func- 45 root (t1) root (t2) root (t3) root sbj obj (a) -ROOT- John loves Mary ⇒ root f1 f2 {f1,f2} f2 f1 w sbj hd obj w w John loves Mary (b) S-root ⇒ root Figure 2: Unary chains in function trees NP-sbj VP-prd sbj prd Once we have converted framework-specific NN-hd V-hd NP-obj hd hd obj representations into function trees, the problem of John loves NN-hd John loves hd cross-framework evaluation can potentially be re- duced to a cross-theory evaluation following Tsar- Mary Mary faty et al. (2011). The main idea is that once (c) S ⇒ root all structures have been converted into function {sbj,prd,obj} trees, one can perform a formal operation called sbj prd obj generalization in order to harmonize the differ- sbj prd obj hd loves hd ences between theories, and measure accurately the distance of parse hypotheses from the gener- NP V NP John Mary alized gold. The generalization operation defined {hd} loves {hd} in Tsarfaty et al. (2011), however, cannot handle trees that may contain unary chains, and therefore hd hd cannot be used for arbitrary function trees. NN NN Consider for instance (t1) and (t2) in Figure 2. According to the definition of subsumption in John Mary Tsarfaty et al. (2011), (t1) is subsumed by (t2) Figure 1: Deterministic conversion into function trees. and vice versa, so the two trees should be identi- The algorithm for extracting a function tree from a de- cal – but they are not. The interpretation we wish pendency tree as in (a) is provided in Tsarfaty et al. to give to a function tree such as (t1) is that the (2011). For a phrase-structure tree as in (b) we can re- word w has both the grammatical function f1 and place each node label with its function (dash-feature). the grammatical function f2. This can be graphi- In a relational-realizational structure like (c) we can re- cally represented as a set of labels dominating w, move the projection nodes (sets) and realization nodes as in (t3). We call structures such as (t3) multi- (phrase labels), which leaves the function nodes intact. function trees. In the next section we formally de- fine multi-function trees, and then use them to de- tion of the dominated span. Function trees ben- velop our protocol for cross-framework and cross- efit from the same advantages as other relational theory evaluation. schemes, namely that they are intuitive to under- stand, they provide the interface for semantic in- 3 The Proposal: Cross-Framework terpretation, and thus may be useful for down- Evaluation with Multi-Function Trees stream applications. Yet they do not suffer from formal restrictions inherent in dependency struc- Our proposal is a three-phase evaluation proto- tures, for instance, the single head assumption. col in the spirit of Tsarfaty et al. (2011). First, For many formal representation types there ex- we obtain a formal common ground for all frame- ists a fully deterministic, heuristics-free, proce- works in terms of multi-function trees. Then we dure mapping them to function trees. In Figure 1 obtain a theoretical common ground by means we illustrate some such procedures for a simple of tree-generalization on gold trees. Finally, we transitive sentence. Now, while all the structures calculate TED-based scores that discard the cost at the right hand side of Figure 1 are of the same of annotation-specific edits. In this section, we formal type (function trees), they have different define multi-function trees and update the tree- tree structures due to different theoretical assump- generalization and TED-based metrics to handle tions underlying the original formal frameworks. multi-function trees that reflect different theories. 46 Figure 3: The Evaluation Protocol. Different formal frameworks yield different parse and gold formal types. All types are transformed into multi-function trees. All gold trees enter generalization to yield a new gold for each sentence. The different δ arcs represent the different edit scripts used for calculating the TED-based scores. 3.1 Defining Multi-Function Trees 3.2 Generalizing Multi-Function Trees An ordinary function tree is a linearly ordered tree Once we obtain multi-function trees for all the T = (V, A) with yield w1 , ..., wn , where internal different gold standard representations in the sys- nodes are labeled with grammatical function la- tem, we feed them to a generalization operation bels drawn from some set L. We use span(v) as shown in Figure 3. The goal of this opera- and label(v) to denote the yield and label, respec- tion is to provide a consensus gold standard that tively, of an internal node v. A multi-function tree captures the linguistic structure that the different is a linearly ordered tree T = (V, A) with yield gold theories agree on. The generalization struc- w1 , ..., wn , where internal nodes are labeled with tures are later used as the basis for the TED-based sets of grammatical function labels drawn from L evaluation. Generalization is defined by means of and where v 6= v 0 implies span(v) 6= span(v 0 ) subsumption. A multi-function tree subsumes an- for all internal nodes v, v 0 . We use labels(v) to other one if and only if all the constraints defined denote the label set of an internal node v. by the first tree are also defined by the second tree. We interpret multi-function trees as encoding So, instead of demanding equality of labels as in sets of functional constraints over spans in func- Tsarfaty et al. (2011), we demand set inclusion: tion trees. Each node v in a multi-function tree represents a constraint of the form: for each T-Subsumption, denoted vt , is a relation l ∈ labels(v), there should be a node v 0 in the between multi-function trees that indicates function tree such that span(v) = span(v 0 ) and that a tree π1 is consistent with and more label(v 0 ) = l. Whenever we have a conversion for general than tree π2 . Formally: π1 vt π2 function trees, we can efficiently collapse them iff for every node n ∈ π1 there exists a node into multi-function trees with no unary produc- m ∈ π2 such that span(n) = span(m) and tions, and with label sets labeling their nodes. labels(n) ⊆ labels(m). Thus, trees (t1) and (t2) in Figure 2 would both T-Unification, denoted tt , is an operation be mapped to tree (t3), which encodes the func- that returns the most general tree structure tional constraints encoded in either of them. that contains the information from both input For dependency trees, we assume the conver- trees, and fails if such a tree does not exist. sion to function trees defined in Tsarfaty et al. Formally: π1 tt π2 = π3 iff π1 vt π3 and (2011), where head daughters always get the la- π2 vt π3 , and for all π4 such that π1 vt π4 bel ‘hd’. For PTB style phrase-structure trees, we and π2 vt π4 it holds that π3 vt π4 . replace the phrase-structure labels with functional dash-features. In relational-realization structures T-Generalization, denoted ut , is an opera- we remove projection and realization nodes. De- tion that returns the most specific tree that terministic conversions exist also for Tiger style is more general than both trees. Formally, treebanks and frameworks such as LFG, but we π1 ut π2 = π3 iff π3 vt π1 and π3 vt π2 , and do not discuss them here.1 for every π4 such that π4 vt π1 and π4 vt π2 1 All the conversions we use are deterministic and are it holds that π4 vt π3 . defined in graph-theoretic and language-independent terms. We make them available at http://stp.lingfil.uu. The generalization tree contains all nodes that ex- se/˜tsarfaty/unipar/index.html. ist in both trees, and for each node it is labeled by 47 the intersection of the label sets dominating the We would now like to use distance-based met- same span in both trees. The unification tree con- rics in order to measure the gap between the gold tains nodes that exist in one tree or another, and and predicted theories. The idea behind distance- for each span it is labeled by the union of all label based evaluation in Tsarfaty et al. (2011) is that sets for this span in either tree. If we generalize recording the edit operations between the native two trees and one tree has no specification for la- gold and the generalized gold allows one to dis- bels over a span, it does not share anything with card their cost when computing the cost of a parse the label set dominating the same span in the other hypothesis turned into the generalized gold. This tree, and the label set dominating this span in the makes sure that different parsers do not get penal- generalized tree is empty. If the trees do not agree ized, or favored, due to annotation specific deci- on any label for a particular span, the respective sions that are not shared by other frameworks. node is similarly labeled with an empty set. When The problem is now that TED is undefined with we wish to unify theories, then an empty set over respect to multi-function trees because it cannot a span is unified with any other set dominating the handle complex labels. To overcome this, we same span in the other tree, without altering it. convert multi-function trees into sorted function trees, which are simply function trees in which Digression: Using Unification to Merge Infor- any label set is represented as a unary chain of mation From Different Treebanks In Tsarfaty single-labeled nodes, and the nodes are sorted ac- et al. (2011), only the generalization operation cording to the canonical order of their labels.2 In was used, providing the common denominator of case of an empty set, a 0-length chain is created, all the gold structures and serving as a common that is, no node is created over this span. Sorted ground for evaluation. The unification operation function trees prevent reordering nodes in a chain is useful for other NLP tasks, for instance, com- in one tree to fit the order in another tree, since it bining information from two different annotation would violate the idea that the set of constraints schemes or enriching one annotation scheme with over a span in a multi-function tree is unordered. information from a different one. In particular, The edit operations we assume are add- we can take advantage of the new framework to node(l, i, j) and delete-node(l, i, j) where l ∈ L enrich the node structure reflected in one theory is a grammatical function label and i < j define with grammatical functions reflected in an anno- the span of a node in the tree. Insertion into a tation scheme that follows a different theory. To unary chain must confine with the canonical order do so, we define the Tree-Labeling-Unification of the labels. Every operation is assigned a cost. operation on multi-function trees. An edit script is a sequence of edit operations that TL-Unification, denoted ttl , is an opera- turns a function tree π1 into π2 , that is: tion that returns a tree that retains the struc- ES(π1 , π2 ) = he1 , . . . , ek i ture of the first tree and adds labels that ex- ist over its spans in the second tree. For- Since all operations are anchored in spans, the se- mally: π1 ttl π2 = π3 iff for every node quence can be determined to have a unique order n ∈ π1 there exists a node m ∈ π3 such of traversing the tree (say, DFS). Different edit that span(m) = span(n) and labels(m) = scripts then only differ in their set of operations labels(n) ∪ labels(π2 , span(n)). on spans. The edit distance problem is finding the minimal cost script, that is, one needs to solve: Where labels(π2 , span(n)) is the set of labels of X the node with yield span(n) in π2 if such a node ES ∗ (π1 , π2 ) = min cost(e) ES(π1 ,π2 ) exists and ∅ otherwise. We further discuss the TL- e∈ES(π1 ,π2 ) Unification and its use for data preparation in §4. In the current setting, when using only add and 3.3 TED Measures for Multi-Function Trees delete operations on spans, there is only one edit The result of the generalization operation pro- script that corresponds to the minimal edit cost. vides us with multi-function trees for each of the So, finding the minimal edit script entails finding sentences in the test set representing sets of con- a single set of operations turning π1 into π2 . 2 straints on which the different gold theories agree. The ordering can be alphabetic, thematic, etc. 48 We can now define δ for the ith framework, as parser (Petrov et al., 2006) and the Brown parser the error of parsei relative to its native gold stan- (Charniak and Johnson, 2005). All experiments dard goldi and to the generalized gold gen. This use Penn Treebank (PTB) data. For Swedish, is the edit cost minus the cost of the script turning we compare MaltParser and MSTParser with two parsei into gen intersected with the script turning variants of the Berkeley parser, one trained on goldi into gen. The underlying intuition is that phrase structure trees, and one trained on a vari- if an operation that was used to turn parsei into ant of the Relational-Realizational representation gen is used to discard theory-specific information of Tsarfaty and Sima’an (2008). All experiments from goldi , its cost should not be counted as error. use the Talbanken Swedish Treebank (STB) data. δ(parsei , goldi , gen) = cost(ES ∗ (parsei , gen)) 4.1 English Cross-Framework Evaluation We use sections 02–21 of the WSJ Penn Tree- −cost(ES ∗ (parsei , gen) ∩ ES ∗ (goldi , gen)) bank for training and section 00 for evaluation and In order to turn distance measures into parse- analysis. We use two different native gold stan- scores we now normalize the error relative to the dards subscribing to different theories of encoding size of the trees and subtract it from a unity. So grammatical relations in tree structures: the Sentence Score for parsing with framework i ◦ T HE DEPENDENCY- BASED THEORY is the is: theory encoded in the basic Stanford Depen- score(parsei , goldi , gen) = dencies (SD) scheme. We obtain the set of δ(parsei , goldi ,gen) basic stanford dependency trees using the 1− software of de Marneffe et al. (2006) and |parsei | + |gen| train the dependency parsers directly on it. Finally, Test-Set Average is defined by macro- avaraging over all sentences in the test-set: ◦ T HE CONSTITUENCY- BASED THEORY is the theory reflected in the phrase-structure P|testset| representation of the PTB (Marcus et al., j=1 δ(parseij , goldij , genj ) 1− P|testset| 1993) enriched with function labels compat- j=1 |parseij | + |genj | ible with the Stanford Dependencies (SD) scheme. We obtain trees that reflect this This last formula represents the T ED E VAL metric theory by TL-Unification of the PTB multi- that we use in our experiments. function trees with the SD multi-function A Note on System Complexity Conversion of trees (PTBttl SD) as illustrated in Figure 4. a dependency or a constituency tree into a func- The theory encoded in the multi-function trees tion tree is linear in the size of the tree. Our corresponding to SD is different from the one implementation of the generalization and unifica- obtained by our TL-Unification, as may be seen tion operation is an exact, greedy, chart-based al- from the difference between the flat SD multi- gorithm that runs in polynomial time (O(n2 ) in function tree and the result of the PTBttl SD in n the number of terminals). The TED software Figure 4. Another difference concerns coordina- that we utilize builds on the TED efficient algo- tion structures, encoded as binary branching trees rithm of Zhang and Shasha (1989) which runs in in SD and as flat productions in the PTBttl SD. O(|T1 ||T2 | min(d1 , n1 ) min(d2 , n2 )) time where Such differences are not only observable but also di is the tree degree (depth) and ni is the number quantifiable, and using our redefined TED metric of terminals in the respective tree (Bille, 2005). the cross-theory overlap is 0.8571. The two dependency parsers were trained using 4 Experiments the same settings as in Tsarfaty et al. (2011), using We validate our cross-framework evaluation pro- SVMTool (Gim´enez and M`arquez, 2004) to pre- cedure on two languages, English and Swedish. dict part-of-speech tags at parsing time. The two For English, we compare the performance of constituency parsers were used with default set- two dependency parsers, MaltParser (Nivre et al., tings and were allowed to predict their own part- 2006) and MSTParser (McDonald et al., 2005), of-speech tags. We report three different evalua- and two constituency-based parsers, the Berkeley tion metrics for the different experiments: 49 (PTB) S ⇒ ⇒ ∅ a constituency tree, it is converted to and evalu- NP VP ∅ ∅ ated on SD. Here we see that MST outperforms NN V NP John ∅ ∅ Malt, though the differences for labeled depen- John loves NN John loves loves Mary dencies are insignificant. We also observe here a Mary Mary familiar pattern from Cer et al. (2010) and others, root sbj obj where the constituency parsers significantly out- (SD) -ROOT- John loves Mary ⇒ root ⇒ {root} perform the dependency parsers after conversion of their output into dependencies. sbj hd obj {sbj} {hd} {obj} The conversion to SD allows one to compare John loves Mary John loves Mary results across formal frameworks, but not with- (PTB) ttl (SD) = {root} out a cost. The conversion introduces a set of an- notation specific decisions which may introduce {sbj} ∅ a bias into the evaluation. In the middle column John {hd} {obj} of Table 1 we report the T ED E VAL metrics mea- loves Mary sured against the generalized gold standard for all Figure 4: Conversion of PTB and SD tree to multi- parsing frameworks. We can now confirm that function trees, followed by TL-Unification of the trees. the constituency-based parsers significantly out- Note that some PTB nodes remain without an SD label. perform the dependency parsers, and that this is not due to specific theoretical decisions which are ◦ LAS/UAS (Buchholz and Marsi, 2006) seen to affect LAS/UAS metrics (Schwartz et al., ◦ PARS E VAL (Black et al., 1991) 2011). For the dependency parsers we now see ◦ T ED E VAL as defined in Section 3 that Malt outperforms MST on labeled dependen- cies slightly, but the difference is insignificant. We use LAS/UAS for dependency parsers that The fact that the discrepancy in theoretical as- were trained on the same dependency theory. We sumptions between different frameworks indeed use ParseEval to evaluate phrase-structure parsers affects the conversion-based evaluation procedure that were trained on PTB trees in which dash- is reflected in the results we report in Table 2. features and empty traces are removed. We Here the leftmost and rightmost columns report use our implementation of T ED E VAL to evaluate T ED E VAL scores against the own native gold parsing results across all frameworks under two (S INGLE) and the middle column against the gen- different scenarios:3 T ED E VAL S INGLE evalu- eralized gold (M ULTIPLE). Had the theories ates against the native gold multi-function trees. for SD and PTBttl SD been identical, T ED E VAL T ED E VAL M ULTIPLE evaluates against the gen- S INGLE and T ED E VAL M ULTIPLE would have eralized (cross-theory) multi-function trees. Un- been equal in each line. Because of theoretical labeled T ED E VAL scores are obtained by sim- discrepancies, we see small gaps in parser perfor- ply removing all labels from the multi-function mance between these cases. Our protocol ensures nodes, and using unlabeled edit operations. We that such discrepancies do not bias the results. calculate pairwise statistical significance using a 4.2 Cross-Framework Swedish Parsing shuffling test with 10K iterations (Cohen, 1995). Tables 1 and 2 present the results of our cross- We use the standard training and test sets of the framework evaluation for English Parsing. In the Swedish Treebank (Nivre and Megyesi, 2007) left column of Table 1 we report ParsEval scores with two gold standards presupposing different for constituency-based parsers. As expected, F- theories: Scores for the Brown parser are higher than the • T HE D EPENDENCY-BASED T HEORY is the F-Scores of the Berkeley parser. F-Scores are dependency version of the Swedish Tree- however not applicable across frameworks. In bank. All trees are projectivized (STB-Dep). the rightmost column of Table 1 we report the LAS/UAS results for all parsers. If a parser yields • T HE C ONSTITUENCY-BASED T HEORY is 3 Our TedEval software can be downloaded at the standard Swedish Treebank with gram- http://stp.lingfil.uu.se/˜tsarfaty/ matical function labels on the edges of con- unipar/download.html. stituency structures (STB). 50 Formalism PS Trees MF Trees Dep Trees Formalism PS Trees MF Trees Dep Trees Theory PTB tlt SD (PTB tlt SD) SD Theory STB STB ut Dep Dep ut SD Metrics PARS E VAL T ED E VAL ATT S CORE Metrics PARS E VAL T ED E VAL ATT S CORES U: 0.9266 U: 0.8298 M ALT N/A U: 0.9525 U: 0.8962 L: 0.8225 L: 0.7782 M ALT N/A L: 0.9088 L: 0.8772 U: 0.9275 U: 0.8438 MST N/A U: 0.9549 U: 0.9059 L: 0.8121 L: 0.7824 MST N/A L: 0.9049 L: 0.8795 F-Score U: 0.9281 B KLY /STB-RR N/A F-Scores U: 0.9677 U: 0.9254 0.7914 L: 0.7861 B ERKELEY 0.9096 L: 0.9227 L: 0.9031 F-Score B KLY /STB-PS N/A N/A F-Scores U: 0.9702 U: 0.9289 0.7855 B ROWN 0.9129 L: 0.9264 L: 0.9057 Table 3: Swedish cross-framework evaluation: Three Table 1: English cross-framework evaluation: Three measures as applicable to the different schemes. Bold- measures as applicable to the different schemes. Bold- face scores are the highest in their column. face scores are highest in their column. Italic scores are the highest for dependency parsers in their column. Formalism PS Trees MF Trees Dep Trees Theory STB STB ut Dep Dep Formalism PS Trees MF Trees Dep Trees Metrics T ED E VAL T ED E VAL T ED E VAL Theory PTB tlt SD (PTB tlt SD) SD S INGLE M ULTIPLE S INGLE ut SD U: 0.9266 U: 0.9264 Metrics T ED E VAL T ED E VAL T ED E VAL M ALT N/A L: 0.8225 L: 0.8372 S INGLE M ULTIPLE S INGLE U: 0.9275 U: 0.9272 U: 0.9525 U: 0.9524 MST N/A M ALT N/A L: 0.8121 L: 0.8275 L: 0.9088 L: 0.9186 U: 0.9239 U: 0.9281 U: 0.9549 U: 0.9548 B KLY-STB-RR N/A MST N/A L: 0.7946 L: 0.7861 L: 0.9049 L: 0.9149 U: 0.9645 U: 0.9677 U: 0.9649 B ERKELEY Table 4: Swedish cross-framework evaluation: T ED E- L: 0.9271 L: 0.9227 L: 0.9324 VAL scores against the native gold and the generalized U: 0.9667 U: 9702 U: 0.9679 B ROWN gold. Boldface scores are the highest in their column. L: 0.9301 L: 9264 L: 0.9362 Table 2: English cross-framework evaluation: T ED E- VAL scores against gold and generalized gold. Bold- parsers, using the HunPoS tagger (Megyesi, face scores are highest in their column. Italic scores 2009), but let the Berkeley parser predict its own are highest for dependency parsers in their column. tags. We use the same evaluation metrics and pro- cedures as before. Prior to evaluating RR trees Because there are no parsers that can out- using ParsEval we strip off the added function put the complete STB representation including nodes. Prior to evaluating them using TedEval we edge labels, we experiment with two variants of strip off the phrase-structure nodes. this theory, one which is obtained by simply re- Tables 3 and 4 summarize the parsing results moving the edge labels and keeping only the for the different Swedish parsers. In the leftmost phrase-structure labels (STB-PS) and one which column of table 3 we present the constituency- is loosely based on the Relational-Realizational based evaluation measures. Interestingly, the scheme of Tsarfaty and Sima’an (2008) but ex- Berkeley RR instantiation performs better than cludes the projection set nodes (STB-RR). RR when training the Berkeley parser on PS trees. trees only add function nodes to PS trees, and These constituency-based scores however have a it holds that STB-PSut STB-RR=STB-PS. The limited applicability, and we cannot use them to overlap between the theories expressed in multi- compare constituency and dependency parsers. In function trees originating from STB-Dep and the rightmost column of Table 3 we report the STB-RR is 0.7559. Our evaluation protocol takes LAS/UAS results for the two dependency parsers. into account such discrepancies while avoiding Here we see higher performance demonstrated by biases that may be caused due to these differences. MST on both labeled and unlabeled dependen- We evaluate MaltParser, MSTParser and two cies, but the differences on labeled dependencies versions of the Berkeley parser, one trained on are insignificant. Since there is no automatic pro- STB-PS and one trained on STB-RR. We use cedure for converting bare-bone phrase-structure predicted part-of-speech tags for the dependency Swedish trees to dependency trees, we cannot use 51 LAS/UAS to compare across frameworks, and we languages such as English (de Marneffe et al., use T ED E VAL for cross-framework evaluation. 2006) but since grammatical relations are re- Training the Berkeley parser on RR trees which flected differently in different languages (Postal encode a mapping of PS nodes to grammatical and Perlmutter, 1977; Bresnan, 2000), a proce- functions allows us to compare parse results for dure to read off these relations in a language- trees belonging to the STB theory with trees obey- independent fashion from phrase-structure trees ing the STB-Dep theory. For unlabeled T ED E- does not, and should not, exist (Rambow, 2010). VAL scores, the dependency parsers perform at the The crucial point is that even when using ex- same level as the constituency parser, though the ternal scripts for recovering a relational scheme difference is insignificant. For labeled T ED E VAL for phrase-structure trees, our protocol has a clear the dependency parsers significantly outperform advantage over simply scoring converted trees. the constituency parser. When considering only Manually created conversion scripts alter the the- the dependency parsers, there is a small advantage oretical assumptions inherent in the trees and thus for Malt on labeled dependencies, and an advan- may bias the results. Our generalization operation tage for MST on unlabeled dependencies, but the and three-way TED make sure that theory-specific latter is insignificant. This effect is replicated in idiosyncrasies injected through such scripts do Table 4 where we evaluate dependency parsers us- not lead to over-penalizing or over-crediting ing T ED E VAL against their own gold theories. Ta- theory-specific structural variations. ble 4 further confirms that there is a gap between Certain linguistic structures cannot yet be eval- the STB and the STB-Dep theories, reflected in uated with our protocol because of the strict as- the scores against the native and generalized gold. sumption that the labeled spans in a parse form a tree. In the future we plan to extend the protocol 5 Discussion for evaluating structures that go beyond linearly- We presented a formal protocol for evaluating ordered trees in order to allow for non-projective parsers across frameworks and used it to soundly trees and directed acyclic graphs. In addition, we compare parsing results for English and Swedish. plan to lift the restriction that the parse yield is Our approach follows the three-phase protocol of known in advance, in order to allow for evalua- Tsarfaty et al. (2011), namely: (i) obtaining a for- tion of joint parse-segmentation hypotheses. mal common ground for the different representa- tion types, (ii) computing the theoretical common 6 Conclusion ground for each test sentence, and (iii) counting We developed a protocol for comparing parsing only what counts, that is, measuring the distance results across different theories and representa- between the common ground and the parse tree tion types which is framework-independent in the while discarding annotation-specific edits. sense that it can accommodate any formal syntac- A pre-condition for applying our protocol is the tic framework that encodes grammatical relations, availability of a relational interpretation of trees in and it is language-independent in the sense that the different frameworks. For dependency frame- there is no language specific knowledge encoded works this is straightforward, as these relations in the procedure. As such, this protocol is ad- are encoded on top of dependency arcs. For con- equate for parser evaluation in cross-framework stituency trees with an inherent mapping of nodes and cross-language tasks and parsing competi- onto grammatical relations (Merlo and Musillo, tions, and using it across the board is expected 2005; Gabbard et al., 2006; Tsarfaty and Sima’an, to open new horizons in our understanding of the 2008), a procedure for reading relational schemes strengths and weaknesses of different parsers in off of the trees is trivial to implement. the face of different theories and different data. For parsers that are trained on and parse into bare-bones phrase-structure trees this is not so. Acknowledgments We thank David McClosky, Reading off the relational structure may be more Marco Khulmann, Yoav Goldberg and three costly and require interjection of additional theo- anonymous reviewers for useful comments. We retical assumptions via manually written scripts. further thank Jennifer Foster for the Brown parses Scripts that read off grammatical relations based and parameter files. This research is partly funded on tree positions work well for configurational by the Swedish National Science Foundation. 52 References Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. 2004. The Penn Arabic treebank: Philip Bille. 2005. A survey on tree edit distance and Building a large-scale annotated Arabic corpus. In related. problems. Theoretical Computer Science, Proceedings of NEMLAR International Conference 337:217–239. on Arabic Language Resources and Tools. Ezra Black, Steven P. Abney, D. Flickenger, Clau- Mitchell P. Marcus, Beatrice Santorini, and Mary Ann dia Gdaniec, Ralph Grishman, P. Harrison, Don- Marcinkiewicz. 1993. Building a large annotated ald Hindle, Robert Ingria, Frederick Jelinek, Ju- corpus of English: The Penn Treebank. Computa- dith L. Klavans, Mark Liberman, Mitchell P. Mar- tional Linguistics, 19:313–330. cus, Salim Roukos, Beatrice Santorini, and Tomek Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Strzalkowski. 1991. A procedure for quantitatively Jan Hajiˇc. 2005. Non-projective dependency pars- comparing the syntactic coverage of English gram- ing using spanning tree algorithms. In HLT ’05: mars. In Proceedings of the DARPA Workshop on Proceedings of the conference on Human Language Speech and Natural Language, pages 306–311. Technology and Empirical Methods in Natural Lan- Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf- guage Processing, pages 523–530, Morristown, NJ, gang Lezius, and George Smith. 2002. The Tiger USA. Association for Computational Linguistics. treebank. In Proceedings of TLT. Beata Megyesi. 2009. The open source tagger Hun- Joan Bresnan. 2000. Lexical-Functional Syntax. PoS for Swedish. In Proceedings of the 17th Nordic Blackwell. Conference of Computational Linguistics (NODAL- Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X IDA), pages 239–241. shared task on multilingual dependency parsing. In Igor Mel’ˇcuk. 1988. Dependency Syntax: Theory and Proceedings of CoNLL-X, pages 149–164. Practice. State University of New York Press. Aoife Cahill, Michael Burke, Ruth O’Donovan, Stefan Paola Merlo and Gabriele Musillo. 2005. Accurate Riezler, Josef van Genabith, and Andy Way. 2008. function parsing. In Proceedings of EMNLP, pages Wide-coverage deep statistical parsing using auto- 620–627. matic dependency structure annotation. Computa- Joakim Nivre and Beata Megyesi. 2007. Bootstrap- tional Linguistics, 34(1):81–124. ping a Swedish Treebank using cross-corpus har- John Carroll, Edward Briscoe, and Antonio Sanfilippo. monization and annotation projection. In Proceed- 1998. Parser evaluation: A survey and a new pro- ings of TLT. posal. In Proceedings of LREC, pages 447–454. Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. Daniel Cer, Marie-Catherine de Marneffe, Daniel Ju- Maltparser: A data-driven parser-generator for de- rafsky, and Christopher D. Manning. 2010. Pars- pendency parsing. In Proceedings of LREC, pages ing to Stanford Dependencies: Trade-offs between 2216–2219. speed and accuracy. In Proceedings of LREC. Joakim Nivre, Laura Rimell, Ryan McDonald, and Eugene Charniak and Mark Johnson. 2005. Coarse- Carlos G´omez-Rodr´ıguez. 2010. Evaluation of de- to-fine n-best parsing and maxent discriminative pendency parsers on unbounded dependencies. In reranking. In Proceedings of ACL. Proceedings of COLING, pages 813–821. Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Paul Cohen. 1995. Empirical Methods for Artificial Klein. 2006. Learning accurate, compact, and in- Intelligence. The MIT Press. terpretable tree annotation. In Proceedings of ACL. Marie-Catherine de Marneffe, Bill MacCartney, and Paul M. Postal and David M. Perlmutter. 1977. To- Christopher D. Manning. 2006. Generating typed ward a universal characterization of passivization. dependency parses from phrase structure parses. In In Proceedings of the 3rd Annual Meeting of the Proceedings of LREC, pages 449–454. Berkeley Linguistics Society, pages 394–417. Ryan Gabbard, Mitchell Marcus, and Seth Kulick. Owen Rambow. 2010. The simple truth about de- 2006. Fully parsing the Penn treebank. In Proceed- pendency and phrase structure representations: An ing of HLT-NAACL, pages 184–191. opinion piece. In Proceedings of HLT-ACL, pages Jes´us Gim´enez and Llu´ıs M`arquez. 2004. SVMTool: 337–340. A general POS tagger generator based on support Roy Schwartz, Omri Abend, Roi Reichart, and Ari vector machines. In Proceedings of LREC. Rappoport. 2011. Neutralizing linguistically prob- Sandra K¨ubler, Ryan McDonald, and Joakim Nivre. lematic annotations in unsupervised dependency 2009. Dependency Parsing. Number 2 in Synthesis parsing evaluation. In Proceedings of ACL, pages Lectures on Human Language Technologies. Mor- 663–672. gan & Claypool Publishers. Khalil Sima’an, Alon Itai, Yoad Winter, Alon Altman, Dekang Lin. 1995. A dependency-based method for and Noa Nativ. 2001. Building a Tree-Bank for evaluating broad-coverage parsers. In Proceedings Modern Hebrew Text. In Traitement Automatique of IJCAI-95, pages 1420–1425. des Langues. 53 Reut Tsarfaty and Khalil Sima’an. 2008. Relational- Realizational parsing. In Proceedings of CoLing. Reut Tsarfaty, Joakim Nivre, and Evelina Andersson. 2011. Evaluating dependency parsing: Robust and heuristics-free cross-framework evaluation. In Pro- ceedings of EMNLP. Kaizhong Zhang and Dennis Shasha. 1989. Sim- ple fast algorithms for the editing distance between trees and related problems. In SIAM Journal of Computing, volume 18, pages 1245–1262. 54 Dependency Parsing of Hungarian: Baseline Results and Challenges Rich´ard Farkas1 , Veronika Vincze2 , Helmut Schmid1 1 Institute for Natural Language Processing, University of Stuttgart {farkas,schmid}@ims.uni-stuttgart.de 2 Research Group on Artificial Intelligence, Hungarian Academy of Sciences

[email protected]

Abstract In this study, we present results on Hungarian de- pendency parsing and we investigate this general Hungarian is a stereotype of morpholog- issue in the case of English and Hungarian. ically rich and non-configurational lan- We employed three state-of-the-art data-driven guages. Here, we introduce results on de- parsers (Nivre et al., 2004; McDonald et al., 2005; pendency parsing of Hungarian that em- ploy a 80K, multi-domain, fully manu- Bohnet, 2010), which achieved (un)labeled at- ally annotated corpus, the Szeged Depen- tachment scores on Hungarian not so different dency Treebank. We show that the results from the corresponding English scores (and even achieved by state-of-the-art data-driven higher on certain domains/subcorpora). Our in- parsers on Hungarian and English (which is vestigations show that the feature representation at the other end of the configurational-non- used by the data-driven parsers is so rich that they configurational spectrum) are quite simi- can – without any modification – effectively learn lar to each other in terms of attachment scores. We reveal the reasons for this and a reasonable model for non-configurational lan- present a systematic and comparative lin- guages as well. guistically motivated error analysis on both We also conducted a systematic and compar- languages. This analysis highlights that ad- ative error analysis of the system’s outputs for dressing the language-specific phenomena Hungarian and English. This analysis highlights is required for a further remarkable error re- the challenges of parsing Hungarian and sug- duction. gests that the further improvement of parsers re- quires special handling of language-specific phe- 1 Introduction nomena. We believe that some of our findings can be relevant for intermediate languages on the From the viewpoint of syntactic parsing, the lan- configurational-non-configurational spectrum. guages of the world are usually categorized ac- cording to their level of configurationality. At one 2 Chief Characteristics of the end, there is English, a strongly configurational Hungarian Morphosyntax language while Hungarian is at the other end of the spectrum. It has very few fixed structures Hungarian is an agglutinative language, which at the sentence level. Leaving aside the issue of means that a word can have hundreds of word the internal structure of NPs, most sentence-level forms due to inflectional or derivational affixa- syntactic information in Hungarian is conveyed tion. A lot of grammatical information is encoded by morphology, not by configuration (E. ´ Kiss, in morphology and Hungarian is a stereotype of 2002). morphologically rich languages. The Hungarian A large part of the methodology for syntactic word order is free in the sense that the positions parsing has been developed for English. How- of the subject, the object and the verb are not fixed ever, parsing non-configurational and less config- within the sentence, but word order is related to urational languages requires different techniques. information structure, e.g. new (or emphatic) in- formation (the focus) always precedes the verb 55 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 55–65, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics and old information (the topic) precedes the focus with that of a nominative noun while in the second position. Thus, the position relative to the verb case, it coincides with a dative noun. has no predictive force as regards the syntactic According to these facts, a Hungarian parser function of the given argument: while in English, must rely much more on morphological analysis the noun phrase before the verb is most typically than e.g. an English one since in Hungarian it the subject, in Hungarian, it is the focus of the is morphemes that mostly encode morphosyntac- sentence, which itself can be the subject, object tic information. One of the consequences of this or any other argument (E. ´ Kiss, 2002). is that Hungarian sentences are shorter in terms The grammatical function of words is deter- of word numbers than English ones. Based on mined by case suffixes as in gyerek “child” – gye- the word counts of the Hungarian–English paral- reknek (child-DAT) “for (a/the) child”. Hungarian lel corpus Hunglish (Varga et al., 2005), an En- nouns can have about 20 cases1 which mark the glish sentence contains 20.5% more words than its relationship between the head and its arguments Hungarian equivalent. These extra words in En- and adjuncts. Although there are postpositions glish are most frequently prepositions, pronomi- in Hungarian, case suffixes can also express re- nal subjects or objects, whose parent and depen- lations that are expressed by prepositions in En- dency label are relatively easy to identify (com- glish. pared to other word classes). This train of thought Verbs are inflected for person and number and indicates that the cross-lingual comparison of fi- the definiteness of the object. Since conjugational nal parser scores should be conducted very care- information is sufficient to deduce the pronominal fully. subject or object, they are typically omitted from the sentence: V´arlak (wait-1 SG 2 OBJ) “I am wait- 3 Related work ing for you”. This pro-drop feature of Hungar- We decided to focus on dependency parsing in ian leads to the fact that there are several clauses this study as it is a superior framework for non- without an overt subject or object. configurational languages. It has gained inter- Another peculiarity of Hungarian is that the est in natural language processing recently be- third person singular present tense indicative form cause the representation itself does not require of the copula is phonologically empty, i.e. there the words inside of constituents to be consecu- are apparently verbless sentences in Hungarian: tive and it naturally represent discontinuous con- A h´az nagy (the house big) “The house is big”. structions, which are frequent in languages where However, in other tenses or moods, the copula grammatical relations are often signaled by mor- is present as in A h´az nagy lesz (the house big phology instead of word order (McDonald and will.be) “The house will be big”. Nivre, 2011). The two main efficient approaches There are two possessive constructions in for dependency parsing are the graph-based and Hungarian. First, the possessive relation is only the transition-based parsers. The graph-based marked on the possessed noun (in contrast, it is models look for the highest scoring directed span- marked only on the possessor in English): a fi´u ning tree in the complete graph whose nodes are kuty´aja (the boy dog-POSS) “the boy’s dog”. Sec- the words of the sentence in question. They solve ond, both the possessor and the possessed bear a the machine learning problem of finding the opti- possessive marker: a fi´unak a kuty´aja (the boy- mal scoring function of subgraphs (Eisner, 1996; DAT the dog- POSS ) “the boy’s dog”. In the latter McDonald et al., 2005). The transition-based ap- case, the possessor and the possessed may not be proaches parse a sentence in a single left-to-right adjacent within the sentence as in A fi´unak l´atta a pass over the words. The next transition in these kuty´aj´at (the boy-DAT see-PAST 3 SGOBJ the dog- systems is predicted by a classifier that is based POSS - ACC ) “He saw the boy’s dog”, which results on history-related features (Kudo and Matsumoto, in a non-projective syntactic tree. Note that in 2002; Nivre et al., 2004). the first case, the form of the possessor coincides Although the available treebanks for Hungar- 1 Hungarian grammars and morphological coding sys- ian are relatively big (82K sentences) and fully tems do not agree on the exact number of cases, some rare manually annotated, the studies on parsing Hun- suffixes are treated as derivational suffixes in one grammar garian are rather limited. The Szeged (Con- and as case suffixes in others; see e.g. Farkas et al. (2010). stituency) Treebank (Csendes et al., 2005) con- 56 sists of six domains – namely, short business The annotation employs 16 coarse grained POS news, newspaper, law, literature, compositions tags, 95 morphological feature values and 29 de- and informatics – and it is manually annotated pendency labels. 19.6% of the sentences in the for the possible alternatives of words’ morpho- corpus contain non-projective edges and 1.8% of logical analyses, the disambiguated analysis and the edges are non-projective2 , which is almost 5 constituency trees. We are aware of only two times more frequent than in English and is the articles on phrase-structure parsers which were same as the Czech non-projectivity level (Buch- trained and evaluated on this corpus (Barta et al., holz and Marsi, 2006). Here we discuss two an- 2005; Iv´an et al., 2007) and there are a few studies notation principles along with our modifications on hand-crafted parsers reporting results on small in the dataset for this study which strongly influ- own corpora (Babarczy et al., 2005; Pr´osz´eky et ence the parsers’ accuracies. al., 2004). Named Entities (NEs) were treated as one to- The Szeged Dependency Treebank (Vincze et ken in the Szeged Dependency Treebank. Assum- al., 2010) was constructed by first automatically ing a perfect phrase recogniser on the whitespace converting the phrase-structure trees into depen- tokenised input for them is quite unrealistic. Thus dency trees, then each of them was manually we decided to split them into tokens for this study. investigated and corrected. We note that the The new tokens automatically got a proper noun dependency treebank contains more information with default morphological features morphologi- than the constituency one as linguistic phenom- cal analysis except for the last token – the head of ena (like discontinuous structures) were not anno- the phrase –, which inherited the morphological tated in the former corpus, but were added to the analysis of the original multiword unit (which can dependency treebank. To the best of our knowl- contain various grammatical information). This edge no parser results have been published on this resulted in an N N N N POS sequence for Kov´acs corpus. Both corpora are available at www.inf. e´ s t´arsa kft. “Smith and Co. Ltd.” which would u-szeged.hu/rgai/SzegedTreebank. be annotated as N C N N in the Penn Treebank. The multilingual track of the CoNLL-2007 Moreover, we did not annotate any internal struc- Shared Task (Nivre et al., 2007) addressed also ture of Named Entities. We consider the last word the task of dependency parsing of Hungarian. The of multiword named entities as the head because Hungarian corpus used for the shared task con- of morphological reasons (the last word of multi- sists of automatically converted dependency trees word units gets inflected in Hungarian) and all the from the Szeged Constituency Treebank. Several previous elements are attached to the succeeding issues of the automatic conversion tool were re- word, i.e. the penultimate word is attached to the considered before the manual annotation of the last word, the antepenultimate word to the penulti- Szeged Dependency Treebank was launched and mate one etc. The reasons for these considerations the annotation guidelines contained instructions are that we believe that there are no downstream related to linguistic phenomena which could not applications which can exploit the information of be converted from the constituency representa- the internal structures of Named Entities and we tion – for a detailed discussion, see Vincze et al. imagine a pipeline where a Named Entity Recog- (2010). Hence the annotation schemata of the niser precedes the parsing step. CoNLL-2007 Hungarian corpus and the Szeged Dependency Treebank are rather different and the Empty copula: In the verbless clauses (pred- final scores reported for the former are not di- icative nouns or adjectives) the Szeged Depen- rectly comparable with our reported scores here dency Treebank introduces virtual nodes (16,000 (see Section 5). items in the corpus). This solution means that a similar tree structure is ascribed to the same 4 The Szeged Dependency Treebank sentence in the present third person singular and We utilize the Szeged Dependency Treebank all the other tenses / persons. A further argu- (Vincze et al., 2010) as the basis of our experi- ment for the use of a virtual node is that the vir- ments for Hungarian dependency parsing. It con- tual node is always present at the syntactic level tains 82,000 sentences, 1.2 million words and 2 Using the transitive closure definition of Nivre and Nils- 250,000 punctuation marks from six domains. son (2005). 57 corpus Malt MST Mate ULA LAS ULA LAS ULA LAS dev 88.3 (89.9) 85.7 (87.9) 86.9 (88.5) 80.9 (82.9) 89.7 (91.1) 86.8 (89.0) Hungarian test 88.7 (90.2) 86.1 (88.2) 87.5 (89.0) 81.6 (83.5) 90.1 (91.5) 87.2 (89.4) dev 87.8 (89.1) 84.5 (86.1) 89.4 (91.2) 86.1 (87.7) 91.6 (92.7) 88.5 (90.0) English test 88.8 (89.9) 86.2 (87.6) 90.7 (91.8) 87.7 (89.2) 92.6 (93.4) 90.3 (91.5) Table 1: Results achieved by the three parsers on the (full) Hungarian (Szeged Dependency Treebank) and English (CoNLL-2009) datasets. The scores in brackets are achieved with gold-standard POS tagging. since it is overt in all the other forms, tenses and Tools: We employed a finite state automata- moods of the verb. Still, the state-of-the-art de- based morphological analyser constructed from pendency parsers cannot handle virtual nodes. For the morphdb.hu lexical resource (Tr´on et al., this study, we followed the solution of the Prague 2006) and we used the MSD-style morphological Dependency Treebank (Hajiˇc et al., 2000) and vir- code system of the Szeged TreeBank (Alexin et tual nodes were removed from the gold standard al., 2003). The output of the morphological anal- annotation and all of their dependents were at- yser is a set of possible lemma–morphological tached to the head of the original virtual node and analysis pairs. This set of possible morphologi- they were given a dedicated edge label (Exd). cal analyses for a word form is then used as pos- sible alternatives – instead of open and closed tag Dataset splits: We formed training, develop- sets – in a standard sequential POS tagger. Here, ment and test sets from the corpus where each we applied the Conditional Random Fields-based set consists of texts from each of the domains. Stanford POS tagger (Toutanova et al., 2003) and We paid attention to the issue that a document carried out 5-fold-cross POS training/tagging in- should not be separated into different datasets be- side the subcorpora.4 For the English experiments cause it could result in a situation where a part of we used the predicted POS tags provided for the the test document was seen in the training dataset CoNLL-2009 shared task (Hajiˇc et al., 2009). (which is unrealistic because of unknown words, As the dependency parser we employed three style and frequently used grammatical structures). state-of-the-art data-driven parsers, a transition- As the fiction subcorpus consists of three books based parser (Malt) and two graph-based parsers and the law subcorpus consists of two rules, we (MST and Mate parsers). The Malt parser (Nivre took half of one of the documents for the test et al., 2004) is a transition-based system, which and development sets and used the other part(s) uses an arc-eager system along with support vec- for training there. This principle was followed at tor machines to learn the scoring function for tran- our cross-fold-validation experiments as well ex- sitions and which uses greedy, deterministic one- cept for the law subcorpus. We applied 3 folds for best search at parsing time. As one of the graph- cross-validation for the fiction subcorpus, other- based parsers, we employed the MST parser (Mc- wise we used 10 folds (splitting at documentary Donald et al., 2005) with a second-order feature boundaries would yield a training fold consisting decoder. It uses an approximate exhaustive search of just 3000 sentences).3 for unlabeled parsing, then a separate arc label 5 Experiments classifier is applied to label each arc. The Mate parser (Bohnet, 2010) is an efficient second or- We carried out experiments using three state-of- der dependency parser that models the interaction the-art parsers on the Szeged Dependency Tree- between siblings as well as grandchildren (Car- bank (Vincze et al., 2010) and on the English reras, 2007). Its decoder works on labeled edges, datasets of the CoNLL-2009 Shared Task (Hajiˇc i.e. it uses a single-step approach for obtaining et al., 2009). labeled dependency trees. Mate uses a rich and 3 Both the training/development/test and the cross- 4 The JAVA implementation of the morphological anal- validation splits are available at www.inf.u-szeged. yser and the slightly modified POS tagger along with trained hu/rgai/SzegedTreebank. models are available at http://www.inf.u-szeged. hu/rgai/magyarlanc. 58 corpus #sent. length CPOS DPOS ULA all ULA LAS all LAS newspaper 9189 21.6 97.2 96.5 88.0 (90.0) +0.8 84.7 (87.5) +1.0 short business 8616 23.6 98.0 97.7 93.8 (94.8) +0.3 91.9 (93.4) +0.4 fiction 9279 12.6 96.9 95.8 87.7 (89.4) -0.5 83.7 (86.2) -0.3 law 8347 27.3 98.3 98.1 90.6 (90.7) +0.2 88.9 (89.0) +0.2 computer 8653 21.9 96.4 95.8 91.3 (92.8) -1.2 88.9 (91.2) -1.6 composition 22248 13.7 96.7 95.6 92.7 (93.9) +0.3 88.9 (91.0) +0.3 Table 2: Domain results achieved by the Mate parser in cross-validation settings. The scores in brackets are achieved with gold-standard POS tagging. The ‘all’ columns contain the added value of extending the training sets with each of the five out-domain subcorpora. well-engineered feature set and it is enhanced by ment was to gain an insight into the performance a Hash Kernel, which leads to higher accuracy. of the parsers which can only access configura- tional information. These parsers achieved worse Evaluation metrics: We apply the Labeled At- results than the full parsers by 6.8 ULA, 20.3 LAS tachment Score (LAS) and Unlabeled Attachment and 2.9 ULA, 6.4 LAS on the development sets Score (ULA), taking into account punctuation as of Hungarian and English, respectively. As ex- well for evaluating dependency parsers and the pected, Hungarian suffers much more when the accuracy on the main POS tags (CPOS) and a parser has to learn from configurational informa- fine-grained morphological accuracy (DPOS) for tion only, especially when grammatical functions evaluating the POS tagger. In the latter, the analy- have to be predicted (LAS). Despite this, the re- sis is regarded as correct if the main POS tag and sults of Table 1 show that the parsers can practi- each of the morphological features of the token in cally eliminate this gap by learning from morpho- question are correct. logical features (and lexicalization). This means Results: Table 1 shows the results got by the that the data-driven parsers employing a very rich parsers on the whole Hungarian corpora and on feature set can learn a model which effectively the English datasets. The most important point captures the dependency structures using feature is that scores are not different from the English weights which are radically different from the scores (although they are not directly compara- ones used for English. ble). To understand the reasons for this, we man- Another cause of the relatively high scores is ually investigated the set of firing features with that the CPOS accuracy scores on Hungarian the highest weights in the Mate parser. Although and English are almost equal: 97.2 and 97.3, re- the assessment of individual feature contributions spectively. This also explains the small differ- to a particular decoder decision is not straightfor- ence between the results got by gold-standard and ward, we observed that features encoding config- predicted POS tags. Moreover, the parser can urational information (i.e. the direction or length also exploit the morphological features as input of an edge, the words or POS tag sequences/sets in Hungarian. between the governor and the dependent) were The Mate parser outperformed the other two frequently among the highest weighted features parsers on each of the four datasets. Comparing in English but were extremely rare in Hungarian. the two graph-based parsers Mate and MST, the For instance, one of the top weighted features for gap between them was twice as big in LAS than in a subject dependency in English was the ‘there is ULA in Hungarian, which demonstrates that the no word between the head and the dependent’ fea- one-step approach looking for the maximum ture while this never occurred among the top fea- labeled spanning tree is more suitable for Hun- tures in Hungarian. garian than the two-step arc labeling approach of As a control experiment, we trained the Mate MST. This probably holds for other morpholog- parser only having access to the gold-standard ically rich languages too as the decoder can ex- POS tag sequences of the sentences, i.e. we ploit information from the labels of decoded arcs. switched off the lexicalization and detailed mor- Based on these results, we decided to use only phological information. The goal of this experi- Mate for our further experiments. 59 Table 2 provides an insight into the effect of low, they form important features for the parser, domain differences on POS tagging and pars- thus we will focus on the more accurate handling ing scores. There is a noticeable difference be- of these cases in future work. tween the “newspaper” and the “short business Comparison to CoNLL-2007 results: The news” corpora. Although these domains seem to best performing participant of the CoNLL-2007 be close to each other at the first glance (both are Shared Task (Nivre et al., 2007) achieved an ULA news), they have different characteristics. On the of 83.6 and LAS of 80.3 (Hall et al., 2007) on one hand, short business news is a very narrow the Hungarian corpus. The difference between the domain consisting of 2-3 sentence long financial top performing English and Hungarian systems short reports. It frequently uses the same gram- were 8.14 ULA and 9.3 LAS. The results reported matical structures (like “Stock indexes rose X per- in 2007 were significantly lower and the gap be- cent at the Y Stock on Wednesday”) and the lexi- tween English and Hungarian is higher than our con is also limited. On the other hand, the news- current values. To locate the sources of difference paper subcorpus consists of full journal articles we carried out other experiments with Mate on covering various domains and it has a fancy jour- the CoNLL-2007 dataset using the gold-standard nalist style. POS tags (the shared task used gold-standard POS The effect of extending the training dataset with tags for evaluation). out-of-domain parses is not convincing. In spite First we trained and evaluated Mate on the of the ten times bigger training datasets, there original CoNLL-2007 datasets, where it achieved are two subcorpora where they just harmed the ULA 84.3 and LAS 80.0. Then we used the sen- parser, and the improvement on other subcorpora tences of the CoNLL-2007 datasets but with the is less than 1 percent. This demonstrates well the new, manual annotation. Here, Mate achieved domain-dependence of parsing. ULA 88.6 and LAS 85.5, which means that the The parser and the POS tagger react to do- modified annotation schema and the less erro- main difficulties in a similar way, according to neous/noisy annotation caused an improvement of the first four rows of Table 2. This observation ULA 4.3 and LAS 5.5. The annotation schema holds for the scores of the parsers working with changed a lot: coordination had to be corrected gold-standard POS tags, which suggests that do- manually since it is treated differently after con- main difficulties harm POS tagging and parsing as version, moreover, the internal structure of ad- well. Regarding the two last subcorpora, the com- jectival/participial phrases was not marked in the positions consist of very short and usually simple original constituency treebank, so it was also sentences and the training corpora are twice as big added manually (Vincze et al., 2010). The im- compared with other subcorpora. Both factors are provement in the labeled attachment score is prob- probably the reasons for the good parsing perfor- ably due to the reduction of the label set (from 49 mance. In the computer corpus, there are many to 29 labels), which step was justified by the fact English terms which are manually tagged with an that some morphosyntactic information was dou- “unknown” tag. They could not be accurately pre- bly coded in the case of nouns (e.g. h´azzal (house- dicted by the POS tagger but the parser could pre- INS) “with the/a house”) in the original CoNLL- dict their syntactic role. 2007 dataset – first, by their morphological case Table 2 also tells us that the difference between (Cas=ins) and second, by their dependency label CPOS and DPOS is usually less than 1 percent. (INS). This experimentally supports that the ambigu- Lastly, as the CoNLL-2007 sentences came ity among alternative morphological analyses from the newspaper subcorpus, we can compare is mostly present at the POS-level and the mor- these scores with the ULA 90.0 and LAS 87.5 phological features are efficiently identified by of Table 2. The ULA 1.5 and LAS 2.0 differ- our morphological analyser. The most frequent ences are the result of the bigger training corpus morphological features which cannot be disam- (9189 sentences on average compared to 6390 in biguated at the word level are related to suffixes the CoNLL-2007 dataset). with multiple functions or the word itself cannot be unambiguously segmented into morphemes. Although the number of such ambiguous cases is 60 Hungarian English label attachment label attachment virtual nodes 31.5% 39.5% multiword NEs 15.2% 17.6% conjunctions and negation – 11.2% PP-attachment – 15.9% noun attachment – 9.6% non-canonical word order 6.4% 6.5% more than 1 premodifier – 5.1% misplaced clause – 9.7% coordination 13.5% 16.5% coordination 8.5% 12.5% mislabeled adverb 16.3% – mislabeled adverb 40.1% – annotation errors 10.7% 6.8% annotation errors 9.7% 8.5% other 28.0% 11.3% other 20.1% 29.3% TOTAL 100% 100% TOTAL 100% 100% Table 3: The most frequent corpus-specific and general attachment and labeling error categories (based on a manual investigation of 200–200 erroneous sentences). 6 A Systematic Error Analysis Hungarian, respectively). In order to discover specialties and challenges of Virtual nodes: In Hungarian, the most common Hungarian dependency parsing, we conducted an source of parsing errors was virtual nodes. As error analysis of parsed texts from the newspaper there are quite a lot of verbless clauses in Hungar- domain both in English and Hungarian. 200 ran- ian (see Section 2 on sentences without copula), it domly selected erroneous sentences from the out- might be difficult to figure out the proper depen- put of Mate were investigated in both languages dency relations within the sentence, since the verb and we categorized the errors on the basis of the plays the central role in the sentence, cf. Tesni`ere linguistic phenomenon responsible for the errors (1959). Our parser was not efficient in identify- – for instance, when an error occurred because of ing the structure of such sentences, probably due the incorrect identification of a multiword Named to the lack of information for data-driven parsers Entity containing a conjunction, we treated it as (each edge is labeled as Exd while they have sim- a Named Entity error instead of a conjunction er- ilar features to ordinary edges). We also note that ror –, i.e. our goal was to reveal the real linguistic the output of the current system with Exd labels sources of errors rather than deducing from auto- does not contain too much information for down- matically countable attachment/labeling statistics. stream applications of parsing. The appropriate We used the parses based on gold-standard handling of virtual nodes is an important direction POS tagging for this analysis as our goal was to for future work. identify the challenges of parsing independently Noun attachment: In Hungarian, the nomi- of the challenges of POS tagging. The error cate- nal arguments of infinitives and participles were gories are summarized in Table 3 along with their frequently erroneously attached to the main relative contribution to attachment and labeling verb. Take the following sentence: A Horn- errors. This table contains the categories with kabinet idej´en j´ol bev´alt m´odszerhez pr´ob´alnak over 5% relative frequency.5 meg visszat´erni (the Horn-government time- The 200 sentences contained 429/319 and 3 SGPOSS - SUP well tried method-ALL try-3 PL 353/330 attachment/labeling errors in Hungarian PREVERB return- INF) “They are trying to return and English, respectively. In Hungarian, attach- to the well-tried method of the Horn government”. ment errors outnumber label errors to a great ex- In this sentence, a Horn-kabinet idej´en “during tent whereas in English, their distribution is basi- the Horn government” is a modifier of the past cally the same. This might be attributed to the participle bev´alt “well-tried”, however, it is at- higher level of non-projectivity (see Section 4) tached to the main verb pr´ob´alnak “they are try- and to the more fine-grained label set of the En- ing” by the parser. Moreover, m´odszerhez “to the glish dataset (36 against 29 labels in English and method” is an argument of the infinitive visszat´er- 5 The full tables are available at www.inf.u-szeged. ni “to return”, but the parser links it to the main hu/rgai/SzegedTreebank. 61 verb. In free word order languages, the order of typically, the prepositional complement which the arguments of the infinitive and the main verb follows the head was attached to the verb instead may get mixed, which is called scrambling (Ross, of the noun or vice versa. In contrast, Hungarian 1986). This is not a common source of error in is a head-after-dependent language, which means English as arguments cannot scramble. that dependents most often occur before the head. Furthermore, there are no prepositions in Hungar- Article attachment: In Hungarian, if there is ian, and grammatical relations encoded by prepo- an article before a prenominal modifier, it can be- sitions in English are conveyed by suffixes or long to the head noun and to the modifier as well. postpositions. Thus, if there is a modifier before In a szoba ajtaja (the room door-3 SGPOSS) “the the nominal head, it requires the presence of a door of the room” the article belongs to the modi- participle as in: Felvette a kirakatban lev˝o ruh´at fier but when the prenominal modifier cannot have (take.on-PAST 3 SGOBJ the shop.window-INE be- an article (e.g. a febru´arban indul´o projekt (the ing dress-ACC) “She put on the dress in the shop February-INE starting project) “the project start- window”. The English sentence is ambiguous (ei- ing in February”), it is attached to the head noun ther the event happens in the shop window or the (i.e. to projekt “project”). It was not always clear dress was originally in the shop window) while for the parser which parent to select for the arti- the Hungarian has only the latter meaning.6 cle. In contrast, these cases are not problematic in English since the modifier typically follows the General dependency parsing difficulties: head and thus each article precedes its head noun. There were certain structures that led to typical label and/or attachment errors in both languages. Conjunctions or negation words – most typ- The most frequent one among them is coordi- ically the words is “too”, csak “only/just” and nation. However, it should be mentioned that nem/sem “not” – were much more frequently at- syntactic ambiguities are often problematic even tached to the wrong node in Hungarian than in for humans to disambiguate without contextual English. In Hungarian, they are ambiguous be- or background semantic knowledge. tween being adverbs and conjunctions and it is In the case of label errors, the relation between mostly their conjunctive uses which are problem- the given node and its parent was labeled incor- atic from the viewpoint of parsing. On the other rectly. In both English and Hungarian, one of the hand, these words have an important role in mark- most common errors of this type was mislabeled ing the information structure of the sentence: they adverbs and adverbial phrases, e.g. locative ad- are usually attached to the element in focus posi- verbs were labeled as ADV/MODE. However, the tion, and if there is no focus, they are attached frequency rate of this error type is much higher to the verb. However, sentences with or with- in English than in Hungarian, which may be re- out focus can have similar word order but their lated to the fact that in the English corpus, there stress pattern is different. Dependency parsers is a much more balanced distribution of adverbial obviously cannot recognize stress patterns, hence labels than in the Hungarian one (where the cat- conjunctions and negation words are sometimes egories MODE and TLOCY are responsible for erroneously attached to the verb in Hungarian. 90% of the occurrences). Assigning the most fre- English sentences with non-canonical word quent label of the training dataset to each adverb order (e.g. questions) were often incorrectly yields an accuracy of 82% in English and 93% in parsed, e.g. the noun following the main verb is Hungarian, which suggests that there is a higher the object in sentences like Replied a salesman: level of ambiguity for English adverbial phrases. ‘Exactly.’, where it is the subject that follows the For instance, the preposition by may introduce an verb for stylistic reasons. However, in Hungarian, adverbial modifier of manner (MNR) in by cre- morphological information is of help in such sen- ating a bill and the agent in a passive sentence tences, as it is not the position relative to the verb (LGS). Thus, labeling adverbs seems to be a more but the case suffix that determines the grammati- 6 However, there exists a head-before-dependent version cal role of the noun. of the sentence (Felvette a ruh´at a kirakatban), whose pre- In English, high or low PP-attachment was ferred reading is “She was in the shop window while dressing responsible for many parsing ambiguities: most up”, that is, the modifier belongs to the verb. 62 difficult task in English.7 7 Conclusions Clauses were also often mislabeled in both lan- We showed that state-of-the-art dependency guages, most typically when there was no overt parsers achieve similar results – in terms of at- conjunction between clauses. Another source of tachment scores – on Hungarian and English. Al- error was when more than one modifier occurred though the results with this comparison should be before a noun (5.1% and 4.2% of attachment er- taken with a pinch of salt – as sentence lengths rors in Hungarian and in English): in these cases, (and information encoded in single words) differ, the first modifier could belong to the noun (a domain differences and annotation schema diver- brown Japanese car) or to the second modifier (a gences are uncatchable – we conclude that parsing brown haired girl). Hungarian is just as hard a task as parsing English. Multiword Named Entities: As we mentioned We argued that this is due to the relatively good in Section 4, members of multiword Named Enti- POS tagging accuracy (which is a consequence ties had a proper noun POS-tag and an NE label of the low ambiguity of alternative morphological in our dataset. Hence when parsing is based on analyses of a sentence and the good coverage of gold standard POS-tags, their recognition is al- the morphological analyser) and the fact that data- most perfect while it is a frequent source or er- driven dependency parsers employ a rich feature rors in the CoNLL-2009 corpus. We investigated representation which enables them to learn differ- the parse of our 200 sentences with predicted POS ent kinds of feature weight profiles. tags at NEs and found that this introduces several We also discussed the domain differences errors (about 5% of both attachment and labeling among the subcorpora of the Szeged Dependency errors) in Hungarian. On the other hand, the re- Treebank and their effect on parsing results. Our sults are only slightly worse in English, i.e. iden- results support that there can be higher differences tifying the inner structure of NEs does not depend in parsing scores among domains in one language on whether the parser builds on gold standard or than among corpora from a similar domain but predicted POS-tags since function words like con- different languages (which again marks pitfalls of junctions or prepositions – which mark grammat- inter-language comparison of parsing scores). ical relations – are tagged in the same way in both Our systematic error analysis showed that han- cases. The relative frequency of this error type is dling the virtual nodes (mostly empty copula) is much higher in English even when the Hungar- a frequent source of errors. We identified several ian parser does not have access to the gold proper phenomena which are not typically listed as Hun- noun POS tags. The reason for this is simple: in garian syntax-specific features but are challeng- the Penn Treebank the correct internal structure of ing for the current data-driven parsers, however, the NEs has to be identified beyond the “phrase they are not problematic in English (like the at- boundaries” while in Hungarian their members tachment of conjunctions and negation words and just form a chain. the attachment problem of nouns and articles). We concluded – based on our quantitative analy- Annotation errors: We note that our analysis sis – that a further notable error reduction is only took into account only sentences which contained achievable if distinctive attention is paid to these at least one parsing error and we crawled only language-specific phenomena. the dependencies where the gold standard anno- We intend to investigate the problem of vir- tation and the output of the parser did not match. tual nodes in dependency parsing in more depth Hence, the frequency of annotation errors is prob- and to implement new feature templates for the ably higher than we found (about 1% of the en- Hungarian-specific challenges as future work. tire set of dependencies) during our investigation as there could be annotation errors in the “error- Acknowledgments free” sentences and also in the investigated sen- tences where the parser agrees with that error. This work was supported in part by the Deutsche 7 Forschungsgemeinschaft grant SFB 732 and the We would nevertheless like to point out that adverbial NIH grant (project codename MASZEKER) of labels have a highly semantic nature, i.e. it could be argued that it is not the syntactic parser that should identify them but the Hungarian government. a semantic processor. 63 References Nianwen Xue, and Yi Zhang. 2009. The CoNLL- 2009 Shared Task: Syntactic and Semantic Depen- Zolt´an Alexin, J´anos Csirik, Tibor Gyim´othy, K´aroly dencies in Multiple Languages. In Proceedings of Bibok, Csaba Hatvani, G´abor Pr´osz´eky, and L´aszl´o the Thirteenth Conference on Computational Nat- Tihanyi. 2003. Annotated Hungarian National Cor- ural Language Learning (CoNLL 2009): Shared pus. In Proceedings of the EACL, pages 53–56. Task, pages 1–18. Anna Babarczy, B´alint G´abor, G´abor Hamp, and Johan Hall, Jens Nilsson, Joakim Nivre, G¨ulsen Andr´as Rung. 2005. Hunpars: a rule-based sen- Eryigit, Be´ata Megyesi, Mattias Nilsson, and tence parser for Hungarian. In Proceedings of the Markus Saers. 2007. Single Malt or Blended? 6th International Symposium on Computational In- A Study in Multilingual Parser Optimization. In telligence. Proceedings of the CoNLL Shared Task Session of Csongor Barta, D´ora Csendes, J´anos Csirik, Andr´as EMNLP-CoNLL 2007, pages 933–939. H´ocza, Andr´as Kocsor, and Korn´el Kov´acs. 2005. Szil´ard Iv´an, R´obert Orm´andi, and Andr´as Kocsor. Learning syntactic tree patterns from a balanced 2007. Magyar mondatok SVM alap´u szintaxis Hungarian natural language database, the Szeged elemz´ese [SVM-based syntactic parsing of Hun- Treebank. In Proceedings of 2005 IEEE Interna- garian sentences]. In V. Magyar Sz´am´ıt´og´epes tional Conference on Natural Language Processing Nyelv´eszeti Konferencia, pages 281–283. and Knowledge Engineering, pages 225 – 231. Taku Kudo and Yuji Matsumoto. 2002. Japanese Bernd Bohnet. 2010. Top accuracy and fast depen- dependency analysis using cascaded chunking. In dency parsing is not a contradiction. In Proceedings Proceedings of the 6th Conference on Natural Lan- of the 23rd International Conference on Computa- guage Learning - Volume 20, COLING-02, pages tional Linguistics (Coling 2010), pages 89–97. 1–7. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Ryan McDonald and Joakim Nivre. 2011. Analyzing Shared Task on Multilingual Dependency Parsing. and integrating dependency parsers. Computational In Proceedings of the Tenth Conference on Com- Linguistics, 37:197–230. putational Natural Language Learning (CoNLL-X), Ryan McDonald, Fernando Pereira, Kiril Ribarov, and pages 149–164. Jan Hajic. 2005. Non-Projective Dependency Pars- Xavier Carreras. 2007. Experiments with a higher- ing using Spanning Tree Algorithms. In Proceed- order projective dependency parser. In Proceed- ings of Human Language Technology Conference ings of the CoNLL Shared Task Session of EMNLP- and Conference on Empirical Methods in Natural CoNLL 2007, pages 957–961. Language Processing, pages 523–530. D´ora Csendes, J´anos Csirik, Tibor Gyim´othy, and Joakim Nivre and Jens Nilsson. 2005. Pseudo- Andr´as Kocsor. 2005. The Szeged Treebank. In Projective Dependency Parsing. In Proceedings TSD, pages 123–131. of the 43rd Annual Meeting of the Association Katalin E.´ Kiss. 2002. The Syntax of Hungarian. for Computational Linguistics (ACL’05), pages 99– Cambridge University Press, Cambridge. 106. Jason M. Eisner. 1996. Three new probabilistic mod- Joakim Nivre, Johan Hall, and Jens Nilsson. 2004. els for dependency parsing: an exploration. In Pro- Memory-Based Dependency Parsing. In HLT- ceedings of the 16th conference on Computational NAACL 2004 Workshop: Eighth Conference linguistics - Volume 1, COLING ’96, pages 340– on Computational Natural Language Learning 345. (CoNLL-2004), pages 49–56. Rich´ard Farkas, D´aniel Szeredi, D´aniel Varga, and Joakim Nivre, Johan Hall, Sandra K¨ubler, Ryan Mc- Veronika Vincze. 2010. MSD-KR harmoniz´aci´o a Donald, Jens Nilsson, Sebastian Riedel, and Deniz Szeged Treebank 2.5-ben [Harmonizing MSD and Yuret. 2007. The CoNLL 2007 Shared Task KR codes in the Szeged Treebank 2.5]. In VII. Ma- on Dependency Parsing. In Proceedings of the gyar Sz´am´ıt´og´epes Nyelv´eszeti Konferencia, pages CoNLL Shared Task Session of EMNLP-CoNLL 349–353. 2007, pages 915–932. Jan Hajiˇc, Alena B¨ohmov´a, Eva Hajiˇcov´a, and Barbora G´abor Pr´osz´eky, L´aszl´o Tihanyi, and G´abor L. Ugray. Vidov´a-Hladk´a. 2000. The Prague Dependency 2004. Moose: A Robust High-Performance Parser Treebank: A Three-Level Annotation Scenario. In and Generator. In Proceedings of the 9th Workshop Anne Abeill´e, editor, Treebanks: Building and of the European Association for Machine Transla- Using Parsed Corpora, pages 103–127. Amster- tion. dam:Kluwer. John R. Ross. 1986. Infinite syntax! ABLEX, Nor- Jan Hajiˇc, Massimiliano Ciaramita, Richard Johans- wood, NJ. son, Daisuke Kawahara, Maria Ant`onia Mart´ı, Llu´ıs ´ ements de syntaxe struc- Lucien Tesni`ere. 1959. El´ M`arquez, Adam Meyers, Joakim Nivre, Sebastian ˇ ep´anek, Pavel Straˇna´ k, Mihai Surdeanu, turale. Klincksieck, Paris. Pad´o, Jan Stˇ 64 Kristina Toutanova, Dan Klein, Christopher D. Man- ning, and Yoram Singer. 2003. Feature-rich part- of-speech tagging with a cyclic dependency net- work. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, pages 173–180. Viktor Tr´on, P´eter Hal´acsy, P´eter Rebrus, Andr´as Rung, Eszter Simon, and P´eter Vajda. 2006. Mor- phdb.hu: Hungarian lexical database and morpho- logical grammar. In Proceedings of 5th Inter- national Conference on Language Resources and Evaluation (LREC ’06). D´aniel Varga, P´eter Hal´acsy, Andr´as Kornai, Viktor Nagy, L´aszl´o N´emeth, and Viktor Tr´on. 2005. Par- allel corpora for medium density languages. In Pro- ceedings of the RANLP, pages 590–596. Veronika Vincze, D´ora Szauter, Attila Alm´asi, Gy¨orgy M´ora, Zolt´an Alexin, and J´anos Csirik. 2010. Hun- garian Dependency Treebank. In Proceedings of the Seventh Conference on International Language Re- sources and Evaluation (LREC’10). 65 Dependency Parsing with Undirected Graphs Carlos G´omez-Rodr´ıguez Daniel Fern´andez-Gonz´alez Departamento de Computaci´on Departamento de Inform´atica Universidade da Coru˜na Universidade de Vigo Campus de Elvi˜na, 15071 Campus As Lagoas, 32004 A Coru˜na, Spain Ourense, Spain

[email protected] [email protected]

Abstract We introduce a new approach to transition- based dependency parsing in which the parser does not directly construct a depen- 0 1 2 3 dency structure, but rather an undirected graph, which is then converted into a di- Figure 1: An example dependency structure where rected dependency tree in a post-processing transition-based parsers enforcing the single-head con- step. This alleviates error propagation, straint will incur in error propagation if they mistak- since undirected parsers do not need to ob- enly build a dependency link 1 → 2 instead of 2 → 1 serve the single-head constraint. (dependency links are represented as arrows going Undirected parsers can be obtained by sim- from head to dependent). plifying existing transition-based parsers satisfying certain conditions. We apply this approach to obtain undirected variants of It has been shown by McDonald and Nivre the planar and 2-planar parsers and of Cov- (2007) that such parsers suffer from error prop- ington’s non-projective parser. We perform agation: an early erroneous choice can place the experiments on several datasets from the CoNLL-X shared task, showing that these parser in an incorrect state that will in turn lead to variants outperform the original directed al- more errors. For instance, suppose that a sentence gorithms in most of the cases. whose correct analysis is the dependency graph in Figure 1 is analyzed by any bottom-up or left- 1 Introduction to-right transition-based parser that outputs de- Dependency parsing has proven to be very use- pendency trees, therefore obeying the single-head ful for natural language processing tasks. Data- constraint (only one incoming arc is allowed per driven dependency parsers such as those by Nivre node). If the parser chooses an erroneous transi- et al. (2004), McDonald et al. (2005), Titov and tion that leads it to build a dependency link from Henderson (2007), Martins et al. (2009) or Huang 1 to 2 instead of the correct link from 2 to 1, this and Sagae (2010) are accurate and efficient, they will lead it to a state where the single-head con- can be trained from annotated data without the straint makes it illegal to create the link from 3 to need for a grammar, and they provide a simple 2. Therefore, a single erroneous choice will cause representation of syntax that maps to predicate- two attachment errors in the output tree. argument structure in a straightforward way. With the goal of minimizing these sources of In particular, transition-based dependency errors, we obtain novel undirected variants of parsers (Nivre, 2008) are a type of dependency several parsers; namely, of the planar and 2- parsing algorithms which use a model that scores planar parsers by G´omez-Rodr´ıguez and Nivre transitions between parser states. Greedy deter- (2010) and the non-projective list-based parser ministic search can be used to select the transition described by Nivre (2008), which is based on to be taken at each state, thus achieving linear or Covington’s algorithm (Covington, 2001). These quadratic time complexity. variants work by collapsing the LEFT- ARC and 66 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 66–76, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics RIGHT- ARC transitions in the original parsers, say that i is the head of j and, conversely, that j which create right-to-left and left-to-right depen- is a syntactic dependent of i. dency links, into a single ARC transition creating Given a dependency graph G = (Vw , E), we an undirected link. This has the advantage that write i →? j ∈ E if there is a (possibly empty) the single-head constraint need not be observed directed path from i to j; and i ↔? j ∈ E if during the parsing process, since the directed no- there is a (possibly empty) path between i and j in tions of head and dependent are lost in undirected the undirected graph underlying G (omitting the graphs. This gives the parser more freedom and references to E when clear from the context). can prevent situations where enforcing the con- Most dependency-based representations of syn- straint leads to error propagation, as in Figure 1. tax do not allow arbitrary dependency graphs, in- On the other hand, these new algorithms have stead, they are restricted to acyclic graphs that the disadvantage that their output is an undirected have at most one head per node. Dependency graph, which has to be post-processed to recover graphs satisfying these constraints are called de- the direction of the dependency links and generate pendency forests. a valid dependency tree. Thus, some complexity Definition 1 A dependency graph G is said to be is moved from the parsing process to this post- a forest iff it satisfies: processing step; and each undirected parser will outperform the directed version only if the simpli- 1. Acyclicity constraint: if i →? j, then not fication of the parsing phase is able to avoid more j → i. errors than are generated by the post-processing. 2. Single-head constraint: if j → i, then there As will be seen in latter sections, experimental re- is no k 6= j such that k → i. sults indicate that this is in fact the case. The rest of this paper is organized as follows: A node that has no head in a dependency for- Section 2 introduces some notation and concepts est is called a root. Some dependency frame- that we will use throughout the paper. In Sec- works add the additional constraint that depen- tion 3, we present the undirected versions of the dency forests have only one root (or, equivalently, parsers by G´omez-Rodr´ıguez and Nivre (2010) that they are connected). Such a forest is called a and Nivre (2008), as well as some considerations dependency tree. A dependency tree can be ob- about the feature models suitable to train them. In tained from any dependency forest by linking all Section 4, we discuss post-processing techniques of its root nodes as dependents of a dummy root that can be used to recover dependency trees from node, conventionally located in position 0 of the undirected graphs. Section 5 presents an empir- input. ical study of the performance obtained by these 2.2 Transition Systems parsers, and Section 6 contains a final discussion. In the framework of Nivre (2008), transition- 2 Preliminaries based parsers are described by means of a non- deterministic state machine called a transition 2.1 Dependency Graphs system. Let w = w1 . . . wn be an input string. A de- Definition 2 A transition system for dependency pendency graph for w is a directed graph G = parsing is a tuple S = (C, T, cs , Ct ), where (Vw , E), where Vw = {0, . . . , n} is the set of nodes, and E ⊆ Vw × Vw is the set of directed 1. C is a set of possible parser configurations, arcs. Each node in Vw encodes the position of 2. T is a finite set of transitions, which are par- a token in w, and each arc in E encodes a de- tial functions t : C → C, pendency relation between two tokens. We write 3. cs is a total initialization function mapping i → j to denote a directed arc (i, j), which will each input string to a unique initial configu- also be called a dependency link from i to j.1 We ration, and 1 In practice, dependency links are usually labeled, but 4. Ct ⊆ C is a set of terminal configurations. to simplify the presentation we will ignore labels throughout most of the paper. However, all the results and algorithms To obtain a deterministic parser from a non- presented can be applied to labeled dependency graphs and deterministic transition system, an oracle is used will be so applied in the experimental evaluation. to deterministically select a single transition at 67 each configuration. An oracle for a transition sys- the planar system that uses two stacks, allowing tem S = (C, T, cs , Ct ) is a function o : C → T . it to recognize 2-planar structures, a larger set Suitable oracles can be obtained in practice by of dependency structures that has been shown to training classifiers on treebank data (Nivre et al., cover the vast majority of non-projective struc- 2004). tures in a number of treebanks (G´omez-Rodr´ıguez 2.3 The Planar, 2-Planar and Covington and Nivre, 2010). Transition Systems This transition system, shown in Figure 2, has configurations of the form c = hΣ0 , Σ1 , B, Ai , Our undirected dependency parsers are based where we call Σ0 the active stack and Σ1 the in- on the planar and 2-planar transition systems active stack. Its S HIFT, L EFT-A RC, R IGHT-A RC by G´omez-Rodr´ıguez and Nivre (2010) and the and R EDUCE transitions work similarly to those version of the Covington (2001) non-projective in the planar parser, but while S HIFT pushes the parser defined by Nivre (2008). We now outline first word in the buffer to both stacks; the other these directed parsers briefly, a more detailed de- three transitions only work with the top of the ac- scription can be found in the above references. tive stack, ignoring the inactive one. Finally, a 2.3.1 Planar S WITCH transition is added that makes the active The planar transition system by G´omez- stack inactive and vice versa. Rodr´ıguez and Nivre (2010) is a linear-time 2.3.3 Covington Non-Projective transition-based parser for planar dependency forests, i.e., forests whose dependency arcs do not Covington (2001) proposes several incremen- cross when drawn above the words. The set of tal parsing strategies for dependency representa- planar dependency structures is a very mild ex- tions and one of them can recover non-projective tension of that of projective structures (Kuhlmann dependency graphs. Nivre (2008) implements a and Nivre, 2006). variant of this strategy as a transition system with Configurations in this system are of the form configurations of the form c = hλ1 , λ2 , B, Ai, c = hΣ, B, Ai where Σ and B are disjoint lists of where λ1 and λ2 are lists containing partially pro- nodes from Vw (for some input w), and A is a set cessed words and B is the buffer list of unpro- of dependency links over Vw . The list B, called cessed words. the buffer, holds the input words that are still to The Covington non-projective transition sys- be read. The list Σ, called the stack, is initially tem is shown at the bottom in Figure 2. At each empty and is used to hold words that have depen- configuration c = hλ1 , λ2 , B, Ai, the parser has dency links pending to be created. The system to consider whether any dependency arc should is shown at the top in Figure 2, where the nota- be created involving the top of the buffer and the tion Σ | i is used for a stack with top i and tail Σ, words in λ1 . A L EFT-A RC transition adds a link and we invert the notation for the buffer for clarity from the first node j in the buffer to the node in the (i.e., i | B as a buffer with top i and tail B). head of the list λ1 , which is moved to the list λ2 The system reads the input sentence and creates to signify that we have finished considering it as a links in a left-to-right order by executing its four possible head or dependent of j. The R IGHT-A RC transitions, until it gets to a terminal configura- transition does the same manipulation, but creat- tion. A S HIFT transition moves the first (leftmost) ing the symmetric link. A N O -A RC transition re- node in the buffer to the top of the stack. Transi- moves the head of the list λ1 and inserts it at the tions L EFT-A RC and R IGHT-A RC create leftward head of the list λ2 without creating any arcs: this or rightward link, respectively, involving the first transition is to be used where there is no depen- node in the buffer and the topmost node in the dency relation between the top node in the buffer stack. Finally, R EDUCE transition is used to pop and the head of λ1 , but we still may want to cre- the top word from the stack when we have fin- ate an arc involving the top of the buffer and other ished building arcs to or from it. nodes in λ1 . Finally, if we do not want to create any such arcs at all, we can execute a S HIFT tran- 2.3.2 2-Planar sition, which advances the parsing process by re- The 2-planar transition system by G´omez- moving the first node in the buffer B and inserting Rodr´ıguez and Nivre (2010) is an extension of it at the head of a list obtained by concatenating 68 λ1 and λ2 . This list becomes the new λ1 , whereas 3.1 Feature models λ2 is empty in the resulting configuration. Some of the features that are typically used to Note that the Covington parser has quadratic train transition-based dependency parsers depend complexity with respect to input length, while the on the direction of the arcs that have been built up planar and 2-planar parsers run in linear time. to a certain point. For example, two such features for the planar parser could be the POS tag associ- 3 The Undirected Parsers ated with the head of the topmost stack node, or The transition systems defined in Section 2.3 the label of the arc going from the first node in the share the common property that their L EFT-A RC buffer to its leftmost dependent.3 and R IGHT-A RC have exactly the same effects ex- As the notion of head and dependent is lost in cept for the direction of the links that they create. undirected graphs, this kind of features cannot be We can take advantage of this property to define used to train undirected parsers. Instead, we use undirected versions of these transition systems, by features based on undirected relations between transforming them as follows: nodes. We found that the following kinds of fea- tures worked well in practice as a replacement for • Configurations are changed so that the arc set features depending on arc direction: A is a set of undirected arcs, instead of di- rected arcs. • Information about the ith node linked to a given node (topmost stack node, topmost • The L EFT-A RC and R IGHT-A RC transitions buffer node, etc.) on the left or on the right, in each parser are collapsed into a single A RC and about the associated undirected arc, typi- transition that creates an undirected arc. cally for i = 1, 2, 3, • The preconditions of transitions that guaran- • Information about whether two nodes are tee the single-head constraint are removed, linked or not in the undirected graph, and since the notions of head and dependent are about the label of the arc between them, lost in undirected graphs. • Information about the first left and right By performing these transformations and leaving “undirected siblings” of a given node, i.e., the the systems otherwise unchanged, we obtain the first node q located to the left of the given node undirected variants of the planar, 2-planar and p such that p and q are linked to some common Covington algorithms that are shown in Figure 3. node r located to the right of both, and vice Note that the transformation can be applied versa. Note that this notion of undirected sib- to any transition system having L EFT-A RC and lings does not correspond exclusively to sib- R IGHT-A RC transitions that are equal except for lings in the directed graph, since it can also the direction of the created link, and thus col- capture other second-order interactions, such lapsable into one. The above three transition sys- as grandparents. tems fulfill this property, but not every transition system does. For example, the well-known arc- 4 Reconstructing the dependency forest eager parser of Nivre (2003) pops a node from the The modified transition systems presented in the stack when creating left arcs, and pushes a node previous section generate undirected graphs. To to the stack when creating right arcs, so the trans- obtain complete dependency parsers that are able formation cannot be applied to it.2 to produce directed dependency forests, we will need a reconstruction step that will assign a direc- 2 One might think that the arc-eager algorithm could still tion to the arcs in such a way that the single-head be transformed by converting each of its arc transitions into constraint is obeyed. This reconstruction step can an undirected transition, without collapsing them into one. be implemented by building a directed graph with However, this would result into a parser that violates the acyclicity constraint, since the algorithm is designed in such weighted arcs corresponding to both possible di- a way that acyclicity is only guaranteed if the single-head rections of each undirected edge, and then finding constraint is kept. It is easy to see that this problem cannot an optimum branching to reduce it to a directed happen in parsers where L EFT-A RC and R IGHT-A RC transi- 3 tions have the same effect: in these, if a directed graph is not These example features are taken from the default model parsable in the original algorithm, its underlying undirected for the planar parser in version 1.5 of MaltParser (Nivre et graph cannot not be parsable in the undirected variant. al., 2006). 69 Planar initial/terminal configurations: cs (w1 . . . wn ) = h[], [1 . . . n], ∅i, Cf = {hΣ, [], Ai ∈ C} Transitions: S HIFT hΣ, i|B, Ai ⇒ hΣ|i, B, Ai R EDUCE hΣ|i, B, Ai ⇒ hΣ, B, Ai L EFT-A RC hΣ|i, j|B, Ai ⇒ hΣ|i, j|B, A ∪ {(j, i)}i only if @k | (k, i) ∈ A (single-head) and i ↔∗ j 6∈ A (acyclicity). R IGHT-A RC hΣ|i, j|B, Ai ⇒ hΣ|i, j|B, A ∪ {(i, j)}i only if @k | (k, j) ∈ A (single-head) and i ↔∗ j 6∈ A (acyclicity). 2-Planar initial/terminal configurations: cs (w1 . . . wn ) = h[], [], [1 . . . n], ∅i, Cf = {hΣ0 , Σ1 , [], Ai ∈ C} Transitions: S HIFT hΣ0 , Σ1 , i|B, Ai ⇒ hΣ0 |i, Σ1 |i, B, Ai R EDUCE hΣ0 |i, Σ1 , B, Ai ⇒ hΣ0 , Σ1 , B, Ai L EFT-A RC hΣ0 |i, Σ1 , j|B, Ai ⇒ hΣ0 |i, Σ1 , j|B, A ∪ {j, i)}i only if @k | (k, i) ∈ A (single-head) and i ↔∗ j 6∈ A (acyclicity). R IGHT-A RC hΣ0 |i, Σ1 , j|B, Ai ⇒ hΣ0 |i, Σ1 , j|B, A ∪ {(i, j)}i only if @k | (k, j) ∈ A (single-head) and i ↔∗ j 6∈ A (acyclicity). S WITCH hΣ0 , Σ1 , B, Ai ⇒ hΣ1 , Σ0 , B, Ai Covington initial/term. configurations: cs (w1 . . . wn ) = h[], [], [1 . . . n], ∅i, Cf = {hλ1 , λ2 , [], Ai ∈ C} Transitions: S HIFT hλ1 , λ2 , i|B, Ai ⇒ hλ1 · λ2 |i, [], B, Ai N O -A RC hλ1 |i, λ2 , B, Ai ⇒ hλ1 , i|λ2 , B, Ai L EFT-A RC hλ1 |i, λ2 , j|B, Ai ⇒ hλ1 , i|λ2 , j|B, A ∪ {(j, i)}i only if @k | (k, i) ∈ A (single-head) and i ↔∗ j 6∈ A (acyclicity). R IGHT-A RC hλ1 |i, λ2 , j|B, Ai ⇒ hλ1 , i|λ2 , j|B, A ∪ {(i, j)}i only if @k | (k, j) ∈ A (single-head) and i ↔∗ j 6∈ A (acyclicity). Figure 2: Transition systems for planar, 2-planar and Covington non-projective dependency parsing. Undirected Planar initial/term. conf.: cs (w1 . . . wn ) = h[], [1 . . . n], ∅i, Cf = {hΣ, [], Ai ∈ C} Transitions: S HIFT hΣ, i|B, Ai ⇒ hΣ|i, B, Ai R EDUCE hΣ|i, B, Ai ⇒ hΣ, B, Ai A RC hΣ|i, j|B, Ai ⇒ hΣ|i, j|B, A ∪ {{i, j}}i only if i ↔∗ j 6∈ A (acyclicity). Undirected 2-Planar initial/term. conf.: cs (w1 . . . wn ) = h[], [], [1 . . . n], ∅i, Cf = {hΣ0 , Σ1 , [], Ai ∈ C} Transitions: S HIFT hΣ0 , Σ1 , i|B, Ai ⇒ hΣ0 |i, Σ1 |i, B, Ai R EDUCE hΣ0 |i, Σ1 , B, Ai ⇒ hΣ0 , Σ1 , B, Ai A RC hΣ0 |i, Σ1 , j|B, Ai ⇒ hΣ0 |i, Σ1 , j|B, A ∪ {{i, j}}i only if i ↔∗ j 6∈ A (acyclicity). S WITCH hΣ0 , Σ1 , B, Ai ⇒ hΣ1 , Σ0 , B, Ai Undirected Covington init./term. conf.: cs (w1 . . . wn ) = h[], [], [1 . . . n], ∅i, Cf = {hλ1 , λ2 , [], Ai ∈ C} Transitions: S HIFT hλ1 , λ2 , i|B, Ai ⇒ hλ1 · λ2 |i, [], B, Ai N O -A RC hλ1 |i, λ2 , B, Ai ⇒ hλ1 , i|λ2 , B, Ai A RC hλ1 |i, λ2 , j|B, Ai ⇒ hλ1 , i|λ2 , j|B, A ∪ {{i, j}}i only if i ↔∗ j 6∈ A (acyclicity). Figure 3: Transition systems for undirected planar, 2-planar and Covington non-projective dependency parsing. 70 tree. Different criteria for assigning weights to A(U ) as follows: arcs provide different variants of the reconstruc- tion technique. 1 if (i, j) ∈ A1 (U ), c(i, j) To describe these variants, we first introduce 2 if (i, j) ∈ A2 (U ) ∧ (i, j) 6∈ A1 (U ). preliminary definitions. Let U = (Vw , E) be an undirected graph produced by an undirected This approach gives the same cost to all arcs parser for some string w. We define the follow- obtained from the undirected graph U , while also ing sets of arcs: allowing (at a higher cost) to attach any node to the dummy root. To obtain satisfactory results A1 (U ) = {(i, j) | j 6= 0 ∧ {i, j} ∈ E}, with this technique, we must train our parser to A2 (U ) = {(0, i) | i ∈ Vw }. explicitly build undirected arcs from the dummy Note that A1 (U ) represents the set of arcs ob- root node to the root word(s) of each sentence us- tained from assigning an orientation to an edge ing arc transitions (note that this implies that we in U , except arcs whose dependent is the dummy need to represent forests as trees, in the manner root, which are disallowed. On the other hand, described at the end of Section 2.1). Under this A2 (U ) contains all the possible arcs originating assumption, it is easy to see that we can obtain the from the dummy root node, regardless of whether correct directed tree T for a sentence if it is pro- their underlying undirected edges are in U or not; vided with its underlying undirected tree U : the this is so that reconstructions are allowed to link tree is obtained in O(n) as the unique orientation unattached tokens to the dummy root. of U that makes each of its edges point away from The reconstruction process consists of finding the dummy root. a minimum branching (i.e. a directed minimum This approach to reconstruction has the advan- spanning tree) for a weighted directed graph ob- tage of being very simple and not adding any com- tained from assigning a cost c(i, j) to each arc plications to the parsing process, while guarantee- (i, j) of the following directed graph: ing that the correct directed tree will be recovered if the undirected tree for a sentence is generated D(U ) = {Vw , A(U ) = A1 (U ) ∪ A2 (U )}. correctly. However, it is not very robust, since the That is, we will find a dependency tree T = direction of all the arcs in the output depends on (Vw , AT ⊆ A(U )) such that the sum of costs of which node is chosen as sentence head and linked the arcs in AT is minimal. In general, such a min- to the dummy root. Therefore, a parsing error af- imum branching can be calculated with the Chu- fecting the undirected edge involving the dummy Liu-Edmonds algorithm (Chu and Liu, 1965; Ed- root may result in many dependency links being monds, 1967). Since the graph D(U ) has O(n) erroneous. nodes and O(n) arcs for a string of length n, this 4.2 Label-based reconstruction can be done in O(n log n) if implemented as de- scribed by Tarjan (1977). To achieve a more robust reconstruction, we use However, applying these generic techniques is labels to encode a preferred direction for depen- not necessary in this case: since our graph U is dency arcs. To do so, for each pre-existing label acyclic, the problem of reconstructing the forest X in the training set, we create two labels Xl and can be reduced to choosing a root word for each Xr . The parser is then trained on a modified ver- connected component in the graph, linking it as sion of the training set where leftward links orig- a dependent of the dummy root and directing the inally labelled X are labelled Xl , and rightward other arcs in the component in the (unique) way links originally labelled X are labelled Xr . Thus, that makes them point away from the root. the output of the parser on a new sentence will be It remains to see how to assign the costs c(i, j) an undirected graph where each edge has a label to the arcs of D(U ): different criteria for assign- with an annotation indicating whether the recon- ing scores will lead to different reconstructions. struction process should prefer to link the pair of nodes with a leftward or a rightward arc. We can 4.1 Naive reconstruction then assign costs to our minimum branching algo- A first, very simple reconstruction technique can rithm so that it will return a tree agreeing with as be obtained by assigning arc costs to the arcs in many such annotations as possible. 71 To do this, we call A1+ (U ) ⊆ A1 (U ) the set a. R of arcs in A1 (U ) that agree with the annotations, R L L L i.e., arcs (i, j) ∈ A1 (U ) where either i < j and 0 1 2 3 4 5 i, j is labelled Xr in U , or i > j and i, j is labelled Xl in U . We call A1− (U ) the set of arcs in A1 (U ) b. that disagree with the annotations, i.e., A1− (U ) = A1 (U )\A1+ (U ). And we assign costs as follows:  0 1 2 3 4 5  1 if (i, j) ∈ A1+ (U ), c(i, j) 2 if (i, j) ∈ A1− (U ), c. 2n if (i, j) ∈ A2 (U ) ∧ (i, j) 6∈ A1 (U ).  where n is the length of the string. 0 1 2 3 4 5 With these costs, the minimum branching algo- rithm will find a tree which agrees with as many Figure 4: a) An undirected graph obtained by the annotations as possible. Additional arcs from the parser with the label-based transformation, b) and c) root not corresponding to any edge in the output The dependency graph obtained by each of the variants of the parser (i.e. arcs in A2 (U ) but not in A1 (U )) of the label-based reconstruction (note how the second will be used only if strictly necessary to guarantee variant moves an arc from the root). connectedness, this is implemented by the high cost for these arcs. given sentence, then the obtained directed tree is While this may be the simplest cost assignment guaranteed to be correct (as it will simply be the to implement label-based reconstruction, we have tree obtained by decoding the label annotations). found that better empirical results are obtained if we give the algorithm more freedom to create new 5 Experiments arcs from the root, as follows: In this section, we evaluate the performance of the if (i, j) ∈ A1+ (U ) ∧ (i, j) 6∈ A2 (U ), undirected planar, 2-planar and Covington parsers   1 c(i, j) 2 if (i, j) ∈ A1− (U ) ∧ (i, j) 6∈ A2 (U ), on eight datasets from the CoNLL-X shared task  2n if (i, j) ∈ A2 (U ). (Buchholz and Marsi, 2006). Tables 1, 2 and 3 compare the accuracy of the While the cost of arcs from the dummy root is undirected versions with naive and label-based re- still 2n, this is now so even for arcs that are in the construction to that of the directed versions of output of the undirected parser, which had cost 1 the planar, 2-planar and Covington parsers, re- before. Informally, this means that with this con- spectively. In addition, we provide a comparison figuration the postprocessor does not “trust” the to well-known state-of-the-art projective and non- links from the dummy root created by the parser, projective parsers: the planar parsers are com- and may choose to change them if it is conve- pared to the arc-eager projective parser by Nivre nient to get a better agreement with label anno- (2003), which is also restricted to planar struc- tations (see Figure 4 for an example of the dif- tures; and the 2-planar parsers are compared with ference between both cost assignments). We be- the arc-eager parser with pseudo-projective trans- lieve that the better accuracy obtained with this formation of Nivre and Nilsson (2005), capable of criterion probably stems from the fact that it is bi- handling non-planar dependencies. ased towards changing links from the root, which We use SVM classifiers from the LIBSVM tend to be more problematic for transition-based package (Chang and Lin, 2001) for all the lan- parsers, while respecting the parser output for guages except Chinese, Czech and German. In links located deeper in the dependency structure, these, we use the LIBLINEAR package (Fan et for which transition-based parsers tend to be more al., 2008) for classification, which reduces train- accurate (McDonald and Nivre, 2007). ing time for these larger datasets; and feature Note that both variants of label-based recon- models adapted to this system which, in the case struction have the property that, if the undirected of German, result in higher accuracy than pub- parser produces the correct edges and labels for a lished results using LIBSVM. 72 The LIBSVM feature models for the arc-eager based reconstruction technique of Section 4.2, im- projective and pseudo-projective parsers are the proves parsing accuracy on most of the tested same used by these parsers in the CoNLL-X dataset/algorithm combinations, and it can out- shared task, where the pseudo-projective version perform state-of-the-art transition-based parsers. of MaltParser was one of the two top performing The accuracy improvements achieved by re- systems (Buchholz and Marsi, 2006). For the 2- laxing the single-head constraint to mitigate er- planar parser, we took the feature models from ror propagation were able to overcome the er- G´omez-Rodr´ıguez and Nivre (2010) for the lan- rors generated in the reconstruction phase, which guages included in that paper. For all the algo- were few: we observed empirically that the dif- rithms and datasets, the feature models used for ferences between the undirected LAS obtained the undirected parsers were adapted from those of from the undirected graph before the reconstruc- the directed parsers as described in Section 3.1.4 tion and the final directed LAS are typically be- The results show that the use of undirected low 0.20%. This is true both for the naive and parsing with label-based reconstruction clearly label-based transformations, indicating that both improves the performance in the vast majority of techniques are able to recover arc directions accu- the datasets for the planar and Covington algo- rately, and the accuracy differences between them rithms, where in many cases it also improves upon come mainly from the differences in training (e.g. the corresponding projective and non-projective having tentative arc direction as part of feature state-of-the-art parsers provided for comparison. information in the label-based reconstruction and In the case of the 2-planar parser the results are not in the naive one) rather than from the differ- less conclusive, with improvements over the di- ences in the reconstruction methods themselves. rected versions in five out of the eight languages. The reason why we can apply the undirected The improvements in LAS obtained with label- simplification to the three parsers that we have based reconstruction over directed parsing are sta- used in this paper is that their L EFT-A RC and tistically significant at the .05 level5 for Danish, R IGHT-A RC transitions have the same effect ex- German and Portuguese in the case of the pla- cept for the direction of the links they create. nar parser; and Czech, Danish and Turkish for The same transformation and reconstruction tech- Covington’s parser. No statistically significant de- niques could be applied to any other transition- crease in accuracy was detected in any of the al- based dependency parsers sharing this property. gorithm/dataset combinations. The reconstruction techniques alone could po- As expected, the good results obtained by the tentially be applied to any dependency parser undirected parsers with label-based reconstruc- (transition-based or not) as long as it can be some- tion contrast with those obtained by the variants how converted to output undirected graphs. with root-based reconstruction, which performed The idea of parsing with undirected relations worse in all the experiments. between words has been applied before in the 6 Discussion work on Link Grammar (Sleator and Temperley, 1991), but in that case the formalism itself works We have presented novel variants of the planar with undirected graphs, which are the final out- and 2-planar transition-based parsers by G´omez- put of the parser. To our knowledge, the idea of Rodr´ıguez and Nivre (2010) and of Covington’s using an undirected graph as an intermediate step non-projective parser (Covington, 2001; Nivre, towards obtaining a dependency structure has not 2008) which ignore the direction of dependency been explored before. links, and reconstruction techniques that can be used to recover the direction of the arcs thus pro- Acknowledgments duced. The results obtained show that this idea This research has been partially funded by the Spanish of undirected parsing, together with the label- Ministry of Economy and Competitiveness and FEDER (projects TIN2010-18552-C03-01 and TIN2010-18552- 4 All the experimental settings and feature models used C03-02), Ministry of Education (FPU Grant Program) and are included in the supplementary material and also available Xunta de Galicia (Rede Galega de Recursos Ling¨u´ısticos at http://www.grupolys.org/˜cgomezr/exp/. para unha Soc. do Co˜nec.). The experiments were conducted 5 Statistical significance was assessed using Dan Bikel’s with the help of computing resources provided by the Su- randomized comparator: http://www.cis.upenn. percomputing Center of Galicia (CESGA). We thank Joakim edu/˜dbikel/software.html Nivre for helpful input in the early stages of this work. 73 Planar UPlanarN UPlanarL MaltP Lang. LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p) Arabic 66.93 (67.34) 77.56 (77.22) 65.91 (66.33) 77.03 (76.75) 66.75 (67.19) 77.45 (77.22) 66.43 (66.74) 77.19 (76.83) Chinese 84.23 (84.20) 88.37 (88.33) 83.14 (83.10) 87.00 (86.95) 84.51* (84.50*) 88.37 (88.35*) 86.42 (86.39) 90.06 (90.02) Czech 77.24 (77.70) 83.46 (83.24) 75.08 (75.60) 81.14 (81.14) 77.60* (77.93*) 83.56* (83.41*) 77.24 (77.57) 83.40 (83.19) Danish 83.31 (82.60) 88.02 (86.64) 82.65 (82.45) 87.58 (86.67*) 83.87* (83.83*) 88.94* (88.17*) 83.31 (82.64) 88.30 (86.91) German 84.66 (83.60) 87.02 (85.67) 83.33 (82.77) 85.78 (84.93) 86.32* (85.67*) 88.62* (87.69*) 86.12 (85.48) 88.52 (87.58) Portug. 86.22 (83.82) 89.80 (86.88) 85.89 (83.82) 89.68 (87.06*) 86.52* (84.83*) 90.28* (88.03*) 86.60 (84.66) 90.20 (87.73) Swedish 83.01 (82.44) 88.53 (87.36) 81.20 (81.10) 86.50 (85.86) 82.95 (82.66*) 88.29 (87.45*) 82.89 (82.44) 88.61 (87.55) Turkish 62.70 (71.27) 73.67 (78.57) 59.83 (68.31) 70.15 (75.17) 63.27* (71.63*) 73.93* (78.72*) 62.58 (70.96) 73.09 (77.95) Table 1: Parsing accuracy of the undirected planar parser with naive (UPlanarN) and label-based (UPlanarL) postprocessing in comparison to the directed planar (Planar) and the MaltParser arc-eager projective (MaltP) algorithms, on eight datasets from the CoNLL-X shared task (Buchholz and Marsi, 2006): Arabic (Hajiˇc et al., 2004), Chinese (Chen et al., 2003), Czech (Hajiˇc et al., 2006), Danish (Kromann, 2003), German (Brants et al., 2002), Portuguese (Afonso et al., 2002), Swedish (Nilsson et al., 2005) and Turkish (Oflazer et al., 2003; Atalay et al., 2003). We show labelled (LAS) and unlabelled (UAS) attachment score excluding and including punctuation tokens in the scoring (the latter in brackets). Best results for each language are shown in boldface, and results where the undirected parser outperforms the directed version are marked with an asterisk. 2Planar U2PlanarN U2PlanarL MaltPP Lang. LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p) Arabic 66.73 (67.19) 77.33 (77.11) 66.37 (66.93) 77.15 (77.09) 66.13 (66.52) 76.97 (76.70) 65.93 (66.02) 76.79 (76.14) Chinese 84.35 (84.32) 88.31 (88.27) 83.02 (82.98) 86.86 (86.81) 84.45* (84.42*) 88.29 (88.25) 86.42 (86.39) 90.06 (90.02) Czech 77.72 (77.91) 83.76 (83.32) 74.44 (75.19) 80.68 (80.80) 78.00* (78.59*) 84.22* (84.21*) 78.86 (78.47) 84.54 (83.89) Danish 83.81 (83.61) 88.50 (87.63) 82.00 (81.63) 86.87 (85.80) 83.75 (83.65*) 88.62* (87.82*) 83.67 (83.54) 88.52 (87.70) German 86.28 (85.76) 88.68 (87.86) 82.93 (82.53) 85.52 (84.81) 86.52* (85.99*) 88.72* (87.92*) 86.94 (86.62) 89.30 (88.69) Portug. 87.04 (84.92) 90.82 (88.14) 85.61 (83.45) 89.36 (86.65) 86.70 (84.75) 90.38 (87.88) 87.08 (84.90) 90.66 (87.95) Swedish 83.13 (82.71) 88.57 (87.59) 81.00 (80.71) 86.54 (85.68) 82.59 (82.25) 88.19 (87.29) 83.39 (82.67) 88.59 (87.38) Turkish 61.80 (70.09) 72.75 (77.39) 58.10 (67.44) 68.03 (74.06) 61.92* (70.64*) 72.18 (77.46*) 62.80 (71.33) 73.49 (78.44) Table 2: Parsing accuracy of the undirected 2-planar parser with naive (U2PlanarN) and label-based (U2PlanarL) postprocessing in comparison to the directed 2-planar (2Planar) and MaltParser arc-eager pseudo-projective (MaltPP) algorithms. The meaning of the scores shown is as in Table 1. Covington UCovingtonN UCovingtonL Lang. LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p) Arabic 65.17 (65.49) 75.99 (75.69) 63.49 (63.93) 74.41 (74.20) 65.61* (65.81*) 76.11* (75.66) Chinese 85.61 (85.61) 89.64 (89.62) 84.12 (84.02) 87.85 (87.73) 86.28* (86.17*) 90.16* (90.04*) Czech 78.26 (77.43) 84.04 (83.15) 74.02 (74.78) 79.80 (79.92) 78.42* (78.69*) 84.50* (84.16*) Danish 83.63 (82.89) 88.50 (87.06) 82.00 (81.61) 86.55 (85.51) 84.27* (83.85*) 88.82* (87.75*) German 86.70 (85.69) 89.08 (87.78) 84.03 (83.51) 86.16 (85.39) 86.50 (85.90*) 88.84 (87.95*) Portug. 84.73 (82.56) 89.10 (86.30) 83.83 (81.71) 87.88 (85.17) 84.95* (82.70*) 89.18* (86.31*) Swedish 83.53 (82.76) 88.91 (87.61) 81.78 (81.47) 86.78 (85.96) 83.09 (82.73) 88.11 (87.23) Turkish 64.25 (72.70) 74.85 (79.75) 63.51 (72.08) 74.07 (79.10) 64.91* (73.38*) 75.46* (80.40*) Table 3: Parsing accuracy of the undirected Covington non-projective parser with naive (UCovingtonN) and label-based (UCovingtonL) postprocessing in comparison to the directed algorithm (Covington). The meaning of the scores shown is as in Table 1. 74 References Proceedings of the NEMLAR International Confer- ence on Arabic Language Resources and Tools. Susana Afonso, Eckhard Bick, Renato Haber, and Di- Jan Hajiˇc, Jarmila Panevov´a, Eva Hajiˇcov´a, Jarmila ana Santos. 2002. “Floresta sint´a(c)tica”: a tree- ˇ ep´anek, Panevov´a, Petr Sgall, Petr Pajas, Jan Stˇ bank for Portuguese. In Proceedings of the 3rd In- Jiˇr´ı Havelka, and Marie Mikulov´a. 2006. ternational Conference on Language Resources and Prague Dependency Treebank 2.0. CDROM CAT: Evaluation (LREC 2002), pages 1968–1703, Paris, LDC2006T01, ISBN 1-58563-370-4. Linguistic France. ELRA. Data Consortium. Nart B. Atalay, Kemal Oflazer, and Bilge Say. 2003. Liang Huang and Kenji Sagae. 2010. Dynamic pro- The annotation process in the Turkish treebank. gramming for linear-time incremental parsing. In In Proceedings of EACL Workshop on Linguisti- Proceedings of the 48th Annual Meeting of the As- cally Interpreted Corpora (LINC-03), pages 243– sociation for Computational Linguistics, ACL ’10, 246, Morristown, NJ, USA. Association for Com- pages 1077–1086, Stroudsburg, PA, USA. Associa- putational Linguistics. tion for Computational Linguistics. Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf- Matthias T. Kromann. 2003. The Danish dependency gang Lezius, and George Smith. 2002. The tiger treebank and the underlying linguistic theory. In treebank. In Proceedings of the Workshop on Tree- Proceedings of the 2nd Workshop on Treebanks and banks and Linguistic Theories, September 20-21, Linguistic Theories (TLT), pages 217–220, V¨axj¨o, Sozopol, Bulgaria. Sweden. V¨axj¨o University Press. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Marco Kuhlmann and Joakim Nivre. 2006. Mildly shared task on multilingual dependency parsing. In non-projective dependency structures. In Proceed- Proceedings of the 10th Conference on Computa- ings of the COLING/ACL 2006 Main Conference tional Natural Language Learning (CoNLL), pages Poster Sessions, pages 507–514. 149–164. Andre Martins, Noah Smith, and Eric Xing. 2009. Chih-Chung Chang and Chih-Jen Lin, 2001. Concise integer linear programming formulations LIBSVM: A Library for Support Vec- for dependency parsing. In Proceedings of the tor Machines. Software available at Joint Conference of the 47th Annual Meeting of the http://www.csie.ntu.edu.tw/∼cjlin/libsvm. ACL and the 4th International Joint Conference on K. Chen, C. Luo, M. Chang, F. Chen, C. Chen, Natural Language Processing of the AFNLP (ACL- C. Huang, and Z. Gao. 2003. Sinica treebank: De- IJCNLP), pages 342–350. sign criteria, representational issues and implemen- Ryan McDonald and Joakim Nivre. 2007. Character- tation. In Anne Abeill´e, editor, Treebanks: Building izing the errors of data-driven dependency parsing and Using Parsed Corpora, chapter 13, pages 231– models. In Proceedings of the 2007 Joint Confer- 248. Kluwer. ence on Empirical Methods in Natural Language Y. J. Chu and T. H. Liu. 1965. On the shortest arbores- Processing and Computational Natural Language cence of a directed graph. Science Sinica, 14:1396– Learning (EMNLP-CoNLL), pages 122–131. 1400. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajiˇc. 2005. Non-projective dependency pars- Michael A. Covington. 2001. A fundamental algo- ing using spanning tree algorithms. In Proceedings rithm for dependency parsing. In Proceedings of of the Human Language Technology Conference the 39th Annual ACM Southeast Conference, pages and the Conference on Empirical Methods in Nat- 95–102. ural Language Processing (HLT/EMNLP), pages Jack Edmonds. 1967. Optimum branchings. Journal 523–530. of Research of the National Bureau of Standards, Jens Nilsson, Johan Hall, and Joakim Nivre. 2005. 71B:233–240. MAMBA meets TIGER: Reconstructing a Swedish R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and treebank from Antiquity. In Peter Juel Henrichsen, C.-J. Lin. 2008. LIBLINEAR: A library for large editor, Proceedings of the NODALIDA Special Ses- linear classification. Journal of Machine Learning sion on Treebanks. Research, 9:1871–1874. Joakim Nivre and Jens Nilsson. 2005. Pseudo- Carlos G´omez-Rodr´ıguez and Joakim Nivre. 2010. projective dependency parsing. In Proceedings of A transition-based parser for 2-planar dependency the 43rd Annual Meeting of the Association for structures. In Proceedings of the 48th Annual Meet- Computational Linguistics (ACL), pages 99–106. ing of the Association for Computational Linguis- Joakim Nivre, Johan Hall, and Jens Nilsson. 2004. tics, ACL ’10, pages 1492–1501, Stroudsburg, PA, Memory-based dependency parsing. In Proceed- USA. Association for Computational Linguistics. ings of the 8th Conference on Computational Nat- ˇ Jan Hajiˇc, Otakar Smrˇz, Petr Zem´anek, Jan Snaidauf, ural Language Learning (CoNLL-2004), pages 49– and Emanuel Beˇska. 2004. Prague Arabic Depen- 56, Morristown, NJ, USA. Association for Compu- dency Treebank: Development in data and tools. In tational Linguistics. 75 Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. MaltParser: A data-driven parser-generator for de- pendency parsing. In Proceedings of the 5th In- ternational Conference on Language Resources and Evaluation (LREC), pages 2216–2219. Joakim Nivre. 2003. An efficient algorithm for pro- jective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technolo- gies (IWPT), pages 149–160. Joakim Nivre. 2008. Algorithms for Deterministic Incremental Dependency Parsing. Computational Linguistics, 34(4):513–553. Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-T¨ur, and G¨okhan T¨ur. 2003. Building a Turkish tree- bank. In Anne Abeill´e, editor, Treebanks: Build- ing and Using Parsed Corpora, pages 261–277. Kluwer. Daniel Sleator and Davy Temperley. 1991. Pars- ing English with a link grammar. Technical Re- port CMU-CS-91-196, Carnegie Mellon University, Computer Science. R. E. Tarjan. 1977. Finding optimum branchings. Networks, 7:25–35. Ivan Titov and James Henderson. 2007. A latent vari- able model for generative dependency parsing. In Proceedings of the 10th International Conference on Parsing Technologies (IWPT), pages 144–155. 76 The Best of Both Worlds – A Graph-based Completion Model for Transition-based Parsers Bernd Bohnet and Jonas Kuhn University of Stuttgart Institute for Natural Language Processing {bohnet,jonas}@ims.uni-stuttgart.de Abstract that machine learning, i.e., a model of linguis- tic experience, is used in exactly those situations Transition-based dependency parsers are when there is an attachment choice in an other- often forced to make attachment deci- sions at a point when only partial infor- wise deterministic incremental left-to-right pars- mation about the relevant graph configu- ing process. As a new word is processed, the ration is available. In this paper, we de- parser has to decide on one out of a small num- scribe a model that takes into account com- ber of possible transitions (adding a dependency plete structures as they become available arc pointing to the left or right and/or pushing or to rescore the elements of a beam, com- popping a word on/from a stack representation). bining the advantages of transition-based Obviously, the learning can be based on the fea- and graph-based approaches. We also pro- ture information available at a particular snapshot pose an efficient implementation that al- lows for the use of sophisticated features in incremental processing, i.e., only surface in- and show that the completion model leads formation for the unparsed material to the right, to a substantial increase in accuracy. We but full structural information for the parts of the apply the new transition-based parser on ty- string already processed. For the completely pro- pologically different languages such as En- cessed parts, there are no principled limitations as glish, Chinese, Czech, and German and re- regards the types of structural configurations that port competitive labeled and unlabeled at- can be checked in feature functions. tachment scores. The graph-based approach in contrast empha- 1 Introduction sizes the objective of exhaustive search over all possible trees spanning the input words. Com- Background. A considerable amount of recent monly, dynamic programming techniques are research has gone into data-driven dependency used to decide on the optimal tree for each par- parsing, and interestingly throughout the continu- ticular word span, considering all candidate splits ous process of improvements, two classes of pars- into subspans, successively building longer spans ing algorithms have stayed at the centre of at- in a bottom-up fashion (similar to chart-based tention, the transition-based (Nivre, 2003) vs. the constituent parsing). Machine learning drives graph-based approach (Eisner, 1996; McDonald the process of deciding among alternative can- et al., 2005).1 The two approaches apply funda- didate splits, i.e., feature information can draw mentally different strategies to solve the task of on full structural information for the entire ma- finding the optimal labeled dependency tree over terial in the span under consideration. However, the words of an input sentence (where supervised due to the dynamic programming approach, the machine learning is used to estimate the scoring features cannot use arbitrarily complex structural parameters on a treebank). configurations: otherwise the dynamic program- The transition-based approach is based on the ming chart would have to be split into exponen- conceptually (and cognitively) compelling idea tially many special states. The typical feature 1 More references will be provided in sec. 2. models are based on combinations of edges (so- 77 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 77–87, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics called second-order factors) that closely follow for a given input word at a point where no or only the bottom-up combination of subspans in the partial information about the word’s own depen- parsing algorithm, i.e., the feature functions de- dents (and further decendents) is available. Fig- pend on the presence of two specific dependency ure 1 illustrates such a case. edges. Configurations not directly supported by the bottom-up building of larger spans are more cumbersome to integrate into the model (since the combination algorithm has to be adjusted), in par- ticular for third-order factors or higher. Empirically, i.e., when applied in supervised machine learning experiments based on existing treebanks for various languages, both strategies (and further refinements of them not mentioned here) turn out roughly equal in their capability Figure 1: The left set of brackets indicates material of picking up most of the relevant patterns well; that has been processed or is under consideration; on some subtle strengths and weaknesses are com- the right is the input, still to be processed. Access to in- plementary, such that stacking of two parsers rep- formation that is yet unavailable would help the parser to decide on the correct transition. resenting both strategies yields the best results (Nivre and McDonald, 2008): in training and ap- Here, the parser has to decide whether to create an plication, one of the parsers is run on each sen- edge between house and with or between bought tence prior to the other, providing additional fea- and with (which is technically achieved by first ture information for the other parser. Another suc- popping house from the stack and then adding the cessful technique to combine parsers is voting as edge). At this time, no information about the ob- carried out by Sagae and Lavie (2006). ject of with is available; with fails to provide what The present paper addresses the question if we call a complete factor for the calculation of the and how a more integrated combination of the scores of the alternative transitions under consid- strengths of the two strategies can be achieved eration. In other words, the model cannot make and implemented efficiently to warrant competi- use of any evidence to distinguish between the tive results. two examples in Figure 1, and it is bound to get one of the two cases wrong. The main issue and solution strategy. In or- Figure 2 illustrates the same case from the per- der to preserve the conceptual (and complexity) spective of a graph-based parser. advantages of the transition-based strategy, the integrated algorithm we are looking for has to be transition-based at the top level. The advan- tages of the graph-based approach – a more glob- ally informed basis for the decision among dif- ferent attachment options – have to be included as part of the scoring procedure. As a prerequi- Figure 2: A second order model as used in graph-based site, our algorithm will require a memory for stor- parsers has access to the crucial information to build ing alternative analyses among which to choose. the correct tree. In this case, the parser condsiders the word friend (as opposed to garden, for instance) as it This has been previously introduced in transition- introduces the bold-face edge. based approaches in the form of a beam (Johans- son and Nugues, 2006): rather than representing Here, the combination of subspans is performed only the best-scoring history of transitions, the k at a point when their internal structure has been best-scoring alternative histories are kept around. finalized, i.e., the attachment of with (to bought As we will indicate in the following, the mere or house) is not decided until it is clear that friend addition of beam search does not help overcome is the object of with; hence, the semantically im- a representational key issue of transition-based portant lexicalization of with’s object informs the parsing: in many situations, a transition-based higher-level attachment decision through a so- parser is forced to make an attachment decision called second order factor in the feature model. 78 Given a suitable amount of training data, the parser requires only one training phase (without model can thus learn to make the correct deci- jackknifing) and it uses only a single transition- sion. The dynamic-programming based graph- based decoder. based parser is designed in such a way that any The structure of this paper is as follows. In Sec- score calculation is based on complete factors for tion 2, we discuss related work. In Section 3, we the subspans that are combined at this point. introduce our transition-based parser and in Sec- Note that the problem for the transition-based tion 4 the completion model as well as the im- parser cannot be remedied by beam search alone. plementation of third order models. In Section 5, If we were to keep the two options for attach- we describe experiments and provide evaluation ing with around in a beam (say, with a slightly results on selected data sets. higher score for attachment to house, but with bought following narrowly behind), there would 2 Related Work be no point in the further processing of the sen- Kudo and Matsumoto (2002) and Yamada and tence at which the choice could be corrected: the Matsumoto (2003) carried over the idea for de- transition-based parser still needs to make the de- terministic parsing by chunks from Abney (1991) cision that friend is attached to with, but this will to dependency parsing. Nivre (2003) describes not lead the parser to reconsider the decision made in a more strict sense the first incremental parser earlier on. that tries to find the most appropriate dependency The strategy we describe in this paper applies tree by a sequence of local transitions. In order in this very type of situation: whenever infor- to optimize the results towards a more globally mation is added in the transition-based parsing optimal solution, Johansson and Nugues (2006) process, the scores of all the histories stored in first applied beam search, which leads to a sub- the beam are recalculated based on a scoring stantial improvment of the results (cf. also (Titov model inspired by the graph-based parsing ap- and Henderson, 2007)). Zhang and Clark (2008) proach, i.e., taking complete factors into account augment the beam-search algorithm, adapting the as they become incrementally available. As a con- early update strategy of Collins and Roark (2004) sequence the beam is reordered, and hence, the to dependency parsing. In this approach, the incorrect preference of an attachment of with to parser stops and updates the model when the or- house (based on incomplete factors) can later be acle transition sequence drops out of the beam. corrected as friend is processed and the complete In contrast to most other approaches, the training second-order factor becomes available.2 procedure of Zhang and Clark (2008) takes the The integrated transition-based parsing strategy complete transition sequence into account as it is has a number of advantages: calculating the update. Zhang and Clark compare (1) We can integrate and investigate a number of aspects of transition-based and graph-based pars- third order factors, without the need to implement ing, and end up using a transition-based parser a more complex parsing model each time anew to with a combined transition-based/second-order explore the properties of such distinct model. graph-based scoring model (Zhang and Clark, (2) The parser with completion model main- 2008, 567), which is similar to the approach we tains the favorable complexity of transition-based describe in this paper. However, their approach parsers. does not involve beam rescoring as the partial (3) The completion model compensates for the structures built by the transition-based parser are lower accuracy of cases when only incomplete in- subsequently augmented; hence, there are cases in formation is available. which our approach is able to differentiate based (4) The parser combines the two leading pars- on higher-order factors that go unnoticed by the ing paradigms in a single efficient parser with- combined model of (Zhang and Clark, 2008, 567). out stacking the two approaches. Therefore the One step beyond the use of a beam is a dynamic programming approach to carry out a full search 2 Since search is not exhaustive, there is of course a slight in the state space, cf. (Huang and Sagae, 2010; danger that the correct history drops out of the beam before complete information becomes available. But as our experi- Kuhlmann et al., 2011). However, in this case ments show, this does not seem to be a serious issue empiri- one has to restrict the employed features to a set cally. which fits to the elements composed by the dy- 79 namic programming approach. This is a trade-off swap}) taken up to this point. between an exhaustive search and a unrestricted (1) The initial state π0 has an empty stack, the (rich) feature set and the question which provides input buffer is the full input string x, and the edge a higher accuracy is still an open research ques- set is empty. (2) The (partial) transition function tion, cf. (Kuhlmann et al., 2011). τ (πi , t) : Π x Ω → Π maps a state and an opera- Parsing of non-projective dependency trees is tion t to a new state πi+1 . (3) Final states πf are an important feature for many languages. At characterized by an empty input buffer and stack; first most algorithms were restricted to projec- no further transitions can be taken. tive dependency trees and used pseudo-projective The transition function is informally defined as parsing (Kahane et al., 1998; Nivre and Nilsson, follows: The shift transition removes the first ele- 2005). Later, additional transitions were intro- ment of the input buffer and pushes it to the stack. duced to handle non-projectivity (Attardi, 2006; The left-arcl transition adds an edge with label l Nivre, 2009). The most common strategy uses from the first word in the buffer to the word on the swap transition (Nivre, 2009; Nivre et al., top of the stack, removes the top element from 2009), an alternative solution uses two planes the stack and pushes the first element of the input and a switch transition to switch between the two buffer to the stack. planes (G´omez-Rodr´ıguez and Nivre, 2010). The right-arcl transition adds an edge from word Since we use the scoring model of a graph- on top of the stack to the first word in the input based parser, we briefly review releated work buffer and removes the top element of the input on graph-based parsing. The most well known buffer and pushes that element onto the stack. graph-based parser is the MST (maximum span- The reduce transition pops the top word from the ning tree) parser, cf. (McDonald et al., 2005; Mc- stack. Donald and Pereira, 2006). The idea of the MST The swap changes the order of the two top el- parser is to find the highest scoring tree in a graph ements on the stack (possibly generating non- that contains all possible edges. Eisner (1996) projkective trees). introduced a dynamic programming algorithm to When more than one operation is applicable, a solve this problem efficiently. Carreras (2007) in- scoring function assigns a numerical value (based troduced the left-most and right-most grandchild on a feature vector and a weight vector trained as factors. We use the factor model of Carreras by supervised machine learning) to each possi- (2007) as starting point for our experiments, cf. ble continuation. When using a beam search ap- Section 4. We extend Carreras (2007) graph- proach with beam size k, the highest-scoring k al- based model with factors involving three edges ternative states with the same length n of transi- similar to that of Koo and Collins (2010). tion history h are kept in a set “beamn ”. In the beam-based parsing algorithm (cf. the 3 Transition-based Parser with a Beam pseudo code in Algorithm 1), all candidate states This section specifies the transition-based beam- for the next set “beamn+1 ” are determined using search parser underlying the combined approach the transition function τ , but based on the scor- more formally. Sec. 4 will discuss the graph- ing function, only the best k are preserved. (Fi- based scoring model that we are adding. nal) states to which no more transitions apply are The input to the parser is a word string x, copied to the next state set. This means that once the goal is to find the optimal set y of labeled all transition paths have reached a final state, the edges xi →l xj forming a dependency tree over x overall best-scoring states can be read off the fi- ∪{root}. We characterize the state of a transition- nal “beamn ”. The y of the top-scoring state is the based parser as πi =hσi , βi , yi , hi i, πi ∈ Π, the set predicted parse. of possible states. σi is a stack of words from x Under the plain transition-based scoring that are still under consideration; βi is the input regime scoreT , the score for a state π is the sum buffer, the suffix of x yet to be processed; yi the of the “local” scores for the transitions ti in the set of labeled edges already assigned (a partial la- state’s history sequence: beled dependency tree); hi is a sequence record- P|h| ing the history of transitions (from the set of op- scoreT (π) = i=0 w · f (πi , ti ) erations Ω = {shift, left-arcl , right-arcl , reduce, 80 Algorithm 1: Transition-based parser up to this point, which is continuously augmented. // x is the input sentence, k is the beam size This means if at a given point n in the transition σ0 = ∅, β0 = x, y0 = ∅, h = ∅ path, complete information for a particular config- π0 ← hσ0 , β0 , y0 , h0 i // initial parts of a state uration (e.g., a third-order factor involving a head, beam0 ← {π0 } // create initial state its dependent and its grand-child dependent) is n ← 0 // iteration unavailable, scoring will ignore this factor at time repeat n, but the configuration will inform the scoring n←n+1 later on, maybe at point n + 4, when the complete for all πj ∈ beamn−1 do transitions ← possible-applicable-transition (πj ) information for this factor has entered the partial // if no transition is applicable keep state πj : graph yn+4 . if transitions = ∅ then beamn ← beamn ∪ {πj } We present results for a number of different else for all ti ∈ transitions do second-order and third-order feature models. // apply the transition i to state j π ← τ (πj , ti ) Second Order Factors. We start with the beamn ← beamn ∪ {π} model introduced by Carreras (2007). Figure 3 // end for illustrates the factors used. // end for sort beamn due to the score(πj ) beamn ← sublist (beamn , 0, k) until beamn−1 = beamn // beam changed? w is the weight vector. Note that the features f (πi , ti ) can take into account all structural and labeling information available prior to taking tran- Figure 3: Model 2a. Second order factors of Carreras sition ti , i.e., the graph built so far, the words (and (2007). We omit the right-headed cases, which are their part of speech etc.) on the stack and in the mirror images. The model comprises a factoring into one first order part and three second order factors (2- input buffer, etc. But if a larger graph configu- 4): 1) The head (h) and the dependent (c); 2) the head, ration involving the next word evolves only later, the dependent and the left-most (or right-most) grand- as in Figure 1, this information is not taken into child in between (cmi); 3) the head, the dependent and account in scoring. For instance, if the feature the right-most (or left-most) grandchild away from the extraction uses the subcategorization frame of a head (cmo). 4) the head, the dependent and between word under consideration to compute a score, it is those words the right-most (or left-most) sibling (ci). quite possible that some dependents are still miss- ing and will only be attached in a future transition. 4 Completion Model We define an augmented scoring function which Figure 4: 2b. The left-most dependent of the head or can be used in the same beam-search algorithm in the right-most dependent in the right-headed case. order to ensure that in the scoring of alternative transition paths, larger configurations can be ex- Figure 4 illustrates a new type of factor we use, ploited as they are completed in the incremental which includes the left-most dependent in the left- process. The feature configurations can be largely headed case and symmetricaly the right-most sib- taken from graph-based approaches. Here, spans ling in the right-head case. from the string are assembled in a bottom-up fash- ion, and the scoring for an edge can be based on Third Order Factors. In addition to the second structurally completed subspans (“factors”). order factors, we investigate combinations of third Our completion model for scoring a state πn order factors. Figure 5 and 6 illustrate the third incorporates factors for all configurations (match- order factors, which are similar to the factors of ing the extraction scheme that is applied) that are Koo and Collins (2010). They restrict the factor present in the partial dependency graph yn built to the innermost sibling pair for the tri-siblings 81 and the outermost pair for the grand-siblings. We model, we have to add the scoring function (2a) use the first two siblings of the dependent from the sum: the left side of the head for the tri-siblings and (2b) scoreG2b (x, y) = scoreG2a (x, y) the first two dependents of the child for the grand- P + (h,c,cmi)∈y w · fgra (x,h,c,cmi) siblings. With these factors, we aim to capture non-projective edges and subcategorization infor- In order to build a scoring function for combi- mation. Figure 7 illustrates a factor of a sequence nation of the factors shown in Figure 5 to 7, we of four nodes. All the right headed variants are have to add to the equation 2b one or more of the symmetrically and left out for brevity. following sums: P (3a) (h,c,ch1,ch2)∈y w · fgra (x,h,c,ch1,ch2) P (3b) (h,c,cm1,cm2)∈y w · fgra (x,h,c,cm1,cm2) P (3c) (h,c,cmo,tmo)∈y w · fgra (x,h,c,cmo,tmo) Figure 5: 3a. The first two children of the head, which do not include the edge between the head and the de- pendent. Feature Set. The feature set of the transition model is similar to that of Zhang and Nivre (2011). In addition, we use the cross product of morphologic features between the head and the dependent since we apply also the parser on mor- phologic rich languages. Figure 6: 3b. The first two children of the dependent. The feature sets of the completion model de- scribed above are mostly based on previous work (McDonald et al., 2005; McDonald and Pereira, 2006; Carreras, 2007; Koo and Collins, 2010). The models denoted with + use all combinations Figure 7: 3c. The right-most dependent of the right- of words before and after the head, dependent, most dependent. sibling, grandchilrden, etc. These are respectively three-, and four-grams for the first order and sec- Integrated approach. To obtain an integrated ond order. The algorithm includes these features system for the various feature models, the scoring only the words left and right do not overlap with function of the transition-based parser from Sec- the factor (e.g. the head, dependent, etc.). We use tion 3 is augmented by a family of scoring func- feature extraction procedure for second order, and tions scoreGm for the completion model, where m third order factors. Each feature extracted in this is from 2a, 2b, 3a etc., x is the input string, and y procedure includes information about the position is the (partial) dependency tree built so far: of the nodes relative to the other nodes of the part scoreTm (π) = scoreT (π) + scoreGm (x, y) and a factor identifier. The scoring function of the completion model depends on the selected factor model Gm . The Training. For the training of our parser, we use model G2a comprises the edge factoring of Fig- a variant of the perceptron algorithm that uses the ure 3. With this model, we obtain the following Passive-Aggressive update function, cf. (Freund scoring function. and Schapire, 1998; Collins, 2002; Crammer et al., 2006). The Passive-Aggressive perceptron P scoreG2a (x, y) = (h,c)∈y w · ff irst (x,h,c) uses an aggressive update strategy by modifying P + (h,c,ci)∈y w · fsib (x,h,c,ci) the weight vector by as much as needed to clas- P + (h,c,cmo)∈y w · fgra (x,h,c,cmo) sify correctly the current example, cf. (Crammer P + (h,c,cmi)∈y w · fgra (x,h,c,cmi) et al., 2006). We apply a random function (hash function) to retrieve the weights from the weight The function f maps the input sentence x, and vector instead of a table. Bohnet (2010) showed a subtree y defined by the indexes to a feature- that the Hash Kernel improves parsing speed and vector. Again, w is the corresponding weight vec- accuracy since the parser uses additionaly nega- tor. In order to add the factor of Figure 4 to our tive features. Ganchev and Dredze (2008) used 82 this technique for structured prediction in NLP to select the best parse tree. The complexity of the reduce the needed space, cf. (Shi et al., 2009). transition-based parser is quadratic due to swap We use as weight vector size 800 million. After operation in the worse case, which is rare, and the training, we counted 65 millions non zero O(n) in the best case, cf. (Nivre, 2009). The weights for English (penn2malt), 83 for Czech beam size B is constant. Hence, the complexity and 87 millions for German. The feature vectors is in the worst case O(n2 ). are the union of features originating from the The parsing time is to a large degree deter- transition sequence of a sentence and the features mined by the feature extraction, the score calcu- of the factors over all edges of a dependency tree lation and the implementation, cf. also (Goldberg (e.g. G2a , etc.). To prevent over-fitting, we use and Elhadad, 2010). The transition-based parser averaging to cope with this problem, cf. (Freund is able to parse 30 sentences per second. The and Schapire, 1998; Collins, 2002). We calculate parser with completion model processes about 5 the error e as the sum of all attachment errors and sentences per second with a beam size of 80. label errors both weighted by 0.5. We use the Note, we use a rich feature set, a completion following equations to compute the update. model with third order factors, negative features, and a large beam. 3 loss: lt = e-(scoreT (xgt , ytg )-scoreT (xt , yt )) We implemented the following optimizations: (1) We use a parallel feature extraction for the lt PA-update: τt = ||fg −fp ||2 beam elements. Each process extracts the fea- tures, scores the possible transitions and computes We train the model to select the transitions and the score of the completion model. After the ex- the completion model together and therefore, we tension step, the beam is sorted and the best ele- use one parameter space. In order to compute the ments are selected according to the beam size. weight vector, we employ standard online learn- (2) The calculation of each score is optimized (be- ing with 25 training iterations, and carry out early yond the distinction of a static and a dynamic updates, cf. Collins and Roark (2004; Zhang and component): We calculate for each location de- Clark (2008). termined by the last element sl ∈ σi and the first element of b0 ∈ βi a numeric feature representa- Efficient Implementation. Keeping the scoring tion. This is kept fix and we add only the numeric with the completion model tractable with millions value for each of the edge labels plus a value for of feature weights and for second- and third-order the transition left-arc or right-arc. In this way, we factors requires careful bookkeeping and a num- create the features incrementally. This has some ber of specialized techniques from recent work on similarity to Goldberg and Elhadad (2010). dependency parsing. (3) We apply edge filtering as it is used in graph- We use two variables to store the scores (a) based dependency parsing, cf. (Johansson and for complete factors and (b) for incomplete fac- Nugues, 2008), i.e., we calculate the edge weights tors. The complete factors (first-order factors and only for the labels that were found for the part-of- higher-order factors for which further augmenta- speech combination of the head and dependent in tion is structurally excluded) need to be calculated the training data. only once and can then be stored with the tree fac- tors. The incomplete factors (higher-order factors 5 Parsing Experiments and Discussion whose node elements may still receive additional descendants) need to be dynamically recomputed The results of different parsing systems are of- while the tree is built. ten hard to compare due to differences in phrase structure to dependency conversions, corpus ver- The parsing algorithm only has to compute the sion, and experimental settings. For better com- scores of the factored model when the transition- parison, we provide results on English for two based parser selects a left-arc or right-arc transi- commonly used data sets, based on two differ- tion and the beam has to be sorted. The parser ent conversions of the Penn Treebank. The first sorts the beam when it exceeds the maximal beam uses the Penn2Malt conversion based on the head- size, in order to discard superfluous parses or 3 when the parsing algorithm terminates in order to 6 core, 3.33 Ghz Intel Nehalem 83 Section Sentences PoS Acc. Parser UAS LAS Training 2-21 39.832 97.08 (McDonald et al., 2005) 90.9 Dev 24 1.394 97.18 (McDonald and Pereira, 2006) 91.5 Test 23 2.416 97.30 (Huang and Sagae, 2010) 92.1 (Zhang and Nivre, 2011) 92.9 Table 1: Overview of the training, development and (Koo and Collins, 2010) 93.04 test data split converted to dependency graphs with (Martins et al., 2010) 93.26 T (baseline) 92.7 head-finding rules of (Yamada and Matsumoto, 2003). G2a (baseline) 92.89 The last column shows the accuracy of Part-of-Speech T2a 92.94 91.87 tags. T2ab 93.16 92.08 T2ab3a 93.20 92.10 T2ab3b 93.23 92.15 finding rules of Yamada and Matsumoto (2003). T2ab3c 93.17 92.10 Table 1 gives an overview of the properties of the T2ab3abc+ 93.39 92.38 corpus. The annotation of the corpus does not G2a+ 93.1 contain non-projective links. The training data (Koo et al., 2008) † 93.16 was 10-fold jackknifed with our own tagger.4 . Ta- (Carreras et al., 2008) † 93.5 (Suzuki et al., 2009) † 93.79 ble 1 shows the tagging accuracy. Table 2 lists the accuracy of our transition- Table 2: English Attachment Scores for the based parser with completion model together with Penn2Malt conversion of the Penn Treebank for the results from related work. All results use pre- test set. Punctuation is excluded from the evaluation. dicted PoS tags. As a baseline, we present in ad- The results marked with † are not directly comparable to our work as they depend on additional sources of dition results without the completion model and information (Brown Clusters). a graph-based parser with second order features (G2a ). For the Graph-based parser, we used 10 training iterations. The following rows denoted tags. From the same data set, we selected the with Ta , T2a , T2ab , T2ab3a , T2ab3b , T2ab3bc , and corpora for Czech and German. In all cases, we T2a3abc present the result for the parser with com- used the provided training, development, and test pletion model. The subscript letters denote the data split, cf. (Hajiˇc et al., 2009). In contrast used factors of the completion model as shown to the evaluation of the Penn2Malt conversion, in Figure 3 to 7. The parsers with subscribed plus we include punctuation marks for these corpora (e.g. G2a+ ) in addition use feature templates that and follow in that the evaluation schema of the contain one word left or right of the head, depen- CoNLL Shared Task 2009. Table 3 presents the dent, siblings, and grandchildren. We left those results as obtained for these data set. feature in our previous models out as they may in- The transition-based parser obtains higher ac- terfere with the second and third order factors. As curacy scores for Czech but still lower scores for in previous work, we exclude punctuation marks English and German. For Czech, the result of T for the English data converted with Penn2Malt in is 1.59 percentage points higher than the top la- the evaluation, cf. (McDonald et al., 2005; Koo beled score in the CoNLL shared task 2009. The and Collins, 2010; Zhang and Nivre, 2011).5 We reason is that T includes already third order fea- optimized the feature model of our parser on sec- tures that are needed to determine some edge la- tion 24 and used section 23 for evaluation. We use bels. The transition-based parser with completion a beam size of 80 for our transition-based parser model T2a has even 2.62 percentage points higher and 25 training iterations. accuracy and it could improve the results of the The second English data set was obtained by parser T by additional 1.03 percentage points. using the LTH conversion schema as used in the The results of the parser T are lower for English CoNLL Shared Task 2009, cf. (Hajiˇc et al., 2009). and German compared to the results of the graph- This corpus preserves the non-projectivity of the based parser G2a . The completion model T2a can phrase structure annotation, it has a rich edge reach a similar accuracy level for these two lan- label set, and provides automatic assigned PoS guages. The third order features let the transition- 4 http://code.google.com/p/mate-tools/ based parser reach higher scores than the graph- 5 We follow Koo and Collins (2010) and ignore any token based parser. The third order features contribute whose POS tag is one of the following tokens ‘‘ ’’:,. for each language a relatively small improvement 84 Parser Eng. Czech German 6 Conclusion and Future Work (Gesmundo et al., 2009)† 88.79/- 80.38 87.29 The parser introduced in this paper combines (Bohnet, 2009) 89.88/- 80.11 87.48 advantageous properties from the two major T (Baseline) 89.52/92.10 81.97/87.26 87.53/89.86 G2a (Baseline) 90.14/92.36 81.13/87.65 87.79/90.12 paradigms in data-driven dependency parsing, T2a 90.20/92.55 83.01/88.12 88.22/90.36 in particular worst case quadratic complexity of T2ab 90.26/92.56 83.22/88.34 88.31/90.24 transition-based parsing with a swap operation T2ab3a 90.20/90.51 83.21.88.30 88.14/90.23 and the consideration of complete second and T2ab3b 90.26/92.57 83.22/88.35 88.50/90.59 T2ab3abc 90.31/92.58 83.31/88.30 88.33/90.45 third order factors in the scoring of alternatives. G2a+ 90.39/92.8 81.43/88.0 88.26/90.50 While previous work using third order factors, cf. T2ab3ab+ 90.36/92.66 83.48/88.47 88.51/90.62 Koo and Collins (2010), was restricted to unla- beled and projective trees, our parser can produce Table 3: Labeled Attachment Scores of parsers that use the data sets of the CoNLL shared task 2009. In labeled and non-projective dependency trees. line with previous work, punctuation is included. The In contrast to parser stacking, which involves parsers marked with † used a joint model for syntactic running two parsers in training and application, parsing and semantic role labelling. We provide more we use only the feature model of a graph-based parsing results for the languages of CoNLL-X Shared parser but not the graph-based parsing algorithm. Task at http://code.google.com/p/mate-tools/. This is not only conceptually superior, but makes training much simpler, since no jackknifing has Parser UAS LAS to be carried out. Zhang and Clark (2008) pro- (Zhang and Clark, 2008) 84.3 posed a similar combination, without the rescor- (Huang and Sagae, 2010) 85.2 ing procedure. Our implementation allows for the (Zhang and Nivre, 2011) 86.0 84.4 use of rich feature sets in the combined scoring T2ab3abc+ 87.5 85.9 functions, and our experimental results show that Table 4: Chinese Attachment Scores for the conver- the “graph-based” completion model leads to an sion of CTB 5 with head rules of Zhang and Clark increase of between 0.4 (for English) and about (2008). We take the standard split of CTB 5 and use 1 percentage points (for Czech). The scores go in line with previous work gold segmentation, POS- beyond the current state of the art results for ty- tags and exclude punctuation marks for the evaluation. pologically different languages such as Chinese, Czech, English, and German. For Czech, English (Penn2Malt) and German, these are to our knowl- of the score. Small and statistically significant im- ege the highest reported scores of a dependency provements provides the additional second order parser that does not use additional sources of in- factor (2b).6 We tried to determine the best third formation (such as extra unlabeled training data order factors or set of factors but we cannot denote for clustering). Note that the efficient techniques such a factor which is the best for all languages. and implementation such as the Hash Kernel, the For German, we obtained a significant improve- incremental calculation of the scores of the com- ment with the factor (3b). We believe that this is pletion model, and the parallel feature extraction due to the flat annotation of PPs in the German as well as the parallelized transition-based pars- corpus. If we combine all third order factors we ing strategy play an important role in carrying out obtain for the Penn2Malt conversion a small im- this idea in practice. provement of 0.2 percentage points over the re- sults of (2ab). We think that a more deep feature selection for third order factors may help to im- References prove the actuary further. S. Abney. 1991. Parsing by chunks. In Principle- In Table 4, we present results on the Chinese Based Parsing, pages 257–278. Kluwer Academic Treebank. To our knowledge, we obtain the best Publishers. published results so far. G. Attardi. 2006. Experiments with a Multilan- guage Non-Projective Dependency Parser. In Tenth Conference on Computational Natural Language 6 The results of the baseline T compared to T2ab3abc are Learning (CoNLL-X). statistically significant (p < 0.01). B. Bohnet. 2009. Efficient Parsing of Syntactic and 85 Semantic Dependency Structures. In Proceedings ˇ ep´anek, P. Straˇna´ k, M. Surdeanu, S. Pad´o, J. Stˇ of the 13th Conference on Computational Natural N. Xue, and Y. Zhang. 2009. The CoNLL-2009 Language Learning (CoNLL-2009). shared task: Syntactic and semantic dependencies B. Bohnet. 2010. Top accuracy and fast dependency in multiple languages. In Proceedings of the Thir- parsing is not a contradiction. In Proceedings of the teenth Conference on Computational Natural Lan- 23rd International Conference on Computational guage Learning (CoNLL 2009): Shared Task, pages Linguistics (Coling 2010), pages 89–97, Beijing, 1–18, Boulder, United States, June. China, August. Coling 2010 Organizing Commit- L. Huang and K. Sagae. 2010. Dynamic programming tee. for linear-time incremental parsing. In Proceedings X. Carreras, M. Collins, and T. Koo. 2008. Tag, of the 48th Annual Meeting of the Association for dynamic programming, and the perceptron for ef- Computational Linguistics, pages 1077–1086, Up- ficient, feature-rich parsing. In Proceedings of the psala, Sweden, July. Association for Computational Twelfth Conference on Computational Natural Lan- Linguistics. guage Learning, CoNLL ’08, pages 9–16, Strouds- R. Johansson and P. Nugues. 2006. Investigating burg, PA, USA. Association for Computational Lin- multilingual dependency parsing. In Proceedings guistics. of the Shared Task Session of the Tenth Confer- X. Carreras. 2007. Experiments with a Higher-order ence on Computational Natural Language Learning Projective Dependency Parser. In EMNLP/CoNLL. (CoNLL-X), pages 206–210, New York City, United M. Collins and B. Roark. 2004. Incremental parsing States, June 8-9. with the perceptron algorithm. In ACL, pages 111– R. Johansson and P. Nugues. 2008. Dependency- 118. based Syntactic–Semantic Analysis with PropBank M. Collins. 2002. Discriminative Training Methods and NomBank. In Proceedings of the Shared Task for Hidden Markov Models: Theory and Experi- Session of CoNLL-2008, Manchester, UK. ments with Perceptron Algorithms. In EMNLP. S. Kahane, A. Nasr, and O. Rambow. 1998. Pseudo-projectivity: A polynomially parsable non- K. Crammer, O. Dekel, S. Shalev-Shwartz, and projective dependency grammar. In COLING-ACL, Y. Singer. 2006. Online Passive-Aggressive Al- pages 646–652. gorithms. Journal of Machine Learning Research, T. Koo and M. Collins. 2010. Efficient third-order 7:551–585. dependency parsers. In Proceedings of the 48th J. Eisner. 1996. Three New Probabilistic Models for Annual Meeting of the Association for Computa- Dependency Parsing: An Exploration. In Proceed- tional Linguistics, pages 1–11, Uppsala, Sweden, ings of the 16th International Conference on Com- July. Association for Computational Linguistics. putational Linguistics (COLING-96), pages 340– Terry Koo, Xavier Carreras, and Michael Collins. 345, Copenhaen. 2008. Simple semi-supervised dependency parsing. Y. Freund and R. E. Schapire. 1998. Large margin pages 595–603. classification using the perceptron algorithm. In T. Kudo and Y. Matsumoto. 2002. Japanese de- 11th Annual Conference on Computational Learn- pendency analysis using cascaded chunking. In ing Theory, pages 209–217, New York, NY. ACM proceedings of the 6th conference on Natural lan- Press. guage learning - Volume 20, COLING-02, pages 1– K. Ganchev and M. Dredze. 2008. Small statisti- 7, Stroudsburg, PA, USA. Association for Compu- cal models by random feature mixing. In Proceed- tational Linguistics. ings of the ACL-2008 Workshop on Mobile Lan- M. Kuhlmann, C. G´omez-Rodr´ıguez, and G. Satta. guage Processing. Association for Computational 2011. Dynamic programming algorithms for Linguistics. transition-based dependency parsers. In ACL, pages A. Gesmundo, J. Henderson, P. Merlo, and I. Titov. 673–682. 2009. A Latent Variable Model of Syn- Andre Martins, Noah Smith, Eric Xing, Pedro Aguiar, chronous Syntactic-Semantic Parsing for Multiple and Mario Figueiredo. 2010. Turbo parsers: De- Languages. In Proceedings of the 13th Confer- pendency parsing by approximate variational infer- ence on Computational Natural Language Learning ence. pages 34–44. (CoNLL-2009), Boulder, Colorado, USA., June 4-5. R. McDonald and F. Pereira. 2006. Online Learning Y. Goldberg and M. Elhadad. 2010. An efficient al- of Approximate Dependency Parsing Algorithms. gorithm for easy-first non-directional dependency In In Proc. of EACL, pages 81–88. parsing. In HLT-NAACL, pages 742–750. R. McDonald, K. Crammer, and F. Pereira. 2005. On- C. G´omez-Rodr´ıguez and J. Nivre. 2010. A line Large-margin Training of Dependency Parsers. Transition-Based Parser for 2-Planar Dependency In Proc. ACL, pages 91–98. Structures. In ACL, pages 1492–1501. J. Nivre and R. McDonald. 2008. Integrating Graph- J. Hajiˇc, M. Ciaramita, R. Johansson, D. Kawahara, Based and Transition-Based Dependency Parsers. M. Ant`onia Mart´ı, L. M`arquez, A. Meyers, J. Nivre, In ACL-08, pages 950–958, Columbus, Ohio. 86 J. Nivre and J. Nilsson. 2005. Pseudo-projective de- pendency parsing. In ACL. J. Nivre, M. Kuhlmann, and J. Hall. 2009. An im- proved oracle for dependency parsing with online reordering. In Proceedings of the 11th Interna- tional Conference on Parsing Technologies, IWPT ’09, pages 73–76, Stroudsburg, PA, USA. Associa- tion for Computational Linguistics. J. Nivre. 2003. An Efficient Algorithm for Pro- jective Dependency Parsing. In 8th International Workshop on Parsing Technologies, pages 149–160, Nancy, France. J. Nivre. 2009. Non-Projective Dependency Parsing in Expected Linear Time. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJC- NLP of the AFNLP, pages 351–359, Suntec, Singa- pore. K. Sagae and A. Lavie. 2006. Parser combina- tion by reparsing. In NAACL ’06: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers on XX, pages 129–132, Morristown, NJ, USA. Association for Computational Linguistics. Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S.V.N. Vishwanathan. 2009. Hash Kernels for Structured Data. In Journal of Machine Learning. J. Suzuki, H. Isozaki, X. Carreras, and M Collins. 2009. An empirical study of semi-supervised struc- tured conditional models for dependency parsing. In EMNLP, pages 551–560. I. Titov and J. Henderson. 2007. A Latent Variable Model for Generative Dependency Parsing. In Pro- ceedings of IWPT, pages 144–155. H. Yamada and Y. Matsumoto. 2003. Statistical De- pendency Analysis with Support Vector Machines. In Proceedings of IWPT, pages 195–206. Y. Zhang and S. Clark. 2008. A tale of two parsers: investigating and combining graph-based and transition-based dependency parsing using beam-search. In Proceedings of EMNLP, Hawaii, USA. Y. Zhang and J. Nivre. 2011. Transition-based de- pendency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics: Human Language Technologies, pages 188–193, Portland, Oregon, USA, June. Association for Computational Linguistics. 87 Answer Sentence Retrieval by Matching Dependency Paths Acquired from Question/Answer Sentence Pairs Michael Kaisser AGT Group (R&D) GmbH J¨agerstr. 41, 10117 Berlin, Germany

[email protected]

Abstract pendency structures of known valid answer sen- tence and from these acquires patterns that can be In Information Retrieval (IR) in general used to more precisely retrieve relevant text pas- and Question Answering (QA) in particu- sages from the underlying document collection. lar, queries and relevant textual content of- To achieve this, the position of key phrases in the ten significantly differ in their properties answer sentence relative to the answer itself is an- and are therefore difficult to relate with tra- ditional IR methods, e.g. key-word match- alyzed and linked to a certain syntactic question ing. In this paper we describe an algorithm type. Unlike most previous work that uses depen- that addresses this problem, but rather than dency paths for QA (see Section 2), our approach looking at it on a term matching/term re- does not require a candidate sentence to be similar formulation level, we focus on the syntac- to the question in any respect. We learn valid de- tic differences between questions and rele- pendency structures from the known answer sen- vant text passages. To this end we propose tences alone, and therefore are able to link a much a novel algorithm that analyzes dependency structures of queries and known relevant wider spectrum of answer sentences to the ques- text passages and acquires transformational tion. patterns that can be used to retrieve rele- The work in this paper is presented and eval- vant textual content. We evaluate our algo- uated in a classical factoid Question Answering rithm in a QA setting, and show that it out- (QA) setting. The main reason for this is that performs a baseline that uses only depen- dency information contained in the ques- in QA suitable training and test data is available tions by 300% and that it also improves per- in the public domain, e.g. via the Text REtrieval formance of a state of the art QA system Conference (TREC), see for example (Voorhees, significantly. 1999). The methods described in this paper how- ever can also be applied to other IR scenarios, e.g. web search. The necessary condition for our ap- 1 Introduction proach to work is that the user query is somewhat It is a well known problem in Information Re- grammatically well formed; this kind of queries trieval (IR) and Question Answering (QA) that are commonly referred to as Natural Language queries and relevant textual content often signif- Queries or NLQs. icantly differ in their properties, and are therefore Table 1 provides evidence that users indeed difficult to match with traditional IR methods. A search the web with NLQs. The data is based on common example is a user entering words to de- two query sets sampled from three months of user scribe their information need that do not match logs from a popular search engine, using two dif- the words used in the most relevant indexed doc- ferent sampling techniques. The “head” set sam- uments. This work addresses this problem, but ples queries taking query frequency into account, shifts focus from words to syntactic structures of so that more common queries have a proportion- questions and relevant pieces of text. To this end, ally higher chance of being selected. The “tail” we present a novel algorithm that analyses the de- query set samples only queries that have been is- 88 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 88–98, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Set Head Tail damental problem, but shifting focus from query Query # 15,665 12,500 term/document term mismatch to mismatches ob- how 1.33% 2.42% served between the grammatical structure of Nat- what 0.77% 1.89% ural Language Queries and relevant text pieces. In define 0.34% 0.18% order to achieve this we analyze the queries’ and is/are 0.25% 0.42% the relevant contents’ syntactic structure by using where 0.18% 0.45% do/does 0.14% 0.30% dependency paths. can 0.14% 0.25% Especially in QA there is a strong tradition why 0.13% 0.30% of using dependency structures: (Lin and Pan- who 0.12% 0.38% tel, 2001) present an unsupervised algorithm to when 0.09% 0.21% automatically discover inference rules (essentially which 0.03% 0.08% paraphrases) from text. These inference rules are Total 3.55% 6.86% based on dependency paths, each of which con- nects two nouns. Their paths have the following Table 1: Percentages of Natural Language queries in head and tail search engine query logs. See text for form: details. N:subj:V←find→V:obj:N→solution→N:to:N This path represents the relation “X finds a solu- sued less that 500 times during a three months pe- tion to Y” and can be mapped to another path rep- riod and it disregards query frequency. As a result, resenting e.g. “X solves Y.” As such the approach rare and frequent queries have the same chance of is suitable to detect paraphrases that describe the being selected. Doubles are excluded from both relation between two entities in documents. How- sets. Table 1 lists the percentage of queries in ever, the paper does not describe how the mined the query sets that start with the specified word. paraphrases can be linked to questions, and which In most contexts this indicates that the query is a paraphrase is suitable to answer which question question, which in turn means that we are dealing type. with an NLQ. Of course there are many NLQs that (Attardi et al., 2001) describes a QA system start with words other than the ones listed, so we that, after a set of candidate answer sentences can expect their real percentage to be even higher. have been identified, matches their dependency relations against the question. Questions and 2 Related Work answer sentences are parsed with MiniPar (Lin, In IR the problem that queries and relevant tex- 1998) and the dependency output is analyzed in tual content often do not exhibit the same terms is order to determine whether relations present in a commonly encountered. Latent Semantic Index- question also appear in a candidate sentence. For ing (Deerwester et al., 1900) was an early, highly the question “Who killed John F. Kennedy”, for influential approach to solve this problem. More example an answer sentence is expected to con- recently, a significant amount of research is ded- tain the answer as subject of the verb “kill”, to icated to query alteration approaches. (Cui et al., which “John F. Kennedy” should be in object re- 2002), for example, assume that if queries con- lation. taining one term often result in the selection of (Cui et al., 2005) describe a fuzzy depen- documents containing another term, then a strong dency relation matching approach to passage re- relationship between the two terms exist. In their trieval in QA. Here, the authors present a statis- approach, query terms and document terms are tical technique to measure the degree of overlap linked via sessions in which users click on doc- between dependency relations in candidate sen- uments that are presented as results for the query. tences with their corresponding relations in the (Riezler and Liu, 2010) apply a Statistical Ma- question. Question/answer passage pairs from chine Translation model to parallel data consist- TREC-8 and TREC-9 evaluations are used as ing of user queries and snippets from clicked web training data. As in some of the papers mentioned documents and in such a way extract contextual earlier, a statistical translation model is used, but expansion terms from the query rewrites. this time to learn relatedness between paths. (Cui We see our work as addressing the same fun- et al., 2004) apply the same idea to answer ex- 89 traction. In each sentences returned by the IR 4. The acquisition of Alaska by the United module, all named entities of the expected answer States of America from Russia in 1867 is types are treated as answer candidates. For ques- known as “Seward’s Folly”. tions with an unknown answer type, all NPs in the candidate sentence are considered. Then those The remaining three sentences introduce vari- paths in the answer sentence that are connected ous forms of syntactic and semantic transforma- to an answer candidate are compared against the tions. In order to capture a wide range of possible corresponding paths in the question, in a similar ways on how answer sentences can be formulated, fashion as in (Cui et al., 2005). The candidate in our model a candidate sentence is not evalu- whose paths show the highest matching score is ated according to its similarity with the question. selected. (Shen and Klakow, 2006) also describe Instead, its similarity to known answer sentences a method that is primarily based on similarity (which were presented to the system during train- scores between dependency relation pairs. How- ing) is evaluated. This allows to us to capture a ever, their algorithm computes the similarity of much wider range of syntactic and semantic trans- paths between key phrases, not between words. formations. Furthermore, it takes relations in a path not as in- dependent from each other, but acknowledges that 3 Overview of the Algorithm they form a sequence, by comparing two paths Our algorithm uses input data containing pairs of with the help of an adaptation of the Dynamic the following: Time Warping algorithm (Rabiner et al., 1991). (Molla, 2006) presents an approach for the ac- NLQs/Questions NLQs that describe the users’ quisition of question answering rules by apply- information need. For the experiments car- ing graph manipulation methods. Questions are ried out in this paper we use questions from represented as dependency graphs, which are ex- the TREC QA track 2002-2006. tended with information from answer sentences. Relevant textual content This is a piece of text These combined graphs can then be used to iden- that is relevant to the user query in that it tify answers. Finally, in (Wang et al., 2007), a contains the information the user is search- quasi-synchronous grammar (Smith and Eisner, ing for. In this paper, we use sentences ex- 2006) is used to model relations between ques- tracted from the AQUAINT corpus (Graff, tions and answer sentences. 2002) that contain the answer to the given In this paper we describe an algorithm that TREC question. learns possible syntactic answer sentence formu- lations for syntactic question classes from a set of In total, the data available to us for our experi- example question/answer sentence pairs. Unlike ments consists of 8,830 question/answer sentence the related work described above, it acknowledges pairs. This data is publicly available, see (Kaisser that a) a valid answer sentence’s syntax might and Lowe, 2008). The algorithm described in this be very different for the question’s syntax and b) paper has three main steps: several valid answer sentence structures, which might be completely independent from each other, Phrase alignment Key phrases from the ques- can exist for one and the same question. tion are paired with phrases from the answer To illustrate this consider the question “When sentences. was Alaska purchased?” The following four sen- Pattern creation The dependency structures of tences all answer the given question, but only the queries and answer sentences are analyzed first sentence is a straightforward reformulation of and patterns are extracted. the question: Pattern evaluation The patterns discovered in 1. The United States purchased Alaska in 1867 the last step are evaluated and a confidence from Russia. score is assigned to each. 2. Alaska was bought from Russia in 1867. The acquired patterns can then be used during 3. In 1867, the Russian Empire sold the Alaska retrieval, where a question is matched against the territory to the USA. antecedents describing the syntax of the question. 90 Input: (a) Query: “When was Alaska purchased?” (b) Answer sentence: “The acquisition of Alaska happened in 1867.” Step 1: Question is segmented into key phrases and stop words: When[1]+was[2]+NP[3]+VERB[4] Step 2: Key question phrases are aligned with key answer sentence phrases: [3]Alaska → Alaska [4]purchased → acquisition ANSWER → 1867 Step 3: A pre-computed parse tree of the answer sentence is loaded: 1: The (the, DT, 2) [det] 2: acquisition (acquisition, NN, 5) [nsubj] 3: of (of, IN, 2) [prep] 4: Alaska (Alaska, IN, 2) [pobj] 5: happened (happen, VBD, null) [ROOT] 6: in (in, IN, 5) [prep] 7: 1867 (1867, CD, 6) [pobj] Step 4: Dependency paths from key question phrases to the answer are computed: Alaska⇒1867: ⇑pobj⇑prep⇑nsubj⇓prep⇓pobj acquisition⇒1867: ⇑nsubj⇓prep⇓pobj Step 5: The resulting pattern is stored: Query: When[1]+was[2]+NP[3]+VERB[4] Path 3: ⇑pobj⇑prep⇑nsubj⇓prep⇓pobj Path 4: ⇑nsubj⇓prep⇓pobj Figure 1: The pattern creation algorithm exemplified in five key steps for the query “When was Alaska pur- chased?” and the answer sentence “The acquisition of Alaska happened in 1867.” Note that one question can potentially match sev- tify and align phrases. Word Alignment is im- eral patterns. The consequents contain descrip- portant in many fields of NLP, e.g. Machine tions of grammatical structures of potential an- Translation (MT) where words in parallel, bilin- swer sentences that can be used to identify and gual corpora need to be aligned, see (Och and evaluate candidate sentences. Ney, 2003) for a comparison of various statisti- cal alignment models. In our case however we 4 Phrase Alignment are dealing with a monolingual alignment prob- lem which enables us to exploit clues not available The goal of this processing step is to align phrases for bilingual alignment: First of all, we can expect from the question with corresponding phrases many query words to be present in the answer sen- from the answer sentences in the training data. tence, either with the exact same surface appear- Consider the following example: ance or in some morphological variant. Secondly, Query: “When was the Alaska territory pur- there are tools available that tell us how semanti- chased?” cally related two words are, most notably Word- Answer sentence: “The acquisition of what Net (Miller et al., 1993). For these reasons we im- would become the territory of Alaska took place plemented a bespoke alignment strategy, tailored in 1867.” towards our problem description. The mapping that has to be achieved is: This method is described in detail in (Kaisser, Query Answer Sentence 2009). The processing steps described in the phrase phrase next sections build on its output. For reasons of “Alaska territory” “territory of Alaska” brevity, we skip a detailed explanations in this pa- “purchased” “acquisition” per and focus only on its key part: the alignment ANSWER “1867” of words with very different surface structures. In our approach, this is a two step process. For more details we would like to point the reader First we align on a word level, then the output to the aforementioned work. of the word alignment process is used to iden- In the above example, the alignment of “pur- 91 chased” and “acquisition” is the most problem- Klein and Manning, 2003a), so at this point they atic, because the surface structures of the two are simply loaded from file. Step 4 is the key step words clearly are very different. For such cases in our algorithm. From the previous steps, we we experimented with a number of alignment know where the key constituents from the ques- strategies based on WordNet. These approaches tion as well as the answer are located in the an- are similar in that each picks one word that has to swer sentence. This enables us to compute the be aligned from the question at a time and com- dependency paths in the answer sentences’ parse pares it to all of the non-stop words in the answer tree that connect the answer with the key con- sentence. Each of the answer sentence words is stituents. In our example the answer is “1867” assigned a value between zero and one express- and the key constituents are “acquisition” and ing its relatedness to the question word. The “Alaska.” Knowing the syntactic relationships highest scoring word, if above a certain thresh- (captured by their dependency paths) between the old, is selected as the closest semantic match. answer and the key phrases enables us to capture Most of these approaches make use of Word- one syntactic possibility of how answer sentences Net::Similarity, a Perl software package that mea- to queries of the form When+was+NP+VERB can sures semantic similarity (or relatedness) between be formulated. a pair of word senses by returning a numeric value As can be seen in Step 5 a flat syntactic ques- that represents the degree to which they are sim- tion representation is stored, together with num- ilar or related (Pedersen et al., 2004). Addition- bers assigned to each constituent. The num- ally, we developed a custom-built method that as- bers for those constituents for which alignments sumes that two words are semantically related if in the answer sentence were sought and found any kind of pointer exists between any occurrence are listed together with the resulting dependency of the words root form in WordNet. For details of paths. Path 3 for example denotes the path from these experiments, please refer to (Kaisser, 2009). constituent 3 (the NP “Alaska”) to the answer. If In our experiments the custom-built method per- no alignment could be found for a constituent, formed best, and was therefore used for the exper- null is stored instead of a path. Should two or iments described in this paper. The main reasons more alternative constituents be identified for one for this are: question constituent, additional patterns are cre- ated, so that each contains one of the possibilities. 1. Many of the measures in the Word- The described procedure is repeated for all ques- Net::Similarity package take only hyponym/ tion/answer sentence pairs in the training set and hypernym relations into account. This makes for each, one or more patterns are created. aligning word of different parts of speech It is worth to note that many TREC ques- difficult or even impossible. However, such tions are fairly short and grammatically sim- alignments are important for our needs. ple. In our training data we for exam- ple find 102 questions matching the pattern 2. Many of the measures return results, even if When[1]+was[2]+NP[3]+VERB[4], which only a weak semantic relationship exists. For together list 382 answer sentences, and thus 382 our purposes however, it is beneficial to only potentially different answer sentence structures take strong semantic relations into account. from which patterns can be gained. As a result, the amount of training examples we have avail- 5 Pattern Creation able, is sufficient to achieve the performance de- Figure 1 details our algorithm in its five key steps. scribed in Section 7. The algorithm described in In step 1 and 2 key phrases from the question are this paper can of course also be used for more aligned to the corresponding phrases in the an- complicated NLQs, although in such a scenario a swer sentence, see Section 4 of this paper. Step significantly larger amount of training data would 3 is concerned with retrieving the parse tree for have to be used. the answer sentence. In our implementation all 6 Pattern Evaluation answer sentences in the training set have for per- formance reasons been parsed beforehand with For each created pattern, at least one match- the Stanford Parser (Klein and Manning, 2003b; ing example must exists: the sentence that was 92 used to create it in the first place. However, we n do not know how precise each pattern is. To X score(ac) = score(pi ) (2) this end, an additional processing step between i=1 pattern creation and application is needed: pat- where tern evaluation. Similar approaches to ours have ( been described in the relevant literature, many correcti +1 if match score(pi ) = correcti +incorrecti +2 (3) of them concerned with bootstrapping, starting 0 no match with (Ravichandran and Hovy, 2002). The gen- eral purpose of this step is to use the available The highest scoring candidate is selected. data about questions and their correct answers to We would like to explicitly call out one prop- evaluate how often each created pattern returns a erty of our algorithm: It effectively returns two correct or an incorrect result. This data is stored entities: a) a sentence that constitutes a valid with each pattern and the result of the equation, response to the query, b) the head node of a often called pattern precision, can be used during phrase in that sentence that constitutes the answer. retrieval stage. Pattern precision in our case is de- Therefore the algorithm can be used for sentence fined as: retrieval or for answer retrieval. It depends on #correct + 1 the application which of the two behaviors is de- p= (1) sired. In the next section, we evaluate its answer #correct + #incorrect + 2 retrieval performance. We use Lucene to retrieve the top 100 para- 7 Experiments & Results graphs from the AQUAINT corpus by issuing a query that consists of the query’s key words and This section provides an evaluation of the algo- all non-stop words in the answer. Then, all pat- rithm described in this paper. The key questions terns are loaded whose antecedent matches the we seek to answer are the following: query that is currently being processed. After that, constituents from all sentences in the retrieved 1. How does our method perform when com- 100 paragraphs are aligned to the query’s con- pared to a baseline that extracts dependency stituents in the same way as for the sentences dur- paths from the question? ing pattern creation, see Section 5. Now, the paths 2. How much does the described algorithm im- specified in these patterns are searched for in the prove performance of a state-of-the-art QA paragraphs’ parse trees. If they are all found, system? it is checked whether they all point to the same node and whether this node’s surface structure is 3. What is the effect of training data size on per- in some morphological form present in the answer formance? Can we expect that more training strings associated with the question in our train- data would further improve the algorithm’s ing data. If this is the case a variable in the pat- performance? tern named correct is increased by 1, otherwise 7.1 Evaluation Setup the variable incorrect is increased by 1. After the evaluation process is finished the final version of We use all factoid questions in TREC’s QA test the pattern given as an example in Figure 1 now sets from 2002 to 2006 for evaluation for which is: a known answer exists in the AQUAINT corpus. Query: When[1]+was[2]+NP[3]+VERB[4] Additionally, the data in (Lin and Katz, 2005) is Path 3: ⇑pobj⇑prep⇑nsubj⇓prep⇓pobj used. In this paper the authors attempt to identify Path 4: ⇑nsubj⇓prep⇓pobj a much more complete set of relevant documents Correct: 15 for a subset of TREC 2002 questions than TREC Incorrect: 4 itself. We adopt a cross validation approach for our evaluation. Table 4 shows how the data is split The variables correct and incorrect are used into five folds. during retrieval, where the score of an answer can- In order to evaluate the algorithm’s patterns we didate ac is the sum of all scores of all matching need a set of sentences to which they can be ap- patterns p: plied. In a traditional QA system architecture, 93 Test Number of Correct Answer Sentences Mean Med set =0 <= 1 <= 3 <= 5 <= 10 <= 25 <= 50 >= 75 >= 90 >= 100 2002 0.203 0.396 0.580 0.671 0.809 0.935 0.984 0.0 0.0 0.0 6.86 2.0 2003 0.249 0.429 0.627 0.732 0.828 0.955 0.997 0.003 0.003 0.0 5.67 2.0 2004 0.221 0.368 0.539 0.637 0.799 0.936 0.985 0.0 0.0 0.0 6.51 3.0 2005 0.245 0.404 0.574 0.665 0.777 0.912 0.987 0.0 0.0 0.0 7.56 2.0 2006 0.241 0.389 0.568 0.665 0.807 0.920 0.966 0.006 0.0 0.0 8.04 3.0 Table 2: Fraction of sentences that contain correct answers in Evaluation Set 1 (approximation). Test Number of Correct Answer Sentences Mean Med set =0 <= 1 <= 3 <= 5 <= 10 <= 25 <= 50 >= 75 >= 90 >= 100 2002 0.0 0.074 0.158 0.235 0.342 0.561 0.748 0.172 0.116 0.060 33.46 21.0 2003 0.0 0.099 0.203 0.254 0.356 0.573 0.720 0.161 0.090 0.031 32.88 19.0 2004 0.0 0.073 0.137 0.211 0.328 0.598 0.779 0.142 0.069 0.034 30.82 20.0 2005 0.0 0.163 0.238 0.279 0.410 0.589 0.759 0.141 0.097 0.069 30.87 17.0 2006 0.0 0.125 0.207 0.281 0.415 0.596 0.727 0.173 0.122 0.088 32.93 17.5 Table 3: Fraction of sentences that contain correct answers in Evaluation Set 2 (approximation). Training Data Test Data In order to provide a quantitative characteriza- Fold sets used # set # 1 T03, T04, T05, T06 4565 T02 1159 tion of the two evaluation sets we estimated the 2 T02, T04, T05, T06, Lin02 6174 T03 1352 number of correct answer sentences they contain. 3 T02, T03, T05, T06, Lin02 6700 T04 826 4 T02, T03, T04, T06, Lin02 6298 T05 1228 For each paragraph it was determined whether it 5 T02, T03, T04, T05, Lin02 6367 T06 1159 contained one of the known answer strings and at least of one of the question key words. Ta- Table 4: Splits into training and tests sets of the data bles 2 and 3 show for each evaluation set how used for evaluation. T02 stands for TREC 2002 data etc. Lin02 is based on (Lin and Katz, 2005). The # many answers on average it contains per ques- rows show how many question/answer sentence pairs tion. The column “= 0” for example shows the are used for training and for testing. fraction of questions for which no valid answer sentence is contained in the evaluation set, while see e.g. (Prager, 2006; Voorhees, 2003), the docu- column “>= 90” gives the fraction of questions ment or passage retrieval step performs this func- with 90 or more valid answer sentences. The last tion. This step is crucial to a QA system’s per- two columns show mean and median values. formance, because it is impossible to locate an- 7.2 Comparison with Baseline swers in the subsequent answer extraction step if the passages returned during passage retrieval do As pointed out in Section 2 there is a strong tra- not contain the answer in the first place. This also dition of using dependency paths in QA. Many holds true in our case: the patterns cannot be ex- relevant papers describe algorithms that analyze pected to identify a correct answer if none of the a question’s grammatical structure and expect sentences used as input contains the correct an- to find a similar structure in valid answer sen- swer. We therefore use two different evaluation tences, e.g. (Attardi et al., 2001), (Cui et al., 2005) sets to evaluate our algorithm: or (Bouma et al., 2005) to name just a few. As already pointed out, a major contribution of our 1. The first set contains for each question all work is that we do not assume this similarity. In sentences in the top 100 paragraphs returned our approach valid answer sentences are allowed by Lucene when using simple queries made to have grammatical structures that are very dif- up from the question’s key words. It cannot ferent from the question and also very different be guaranteed that answers to every question from each other. Thus it is natural to compare our are present in this test set. approach against a baseline that compares can- 2. For the second set, the query additionally list didate sentences not against patterns that were all known correct answers to the question as gained from question/answer sentence pairs, but parts of one OR operator. This increases the from questions alone. In order to create these pat- chance that the evaluation set actually con- terns, we use a small trick: During the Pattern tains valid answer sentences significantly. Creation step, see Section 5 and Figure 1, we re- 94 Test Q Qs with Min one Overall Accuracy Acc. if place the answer sentences in the input file with set number patterns correct correct overall pattern 2002 429 321 77 37 0.086 0.115 the questions, and assume that the question word 2003 354 237 39 26 0.073 0.120 2004 204 142 25 15 0.074 0.073 indicates the position where the answer should be 2005 319 214 38 18 0.056 0.084 located. 2006 352 208 34 16 0.045 0.077 Sum 1658 1122 213 112 0.068 0.100 Test Q Qs with >1 Overall Accuracy Acc. if set number patterns correct correct overall pattern 2002 429 321 147 50 0.117 0.156 Table 8: Baseline performance based on evaluation set 2003 354 237 76 22 0.062 0.093 2. 2004 204 142 74 26 0.127 0.183 2005 319 214 97 46 0.144 0.215 2006 352 208 85 31 0.088 0.149 Sum 1658 1122 452 176 0.106 0.156 described in this paper and the baseline approach do not make use of many techniques commonly Table 5: Performance based on evaluation set 1. used to increase performance of a QA system, e.g. TF-IDF fallback strategies, fuzzy matching, man- Test Q Qs with >1 Overall Accuracy Acc. if set number patterns correct correct overall pattern ual reformulation patterns etc. It was a deliberate 2002 429 321 239 133 0.310 0.414 2003 354 237 149 88 0.248 0.371 decision from our side not to use any of these ap- 2004 204 142 119 65 0.319 0.458 2005 319 214 161 92 0.288 0.429 proaches. After all, this would result in an ex- 2006 352 208 139 84 0.238 0.403 Sum 1658 1122 807 462 0.278 0.411 perimental setup where the performance of our answer extraction strategy could not have been Table 6: Performance based on evaluation set 2. observed in isolation. The QA system used as a baseline in the next section makes use of many of Tables 5 and 6 show how our algorithm per- these techniques and we will see that our method, forms on evaluation sets 1 and 2, respectively. Ta- as described here, is suitable to increase its per- bles 7 and 8 show how the baseline performs on formance significantly. evaluation sets 1 and 2, respectively. The tables’ columns list the year of the TREC test set used, 7.3 Impact on an existing QA System the number of questions in the set (we only use Tables 9 and 10 show how our algorithm in- questions for which we know that there is an an- creases performance of our QuALiM system, see swer in the corpus), the number of questions for e.g. (Kaisser et al., 2006). Section 6 in this pa- which one or more patterns exist, how often at per describes via formulas 2 and 3 how answer least one pattern returned the correct answer, how candidates are ranked. This ranking is combined often we get an overall correct result by taking with the existing QA system’s candidate ranking all patterns and their confidence values into ac- by simply using it as an additional feature that count, accuracy@1 of the overall system, and ac- boosts candidates proportionally to their confi- curacy@1 computed only for those questions for dence score. The difference between both tables which we have at least one pattern available (for is that the first uses all 1658 questions in our test all other questions the system returns no result.) sets for the evaluation, whereas the second con- As can be seen, on evaluation set 1 our method siders only those 1122 questions for which our outperforms the baseline by 300%, on evaluation system was able to learn a pattern. Thus for Table set 2 by 311%, taking accuracy if a pattern exists 10 questions which the system had no chance of as a basis. answering due to limited training data are omitted. Test Q Qs with Min one Overall Accuracy Acc. if As can be seen, accuracy@1 increases by 4.9% on set number patterns correct correct overall pattern 2002 429 321 43 14 0.033 0.044 the complete test set and by 11.5% on the partial 2003 354 237 28 10 0.028 0.042 2004 204 142 19 6 0.029 0.042 set. 2005 319 214 21 7 0.022 0.033 2006 352 208 20 7 0.020 0.034 Note that the QA system used as a baseline is Sum 1658 1122 131 44 0.027 0.039 at an advantage in at least two respects: a) It has Table 7: Baseline performance based on evaluation set important web-based components and as such has 1. access to a much larger body of textual informa- tion. b) The algorithm described in this paper is an Many of the papers cited earlier that use an ap- answer extraction approach only. For paragraph proach similar to our baseline approach of course retrieval we use the same approach as for evalu- report much better results than Tables 7 and 8. ation set 1, see Section 7.1. However, in more This however is not too surprising as the approach than 20% of the cases, this method returns not 95 a single paragraph that contains both the answer and at least one question keyword. In such cases, the simple paragraph retrieval makes it close to impossible for our algorithm to return the correct answer. Test Set QuALiM QASP combined increase 2002 0.503 0.117 0.524 4.2% 2003 0.367 0.062 0.390 6.2% 2004 0.426 0.127 0.451 5.7% 2005 0.373 0.144 0.389 4.2% 2006 0.341 0.088 0.358 5.0% 02-06 0.405 0.106 0.425 4.9% Table 9: Top-1 accuracy of the QuALiM system on its own and when combined with the algorithm described Figure 2: Effect of the amount of training data on sys- in this paper. All increases are statistically significant tem performance using a sign test (p < 0.05). 8 Conclusions Test Set QuALiM QASP combined increase 2002 0.530 0.156 0.595 12.3% 2003 2004 0.380 0.465 0.093 0.183 0.430 0.514 13.3% 10.6% In this paper we present an algorithm that acquires 2005 0.388 0.214 0.421 8.4% syntactic information about how relevant textual 2006 0.385 0.149 0.428 11.3% 02-06 0.436 0.157 0.486 11.5% content to a question can be formulated from a collection of paired questions and answer sen- Table 10: Top-1 accuracy of the QuALiM system on its own and when combined with the algorithm de- tences. Other than previous work employing de- scribed in this paper, when only considering questions pendency paths for QA, our approach does not as- for which a pattern could be acquired from the training sume that a valid answer sentence is similar to the data. All increases are statistically significant using a question and it allows many potentially very dif- sign test (p < 0.05). ferent syntactic answer sentence structures. The algorithm is evaluated using TREC data, and it is shown that it outperforms an algorithm that merely uses the syntactic information contained 7.4 Effect of Training Data Size in the question itself by 300%. It is also shown that the algorithm improves the performance of a We now assess the effect of training data size on state-of-the-art QA system significantly. performance. Tables 5 and 6 presented earlier show that an average of 32.2% of the questions As always, there are many ways how we could have no matching patterns. This is because the imagine our algorithm to be improved. Combin- data used for training contained no examples for a ing it with fuzzy matching techniques as in (Cui et significant subset of question classes. It can be ex- al., 2004) or (Cui et al., 2005) is an obvious direc- pected that, if more training data would be avail- tion for future work. We are also aware that in or- able, this percentage would decrease and perfor- der to apply our algorithm on a larger scale and in mance would increase. In order to test this as- a real world setting with real users, we would need sumption, we repeated the evaluation procedure a much larger set of training data. These could detailed in this section several times, initially us- be acquired semi-manually, for example by using ing data from only one TREC test set for train- crowd-sourcing techniques. We are also thinking ing and then gradually adding more sets until all about fully automated approaches, or about us- available training data had been used. The results ing indirect human evidence, e.g. user clicks in for evaluation set 2 are presented in Figure 2. As search engine logs. Typically users only see the can be seen, every time more data is added, per- title and a short abstract of the document when formance increases. This strongly suggests that clicking on a result, so it is possible to imagine a the point of diminishing returns, when adding ad- scenario where a subset of these abstracts, paired ditional training data no longer improves perfor- with user queries, could serve as training data. mance is not yet reached. 96 References Dekang Lin and Patrick Pantel. 2001. Discovery of Inference Rules for Question-Answering. Natural Giuseppe Attardi, Antonio Cisternino, Francesco Language Engineering, 7(4):343–360. Formica, Maria Simi, and Alessandro Tommasi. 2001. PIQASso: Pisa Question Answering System. Dekang Lin. 1998. Dependency-based Evaluation of In Proceedings of the 2001 Edition of the Text RE- MINIPAR. In Workshop on the Evaluation of Pars- trieval Conference (TREC-01). ing Systems. Gosse Bouma, Jori Mur, and Gertjan van Noord. 2005. George A. Miller, Richard Beckwith, Christiane Fell- Reasoning over Dependency Relations for QA. In baum, Derek Gross, and Katherine Miller. 1993. Proceedings of the IJCAI workshop on Knowledge Introduction to WordNet: An On-Line Lexical and Reasoning for Answering Questions (KRAQ- Database. Journal of Lexicography, 3(4):235–244. 05). Diego Molla. 2006. Learning of Graph-based Hang Cui, Ji-Rong Wen, Jian-Yun Nie, and Wei-Ying Question Answering Rules. In Proceedings of Ma. 2002. Probabilistic query expansion using HLT/NAACL 2006 Workshop on Graph Algorithms query logs. In 11th International World Wide Web for Natural Language Processing. Conference (WWW-02). Franz Josef Och and Hermann Ney. 2003. A System- Hang Cui, Keya Li, Renxu Sun, Tat-Seng Chua, and atic Comparison of Various Statistical Alignment Min-Yen Kan. 2004. National University of Sin- Models. Computational Linguistics, 29(1):19–52. gapore at the TREC-13 Question Answering Main Ted Pedersen, Siddharth Patwardhan, and Jason Task. In Proceedings of the 2004 Edition of the Text Michelizzi. 2004. WordNet::Similarity - Measur- REtrieval Conference (TREC-04). ing the Relatedness of Concepts. In Proceedings Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and of the Nineteenth National Conference on Artificial Tat-Seng Chua. 2005. Question Answering Pas- Intelligence (AAAI-04). sage Retrieval Using Dependency Relations. In John Prager. 2006. Open-Domain Question- Proceedings of the 28th ACM-SIGIR International Answering. Foundations and Trends in Information Conference on Research and Development in Infor- Retrieval, 1(2). mation Retrieval (SIGIR-05). L. R. Rabiner, A. E. Rosenberg, and S. E. Levin- Scott Deerwester, Susan Dumais, George Furnas, son. 1991. Considerations in Dynamic Time Warp- Thomas Landauer, and Richard Harshman. 1900. ing Algorithms for Discrete Word Recognition. In Indexing by Latent Semantic Analysis. Journal of Proceedings of IEEE Transactions on Acoustics, the American society for information science, 41(6). Speech and Signal Processing. David Graff. 2002. The AQUAINT Corpus of English Deepak Ravichandran and Eduard Hovy. 2002. News Text. Learning Surface Text Patterns for a Question An- Michael Kaisser and John Lowe. 2008. Creating a swering System. In Proceedings of the 40th Annual Research Collection of Question Answer Sentence Meeting of the Association for Computational Lin- Pairs with Amazon’s Mechanical Turk. In Proceed- guistics (ACL-02). ings of the Sixth International Conference on Lan- Stefan Riezler and Yi Liu. 2010. Query Rewriting guage Resources and Evaluation (LREC-08). using Monolingual Statistical Machine Translation. Michael Kaisser, Silke Scheible, and Bonnie Webber. Computational Linguistics, 36(3). 2006. Experiments at the University of Edinburgh for the TREC 2006 QA track. In Proceedings of Dan Shen and Dietrich Klakow. 2006. Exploring Cor- the 2006 Edition of the Text REtrieval Conference relation of Dependency Relation Paths for Answer (TREC-06). Extraction. In Proceedings of the 21st International Michael Kaisser. 2009. Acquiring Syntactic and Conference on Computational Linguistics and 44th Semantic Transformations in Question Answering. Annual Meeting of the ACL (COLING/ACL-06). Ph.D. thesis, University of Edinburgh. David A. Smith and Jason Eisner. 2006. Quasisyn- Dan Klein and Christopher D. Manning. 2003a. Ac- chronous grammars: Alignment by Soft Projec- curate Unlexicalized Parsing. In Proceedings of the tion of Syntactic Dependencies. In Proceedings of 41st Meeting of the Association for Computational the HLTNAACL Workshop on Statistical Machine Linguistics (ACL-03). Translation. Dan Klein and Christopher D. Manning. 2003b. Fast Ellen M. Voorhees. 1999. Overview of the Eighth Exact Inference with a Factored Model for Natural Text REtrieval Conference (TREC-8). In Pro- Language Parsing. In Advances in Neural Informa- ceedings of the Eighth Text REtrieval Conference tion Processing Systems 15. (TREC-8). Jimmy Lin and Boris Katz. 2005. Building a Reusable Ellen M. Voorhees. 2003. Overview of the TREC Test Collection for Question Answering. Journal of 2003 Question Answering Track. In Proceedings of the American Society for Information Science and the 2003 Edition of the Text REtrieval Conference Technology (JASIST). (TREC-03). 97 Mengqiu Wang, Noah A. Smith, and Teruko Mita- mura. 2007. What is the Jeopardy model? A Qua- sisynchronous Grammar for QA. In Proceedings of EMNLP-CoNLL 2007. 98 Can Click Patterns across User’s Query Logs Predict Answers to Definition Questions? Alejandro Figueroa Yahoo! Research Latin America Blanco Encalada 2120, Santiago, Chile

[email protected]

Abstract It is a standard practice of definition ques- tion answering (QA) systems to mine KBs (e.g., In this paper, we examined click patterns online encyclopedias and dictionaries) for reli- produced by users of Yahoo! search engine able descriptive information on the definiendum when prompting definition questions. Reg- (Sacaleanu et al., 2008). Normally, these pieces of ularities across these click patterns are then information (i.e., nuggets) explain different facets utilized for constructing a large and hetero- geneous training corpus for answer rank- of the definiendum (e.g., “ballet choreographer” ing. In a nutshell, answers are extracted and “born in Bordeaux”), and the main idea con- from clicked web-snippets originating from sists in projecting the acquired nuggets into the any class of web-site, including Knowledge set of answer candidates afterwards. However, Bases (KBs). On the other hand, non- the performance of this category of method falls answers are acquired from redundant pieces into sharp decline whenever few or no coverage of text across web-snippets. is found across KBs (Zhang et al., 2005; Han et The effectiveness of this corpus was as- al., 2006). Put differently, this technique usually sessed via training two state-of-the-art succeeds in discovering the most relevant facts models, wherewith answers to unseen about the most promiment sense of the definien- queries were distinguished. These test- ing queries were also submitted by search dum. But it often misses many pertinent nuggets, engine users, and their answer candidates especially those that can be paraphrased in several were taken from their respective returned ways; and/or those regarding ancillary senses of web-snippets. This corpus helped both the definiendum, which are hardly found in KBs. techniques to finish with an accuracy higher As a means of dealing with this, current strate- than 70%, and to predict over 85% of the answers clicked by users. In particular, our gies try to construct general definition models results underline the importance of non-KB inferred from a collection of definitions com- training data. ing from the Internet or KBs (Androutsopoulos and Galanis, 2005; Xu et al., 2005; Han et al., 2006). To a great extent, models exploiting non- 1 Introduction KB sources demand considerable annotation ef- It is a well-known fact that definition queries are forts, or when the data is obtained automatically, very popular across users of commercial search they benefit from empirical thresholds that ensure engines (Rose and Levinson, 2004). The essen- a certain degree of similarity to an array of KB tial characteristic of definition questions is their articles. These thesholds attempt to trade-off the aim for discovering as much as possible descrip- cleanness of the training material against its cov- tive information about the concept being defined erage. Moreover, gathering negative samples is (a.k.a. definiendum, pl. definienda). Some exam- also hard as it is not easy to find wide-coverage ples of this kind of query include “Who is Ben- authoritative sources of non-descriptive informa- jamin Millepied?” and “Tell me about Bank of tion about a particular definiendum. America”. Our approach has different innovative aspects 99 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 99–108, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics compared to other research in the area of defini- Katz et al., 2007; Westerhout, 2009; Navigli and tion extraction. It is at the crossroads of query Velardi, 2010). Due to training, there is a press- log analysis and QA systems. We study the click ing necessity for large-scale authoritative sources behavior of search engines’ users with regard to of descriptive and non-descriptive nuggets. In the definition questions. Based on this study, we pro- same manner, there is a growing importance of pose a novel way of acquiring large-scale and het- strategies capable of extracting trustworthy and erogeneous training material for this task, which negative/positive samples from any type of text. consists of: Conventionally, these methods interpret descrip- tions as positive examples, whereas contexts pro- • automatically obtaining positive samples in viding non-descriptive information as negative el- accordance with click patterns of search en- ements. Four representative techniques are: gine users. This aids in harvesting a host of descriptions from non-KB sources in con- • centroid vector (Xu et al., 2003; Cui et junction with descriptive information from al., 2004) collects an array of articles about KBs. the definiendum from a battery of pre- determined KBs. These articles are then • automatically acquiring negative data in con- used to learn a vector of word frequencies, sonance with redundancy patterns across wherewith answer candidates are rated af- snippets displayed within search engine re- terwards. Sometimes web-snippets together sults when processing definition queries. with a query reformulation method are ex- In brief, our experiments reveal that these pat- ploited instead of pre-defined KBs (Chen et terns can be effectively exploited for devising ef- al., 2006). ficient models. • (Androutsopoulos and Galanis, 2005) gath- Given the huge amount of amassed data, we ered articles from KBs to score 250- additionally contrast the performance of systems characters windows carrying the definien- built on top of samples originated solely from dum. These windows were taken from KB, non-KB, and both combined. Our compar- the Internet, and accordingly, highly sim- ison corroborates that KBs yield massive trust- ilar windows were interpreted as positive worthy descriptive knowledge, but they do not examples, while highly dissimilar as nega- bear enough diversity to discriminate all answer- tive samples. For this purpose, two thresh- ing nuggets within any kind of text. Essentially, olds are used, which ensure the trustwor- our experiments unveil that non-KB data is richer thiness of both sets. However, they also and therefore it is useful for discovering more de- cause the sets to be less diverse as not all scriptive nuggets than KB material. But its usage definienda are widely covered across KBs. relies on its cleanness and on a negative set. Many Indeed, many facets outlined within the 250- people had these intuitions before, but to the best characters windows will not be detected. of our knowledge, we provide the first empirical confirmation and quantification. • (Xu et al., 2005) manually labeled samples The road-map of this paper is as follows: sec- taken from an Intranet. Manual annotations tion 2 touches on related works; section 3 digs are constrained to a small amount of exam- deeper into click patterns for definition questions, ples, because it requires substantial human subsequently section 4 explains our corpus con- efforts to tag a large corpus, and disagree- struction strategy; section 5 describes our experi- ments between annotators are not uncom- ments, and section 6 draws final conclusions. mon. 2 Related Work • (Figueroa and Atkinson, 2009) capitalized on abstracts supplied by Wikipedia for build- In recent years, definition QA systems have ing language models (LMs), thus there was shown a trend towards the utilization of several no need for a negative set. discriminant and statistical learning techniques (Androutsopoulos and Galanis, 2005; Chen et al., Our contribution is a novel technique for ob- 2006; Han et al., 2006; Fahmi and Bouma, 2006; taining heterogeneous training material for defi- 100 nitional QA, that is to say, massive examples har- the desire of going to a specific site that the user vested from KBs and non-KBs. Fundamentally, has in mind, and the latter regards the goal of positive examples are extracted from web snippets learning something by reading or viewing some grounded on click patterns of users of a search en- content (Rose and Levinson, 2004). Navigational gine, whereas the negative collection is acquired queries are hence of less relevance to definition via redundancy patterns across web-snippets dis- questions, and for this reason, these were removed played to the user by the search engine. This data in congruence with the next three criteria: is capitalized by two state-of-the-art definition ex- tractors, which are different in nature. In addition, • (Lee et al., 2005) pointed out that users will our paper discusses the effect on the performance only visit the web site they bear in mind, of different sorts (KBs and non-KBs) and amount when prompting navigational queries. Thus, of training data. these queries are characterized by clicking As for user clicks, they provide valuable rele- the same URL almost all the time (Lee et al., vance feedback for a variety of tasks, cf. (Radlin- 2005). More precisely, we discarded queries ski et al., 2010). For instance, (Ji et al., 2009) that: a) appear more than four times in the extracted relevance information from clicked and query log; and which at the same time b) its non-clicked documents within aggregated search most clicked URL represents more than 98% sessions. They modelled sequences of clicks as of all its clicks. Following the same idea, we a means of learning to globally rank the relative additionally eliminated prompted URLs and relevance of all documents with respect to a given queries where the clicked URL is of the form query. (Xu et al., 2010) improved the quality of “www.search-query-without-spaces.” training material for learning to rank approaches • By the same token, queries containing key- via predicting labels using clickthrough data. In words such as “homepage”, “on-line”, and our work, we combine click patterns across Ya- “sign in” were also removed. hoo! search query logs with QA techniques to build one-sided and two-sided classifiers for rec- • After the previous steps, many navigational ognizing answers to definition questions. queries (e.g., “facebook”) still remained in the query log. We noticed that a substantial 3 User Click Analysis for Definition QA portion was signaled by several frequently In this section, we examine a collection of queries and indistinctly clicked URLs. Take for submitted to Yahoo! search engine during the pe- instance “facebook”: “www.facebook.com” riod from December 2010 to March 2011. More and “www.facebook.com/login.php”. specifically, for this analysis, we considered a log encompassing a random sample of 69,845,262 With this in mind, we discarded entries em- (23,360,089 distinct) queries. Basically, this log bodied in a manually compiled black list. comprises the query sent by the user in conjunc- This list contains the 600 highest frequent tion with the displayed URLs and the information cases. about the sequence of their clicks. In the first place, we associate each query with A third category in (Rose and Levinson, 2004) a category in the taxonomy proposed by (Rose regards resource queries, which we distinguished and Levinson, 2004), and in this way definition via keywords like “image”, “lyrics” and “maps”. queries are selected. Secondly, we investigate Altogether, an amount of (35.67%) 24,916,610 user click patterns observed across these filtered (3,576,817 distinct) queries were seen as navi- definition questions. gational and resource. Note that in (Rose and Levinson, 2004) both classes encompassed be- 3.1 Finding Definition Queries tween 37%-38% of their query set. According to (Broder, 2002; Lee et al., 2005; Subsequently, we profited from the remaining Dupret and Piwowarski, 2008), the intention of 44,928,652 (informational) entries for detecting the user falls into at least two categories: navi- queries where the intention of the user is find- gational (e.g., “google”) and informational (e.g., ing descriptive information about a topic (i.e., “maximum entropy models”). The former entails definiendum). In the taxonomy delineated by 101 (Rose and Levinson, 2004), informational queries 3.2 User Click Patterns are sub-categorized into five groups including list, In substance, the first filter recognizes the inten- locate, and definitional (directed and undirected). tion of the user by means of the formulation given In practice, we filtered definition questions as fol- by the user (e.g., “What is a/the/an...”). With re- lows: gard to this filter, some interesting observations 1. We exploited an array of expressions that are as follows: are commonly utilized in query analysis for • In 40.27% of the entries, users did not visit classifying definition questions (Figueroa, any of the displayed web-sites. Conse- 2010). E.g., “Who is/was...”, “What is/was quently, we concluded that the information a/an...”, “define...”, and “describe...”. Over- conveyed within the multiple snippets was all, these rules assisted in selecting 332,227 often enough to answer the respective def- entries. inition question. In other words, a signifi- 2. As stated in (Dupret and Piwowarski, 2008), cant fraction of the users were satisfied with informational queries are typified by the user a small set of brief, but quickly generated de- clicking several documents. In light of that, scriptions. we say that some definitional queries are characterized by multiple clicks, where at • In 2.18% of these cases, the search engine re- least one belongs to a KB. This aids in cap- turned no results, and a few times users tried turing the intention of the user when look- another paraphrase or query, due to useless ing for descriptive knowledge and only en- results or misspellings. tering noun phrases like “thoracic outlet syn- • We also noticed that definition questions drome”: matched by these expressions are seldom re- www.medicinenet.com lated to more than one click, although infor- en.wikipedia.org mational queries produce several clicks, in health.yahoo.net general. In 46.44% of the cases, the user www.livestrong.com clicked a sole document, and more surpris- health.yahoo.net ingly, we observed that users are likely to en.wikipedia.org click sources different from KBs, in con- www.medicinenet.com trast to the widespread belief in definition www.mayoclinic.com QA research. Users pick hits originating en.wikipedia.org from small but domain-specific web-sites as www.nismat.org a result of at least two effects: a) they are en.wikipedia.org looking for minor or ancillary senses of the definiendum (e.g., “ETA” in “www.travel- Table 1: Four distinct sequences of hosts clicked by industry-dictionary.com”); and more perti- users given the search query: “thoracic outlet syn- nent b) the user does not trust the information drome”. yielded by KBs and chooses more authorita- tive resources, for instance, when looking for In so doing, we manually compiled a list reliable medical information (e.g., “What is of 36 frequently clicked KB hosts (e.g., hypothyroidism?”, and “What is mrsa infec- Wikipedia and Britannica encyclopedia). tion?”). This filter produced 567,986 queries. While the first filter infers the intention of the Unfortunately, since query logs stored by user from the query itself, the second deduces it search engines are not publicly available due to from the origin of the clicked documents. With privacy and legal concerns, there is no accessible regard to this second filter, clicking patterns are training material to build models on top of anno- more disperse. Here, the first two clicks normally tated data. Thus, we exploited the aforementioned correspond to the top two/three ranked hits re- hand-crafted rules to connect queries to their re- turned by the search engine, see also (Ji et al., spective category in this taxonomy. 2009). Also, sequences of clicks signal that users 102 normally visit only one site belonging to a KB, they appear within snippets across several ques- and at least one coming from a non-KB (see Ta- tions. In other words: “If it seems to answer every ble 1). question, it will probably answer no question”. All in all, the insight gained in this analysis al- Take for instance: lows the construction of an heterogeneous corpus Information about #Q# in the Columbia for definition question answering. Put differently, Encyclopedia , Computer Desktop these user click patterns offer a way to obtain huge Encyclopedia , computing dictionary amounts of heterogeneous training material. In Conversely, templates that are more plausible this way the heavy dependence of open-domain to be answers are strongly related to their specific description identifiers on KB data can be allevi- definition questions, and consequently, they are ated. low in frequency and unlikely to be in the result set of a large number of queries. This negative set 4 Click-Based Corpus Acquisition was expanded with templates coming from titles Since queries obtained by the previous two filters of snippets, which at the same time, have a fre- are not associated with the actual snippets seen quency higher than four across all snippets (inde- by the users (due to storage limitations), snip- pendent on which queries they appear). This pro- pets were recovered by means of submitting the cess cooperated on gathering 1,021,571 different queries to Yahoo! search engine. negative examples. In order to measure the pre- After retrieval, we benefited from OpenNLP1 cision of this process, we randomly selected and for detecting sentence boundaries, tokenization checked 1,000 elements, and we found an error of and part-of-speech (POS) information. Here, we 1.3%. additionally interpreted truncations (“. . .”) as sen- 4.2 Positive Set tence delimiters. POS tags were used to recognize and replace numbers with a placeholder (#CD#) As for the positive set, this was constructed as a means of creating sentence templates. We only from the summary section of web-snippets modified numbers as their value is just as of- clicked by the users. We constrained these snip- ten confusing as useful (Baeza-Yates and Ribeiro- pets to bear a title template associated with at least Neto, 1999). two web-snippets clicked for two distinct queries. Along with numbers, sequences of full Some good examples are: and partial matches of the definiendum were What is #Q# ? Choices and Consequences. also substituted with placeholders, “#Q#” and Biology question : What is an #Q# ? “#QT#”, respectively. To exemplify, consider Since clicks are linked with entire snippets, this pre-processed snippet regarding “Benjamin it is uncertain which sentences are genuine de- Millepied” from “www.mashceleb.com”: scriptions (see the previous example). There- #Q# / News &amp; Biography - MashCeleb fore, we removed those templates already con- Latest news coverage of #Q# tained in the negative set, along with those sam- #Q# ( born #CD# ) is a principal dancer ples that matched an array of well-known hand- at New York City Ballet and a ballet choreographer... crafted rules. This set included: We benefit from these templates for building a. sentences containing words such as “ask”, both a positive and a negative training set. “report”, “say”, and “unless” (Kil et al., 2005; Schlaefer et al., 2007); 4.1 Negative Set The negative set comprised templates appearing b. sentences bearing several named entities across all (clicked and unclicked) web-snippets, (Schlaefer et al., 2006; Schlaefer et al., which at the same time, are related to more 2007), which were recognized by the number than five distinct queries. We hypothesize that of tokens starting with a capital letter versus these prominent elements correspond to non- those starting with a lowercase letter; informative, and thus non-descriptive, content as c. statements of persons (Schlaefer et al., 1 http://opennlp.sourceforge.net 2007); and 103 d. we also profited from about five hundred 23,132 elements, and some illustrative annota- common expressions across web snippets in- tions are shown in Table 2. It is worth highlight- cluding “Picture of ”, and “Jump to : naviga- ing that these examples signal that our models tion , search”, as well as “Recent posts”. are considering pattern-free descriptions, that is to say, unlike other systems (Xu et al., 2003; Katz This process assisted in acquiring 881,726 dif- et al., 2004; Fernandes, 2004; Feng et al., 2006; ferent examples, where 673,548 came from KBs. Figueroa and Atkinson, 2009; Westerhout, 2009) Here, we also randomly selected 1,000 instances which consider definitions aligning an array of and manually checked if they were actual descrip- well-known patterns (e.g., “is a” and “also known tions. The error of this set was 12.2%. as”), our models disregard any class of syntactic To put things into perspective, in contrast to constraint. other corpus acquisition approaches, the present As to a baseline system, we accounted for the method generated more than 1,800,000 positive centroid vector (Xu et al., 2003; Cui et al., 2004). and negative training samples combined, while When implementing, we followed the blueprint the open-domain strategy of (Miliaraki and An- in (Chen et al., 2006), and it was built for each droutsopoulos, 2004; Androutsopoulos and Gala- definiendum from a maximum of 330 web snip- nis, 2005) ca. 20,000 examples, the close-domain pets fetched by means of Bing Search. This base- technique of (Xu et al., 2005) about 3,000 and line achieved a modest performance as it correctly (Fahmi and Bouma, 2006) ca. 2,000. classified 43.75% of the testing examples. In de- tail, 47.75% out of the 56.25% of the misclas- 5 Answering New Definition Queries sified elements were a result of data-sparseness. This baseline has been widely used as a starting In our experiments, we checked the effectiveness point for comparison purposes, however it is hard of our user click-based corpus acquisition tech- for this technique to discover diverse descriptive nique by studying its impact on two state-of-the- nuggets. This problem stems from the narrow- art systems. The first one is based on the bi-term coverage of the centroid vector learned for the re- LMs proposed by (Chen et al., 2006). This sys- spective definienda (Zhang et al., 2005). In short, tem requires only positive samples as training ma- these figures support the necessity for more robust terial. Conversely, our second system capitalizes methods based on massive training material. on both positive and negative examples, and it is Experiments. We trained both models by sys- based on the Maximum Entropy (ME) models tematically increasing the size of the training ma- presented by (Fahmi and Bouma, 2006). These terial by 1%. For this, we randomly split the train- ME2 models amalgamated bigrams and unigrams ing data into 100 equally sized packs, and system- as well as two additional syntactic features, which atically added one to the previously selected sets were not applicable to our task (i.e, sentence posi- (i. e., 1%, 2%, 3%, . . ., 99%, 100%). We also ex- tion). We added to this model the sentence length perimented with: 1) positive examples originated as a feature in order to homologate the attributes solely from KBs; 2) positive samples harvested used by both systems, therefore offering a good only from non-KBs; and eventually 3) all positive framework to assess the impact of our negative examples combined. set. Note that (Fahmi and Bouma, 2006), unlike Figure 1 juxtaposes the outcomes accom- us, applied their models only to sentences observ- plished by both techniques under the different ing some specific syntactic patterns. configurations. These figures, compared with re- With regard to the test set, this was constructed sults obtained by the baseline, indicate the im- by manually annotating 113,184 sentence tem- portant contribution of our corpus to tackle data- plates corresponding to 3,162 unseen definienda. sparseness. This contrast substantiates our claim In total, this array of unseen testing instances that click patterns can be utilized as indicators of encompassed 11,566 different positive samples. answers to definition questions. Since our models In order to build a balanced testing collection, ignore definition patterns, they have the potential the same number of negative examples were ran- of detecting a wide diversity of descriptive infor- domly selected. Overall, our testing set contains mation. 2 http://maxent.sourceforge.net/about.html Further, the improvement of about 9%-10% by 104 Label Example/Template + Propylene #Q# is a type of alcohol made from fermented yeast and carbohydrates and is commonly used in a wide variety of products . + #Q# is aggressive behavior intended to achieve a goal . + In Hispanic culture , when a girl turns #CD# , a celebration is held called the #Q#, symbolizing the girl ’s passage to womanhood . + Kirschwasser , German for ” cherry water ” and often shortened to #Q# in English-speaking countries , is a colorless brandy made from black ... + From the Gaelic ’dubhglas ’ meaning #Q#, #QT# stream , or from the #QT# river . + Council Bluffs Orthopedic Surgeon Doctors physician directory - Read about #Q#, damage to any of the #CD# tendons that stabilize the shoulder joint . + It also occurs naturally in our bodies in fact , an average size adult manufactures up to #CD# grams of #Q# daily during normal metabolism . - Sterling Silver #Q# Hoop Earrings Overstockjeweler.com - I know V is the rate of reaction and the #Q# is hal ... - As sad and mean as that sounds , there is some truth to it , as #QT# as age their bodies do not function as well as they used to ( in all respects ) so there is a ... - If you ’re new to the idea of Christian #Q#, what I call ” the wild things of God , - A look at the Biblical doctrine of the #QT# , showing the biblical basis for the teaching and including a discussion of some of the common objections . - #QT# is Users Choice ( application need to be run at #QT# , but is not system critical ) , this page shows you how it affects your Windows operating system . - Your doctor may recommend that you use certain drugs to help you control your #Q# . - Find out what is the full meaning of #Q# on Abbreviations.com ! Table 2: Samples of manual annotations (testing set). means of exploiting our negative set makes its Best True Positive positive contribution clear. In particular, this sup- Conf. of Accuracy positives examples ME-combined 80.72% 88% 881,726 ports our hypothesis that redundancy across web- ME-KB 80.33% 89.37% 673,548 snippets pertaining to several definition questions ME-N-KB 78.99% 93.38% 208,178 can be exploited as negative evidence. On the whole, this enhancement also suggests that ME Table 3: Comparison of performance, the total amount models are a better option than LMs. and origin of training data, and the number of recog- nized descriptions. Furthermore, in the case of ME models, putting together evidence from KB and non-KBs bet- ters the performance. Conversely, in the case of racy. Nevertheless, this fraction (32%) is still LMs, we do not observe a noticeable improve- larger than the data-sets considered by other open- ment when unifying both sources. We attribute domain Machine Learning approaches (Miliaraki this difference to the fact that non-KB data is nois- and Androutsopoulos, 2004; Androutsopoulos ier, and thus negative examples are necessary to and Galanis, 2005). cushion this noise. By and large, the outcomes In detail, when contrasting the confusion ma- show that the usage of descriptive information de- trices of the best configurations accomplished rived exclusively from KBs is not the best, but a by ME-combined (80.72%), ME-KB (80.33%) cost-efficient solution. and ME-N-KB (78.99%), one can find that ME- Incidentally, Figure 1 reveals that more training combined correctly identified 88% of the answers data does not always imply better results. Overall, (true positives), while ME-KB 89.37% and ME- the best performance (ME-combined → 80.72%) N-KB 93.38% (see Table 3). was reaped when considering solely 32% of the Interestingly enough, non-KB data only em- training material. Hence, ME-KB finished with bodies 23.61% of all positive training material, the best performance when accounting for about but it still has the ability to recognize more an- 215,500 positive examples (see Table 3). Adding swers. Despite of that, the other two strate- more examples brought about a decline in accu- gies outperform ME-N-KB, because they are able 105 Figure 1: Results for each configuration (accuracy). to correctly label more negative test examples. size of the corpus. Our figures additionally sug- Given these figures, we can conclude that this is gest that more effort should go into increasing di- achieved by mitigating the impact of the noise in versity than the number of training instances. In the training corpus by means of cleaner (KB) data. light of these observations, we also conjecture that We verified this synergy by inspecting the num- a more reduced, but diverse and manually anno- ber of answers from non-KBs detected by the tated, corpus might be more effective. In partic- three top configurations in Table 3: ME-combined ular, a manually checked corpus distilled by in- (9,086), ME-KB (9,230) and ME-N-KB (9,677). specting click patterns across query logs of search In like manner, we examined the confusion ma- engines. trix for the best configuration (ME-combined → Lastly, in order to evaluate how good a click 80.72%): 1,388 (6%) positive examples were mis- predictor the three top ME-configurations are, labeled as negative, while 3,071 (13.28%) nega- we focused our attention only on the manu- tive samples were mistagged as positive. ally labeled positive samples (answers) that were In addition, we performed significance tests uti- clicked by the users. Overall, 86.33% (ME- lizing two-tailed paired t-test at 95% confidence combined), 88.85% (ME-KB) and 92.45% (ME- interval on twenty samples. For this, we used N-KB) of these responses were correctly pre- only the top three configurations in Table 3 and dicted. In light of that, one can conclude that each sample was determined by using boostrap- (clicked and non-clicked) answers to definition ping resampling. Each sample has the same size questions can be identified/predicted on the basis of the original test corpus. Overall, the tests im- of user’s click patterns across query logs. plied that all pairs were statistically different from From the viewpoint of search engines, web each other. snippets are computed off-line, in general. In In summary, the results show that both negative so doing, some methods select the spans of text examples and combining positive examples from bearing query terms with the potential of putting heterogeneous sources are indispensable to tackle the document on top of the rank (Turpin et al., any class of text. However, it is vital to lessen the 2007; Tsegay et al., 2009). This helps to create an noise in non-KB data, since this causes a more abridged version of the document that can quickly adverse effect on the performance. Given the up- produce the snippet. This has to do with the trade- perbound in accuracy, our outcomes indicate that off between storage capacity, indexing, and re- cleanness and quality are more important than the trieval speed. Ergo, our technique can help to de- 106 termine whether or not a span of text is worth ex- this implies that these tools have to be re-trained panding, or in some cases whether or not it should to cope with web-snippets. be included in the snippet view of the document. In our instructive snippet, we now might have: Acknowledgements Benjamin Millepied / News &amp; This work was partially supported by R&D Biography - MashCeleb project FONDEF D09I1185. We also thank our Benjamin Millepied (born 1977) is a principal dancer at New York City Ballet reviewers for their interesting comments, which and a ballet choreographer of helped us to make this work better. international reputation. Millepied was born in Bordeaux, France. His... References Improving the results of informational (e.g., definition) queries, especially of less frequent I. Androutsopoulos and D. Galanis. 2005. A prac- tically Unsupervised Learning Method to Identify ones, is key for competing commercial search Single-Snippet Answers to Definition Questions on engines as they are embodied in the non- the web. In HLT/EMNLP, pages 323–330. navigational tail where these engines differ the R. Baeza-Yates and B. Ribeiro-Neto. 1999. Modern most (Zaragoza et al., 2010). Information Retrieval. Addison Wesley. A. Broder. 2002. A Taxonomy of Web Search. SIGIR 6 Conclusions Forum, 36:3–10, September. Y. Chen, M. Zhon, and S. Wang. 2006. Reranking An- This work investigates into the click behavior of swers for Definitional QA Using Language Model- commercial search engine users regarding defi- ing. In Coling/ACL-2006, pages 1081–1088. nition questions. These behaviour patterns are H. Cui, K. Li, R. Sun, T.-S. Chua, and M.-Y. Kan. then exploited as a corpus acquisition technique 2004. National University of Singapore at the for definition QA, which offers the advantage of TREC 13 Question Answering Main Task. In Pro- encompassing positive samples from heterogo- ceedings of TREC 2004. NIST. neous sources. In contrast, negative examples Georges E. Dupret and Benjamin Piwowarski. 2008. are obtained in conformity to redundancy pat- A user browsing model to predict search engine click data from past observations. In SIGIR ’08, terns across snippets, which are returned by the pages 331–338. search engine when processing several definition Ismail Fahmi and Gosse Bouma. 2006. Learning to queries. The effectiveness of these patterns, and Identify Definitions using Syntactic Features. In hence of the obtained corpus, was tested by means Proceedings of the Workshop on Learning Struc- of two models different in nature, where both tured Information in Natural Language Applica- were capable of achieving an accuracy higher than tions. 70%. Donghui Feng, Deepak Ravichandran, and Eduard H. As a future work, we envision that answers de- Hovy. 2006. Mining and Re-ranking for Answering tected by our strategy can aid in determining some Biographical Queries on the Web. In AAAI. Aaron Fernandes. 2004. Answering Definitional query expansion terms, and thus to devise some Questions before they are Asked. Master’s thesis, relevance feedback methods that can bring about Massachusetts Institute of Technology. an improvement in terms of the recall of answers. A. Figueroa and J. Atkinson. 2009. Using Depen- Along the same lines, it can cooperate on the vi- dency Paths For Answering Definition Questions on sualization of the results by highlighting and/or The Web. In WEBIST 2009, pages 643–650. extending truncated answers, that is more infor- Alejandro Figueroa. 2010. Finding Answers to Defini- mative snippets, which is one of the holy grail of tion Questions on the Web. Phd-thesis, Universitaet search operators, especially when processing in- des Saarlandes, 7. formational queries. K. Han, Y. Song, and H. Rim. 2006. Probabilis- tic Model for Definitional Question Answering. In NLP tools (e.g., parsers and name entity recog- Proceedings of SIGIR 2006, pages 212–219. nizers) can also be exploited for designing better Shihao Ji, Ke Zhou, Ciya Liao, Zhaohui Zheng, Gui- training data filters and more discriminative fea- Rong Xue, Olivier Chapelle, Gordon Sun, and tures for our models that can assist in enhanc- Hongyuan Zha. 2009. Global ranking by exploit- ing the performance, cf. (Surdeanu et al., 2008; ing user clicks. In Proceedings of the 32nd inter- Figueroa, 2010; Surdeanu et al., 2011). However, national ACM SIGIR conference on Research and 107 development in information retrieval, SIGIR ’09, Yohannes Tsegay, Simon J. Puglisi, Andrew Turpin, pages 35–42, New York, NY, USA. ACM. and Justin Zobel. 2009. Document compaction B. Katz, M. Bilotti, S. Felshin, A. Fernandes, for efficient query biased snippet generation. In W. Hildebrandt, R. Katzir, J. Lin, D. Loreto, Proceedings of the 31th European Conference on G. Marton, F. Mora, and O. Uzuner. 2004. An- IR Research on Advances in Information Retrieval, swering multiple questions on a topic from hetero- ECIR ’09, pages 509–520, Berlin, Heidelberg. geneous resources. In Proceedings of TREC 2004. Springer-Verlag. NIST. Andrew Turpin, Yohannes Tsegay, David Hawking, B. Katz, S. Felshin, G. Marton, F. Mora, Y. K. Shen, and Hugh E. Williams. 2007. Fast generation of G. Zaccak, A. Ammar, E. Eisner, A. Turgut, and result snippets in web search. In Proceedings of L. Brown Westrick. 2007. CSAIL at TREC 2007 the 30th annual international ACM SIGIR confer- Question Answering. In Proceedings of TREC ence on Research and development in information 2007. NIST. retrieval, SIGIR ’07, pages 127–134, New York, Jae Hong Kil, Levon Lloyd, and Steven Skiena. 2005. NY, USA. ACM. Question Answering with Lydia (TREC 2005 QA Eline Westerhout. 2009. Extraction of definitions us- track). In Proceedings of TREC 2005. NIST. ing grammar-enhanced machine learning. In Pro- ceedings of the EACL 2009 Student Research Work- U. Lee, Z. Liu, and J. Cho. 2005. Automatic Iden- shop, pages 88–96. tification of User Goals in Web Search. In Pro- ceedings of the 14th WWW conference, WWW ’05, Jinxi Xu, Ana Licuanan, and Ralph Weischedel. 2003. pages 391–400. TREC2003 QA at BBN: Answering Definitional Questions. In Proceedings of TREC 2003, pages S. Miliaraki and I. Androutsopoulos. 2004. Learn- 98–106. NIST. ing to identify single-snippet answers to definition J. Xu, Y. Cao, H. Li, and M. Zhao. 2005. Ranking questions. In COLING ’04, pages 1360–1366. Definitions with Supervised Learning Methods. In Roberto Navigli and Paola Velardi. 2010. WWW2005, pages 811–819. LearningWord-Class Lattices for Definition Jingfang Xu, Chuanliang Chen, Gu Xu, Hang Li, and and Hypernym Extraction. In Proceedings of Elbio Renato Torres Abib. 2010. Improving qual- the 48th Annual Meeting of the Association for ity of training data for learning to rank using click- Computational Linguistics (ACL 2010). through data. In Proceedings of the third ACM in- Filip Radlinski, Martin Szummer, and Nick Craswell. ternational conference on Web search and data min- 2010. Inferring query intent from reformulations ing, WSDM ’10, pages 171–180, New York, NY, and clicks. In Proceedings of the 19th international USA. ACM. conference on World wide web, WWW ’10, pages H. Zaragoza, B. Barla Cambazoglu, and R. Baeza- 1171–1172, New York, NY, USA. ACM. Yates. 2010. We Search Solved? All Result Rank- Daniel E. Rose and Danny Levinson. 2004. Un- ings the Same? In Proceedings of CKIM’10, pages derstanding User Goals in Web Search. In WWW, 529–538. pages 13–19. Zhushuo Zhang, Yaqian Zhou, Xuanjing Huang, and B. Sacaleanu, G. Neumann, and C. Spurk. 2008. Lide Wu. 2005. Answering Definition Questions DFKI-LT at QA@CLEF 2008. In In Working Notes Using Web Knowledge Bases. In Proceedings of for the CLEF 2008 Workshop. IJCNLP 2005, pages 498–506. Nico Schlaefer, P. Gieselmann, and Guido Sautter. 2006. The Ephyra QA System at TREC 2006. In Proceedings of TREC 2006. NIST. Nico Schlaefer, Jeongwoo Ko, Justin Betteridge, Guido Sautter, Manas Pathak, and Eric Nyberg. 2007. Semantic Extensions of the Ephyra QA Sys- tem for TREC 2007. In Proceedings of TREC 2007. NIST. Mihai Surdeanu, Massimiliano Ciaramita, and Hugo Zaragoza. 2008. Learning to Rank Answers on Large Online QA Collections. In Proceedings of the 46th Annual Meeting of the Association for Compu- tational Linguistics (ACL 2008), pages 719–727. Mihai Surdeanu, Massimiliano Ciaramita, and Hugo Zaragoza. 2011. Learning to rank answers to non- factoid questions from web collections. Computa- tional Linguistics, 37:351–383. 108 Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Retrieval in a Service Context Vassilina Nikoulina Bogomil Kovachev Xerox Research Center Europe Informatics Institute

[email protected]

University of Amsterdam

[email protected]

Nikolaos Lagos Christof Monz Xerox Research Center Europe Informatics Institute

[email protected]

University of Amsterdam

[email protected]

Abstract to the undelying IR system used and without ac- cessing, at translation time, the content provider’s This work proposes to adapt an existing document set. Keeping in mind these constraints, general SMT model for the task of translat- we present two approaches on query translation ing queries that are subsequently going to optimisation. be used to retrieve information from a tar- One of the important observations done dur- get language collection. In the scenario that we focus on access to the document collec- ing the CLEF 2009 campaign (Ferro and Peters, tion itself is not available and changes to 2009) related to CLIR was that the usage of Sta- the IR model are not possible. We propose tistical Machine Translation (SMT) systems (eg. two ways to achieve the adaptation effect Google Translate) for query translation led to and both of them are aimed at tuning pa- important improvements in the cross-lingual re- rameter weights on a set of parallel queries. trieval performance (the best CLIR performance The first approach is via a standard tuning increased from ˜55% of the monolingual baseline procedure optimizing for BLEU score and the second one is via a reranking approach in 2008 to more than 90% in 2009 for French optimizing for MAP score. We also extend and German target languages). However, general- the second approach by using syntax-based purpose SMT systems are not necessarily adapted features. Our experiments show improve- for query translation. That is because SMT sys- ments of 1-2.5 in terms of MAP score over tems trained on a corpus of standard parallel the retrieval with the non-adapted transla- phrases take into account the phrase structure im- tion. We show that these improvements are plicitly. The structure of queries is very differ- due both to the integration of the adapta- ent from the standard phrase structure: queries are tion and syntax-features for the query trans- lation task. very short and the word order might be different than the typical full phrase one. This problem can be seen as a problem of genre adaptation for SMT, 1 Introduction where the genre is “query”. To our knowledge, no suitable corpora of par- Cross Lingual Information Retrieval (CLIR) is an allel queries is available to train an adapted SMT important feature for any digital content provider system. Small corpora of parallel queries1 how- in today’s multilingual environment. However, ever can be obtained (eg. CLEF tracks) or man- many of the content providers are not willing to ually created. We suggest to use such corpora change existing well-established document index- in order to adapt the SMT model parameters for ing and search tools, nor to provide access to query translation. In our approach the parameters their document collection by a third-party exter- of the SMT models are optimized on the basis of nal service. The work presented in this paper as- the parallel queries set. This is achieved either di- sumes such a context of use, where a query trans- rectly in the SMT system using the MERT (Mini- lation service allows translating queries posed to mum Error Rate Training) algorithm and optimiz- the search engine of a content provider into sev- 1 eral target languages, without requiring changes Insufficient for a full SMT system training (˜500 entries) 109 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 109–119, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics ing according to the BLEU2 (Papineni et al., 2001) ror on the training data. score, or via reranking the Nbest translation can- To our knowledge, existing work that use MT- didates generated by a baseline system based on based techniques for query translation use an out- new parameters (and possibly new features) that of-the-box MT system, without adapting it for aim to optimize a retrieval metric. query translation in particular (Jones et al., 1999; It is important to note that both of the pro- Wu et al., 2008) (although some query expan- posed approaches allow keeping the MT system sion techniques might be applied to the produced independent of the document collection and in- translation afterwards (Wu and He, 2010)). dexing, and thus suitable for a query translation There is a number of works done for do- service. These two approaches can also be com- main adaptation in Statistical Machine Transla- bined by using the model produced with the first tion. However, we want to distinguish between approach as a baseline that produces the Nbest list genre and domain adaptation in this work. Gen- of translations that is then given to the reranking erally, genre can be seen as a sub-problem of do- approach. main. Thus, we consider genre to be the general The remainder of this paper is organized as fol- style of the text e.g. conversation, news, blog, lows. We first present related work addressing the query (responsible mostly for the text structure) problem of query translation. We then describe while the domain reflects more what the text is two approaches towards adapting an SMT system about – eg. social science, healthcare, history, so to the query-genre: tuning the SMT system on a domain adaptation involves lexical disambigua- parallel set of queries (Section 3.1) and adapting tion and extra lexical coverage problems. To our machine translation via the reranking framework knowledge, there is not much work addressing ex- (Section 3.2). We then present our experimental plicitly the problem of genre adaptation for SMT. settings and results (Section 4) and conclude in Some work done on domain adaptation could be section 5. applied to genre adaptation, such as incorporating available in-domain corpora in the SMT model: 2 Related work either monolingual (Bertoldi and Federico, 2009; Wu et al., 2008; Zhao et al., 2004; Koehn and We may distinguish two main groups of ap- Schroeder, 2007), or small parallel data used for proaches to CLIR: document translation and tuning the SMT parameters (Zheng et al., 2010; query translation. We concentrate on the second Pecina et al., 2011). group which is more relevant to our settings. The standard query translation methods use different 3 Our approach translation resources such as bilingual dictionar- This work is based on the hypothesis that the ies, parallel corpora and/or machine translation. general-purpose SMT system needs to be adapted The aspect of disambiguation is important for the for query translation. Although in (Ferro and first two techniques. Peters, 2009) it has been mentioned that using Different methods were proposed to deal with Google translate (general-purpose MT) for query disambiguation issues, often relying on the docu- translation allowed to CLEF participants to obtain ment collection or embedding the translation step the best CLIR performance, there is still 10% gap directly into the retrieval model (Hiemstra and between monolingual and cross-lingual IR. We Jong, 1999; Berger et al., 1999; Kraaij et al., believe that, as in (Clinchant and Renders, 2007), 2003). Other methods rely on external resources more adapted query translation, possibly further like query logs (Gao et al., 2010), Wikipedia (Ja- combined with query expansion techniques, can didinejad and Mahmoudi, 2009) or the web (Nie lead to improved retrieval. and Chen, 2002; Hu et al., 2008). (Gao et al., The problem of the SMT adaptation for query- 2006) proposes syntax-based translation models genre translation has different quality aspects. to deal with the disambiguation issues (NP-based, On the one hand, we want our model to pro- dependency-based). The candidate translations duce a “good” translation (well-formed and trans- proposed by these models are then reranked with mitting the information contained in the source the model learned to minimize the translation er- query) of an input query. On the other hand, we 2 Standard MT evaluation metric want to obtain good retrieval performance using 110 the proposed translation. These two aspects are Our hypothesis is that the impact of different not necessarily correlated: a bag-of-word transla- features should be different depending on whether tion can lead to good retrieval performance, even we translate a full sentence, or a query-genre en- though it won’t be syntactically well-formed; at try. Thus, one would expect that in the case the same time a well-formed translation can lead of query-genre the language model or the distor- to worse retrieval if the wrong lexical choice is tion features should get less importance than in done. Moreover, often the retrieval demands some the case of the full-sentence translation. MERT linguistic preprocessing (eg. lemmatisation, PoS tuning on a genre-adapted parallel corpus should tagging) which in interaction with badly-formed leverage this information from the data, adapting translations might bring some noise. the SMT model to the query-genre. We would A couple of works studied the correlation be- also like to note that the tuning approach (pro- tween the standard MT evaluation metrics and posed for domain adaptation by (Zheng et al., the retrieval precision. Thus, (Fujii et al., 2009) 2010)) seems to be more appropriate for genre showed a good correlation of the BLEU scores adaptation than for domain adaptation where the with the MAP scores for Cross-Lingual Patent problem of lexical ambiguity is encoded in the Retrieval. However, the topics in patent search translation model and re-weighting the main fea- (long and well structured) are very different from tures might not be sufficient. standard queries. (Kettunen, 2009) also found a We use the MERT implementation provided pretty high correlation ( 0.8 − 0.9) between stan- with the Moses toolkit with default settings. Our dard MT evaluation metrics (METEOR(Banerjee assumption is that this procedure although not ex- and Lavie, 2005), BLEU, NIST(Doddington, plicitly aimed at improving retrieval performance 2002)) and retrieval precision for long queries. will nevertheless lead to “better” query transla- However, the same work shows that the correla- tions when compared to the baseline. The results tion decreases ( 0.6 − 0.7) for short queries. of this apporach allow us also to observe whether In this paper we propose two approaches to and to what extent changes in BLEU scores are SMT adaptation for queries. The first one op- correlated to changes in MAP scores. timizes BLEU, while the second one optimizes Mean Average Precision (MAP), a standard met- 3.2 Reranking framework for query ric in information retrieval. We’ll address the is- translation sue of the correlation between BLEU and MAP in The second approach addresses the retrieval qual- Section 4. ity problem. An SMT system is usually trained to Both of the proposed approaches rely on the optimize the quality of the translation (eg. BLEU phrase-based SMT (PBMT) model (Koehn et al., score for SMT), which is not necessarily corre- 2003) implemented in the Open Source SMT lated with the retrieval quality (especially for the toolkit MOSES (Koehn et al., 2007). short queries). Thus, for example, the word or- der which is crucial for translation quality (and is 3.1 Tuning for genre adaptation taken into account by most MT evaluation met- First, we propose to adapt the PBMT model by rics) is often ignored by IR models. Our second tuning the model’s weights on a parallel set of approach follows (Nie, 2010, pp.106) argument queries. This approach addresses the first as- that “the translation problem is an integral part pect of the problem, which is producing a “good” of the whole CLIR problem, and unified CLIR translation. The PBMT model combines differ- models integrating translation should be defined”. ent types of features via a log-linear model. The We propose integrating the IR metric (MAP) into standard features include (Koehn, 2010, Chapter the translation model optimisation step via the 5): language model, word penalty, distortion, dif- reranking framework. ferent translation models, etc. The weights of Previous attempts to apply the reranking ap- these features are learned during the tuning step proach to SMT did not show significant improve- with the MERT (Och, 2003) algorithm. Roughly ments in terms of MT evaluation metrics (Och the MERT algorithm tunes feature weights one by et al., 2003; Nikoulina and Dymetman, 2008). one and optimizes them according to the BLEU One of the reasons being the poor diversity of the score obtained. Nbest list of the translations. However, we be- 111 lieve that this approach has more potential in the defined as a weighted linear combination of context of query translation. features: tˆ(λ) = arg maxt∈GEN (q) λ· F (t) First of all the average query length is ˜5 words, As shown above the best translation is selected ac- which means that the Nbest list of the translations cording to features’ weights λ. In order to learn is more diverse than in the case of general phrase the weights λ maximizing the retrieval perfor- translation (average length 25-30 words). mance, an appropriate annotated training set has Moreover, the retrieval precision is more natu- to be created. We use the CLEF tracks to create rally integrated into the reranking framework than the training set. The retrieval scores annotations standard MT evaluation metrics such as BLEU. are based on the document relevance annotations The main reason is that the notion of Average Re- performed by human annotators during the CLEF trieval Precision is well defined for a single query campaign. translation, while BLEU is defined on the corpus The annotated training set is created out of level and correlates poorly with human quality queries {q1 , ..., qK } with an Nbest list of trans- judgements for the individual translations (Specia lations GEN (qi ) of each query qi , i ∈ {1..K} as et al., 2009; Callison-Burch et al., 2009). follows: Finally, the reranking framework allows a lot of flexibility. Thus, it allows enriching the base- • A list of N (we take N = 1000) translations line translation model with new complex features (GEN (qi )) is produced by the baseline MT which might be difficult to introduce into the model for each query qi , i = 1..K. translation model directly. • Each translation t ∈ GEN (qi ) is used Other works applied the reranking framework to perform a retrieval from a target docu- to different NLP tasks such as Named Entities ment collection, and an Average Precision Extraction (Collins, 2001), parsing (Collins and score (AP (t)) is computed for each t ∈ Roark, 2004), and language modelling (Roark et GEN (qi ) by comparing its retrieval to the al., 2004). Most of these works used the reranking relevance annotations done during the CLEF framework to combine generative and discrimina- campaign. tive methods when both approaches aim at solv- ing the same problem: the generative model pro- The weights λ are learned with the objective of duces a set of hypotheses, and the best hypoth- maximizing MAP for all the queries of the train- esis is chosen afterwards via the discriminative ing set, and, therefore, are optimized for retrieval reranking model, which allows to enrich the base- quality. line model with the new complex and heteroge- The weights optimization is done with neous features. We suggest using the reranking the Margin Infused Relaxed Algorithm framework to combine two different tasks: Ma- (MIRA)(Crammer and Singer, 2003), which chine Translation and Cross-lingual Information was applied to SMT by (Watanabe et al., 2007; Retrieval. In this context the reranking framework Chiang et al., 2008). MIRA is an online learning doesn’t only allow enriching the baseline transla- algorithm where each weights update is done to tion model but also performing training using a keep the new weights as close as possible to the more appropriate evaluation metric. old weights (first term), and score oracle trans- lation (the translation giving the best retrieval 3.2.1 Reranking training score : t∗i = arg maxt AP (t)) higher than each Generally, the reranking framework can be re- non-oracle translation (tij ) by a margin at least as sumed in the following steps : wide as the loss lij (second term): 0 1. The baseline (generic-purpose) MT system λ = minλ0 21 kλ − λk2 + generates a list of candidate translations 0 C K ∗ ) − F (t ) P i=1 max j=1..N lij − λ · (F (ti ij GEN (q) for each query q; The loss lij is defined as the difference in the re- 2. A vector of features F (t) is assigned to each trieval average precision between the oracle and translation t ∈ GEN (q); non-oracle translations: lij = AP (t∗i ) − AP (tij ). 3. The best translation tˆ is chosen as the one C is the regularization parameter which is chosen maximizing the translation score, which is via 5-fold cross-validation. 112 3.2.2 Features PoS mapping features. The goal of the PoS One of the advantages of the reranking frame- mapping features is to control the correspondence work is that new complex features can be easily of Part Of Speech Tags between an input query integrated. We suggest to enrich the reranking and its translation. As the coupling features, the model with different syntax-based features, such PoS mapping features rely on the word align- as: ments between the source sentence and its trans- lation3 . A vector of sparse features is introduced • features relying on dependency structures: where each component corresponds to a pair of called therein coupling features (proposed by PoS tags aligned in the training data. We intro- (Nikoulina and Dymetman, 2008)); duce a generic PoS map variant, which counts a number of occurrences of a specific pair of PoS • features relying on Part of Speech Tagging: tags, and lexical PoS map variant, which weights called therein PoS mapping features. down these pairs by a lexical alignment score (p(s|t) or p(t|s)). By integrating the syntax-based features we have a double goal: showing the potential of 4 Experiments the reranking framework with more complex fea- 4.1 Experimental basis tures, and examining whether the integration of syntactic information could be useful for query 4.1.1 Data translation. To simulate parallel query data we used trans- lation equivalent CLEF topics. The data set used Coupling features. The goal of the coupling for the first approach consists of the CLEF topic features is to measure the similarity between data from the following years and tasks: AdHoc- source and target dependency structures. The ini- main track from 2000 to 2008; CLEF AdHoc- tial hypothesis is that a better translation should TEL track 2008; Domain Specific tracks from have a dependency structure closer to the one of 2000 to 2008; CLEF robust tracks 2007 and 2008; the source query. GeoCLEf tracks 2005-2007. To avoid the issue of In this work we experiment with two dif- overlapping topics we removed duplicates. The ferent coupling variants proposed in (Nikoulina created parallel queries set contained 500 − 700 and Dymetman, 2008), namely, Lexicalised and parallel entries (depending on the language pair, Label-specific coupling features. Table 1) and was used for Moses parameters tun- The generic coupling features are based on ing. the notion of “rectangles” that are of the follow- In order to create the training set for the rerank- ing type : ((s1 , ds12 , s2 ), (t1 , dt12 , t2 )), where ing approach, we need to have access to the rele- ds12 is an edge between source words s1 and s2 , vance judgements. We didn’t have access to all dt12 is an edge between target words t1 and t2 , relevance judgements of the previously desribed s1 is aligned with t1 and s2 is aligned with t2 . tracks. Thus we used only a subset of the previ- Lexicalised features take into account the qual- ously extracted parallel set, which includes CLEF ity of lexical alignment, by weighting each rect- 2000-2008 topics from the AdHoc-main, AdHoc- angle (s1 , s2 , t1 , t2 ) by a probability of align- TEL and GeoCLEF tracks. ing s1 to t1 and s2 to t2 (eg. p(s1 |t1 )p(s2 |t2 ) or The number of queries obtained altogether is p(t1 |s1 )p(t2 |s2 )). shown in (Table 1). The Label-Specific features take into account 4.1.2 Baseline the nature of the aligned dependencies. Thus, the rectangles of the form ((s1, subj, s2), (t1, subj, We tested our approaches on the CLEF AdHoc- t2)) will get more weight than a rectangle ((s1, TEL 2009 task (50 topics). This task dealt subj, s2), (t1, nmod, t2)). The importance of with monolingual and cross-lingual search in a each “rectangle” is learned on the parallel anno- library catalog. The monolingual retrieval is tated corpus by introducing a collection of Label- 3 This alignment can be either produced by a toolkit like Specific coupling features, each for a specific pair GIZA++(Och and Ney, 2003) or obtained directly by a sys- of source label and target label. tem that produced the Nbest list of the translations (Moses). 113 Language pair Number of queries The 5best retrieval can be seen as a sort of query Total queries expansion, without accessing the document col- En - Fr, Fr - En 470 lection or any external resources. En - De, De - En 714 Given that the query length is shorter than for a Annotated queries standard sentence, the 4-gramm BLEU (used for En - Fr, Fr - En 400 standart MT evaluation) might not be able to cap- En - De, De - En 350 ture the difference between the translations (eg. English-German 4-gramm BLEU is equal to 0 for Table 1: Top: total number of parallel queries gathered our task). For that reason we report both 3- and from all the CLEF tasks (size of the tuning set). Bot- 4-gramm BLEU scores. tom: number of queries extracted from the tasks for Note, that the French-English baseline retrieval which the human relevance judgements were availble quality is much better than the German-English. (size of the reranking training set). This is probably due to the fact that our German- English translation system doesn’t use any de- performed with the lemur4 toolkit (Ogilvie and coumpounding, which results into many non- Callan, 2001). The preprocessing includes lem- translated words. matisation (with the Xerox Incremental Parser- XIP (A¨ıt-Mokhtar et al., 2002)) and filtering out 4.2 Results the function words (based on XIP PoS tagging). We performed the query-genre adaptation ex- Table 2 shows the performance of the monolin- periments for English-French, French-English, gual retrieval model for each collection. The German-English and English-German language monolingual retrieval results are comparable to pairs. the CLEF AdHoc-TEL 2009 participants (Ferro Ideally, we would have liked to combine the and Peters, 2009). Let us note here that it is not two approaches we proposed: use the query- the case for our CLIR results since we didn’t ex- genre-tuned model to produce the Nbest list ploit the fact that each of the collections could ac- which is then reranked to optimize the MAP tually contain the entries in a language other than score. However, it was not possible in our exper- the official language of the collection. imental settings due to the small amount of train- The cross-lingual retrieval is performed as fol- ing data available. We thus simply compare these lows : two approaches to a baseline approach and com- ment on their respective performance. • the input query (eg. in English) is first trans- lated into the language of the collection (eg. 4.2.1 Query-genre tuning approach German); For the CLEF-tuning experiments we used the • this translation is used to search the target same translation model and language model as for collection (eg. Austrian National Library for the baseline (Europarl-based). The weights were German ) . then tuned on the CLEF topics described in sec- tion 4.1.1. We then tested the system obtained on The baseline translation is produced with 50 parallel queries from the CLEF AdHoc-TEL Moses trained on Europarl. Table 2 reports the 2009 task. baseline performance both in terms of MT evalu- Table 3 describes the results of the evalua- ation metrics (BLEU) and Information Retrieval tion. We observe consistent 1-best MAP improve- evaluation metric MAP (Mean Average Preci- ments, but unstable BLEU (3-gramm) (improve- sion). ments for English-German, and degradation for The 1best MAP score corresponds to the case other language pairs), although one would have when the single translation is proposed for the expected BLEU to be improved in this experi- retrieval by the query translation model. 5best mental setting given that BLEU was the objective MAP score corresponds to the case when the 5 function for MERT. These results, on one side, top translations proposed by the translation ser- confirm the remark of (Kettunen, 2009) that there vice are concatenated and used for the retrieval. is a correlation (although low) between BLEU 4 http://www.lemurproject.org/ and MAP scores. The unstable BLEU scores 114 MAP MAP BLEU BLEU MAP 1-best 5-best 4-gramm 3-gramm Monolingual IR Bilingual IR French-English 0.1828 0.2186 0.1199 0.1568 English 0.3159 German-English 0.0941 0.0942 0.2351 0.2923 French 0.2386 English-French 0.1504 0.1543 0.2863 0.3423 German 0.2162 English-German 0.1009 0.1157 0.0000 0.1218 Table 2: Baseline MAP scores for monolingual and bilingual CLEF AdHoc TEL 2009 task. MAP MAP BLEU BLEU 1-best 5-best 4-gramm 3-gramm Fr-En 0.1954 0.2229 0.1062 0.1489 De-En 0.1018 0.1078 0.2240 0.2486 En-Fr 0.1611 0.1516 0.2072 0.2908 En-De 0.1062 0.1132 0.0000 0.1924 Table 3: BLEU and MAP performance on CLEF AdHoc TEL 2009 task for the genre-tuned model. might also be explained by the small size of the structure: mostly content words and fewer func- test set (compared to a standard test set of 1000 tion words when compared to the full sentence. full-sentences). The language model weight is consistently Secondly, we looked at the weights of the fea- though not drastically smaller when tuning with tures both in the baseline model (Europarl-tuned) CLEF data. We suppose that this is due to the and in the adapted model (CLEF-tuned), shown in fact that a Europarl-base language model is not Table 4. We are unsure how suitable the sizes of the best choice for translating query data. the CLEF tuning sets are, especially for the pairs 4.2.2 Reranking approach involving English and French. Nevertheless we The reranking experiments include different do observe and comment on some patterns. features combinations. First, we experiment with For the pairs involving English and German the Moses features only in order to make this ap- the distortion weight is much higher when tuning proach comparable with the first one. Secondly, with CLEF data compared to tuning with Europarl we compare different syntax-based features com- data. The picture is reversed when looking at the binations, as described in section 3.2.2. Thus, we two pairs involving English and French. This is compare the following reranking models (defined to be expected if we interpret a high distortion by the feature set): moses, lex (lexical coupling weight as follows: “it is not encouraged to place + moses features), lab (label-specific coupling + source words that are near to each other far away moses features), posmaplex (lexical PoS mapping from each other in the translation”. Indeed, the lo- + moses features ), lab-lex (label-specific cou- cal reorderings are much more frequent between pling + lexical coupling + moses features), lab- English and French (e.g. white house = maison lex-posmap (label-specific coupling + lexical cou- blanche), while the long-distance reorderings are pling features + generic PoS mapping). To reduce more typcal between English and German. the size of feature-functions vectors we take only The word penalty is consistenly higher over all the 20 most frequent features in the training data pairs when tuning with CLEF data compared to for Label-specific coupling and PoS mapping fea- tuning with Europarl data. We could see an ex- tures. The computation of the syntax features is planation for this pattern in the smaller size of based on the rule-based XIP parser, where some the CLEF sentences if we interpret higher word heuristics specific to query processing have been penalty as a preference for shorter translations. integrated into English and French (but not Ger- This can be explained both with the smaller aver- man) grammars (Brun et al., 2012). age size of the queries and with the specific query The results of these experiments are illustrated 115 Lng pair Tune set DW LM φ(f |e) lex(f |e) φ(e|f ) lex(e|f ) PP WP Europarl 0.0801 0.1397 0.0431 0.0625 0.1463 0.0638 -0.0670 -0.3975 Fr-En CLEF 0.0015 0.0795 -0.0046 0.0348 0.1977 0.0208 -0.2904 0.3707 Europarl 0.0588 0.1341 0.0380 0.0181 0.1382 0.0398 -0.0904 -0.4822 De-En CLEF 0.3568 0.1151 0.1168 0.0549 0.0932 0.0805 0.0391 -0.1434 Europarl 0.0789 0.1373 0.0002 0.0766 0.1798 0.0293 -0.0978 -0.4002 En-Fr CLEF 0.0322 0.1251 0.0350 0.1023 0.0534 0.0365 -0.3182 -0.2972 Europarl 0.0584 0.1396 0.0092 0.0821 0.1823 0.0437 -0.1613 -0.3233 En-De CLEF 0.3451 0.1001 0.0248 0.0872 0.2629 0.0153 -0.0431 0.1214 Table 4: Feature weights for the query-genre tuned model. Abbreviations: DW - distortion weight, LM - language model weight, PP - phrase penalty, WP - word penalty, φ-phrase translation probability, lex-lexical weighting. Query Example MAP bleu1 German which can be explained by the fact that Src1 Weibliche M¨artyrer the German grammar used for query processing Ref Female Martyrs was not adapted for queries as opposed to English T1 female martyrs 0.07 1 T2 Women martyr 0.4 0 and French grammars. However, we do not ob- Src 2 Genmanipulation am serve the same tendency for BLEU score, where Menschen only a few of the adapted models outperform the Ref Human Gene Manipula- baseline, which confirms the hypothesis of the tion low correlation between BLEU and MAP scores T1 On the genetic manipula- 0.044 0.167 in these settings. Table 5 gives some examples of tion of people the queries translations before (T1) and after (T2) T2 genetic manipulation of 0.069 0.286 reranking. These examples also illustrate differ- the human being Src 3 Arbeitsrecht in der Eu- ent types of disagreement between MAP and 1- rop¨aischen Union gramm BLEU5 score. Ref European Union Labour The results for English-German and English- Laws French look more confusing. This can be partly T1 Labour law in the Euro- 0.015 0.5 due to the more rich morphology of the target lan- pean Union guages which may create more noise in the syn- T2 labour legislation in the 0.036 0.5 tax structure. Reranking however improves over European Union the 1-best MAP baseline for English-German, and Table 5: Some examples of queries translations (T1: 5-best MAP is also improved excluding the mod- baseline, T2: after reranking with lab-lex), MAP and els involving PoS tagging for German (posmap, 1-gramm BLEU scores for German-English. posmaplex, lab-lex-posmap). The results for English-French are more difficult to interpret. To find out the reason of such a behavior, we looked in Figure 1. To keep the figure more readable, at the translations. We observed the following to- we report only on 3-gramm BLEU scores. When kenization problem for French: the apostrophe is computing the 5best MAP score, the order in the systematically separated, e.g. “d ’ aujourd ’ hui”. Nbest list is defined by a corresponding reranking This leads to both noisy pre-retrieval preprocess- model. Each reranking model is illustrated by a ing (eg. d is tagged as a NOUN) and noisy syntax- single horizontal red bar. We compare the rerank- based feature values, which might explain the un- ing results to the baseline model (vertical line) and stable results. also to the results of the first approach (yellow bar Finally, we can see that the syntax-based fea- labelled MERT:moses) on the same figure. tures can be beneficial for the final retrieval qual- First, we remark that the adapted models ity: the models with syntax features can outper- (query-genre tuning and reranking) outperform form the model basd on the moses features only. the baseline in terms of MAP (1best and 5 best) The syntax-based features leading to the most sta- for French-English and German-English transla- tions for most of the models. The only exception 5 The higher order BLEU scores are equal to 0 for most is posmaplex model (based on PoS tagging) for of the individual translations. 116 Figure 1: Reranking results. The vertical line corresponds to the baseline scores. The lowest bar (MERT:moses, in yellow): the results of the tuning approach, other bars(in red): the results of the reranking approach. ble results seem to be lab-lex (combination of lex- of MAP is improved between 1-2.5 points. We ical and label-specific coupling): it leads to the believe that the combination of these two meth- best gains over 1-best and 5-best MAP for all lan- ods would be the most beneficial setting, although guage pairs excluding English-French. This is a we were not able to prove this experimentally surprising result given the fact that the underlying (due to the lack of training data). None of these IR model doesn’t take syntax into account in any methods require access to the document collec- way. In our opinion, this is probably due to the tion at test time, and can be used in the context interaction between the pre-retrieval preprocess- of a query translation service. The combination ing (lemmatisation, PoS tagging) done with the of our adapted SMT model with other state-of-the linguistic tools which might produce noisy results art CLIR techniques (eg. query expansion with when applied to the SMT outputs. The rerank- PRF) will be explored in future work. ing with syntax-based features allows to choose a better-formed query for which the PoS tagging Acknowledgements and lemmatisation tools produce less noise which This research was supported by the European leads to a better retrieval. Union’s ICT Policy Support Programme as part of the Competitiveness and Innovation Framework 5 Conclusion Programme, CIP ICT-PSP under grant agreement nr 250430 (Project GALATEAS). In this work we proposed two methods for query- genre adaptation of an SMT model: the first method addressing the translation quality aspect References and the second one the retrieval precision aspect. Salah A¨ıt-Mokhtar, Jean-Pierre Chanod, and Claude We have shown that CLIR performance in terms Roux. 2002. Robustness beyond shallowness: in- 117 cremental deep parsing. Natural Language Engi- Technology Research, pages 138–145, San Diego, neering, 8:121–144, June. California. Morgan Kaufmann Publishers Inc. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: Nicola Ferro and Carol Peters. 2009. CLEF 2009 an automatic metric for MT evaluation with im- ad hoc track overview: TEL and persian tasks. proved correlation with human judgments. In Pro- In Working Notes for the CLEF 2009 Workshop, ceedings of the ACL Workshop on Intrinsic and Ex- Corfu, Greece. trinsic Evaluation Measures for Machine Transla- Atsushi Fujii, Masao Utiyama, Mikio Yamamoto, and tion and/or Summarization, pages 65–72, Ann Ar- Takehito Utsuro. 2009. Evaluating effects of ma- bor, Michigan, June. Association for Computational chine translation accuracy on cross-lingual patent Linguistics. retrieval. In Proceedings of the 32nd international Adam Berger, John Lafferty, and John La Erty. 1999. ACM SIGIR conference on Research and develop- The weaver system for document retrieval. In In ment in information retrieval, SIGIR ’09, pages Proceedings of the Eighth Text REtrieval Confer- 674–675. ence (TREC-8, pages 163–174. Jianfeng Gao, Jian-Yun Nie, and Ming Zhou. 2006. Nicola Bertoldi and Marcello Federico. 2009. Do- Statistical query translation models for cross- main adaptation for statistical machine translation language information retrieval. 5:323–359, Decem- with monolingual resources. In Proceedings of ber. the Fourth Workshop on Statistical Machine Trans- Wei Gao, Cheng Niu, Jian-Yun Nie, Ming Zhou, Kam- lation, pages 182–189. Association for Computa- Fai Wong, and Hsiao-Wuen Hon. 2010. Exploit- tional Linguistics. ing query logs for cross-lingual query suggestions. Caroline Brun, Vassilina Nikoulina, and Nikolaos La- ACM Trans. Inf. Syst., 28(2). gos. 2012. Linguistically-adapted structural query Djoerd Hiemstra and Franciska de Jong. 1999. Dis- annotation for digital libraries in the social sciences. ambiguation strategies for cross-language informa- In Proceedings of the 6th EACL Workshop on Lan- tion retrieval. In Proceedings of the Third European guage Technology for Cultural Heritage, Social Sci- Conference on Research and Advanced Technology ences, and Humanities, Avignon, France, April. for Digital Libraries, pages 274–293. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 Rong Hu, Weizhu Chen, Peng Bai, Yansheng Lu, Workshop on Statistical Machine Translation. In Zheng Chen, and Qiang Yang. 2008. Web query Proceedings of the Fourth Workshop on Statistical translation via web log mining. In Proceedings of Machine Translation, pages 1–28, Athens, Greece, the 31st annual international ACM SIGIR confer- March. Association for Computational Linguistics. ence on Research and development in information David Chiang, Yuval Marton, and Philip Resnik. retrieval, SIGIR ’08, pages 749–750. ACM. 2008. Online large-margin training of syntactic and Amir Hossein Jadidinejad and Fariborz Mahmoudi. structural translation features. In Proceedings of the 2009. Cross-language information retrieval us- 2008 Conference on Empirical Methods in Natural ing meta-language index construction and structural Language Processing, pages 224–233. Association queries. In Proceedings of the 10th cross-language for Computational Linguistics. evaluation forum conference on Multilingual in- St´ephane Clinchant and Jean-Michel Renders. 2007. formation access evaluation: text retrieval experi- Query translation through dictionary adaptation. In ments, CLEF’09, pages 70–77, Berlin, Heidelberg. CLEF’07, pages 182–187. Springer-Verlag. Michael Collins and Brian Roark. 2004. Incremental Gareth Jones, Sakai Tetsuya, Nigel Collier, Akira Ku- parsing with the perceptron algorithm. In ACL ’04: mano, and Kazuo Sumita. 1999. Exploring the Proceedings of the 42nd Annual Meeting on Asso- use of machine translation resources for english- ciation for Computational Linguistics. japanese cross-language information retrieval. In In Michael Collins. 2001. Ranking algorithms for Proceedings of MT Summit VII Workshop on Ma- named-entity extraction: boosting and the voted chine Translation for Cross Language Information perceptron. In ACL’02: Proceedings of the 40th Retrieval, pages 181–188. Annual Meeting on Association for Computational Kimmo Kettunen. 2009. Choosing the best mt pro- Linguistics, pages 489–496, Philadelphia, Pennsyl- grams for clir purposes — can mt metrics be help- vania. Association for Computational Linguistics. ful? In Proceedings of the 31th European Confer- Koby Crammer and Yoram Singer. 2003. Ultracon- ence on IR Research on Advances in Information servative online algorithms for multiclass problems. Retrieval, ECIR ’09, pages 706–712, Berlin, Hei- Journal of Machine Learning Research, 3:951–991. delberg. Springer-Verlag. George Doddington. 2002. Automatic evaluation Philipp Koehn and Josh Schroeder. 2007. Experi- of Machine Translation quality using n-gram co- ments in domain adaptation for statistical machine occurrence statistics. In Proceedings of the sec- translation. In Proceedings of the Second Work- ond international conference on Human Language shop on Statistical Machine Translation, StatMT 118 ’07, pages 224–227. Association for Computational K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001. Linguistics. Bleu: a method for automatic evaluation of machine Philipp Koehn, Franz Josef Och, and Daniel Marcu. translation. 2003. Statistical phrase-based translation. In Pavel Pecina, Antonio Toral, Andy Way, Vassilis Pa- NAACL ’03: Proceedings of the 2003 Conference pavassiliou, Prokopis Prokopidis, and Maria Gi- of the North American Chapter of the Association agkou. 2011. Towards using web-crawled data for for Computational Linguistics on Human Language domain adaptation in statistical machine translation. Technology, pages 48–54, Morristown, NJ, USA. In Proceedings of the 15th Annual Conference of Association for Computational Linguistics. the European Associtation for Machine Translation, Philipp Koehn, Hieu Hoang, Alexandra Birch, pages 297–304, Leuven, Belgium. European Asso- Chris Callison-Burch, Marcello Federico, Nicola ciation for Machine Translation. Bertoldi, Brooke Cowan, Wade Shen, Christine Brian Roark, Murat Saraclar, Michael Collins, and Moran, Richard Zens, Chris Dyer, Ondˇrej Bojar, Mark Johnson. 2004. Discriminative language Alexandra Constantin, and Evan Herbst. 2007. modeling with conditional random fields and the Moses: open source toolkit for statistical machine perceptron algorithm. In Proceedings of the 42nd translation. In ACL ’07: Proceedings of the 45th Annual Meeting of the Association for Computa- Annual Meeting of the ACL on Interactive Poster tional Linguistics (ACL’04), July. and Demonstration Sessions, pages 177–180. As- Lucia Specia, Marco Turchi, Nicola Cancedda, Marc sociation for Computational Linguistics. Dymetman, and Nello Cristianini. 2009. Estimat- Philip Koehn. 2010. Statistical Machine Translation. ing the sentence-level quality of machine translation Cambridge University Press. systems. In Proceedings of the 13th Annual Confer- ence of the EAMT, page 28–35, Barcelona, Spain. Wessel Kraaij, Jian-Yun Nie, and Michel Simard. Taro Watanabe, Jun Suzuki, Hajime Tsukada, and 2003. Embedding web-based statistical trans- Hideki Isozaki. 2007. Online large-margin train- lation models in cross-language information re- ing for statistical machine translation. In Proceed- trieval. Computational Linguistiques, 29:381–419, ings of the 2007 Joint Conference on Empirical September. Methods in Natural Language Processing and Com- Jian-yun Nie and Jiang Chen. 2002. Exploiting the putational Natural Language Learning (EMNLP- web as parallel corpora for cross-language informa- CoNLL), pages 764–773, Prague, Czech Republic. tion retrieval. Web Intelligence, pages 218–239. Association for Computational Linguistics. Jian-Yun Nie. 2010. Cross-Language Information Re- Dan Wu and Daqing He. 2010. A study of query trieval. Morgan & Claypool Publishers. translation using google machine translation sys- Vassilina Nikoulina and Marc Dymetman. 2008. Ex- tem. Computational Intelligence and Software En- periments in discriminating phrase-based transla- gineering (CiSE). tions on the basis of syntactic coupling features. In Hua Wu, Haifeng Wang, and Chengqing Zong. 2008. Proceedings of the ACL-08: HLT Second Workshop Domain adaptation for statistical machine transla- on Syntax and Structure in Statistical Translation tion with domain dictionary and monolingual cor- (SSST-2), pages 55–60. Association for Computa- pora. In Proceedings of the 22nd International tional Linguistics, June. Conference on Computational Linguistics (Col- Franz Josef Och and Hermann Ney. 2003. A sys- ing2008), pages 993–100. tematic comparison of various statistical alignment Bing Zhao, Matthias Eck, and Stephan Vogel. 2004. models. Computational Linguistics, 29(1):19–51. Language model adaptation for statistical machine Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, translation with structured query models. In Pro- Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar ceedings of the 20th international conference on Kumar, Libin Shen, David Smith, Katherine Eng, Computational Linguistics, COLING ’04. Associ- Viren Jain, Zhen Jin, and Dragomir Radev. 2003. ation for Computational Linguistics. Syntax for Statistical Machine Translation: Final Zhongguang Zheng, Zhongjun He, Yao Meng, and report of John Hopkins 2003 Summer Workshop. Hao Yu. 2010. Domain adaptation for statisti- Technical report, John Hopkins University. cal machine translation in development corpus se- Franz Josef Och. 2003. Minimum error rate train- lection. In Universal Communication Symposium ing in statistical machine translation. In ACL ’03: (IUCS), 2010 4th International, pages 2–7. IEEE. Proceedings of the 41st Annual Meeting on Asso- ciation for Computational Linguistics, pages 160– 167, Morristown, NJ, USA. Association for Com- putational Linguistics. Paul Ogilvie and James P. Callan. 2001. Experiments using the lemur toolkit. In TREC. 119 Computing Lattice BLEU Oracle Scores for Machine Translation Artem Sokolov Guillaume Wisniewski Franc¸ois Yvon LIMSI-CNRS & Univ. Paris Sud BP-133, 91 403 Orsay, France {firstname.lastname}@limsi.fr Abstract to better understand the behavior of the system (Turchi et al., 2008; Auli et al., 2009). Useful The search space of Phrase-Based Statisti- diagnostics are, for instance, provided by look- cal Machine Translation (PBSMT) systems ing at the best (oracle) hypotheses contained in can be represented under the form of a di- the search space, i.e, those hypotheses that have rected acyclic graph (lattice). The quality of this search space can thus be evaluated the highest quality score with respect to one or by computing the best achievable hypoth- several references. Such oracle hypotheses can esis in the lattice, the so-called oracle hy- be used for failure analysis and to better under- pothesis. For common SMT metrics, this stand the bottlenecks of existing translation sys- problem is however NP-hard and can only tems (Wisniewski et al., 2010). Indeed, the in- be solved using heuristics. In this work, ability to faithfully reproduce reference transla- we present two new methods for efficiently tions can have many causes, such as scantiness computing BLEU oracles on lattices: the of the translation table, insufficient expressiveness first one is based on a linear approximation of the corpus BLEU score and is solved us- of reordering models, inadequate scoring func- ing the FST formalism; the second one re- tion, non-literal references, over-pruned lattices, lies on integer linear programming formu- etc. Oracle decoding has several other applica- lation and is solved directly and using the tions: for instance, in (Liang et al., 2006; Chi- Lagrangian relaxation framework. These ang et al., 2008) it is used as a work-around to new decoders are positively evaluated and the problem of non-reachability of the reference compared with several alternatives from the in discriminative training of MT systems. Lattice literature for three language pairs, using lat- tices produced by two PBSMT systems. reranking (Li and Khudanpur, 2009), a promising way to improve MT systems, also relies on oracle decoding to build the training data for a reranking 1 Introduction algorithm. The search space of Phrase-Based Statistical Ma- For sentence level metrics, finding oracle hy- chine Translation (PBSMT) systems has the form potheses in n-best lists is a simple issue; how- of a very large directed acyclic graph. In several ever, solving this problem on lattices proves much softwares, an approximation of this search space more challenging, due to the number of embed- can be outputted, either as a n-best list contain- ded hypotheses, which prevents the use of brute- ing the n top hypotheses found by the decoder, or force approaches. When using BLEU, or rather as a phrase or word graph (lattice) which com- sentence-level approximations thereof, the prob- pactly encodes those hypotheses that have sur- lem is in fact known to be NP-hard (Leusch et vived search space pruning. Lattices usually con- al., 2008). This complexity stems from the fact tain much more hypotheses than n-best lists and that the contribution of a given edge to the total better approximate the search space. modified n-gram precision can not be computed Exploring the PBSMT search space is one of without looking at all other edges on the path. the few means to perform diagnostic analysis and Similar (or worse) complexity result are expected 120 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 120–129, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics for other metrics such as METEOR (Banerjee and that it contains a unique initial state q0 and a Lavie, 2005) or TER (Snover et al., 2006). The unique final state qF . Let Πf denote the set of all exact computation of oracles under corpus level paths from q0 to qF in Lf . Each path π ∈ Πf cor- metrics, such as BLEU, poses supplementary com- responds to a possible translation eπ . The job of binatorial problems that will not be addressed in a (conventional) decoder is to find the best path(s) this work. in Lf using scores that combine the edges’ fea- In this paper, we present two original methods ture vectors with the parameters λ ¯ learned during for finding approximate oracle hypotheses on lat- tuning. tices. The first one is based on a linear approxima- In oracle decoding, the decoder’s job is quite tion of the corpus BLEU, that was originally de- different, as we assume that at least a reference signed for efficient Minimum Bayesian Risk de- rf is provided to evaluate the quality of each indi- coding on lattices (Tromble et al., 2008). The sec- vidual hypothesis. The decoder therefore aims at ond one, based on Integer Linear Programming, is finding the path π ∗ that generates the hypothesis an extension to lattices of a recent work on failure that best matches rf . For this task, only the output analysis for phrase-based decoders (Wisniewski labels ei will matter, the other informations can be et al., 2010). In this framework, we study two left aside.4 decoding strategies: one based on a generic ILP Oracle decoding assumes the definition of a solver, and one, based on Lagrangian relaxation. measure of the similarity between a reference Our contribution is also experimental as we and a hypothesis. In this paper we will con- compare the quality of the BLEU approxima- sider sentence-level approximations of the popu- tions and the time performance of these new ap- lar BLEU score (Papineni et al., 2002). BLEU is proaches with several existing methods, for differ- formally defined for two parallel corpora, E = ent language pairs and using the lattice generation {ej }Jj=1 and R = {rj }Jj=1 , each containing J capacities of two publicly-available state-of-the- sentences as: art phrase-based decoders: Moses1 and N-code2 . Y n 1/n The rest of this paper is organized as follows. n-BLEU(E, R) = BP · pm , (1) In Section 2, we formally define the oracle decod- m=1 ing task and recall the formalism of finite state automata on semirings. We then describe (Sec- where BP = min(1, e1−c1 (R)/c1 (E) ) is the tion 3) two existing approaches for solving this brevity penalty and pm = cm (E, R)/cm (E) are task, before detailing our new proposals in sec- clipped or modified m-gram precisions: cm (E) is tions 4 and 5. We then report evaluations of the the total number of word m-grams in E; cm (E, R) existing and new oracles on machine translation accumulates over sentences the number of m- tasks. grams in ej that also belong to rj . These counts are clipped, meaning that a m-gram that appears 2 Preliminaries k times in E and l times in R, with k > l, is only counted l times. As it is well known, BLEU per- 2.1 Oracle Decoding Task forms a compromise between precision, which is We assume that a phrase-based decoder is able directly appears in Equation (1), and recall, which to produce, for each source sentence f , a lattice is indirectly taken into account via the brevity Lf = hQ, Ξi, with # {Q} vertices (states) and penalty. In most cases, Equation (1) is computed # {Ξ} edges. Each edge carries a source phrase with n = 4 and we use BLEU as a synonym for fi , an associated output phrase ei as well as a fea- 4- BLEU . ture vector h¯ i , the components of which encode BLEU is defined for a pair of corpora, but, as an various compatibility measures between fi and ei . oracle decoder is working at the sentence-level, it We further assume that Lf is a word lattice, should rely on an approximation of BLEU that can meaning that each ei carries a single word3 and linear chain of arcs. 4 The algorithms described below can be straightfor- 1 http://www.statmt.org/moses/ wardly generalized to compute oracle hypotheses under 2 http://ncode.limsi.fr/ combined metrics mixing model scores and quality measures 3 Converting a phrase lattice to a word lattice is a simple (Chiang et al., 2008), by weighting each edge with its model matter of redistributing a compound input or output over a score and by using these weights down the pipe. 121 evaluate the similarity between a single hypoth- 2.3 Finite State Acceptors esis and its reference. This approximation intro- The implementations of the oracles described in duces a discrepancy as gathering sentences with the first part of this work (sections 3 and 4) use the the highest (local) approximation may not result common formalism of finite state acceptors (FSA) in the highest possible (corpus-level) BLEU score. over different semirings and are implemented us- Let BLEU0 be such a sentence-level approximation ing the generic OpenFST toolbox (Allauzen et al., of BLEU. Then lattice oracle decoding is the task 2007). of finding an optimal path π ∗ (f ) among all paths A (⊕, ⊗)-semiring K over a set K is a system Πf for a given f , and amounts to the following hK, ⊕, ⊗, ¯0, ¯1i, where hK, ⊕, ¯0i is a commutative optimization problem: monoid with identity element ¯0, and hK, ⊗, ¯1i is a monoid with identity element ¯1. ⊗ distributes π ∗ (f ) = arg max BLEU0 (eπ , rf ). (2) over ⊕, so that a ⊗ (b ⊕ c) = (a ⊗ b) ⊕ (a ⊗ c) π∈Πf and (b ⊕ c) ⊗ a = (b ⊗ a) ⊕ (c ⊗ a) and element ¯0 annihilates K (a ⊗ ¯0 = ¯0 ⊗ a = ¯0). 2.2 Compromises of Oracle Decoding Let A = (Σ, Q, I, F, E) be a weighted finite- state acceptor with labels in Σ and weights in K, As proved by Leusch et al. (2008), even with meaning that the transitions (q, σ, q 0 ) in A carry a brevity penalty dropped, the problem of deciding weight w ∈ K. Formally, E is a mapping from whether a confusion network contains a hypoth- (Q × Σ × Q) into K; likewise, initial I and fi- esis with clipped uni- and bigram precisions all nal weight F functions are mappings from Q into equal to 1.0 is NP-complete (and so is the asso- K. We borrow the notations of Mohri (2009): ciated optimization problem of oracle decoding if ξ = (q, a, q 0 ) is a transition in domain(E), for 2-BLEU). The case of more general word and p(ξ) = q (resp. n(ξ) = q 0 ) denotes its origin phrase lattices and 4-BLEU score is consequently (resp. destination) state, w(ξ) = σ its label and also NP-complete. This complexity stems from E(ξ) its weight. These notations extend to paths: chaining up of local unigram decisions that, due if π is a path in A, p(π) (resp. n(π)) is its initial to the clipping constraints, have non-local effect (resp. ending) state and w(π) is the label along on the bigram precision scores. It is consequently the path. A finite state transducer (FST) is an FSA necessary to keep a possibly exponential num- with output alphabet, so that each transition car- ber of non-recombinable hypotheses (character- ries a pair of input/output symbols. ized by counts for each n-gram in the reference) As discussed in Sections 3 and 4, several oracle until very late states in the lattice. decoding algorithms can be expressed as shortest- These complexity results imply that any oracle path problems, provided a suitable definition of decoder has to waive either the form of the objec- the underlying acceptor and associated semiring. tive function, replacing BLEU with better-behaved In particular, quantities such as: scoring functions, or the exactness of the solu- M tion, relying on approximate heuristic search al- E(π), (3) gorithms. π∈Π(A) In Table 1, we summarize different compro- where the total weight of a successful path π = mises that the existing (section 3), as well as ξ1 . . . ξl in A is computed as: our novel (sections 4 and 5) oracle decoders, l have to make. The “target” and “target level” O columns specify the targeted score. None of E(π) =I(p(ξ1 )) ⊗ E(ξi ) ⊗ F (n(ξl )) i=1 the decoders optimizes it directly: their objec- tive function is rather the approximation of BLEU can be efficiently found by generic shortest dis- given in the “target replacement” column. Col- tance algorithms over acyclic graphs (Mohri, umn “search” details the accuracy of the target re- 2002). For FSA-based implementations over placement optimization. Finally, columns “clip- semirings where ⊕ = max, the optimization ping” and “brevity” indicate whether the corre- problem (2) is thus reduced to Equation (3), while sponding properties of BLEU score are considered the oracle-specific details can be incorporated into in the target substitute and in the search algorithm. in the definition of ⊗. 122 this paper existing oracle target target level target replacement search clipping brevity LM-2g/4g 2/4- BLEU sentence P2 (e; r) or P4 (e; r) exact no no PB 4- BLEU sentence partial log BLEU (4) appr. no no PB` 4- BLEU sentence partial log BLEU (4) appr. no yes LB-2g/4g 2/4- BLEU corpus linear appr. lin BLEU (5) exact no yes SP 1- BLEU sentence unigram count exact no yes ILP 2- BLEU sentence uni/bi-gram counts (7) appr. yes yes RLX 2- BLEU sentence uni/bi-gram counts (8) exact yes yes Table 1: Recapitulative overview of oracle decoders. 3 Existing Algorithms 3.2 Partial BLEU Oracle (PB) In this section, we describe our reimplementation Another approach is put forward in (Dreyer et of two approximate search algorithms that have al., 2007) and used in (Li and Khudanpur, 2009): been proposed in the literature to solve the oracle oracle translations are shortest paths in a lattice decoding problem for BLEU. In addition to their L, where the weight of each path π is the sen- approximate nature, none of them accounts for the tence level log BLEU(π) score of the correspond- fact that the count of each matching word has to ing complete or partial hypothesis: be clipped. 1 X log BLEU(π) = log pm . (4) 4 3.1 Language Model Oracle (LM) m=1...4 The simplest approach we consider is introduced Here, the brevity penalty is ignored and n- in (Li and Khudanpur, 2009), where oracle decod- gram precisions are offset to avoid null counts: ing is reduced to the problem of finding the most pm = (cm (eπ , r) + 0.1)/(cm (eπ ) + 0.1). likely hypothesis under a n-gram language model This approach has been reimplemented using trained with the sole reference translation. the FST formalism by defining a suitable semir- Let us suppose we have a n-gram language ing. Let each weight of the semiring keep a set model that gives a probability P (en |e1 . . . en−1 ) of tuples accumulated up to the current state of of word en given the n − 1 previous words. the lattice. Each tuple contains three words of re- The probability Q of a hypothesis e is then cent history, a partial hypothesis as well as current Pn (e|r) = i=1 P (ei+n |ei . . . ei+n−1 ). The lan- values of the length of the partial hypothesis, n- guage model can conveniently be represented as a gram counts (4 numbers) and the sentence-level FSA ALM , with each arc carrying a negative log- log BLEU score defined by Equation (4). In the probability weight and with additional ρ-type fail- beginning each arc is initialized with a singleton ure transitions to accommodate for back-off arcs. set containing one tuple with a single word as the If we train, for each source sentence f , a sepa- partial hypothesis. For the semiring operations we rate language model ALM (rf ) using only the ref- define one common ⊗-operation and two versions erence rf , oracle decoding amounts to finding a of the ⊕-operation: shortest (most probable) path in the weighted FSA • L1 ⊗P B L2 – appends a word on the edge of resulting from the composition L ◦ ALM (rf ) over L2 to L1 ’s hypotheses, shifts their recent histories the (min, +)-semiring: and updates n-gram counts, lengths, and current score; • L1 ⊕P B L2 – merges all sets from L1 π ∗LM (f ) = ShortestPath(L ◦ ALM (rf )). and L2 and recombinates those having the same recent history; • L1 ⊕P B` L2 – merges all sets This approach replaces the optimization of n- from L1 and L2 and recombinates those having BLEU with a search for the most probable path the same recent history and the same hypothesis under a simplistic n-gram language model. One length. may expect the most probable path to select fre- If several hypotheses have the same recent quent n-gram from the reference, thus augment- history (and length in the case of ⊕P B` ), re- ing n-BLEU. combination removes all of them, but the one 123 1:ε/0 1:111/0 0:00/0 1:11/0 0:ε/0 1:101/0 1:1/0 1 10 11 0:0/0 1:ε/0 0:110/0 1:01/0 0:100/0 q₀ 0 0:10/0 1 0:ε/0 q₀ 1:011/0 0:ε/0 0:ε/0 0:000/0 q₀ 1:ε/0 0:010/0 1:ε/0 0 01 00 1:001/0 (a) ∆1 (b) ∆2 (c) ∆3 Figure 1: Examples of the ∆n automata for Σ = {0, 1} and n = 1 . . . 3. Initial and final states are marked, respectively, with bold and with double borders. Note that arcs between final states are weighted with 0, while in reality they will have this weight only if the corresponding n-gram does not appear in the reference. with the largest current BLEU score. Optimal gram, and all weighted transitions of the kind path is then found by launching the generic (σ1n−1 , σn : σ1n /θn × δσ1n (r), σ2n ), where σs are ShortestDistance(L) algorithm over one of in Σ, input word sequence σ1n−1 and output se- the semirings above. quence σ2n , are, respectively, the maximal prefix The (⊕P B` , ⊗P B )-semiring, in which the and suffix of an n-gram σ1n . equal length requirement also implies equal In supplement, we add auxiliary states corre- brevity penalties, is more conservative in recom- sponding to m-grams (m < n − 1), whose func- bining hypotheses and should achieve final BLEU tional purpose is to help reach one of the main n−1 that is least as good as that obtained with the (n − 1)-gram states. There are |Σ||Σ|−1−1 , n > 1, (⊕P B , ⊗P B )-semiring5 . such supplementary states and their transitions are (σ1k , σk+1 : σ1k+1 /0, σ1k+1 ), k = 1 . . . n−2. Apart 4 Linear BLEU Oracle (LB) from these auxiliary states, the rest of the graph In this section, we propose a new oracle based on (i.e., all final states) reproduces the structure of the linear approximation of the corpus BLEU in- the well-known de Bruijn graph B(Σ, n) (see Fig- troduced in (Tromble et al., 2008). While this ap- ure 1). proximation was earlier used for Minimum Bayes To actually compute the best hypothesis, we Risk decoding in lattices (Tromble et al., 2008; first weight all arcs in the input FSA L with θ0 to Blackwood et al., 2010), we show here how it can obtain ∆0 . This makes each word’s weight equal also be used to approximately compute an oracle in a hypothesis path, and the total weight of the translation. path in ∆0 is proportional to the number of words Given five real parameters θ0...4 and a word vo- in it. Then, by sequentially composing ∆0 with cabulary Σ, Tromble et al. (2008) showed that one other ∆n s, we discount arcs whose output n-gram can approximate the corpus-BLEU with its first- corresponds to a matching n-gram. The amount order (linear) Taylor expansion: of discount is regulated by the ratio between θn ’s for n > 0. 4 X X With all operations performed over the lin BLEU(π) = θ0 |eπ |+ θn cu (eπ )δu (r), (min, +)-semiring, the oracle translation is then n=1 u∈Σn (5) given by: where cu (e) is the number of times the n-gram ∗ u appears in e, and δu (r) is an indicator variable π LB = ShortestPath(∆0 ◦∆1 ◦∆2 ◦∆3 ◦∆4 ). testing the presence of u in r. We set parameters θn as in (Tromble et al., To exploit this approximation for oracle decod- 2008): θ0 = 1, roughly corresponding to the ing, we construct four weighted FSTs ∆n con- brevity penalty (each word in a hypothesis adds taining a (final) state for each possible (n − 1)- up equally to the final path length) and θn = 5 See, however, experiments in Section 6. −(4p · rn−1 )−1 , which are increasing discounts 124 define, for every edge ξi , an associated reward, θi 36 34 that describes the edge’s local contribution to the 32 36 34 30 28 hypothesis score. For instance, for the sentence 32 BLEU 30 28 26 24 approximation of the 1-BLEU score, the rewards 26 24 22 are defined as: 22 0 0.2 ( 0.4 r Θ1 if w(ξi ) is in the reference, 1 0.8 0.6 0.8 0.6 θi = p 0.4 0.2 0 1 −Θ2 otherwise, Figure 2: Performance of the LB-4g oracle for differ- where Θ1 and Θ2 are two positive constants cho- ent combinations of p and r on WMT11 de2en task. sen to maximize the corpus BLEU score6 . Con- stant Θ1 (resp. Θ2 ) is a reward (resp. a penalty) for generating a word in the reference (resp. not in for matching n-grams. The values of p and r were the reference). The score of an assignment ξ ∈ P found by grid search with a 0.05 step value. A P#{Ξ} is then defined as: score(ξ) = i=1 ξi · θi . This typical result of the grid evaluation of the LB or- score can be seen as a compromise between the acle for German to English WMT’11 task is dis- number of common words in the hypothesis and played on Figure 2. The optimal values for the the reference (accounting for recall) and the num- other pairs of languages were roughly in the same ber of words of the hypothesis that do not appear ballpark, with p ≈ 0.3 and r ≈ 0.2. in the reference (accounting for precision). 5 Oracles with n-gram Clipping As explained in Section 2.3, finding the or- acle hypothesis amounts to solving the shortest In this section, we describe two new oracle de- distance (or path) problem (3), which can be re- coders that take n-gram clipping into account. formulated by a constrained optimization prob- These oracles leverage on the well-known fact lem (Wolsey, 1998): that the shortest path problem, at the heart of #{Ξ} all the oracles described so far, can be reduced X straightforwardly to an Integer Linear Program- arg max ξi · θi (6) ξ∈P i=1 ming (ILP) problem (Wolsey, 1998). Once oracle X X decoding is formulated as an ILP problem, it is s.t. ξ = 1, ξ=1 relatively easy to introduce additional constraints, ξ∈Ξ− (qF ) ξ∈Ξ+ (q0 ) for instance to enforce n-gram clipping. We will X X ξ− ξ = 0, q ∈ Q \ {q0 , qF } first describe the optimization problem of oracle ξ∈Ξ+ (q) ξ∈Ξ− (q) decoding and then present several ways to effi- ciently solve it. where q0 (resp. qF ) is the initial (resp. final) state of the lattice and Ξ− (q) (resp. Ξ+ (q)) denotes the 5.1 Problem Description set of incoming (resp. outgoing) edges of state q. Throughout this section, abusing the notations, These path constraints ensure that the solution of we will also think of an edge ξi as a binary vari- the problem is a valid path in the lattice. able describing whether the edge is “selected” or The optimization problem in Equation (6) can not. The set {0, 1}#{Ξ} of all possible edge as- be further extended to take clipping into account. signments will be denoted by P. Note that Π, the Let us introduce, for each word w, a variable γw set of all paths in the lattice is a subset of P: by that denotes the number of times w appears in the enforcing some constraints on an assignment ξ in hypothesis clipped to the number of times, it ap- P, it can be guaranteed that it will represent a path pears in the reference. Formally, γw is defined by: in the lattice. For the sake of presentation, we as-   sume that each edge ξi generates a single word  X  w(ξi ) and we focus first on finding the optimal γw = min ξ, cw (r)   ξ∈Ω(w) hypothesis with respect to the sentence approxi- mation of the 1-BLEU score. 6 We tried several combinations of Θ1 and Θ2 and kept As 1-BLEU is decomposable, it is possible to the one that had the highest corpus 4-BLEU score. 125 wherePΩ (w) is the subset of edges generating w, 5.2 Shortest Path Oracle (SP) and ξ∈Ω(w) ξ is the number of occurrences of As a trivial special class of the above formula- w in the solution and cw (r) is the number of oc- tion, we also define a Shortest Path Oracle (SP) currences of w in the reference r. Using the γ that solves the optimization problem in (6). As variables, we define a “clipped” approximation of no clipping constraints apply, it can be solved ef- 1- BLEU : ficiently using the standard Bellman algorithm.   #{Ξ} 5.3 Oracle Decoding through Lagrangian X X X Θ1 · γw − Θ2 ·  ξi − γw  w i=1 w Relaxation (RLX) Indeed, the clipped number of words in the hy- In this section, we introduce another method to pothesis that appear in the reference is given by solve problem (7) without relying on an exter- P P#{Ξ} P nal ILP solver. Following (Rush et al., 2010; w γw , and i=1 ξi − w γw corresponds to the number of words in the hypothesis that do not Chang and Collins, 2011), we propose an original appear in the reference or that are surplus to the method for oracle decoding based on Lagrangian clipped count. relaxation. This method relies on the idea of re- Finally, the clipped lattice oracle is defined by laxing the clipping constraints: starting from an the following optimization problem: unconstrained problem, the counts clipping is en- forced by incrementally strengthening the weight #{Ξ} X X of paths satisfying the constraints. arg max (Θ1 + Θ2 ) · γw − Θ2 · ξi The oracle decoding problem with clipping ξ∈P,γw w i=1 constraints amounts to solving: (7) X #{Ξ} s.t. γw ≥ 0, γw ≤ cw (r), γw ≤ ξ arg min − X ξi · θi (8) ξ∈Ω(w) ξ∈Π X X i=1 X ξ = 1, ξ=1 s.t. ξ ≤ cw (r), w ∈ r ξ∈Ξ− (qF ) ξ∈Ξ+ (q0 ) ξ∈Ω(w) X X ξ− ξ = 0, q ∈ Q \ {q0 , qF } where, by abusing the notations, r also denotes ξ∈Ξ+ (q) ξ∈Ξ− (q) the set of words in the reference. For sake of clar- where the first three sets of constraints are the lin- ity, the path constraints are incorporated into the earization of the definition of γw , made possible domain (the arg min runs over Π and not over P). by the positivity of Θ1 and Θ2 , and the last three To solve this optimization problem we consider its sets of constraints are the path constraints. dual form and use Lagrangian relaxation to deal In our implementation we generalized this op- with clipping constraints. timization problem to bigram lattices, in which Let λ = {λw }w∈r be positive Lagrange mul- each edge is labeled by the bigram it generates. tipliers, one for each different word of the refer- Such bigram FSAs can be produced by compos- ence, then the Lagrangian of the problem (8) is: ing the word lattice with ∆2 from Section 4. In   #{Ξ} this case, the reward of an edge will be defined as X X X a combination of the (clipped) number of unigram L(λ, ξ) = − ξi θi + λw ξ − cw (r) i=1 w∈r ξ∈Ω(w) matches and bigram matches, and solving the op- timization problem yields a 2-BLEU optimal hy- The dual objective is L(λ) = minξ L(λ, ξ) pothesis. The approach can be further generalized and the dual problem is: maxλ,λ0 L(λ). To to higher-order BLEU or other metrics, as long as solve the latter, we first need to work out the dual the reward of an edge can be computed locally. objective: The constrained optimization problem (7) can be solved efficiently using off-the-shelf ILP ξ ∗ = arg min L(λ, ξ) ξ∈Π solvers7 . 7 #{Ξ} In our experiments we used Gurobi (Optimization, X 2010) a commercial ILP solver that offers free academic li- = arg min ξi λw(ξi ) − θi ξ∈Π i=1 cense. 126 where we assume that λw(ξi ) is 0 when word decoder fr2en de2en en2de w(ξi ) is not in the reference. In the same way N-code 27.88 22.05 15.83 oracle test as in Section 5.2, the solution of this problem can Moses 27.68 21.85 15.89 be efficiently retrieved with a shortest path algo- N-code 36.36 29.22 21.18 rithm. Moses 35.25 29.13 22.03 It is possible to optimize L(λ) by noticing that it is a concave function. It can be shown (Chang Table 2: Test BLEU scores and oracle scores on and Collins, 2011) that, at convergence, the clip- 100-best lists for the evaluated systems. ping constraints will be enforced in the optimal solution. In this work, we chose to use a simple and 4). Systems were trained on the data provided gradient descent to solve the dual problem. A sub- for the WMT’11 Evaluation task10 , tuned on the gradient of the dual objective is: WMT’09 test data and evaluated on WMT’10 test ∂L(λ) X set11 to produce lattices. The BLEU test scores = ξ − cw (r). ∂λw ∗ and oracle scores on 100-best lists with the ap- ξ∈Ω(w)∩ξ proximation (4) for N-code and Moses are given Each component of the gradient corresponds to in Table 2. It is not until considering 10,000-best the difference between the number of times the lists that n-best oracles achieve performance com- word w appears in the hypothesis and the num- parable to the (mediocre) SP oracle. ber of times it appears in the reference. The algo- To make a fair comparison with the ILP and rithm below sums up the optimization of task (8). RLX oracles which optimize 2-BLEU, we in- In the algorithm α(t) corresponds to the step size cluded 2-BLEU versions of the LB and LM ora- at the tth iteration. In our experiments we used a cles, identified below with the “-2g” suffix. The constant step size of 0.1. Compared to the usual two versions of the PB oracle are respectively gradient descent algorithm, there is an additional denoted as PB and PB`, by the type of the ⊕- projection step of λ on the positive orthant, which operation they consider (Section 3.2). Parame- enforces the constraint λ 0. ters p and r for the LB-4g oracle for N-code were found with grid search and reused for Moses: (0) ∀w, λw ← 0 p = 0.25, r = 0.15 (fr2en); p = 0.175, r = 0.575 for t = 1 → T do (en2de) and p = 0.35, r = 0.425 (de2en). Cor- ξ ∗(t) = arg minξ i ξi · λw(ξi ) − θi respondingly, for the LB-2g oracle: p = 0.3, r = P if all clipping constraints are enforced 0.15; p = 0.3, r = 0.175 and p = 0.575, r = 0.1. then optimal solution found The proposed LB, ILP and RLX oracles were else for w ∈ r do the best performing oracles, with the ILP and nw ← n. of occurrences of w in ξ ∗(t) RLX oracles being considerably faster, suffering (t) (t) only a negligible decrease in BLEU, compared to λw ← λw + α(t) · (nw − cw (r)) (t) λw ← max(0, λw ) (t) the 4-BLEU-optimized LB oracle. We stopped RLX oracle after 20 iterations, as letting it con- verge had a small negative effect (∼1 point of the 6 Experiments corpus BLEU), because of the sentence/corpus dis- crepancy ushered by the BLEU score approxima- For the proposed new oracles and the existing ap- tion. proaches, we compare the quality of oracle trans- Experiments showed consistently inferior per- lations and the average time per sentence needed formance of the LM-oracle resulting from the op- to compute them8 on several datasets for 3 lan- timization of the sentence probability rather than guage pairs, using lattices generated by two open- BLEU . The PB oracle often performed compara- source decoders: N-code and Moses9 (Figures 3 bly to our new oracles, however, with sporadic 8 Experiments were run in parallel on a server with 64G resource-consumption bursts, that are difficult to of RAM and 2 Xeon CPUs with 4 cores at 2.3 GHz. 9 10 As the ILP (and RLX) oracle were implemented in http://www.statmt.org/wmt2011 11 Python, we pruned Moses lattices to accelerate task prepa- All BLEU scores are reported using the multi-bleu.pl ration for it. script. 127 50 6 30 BLEU BLEU BLEU avg. time avg. time avg. time 48.22 48.12 47.82 47.71 46.76 46.48 5 1.5 35.49 45 35 35.09 34.85 34.79 34.76 34.70 1 25.34 4 25 24.85 24.78 24.75 24.73 24.66 41.23 40 avg. time, s avg. time, s avg. time, s 1 BLEU BLEU BLEU 38.91 38.75 3 22.19 30.78 35 30 20.78 20.74 29.53 29.53 2 20 0.5 0.5 30 1 25 0 25 0 15 0 RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g (a) fr2en (b) de2en (c) en2de Figure 3: Oracles performance for N-code lattices. 50 30 37.73 29.94 BLEU BLEU 4 BLEU avg. time avg. time avg. time 9 36.91 28.94 36.75 28.76 28.68 36.62 28.65 28.64 36.52 36.43 3 8 45 35 44.44 26.48 44.08 43.82 43.82 3 7 43.42 43.20 25 41.03 6 40 avg. time, s avg. time, s avg. time, s 2 BLEU BLEU BLEU 5 2 36.34 36.25 4 30.52 35 30 21.29 21.23 20 29.51 29.45 3 1 1 30 2 1 25 0 25 0 15 0 RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g (a) fr2en (b) de2en (c) en2de Figure 4: Oracles performance for Moses lattices pruned with parameter -b 0.5. avoid without more cursory hypotheses recom- ter approximations of BLEU than was previously bination strategies and the induced effect on the done, taking the corpus-based nature of BLEU, or translations quality. The length-aware PB` oracle clipping constrainst into account, delivering better has unexpectedly poorer scores compared to its oracles without compromising speed. length-agnostic PB counterpart, while it should, Using 2-BLEU and 4-BLEU oracles yields com- at least, stay even, as it takes the brevity penalty parable performance, which confirms the intuition into account. We attribute this fact to the com- that hypotheses sharing many 2-grams, would plex effect of clipping coupled with the lack of likely have many common 3- and 4-grams as well. control of the process of selecting one hypothe- Taking into consideration the exceptional speed of sis among several having the same BLEU score, the LB-2g oracle, in practice one can safely opti- length and recent history. Anyhow, BLEU scores mize for 2-BLEU instead of 4-BLEU, saving large of both of PB oracles are only marginally differ- amounts of time for oracle decoding on long sen- ent, so the PB`’s conservative policy of pruning tences. and, consequently, much heavier memory con- Overall, these experiments accentuate the sumption makes it an unwanted choice. acuteness of scoring problems that plague modern decoders: very good hypotheses exist for most in- 7 Conclusion put sentences, but are poorly evaluated by a linear We proposed two methods for finding oracle combination of standard features functions. Even translations in lattices, based, respectively, on a though the tuning procedure can be held respon- linear approximation to the corpus-level BLEU sible for part of the problem, the comparison be- and on integer linear programming techniques. tween lattice and n-best oracles shows that the We also proposed a variant of the latter approach beam search leaves good hypotheses out of the n- based on Lagrangian relaxation that does not rely best list until very high value of n, that are never on a third-party ILP solver. All these oracles have used in practice. superior performance to existing approaches, in Acknowledgments terms of the quality of the found translations, re- source consumption and, for the LB-2g oracles, This work has been partially funded by OSEO un- in terms of speed. It is thus possible to use bet- der the Quaero program. 128 References Mehryar Mohri. 2002. Semiring frameworks and al- gorithms for shortest-distance problems. J. Autom. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wo- Lang. Comb., 7:321–350. jciech Skut, and Mehryar Mohri. 2007. OpenFst: Mehryar Mohri. 2009. Weighted automata algo- A general and efficient weighted finite-state trans- rithms. In Manfred Droste, Werner Kuich, and ducer library. In Proc. of the Int. Conf. on Imple- Heiko Vogler, editors, Handbook of Weighted Au- mentation and Application of Automata, pages 11– tomata, chapter 6, pages 213–254. 23. Gurobi Optimization. 2010. Gurobi optimizer, April. Michael Auli, Adam Lopez, Hieu Hoang, and Philipp Version 3.0. Koehn. 2009. A systematic analysis of translation Kishore Papineni, Salim Roukos, Todd Ward, and model search spaces. In Proc. of WMT, pages 224– Wei-Jing Zhu. 2002. BLEU: a method for auto- 232, Athens, Greece. matic evaluation of machine translation. In Proc. of Satanjeev Banerjee and Alon Lavie. 2005. ME- the Annual Meeting of the ACL, pages 311–318. TEOR: An automatic metric for MT evaluation with Alexander M. Rush, David Sontag, Michael Collins, improved correlation with human judgments. In and Tommi Jaakkola. 2010. On dual decomposi- Proc. of the ACL Workshop on Intrinsic and Extrin- tion and linear programming relaxations for natural sic Evaluation Measures for Machine Translation, language processing. In Proc. of the 2010 Conf. on pages 65–72, Ann Arbor, MI, USA. EMNLP, pages 1–11, Stroudsburg, PA, USA. Graeme Blackwood, Adri`a de Gispert, and William Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- Byrne. 2010. Efficient path counting transducers nea Micciulla, and John Makhoul. 2006. A study for minimum bayes-risk decoding of statistical ma- of translation edit rate with targeted human anno- chine translation lattices. In Proc. of the ACL 2010 tation. In Proc. of the Conf. of the Association for Conference Short Papers, pages 27–32, Strouds- Machine Translation in the America (AMTA), pages burg, PA, USA. 223–231. Yin-Wen Chang and Michael Collins. 2011. Exact de- Roy W. Tromble, Shankar Kumar, Franz Och, and coding of phrase-based translation models through Wolfgang Macherey. 2008. Lattice minimum lagrangian relaxation. In Proc. of the 2011 Conf. on bayes-risk decoding for statistical machine transla- EMNLP, pages 26–37, Edinburgh, UK. tion. In Proc. of the Conf. on EMNLP, pages 620– 629, Stroudsburg, PA, USA. David Chiang, Yuval Marton, and Philip Resnik. Marco Turchi, Tijl De Bie, and Nello Cristianini. 2008. Online large-margin training of syntactic 2008. Learning performance of a machine trans- and structural translation features. In Proc. of the lation system: a statistical and computational anal- 2008 Conf. on EMNLP, pages 224–233, Honolulu, ysis. In Proc. of WMT, pages 35–43, Columbus, Hawaii. Ohio. Markus Dreyer, Keith B. Hall, and Sanjeev P. Khu- Guillaume Wisniewski, Alexandre Allauzen, and danpur. 2007. Comparing reordering constraints Franc¸ois Yvon. 2010. Assessing phrase-based for SMT using efficient BLEU oracle computation. translation models with oracle decoding. In Proc. In Proc. of the Workshop on Syntax and Structure of the 2010 Conf. on EMNLP, pages 933–943, in Statistical Translation, pages 103–110, Morris- Stroudsburg, PA, USA. town, NJ, USA. L. Wolsey. 1998. Integer Programming. John Wiley Gregor Leusch, Evgeny Matusov, and Hermann Ney. & Sons, Inc. 2008. Complexity of finding the BLEU-optimal hy- pothesis in a confusion network. In Proc. of the 2008 Conf. on EMNLP, pages 839–847, Honolulu, Hawaii. Zhifei Li and Sanjeev Khudanpur. 2009. Efficient extraction of oracle-best translations from hyper- graphs. In Proc. of Human Language Technolo- gies: The 2009 Annual Conf. of the North Ameri- can Chapter of the ACL, Companion Volume: Short Papers, pages 9–12, Morristown, NJ, USA. Percy Liang, Alexandre Bouchard-Cˆot´e, Dan Klein, and Ben Taskar. 2006. An end-to-end discrim- inative approach to machine translation. In Proc. of the 21st Int. Conf. on Computational Linguistics and the 44th annual meeting of the ACL, pages 761– 768, Morristown, NJ, USA. 129 Toward Statistical Machine Translation without Parallel Corpora Alexandre Klementiev Ann Irvine Chris Callison-Burch David Yarowsky Center for Language and Speech Processing Johns Hopkins University Abstract novel algorithm to estimate reordering features from monolingual data alone, and we report the We estimate the parameters of a phrase- performance of a phrase-based statistical model based statistical machine translation sys- (Koehn et al., 2003) estimated using these mono- tem from monolingual corpora instead of a lingual features. bilingual parallel corpus. We extend exist- ing research on bilingual lexicon induction Most of the prior work on lexicon induction to estimate both lexical and phrasal trans- is motivated by the idea that it could be applied lation probabilities for MT-scale phrase- to machine translation but stops short of actu- tables. We propose a novel algorithm to es- ally doing so. Lexicon induction holds the po- timate reordering probabilities from mono- tential to create machine translation systems for lingual data. We report translation results languages which do not have extensive parallel for an end-to-end translation system us- corpora. Training would only require two large ing these monolingual features alone. Our monolingual corpora and a small bilingual dictio- method only requires monolingual corpora in source and target languages, a small nary, if one is available. The idea is that intrin- bilingual dictionary, and a small bitext for sic properties of monolingual data (possibly along tuning feature weights. In this paper, we ex- with a handful of bilingual pairs to act as exam- amine an idealization where a phrase-table ple mappings) can provide independent but infor- is given. We examine the degradation in mative cues to learn translations because words translation performance when bilingually (and phrases) behave similarly across languages. estimated translation probabilities are re- This work is the first attempt to extend and apply moved and show that 80%+ of the loss can be recovered with monolingually estimated these ideas to an end-to-end machine translation features alone. We further show that our pipeline. While we make an explicit assumption monolingual features add 1.5 BLEU points that a table of phrasal translations is given a priori, when combined with standard bilingually we induce every other parameter of a full phrase- estimated phrase table features. based translation system from monolingual data alone. The contributions of this work are: 1 Introduction • In Section 2.2 we analyze the challenges The parameters of statistical models of transla- of using bilingual lexicon induction for sta- tion are typically estimated from large bilingual tistical MT (performance on low frequency parallel corpora (Brown et al., 1993). However, items, and moving from words to phrases). these resources are not available for most lan- guage pairs, and they are expensive to produce in • In Sections 3.1 and 3.2 we use multiple cues quantities sufficient for building a good transla- present in monolingual data to estimate lexi- tion system (Germann, 2001). We attempt an en- cal and phrasal translation scores. tirely different approach; we use cheap and plen- • In Section 3.3 we propose a novel algo- tiful monolingual resources to induce an end-to- rithm for estimating phrase reordering fea- end statistical machine translation system. In par- tures from monolingual texts. ticular, we extend the long line of work on in- ducing translation lexicons (beginning with Rapp • Finally, in Section 5 we systematically drop (1995)) and propose to use multiple independent feature functions from a phrase table and cues present in monolingual texts to estimate lex- then replace them with monolingually es- ical and phrasal translation probabilities for large, timated equivalents, reporting end-to-end MT-scale phrase-tables. We then introduce a translation quality. 130 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 130–140, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics 2 Background Facebook verdienen aufrgund Wieviel seines Profils sollte man We begin with a brief overview of the stan- in dard phrase-based statistical machine translation How model. Here, we define the parameters which much we later replace with monolingual alternatives. We continue with a discussion of bilingual lex- should m icon induction; we extend these methods to es- you m timate the monolingual parameters in Section 3. d charge This approach allows us to replace expensive/rare for d bilingual parallel training data with two large your m monolingual corpora, a small bilingual dictionary, and ≈2,000 sentence bilingual development set, Facebook d which are comparatively plentiful/inexpensive. profile s 2.1 Parameters of phrase-based SMT Statistical machine translation (SMT) was first Figure 1: The reordering probabilities from the phrase- based models are estimated from bilingual data by cal- formulated as a series of probabilistic mod- culating how often in the parallel corpus a phrase pair els that learn word-to-word correspondences (f, e) is orientated with the preceding phrase pair in from sentence-aligned bilingual parallel corpora the 3 types of orientations (monotone, swapped, and (Brown et al., 1993). Current methods, includ- discontinuous). ing phrase-based (Och, 2002; Koehn et al., 2003) and hierarchical models (Chiang, 2005), typically start by word-aligning a bilingual parallel cor- age word translation probabilities, w(ei |fj ), pus (Och and Ney, 2003). They extract multi- are calculated via phrase-pair-internal word word phrases that are consistent with the Viterbi alignments. word alignments and use these phrases to build new translations. A variety of parameters are es- • Reordering model. Each phrase pair (e, f ) timated using the bitexts. Here we review the pa- also has associated reordering parameters, rameters of the standard phrase-based translation po (orientation|f, e), which indicate the dis- model (Koehn et al., 2007). Later we will show tribution of its orientation with respect to the how to estimate them using monolingual texts in- previously translated phrase. Orientations stead. These parameters are: are monotone, swap, discontinuous (Tillman, 2004; Kumar and Byrne, 2004), see Figure 1. • Phrase pairs. Phrase extraction heuristics (Venugopal et al., 2003; Tillmann, 2003; • Other features. Other typical features are Och and Ney, 2004) produce a set of phrase n-gram language model scores and a phrase pairs (e, f ) that are consistent with the word penalty, which governs whether to use fewer alignments. In this paper we assume that the longer phrases or more shorter phrases. phrase pairs are given (without any scores), These are not bilingually estimated, so we and we induce every other parameter of the can re-use them directly without modifica- phrase-based model from monolingual data. tion. • Phrase translation probabilities. Each phrase pair has a list of associated fea- The features are combined in a log linear model, ture functions (FFs). These include phrase and their weights are set through minimum error translation probabilities, φ(e|f ) and φ(f |e), rate training (Och, 2003). We use the same log which are typically calculated via maximum linear formulation and MERT but propose alterna- likelihood estimation. tives derived directly from monolingual data for all parameters except for the phrase pairs them- • Lexical weighting. Since MLE overestimates selves. Our pipeline still requires a small bitext of φ for phrase pairs with sparse counts, lexi- approximately 2,000 sentences to use as a devel- cal weighting FFs are used to smooth. Aver- opment set for MERT parameter tuning. 131 2.2 Bilingual lexicon induction for SMT Bilingual lexicon induction describes the class of 40 algorithms that attempt to learn translations from monolingual corpora. Rapp (1995) was the first 30 to propose using non-parallel texts to learn the Accuracy, % translations of words. Using large, unrelated En- ● 20 glish and German corpora (with 163m and 135m ● ● ● ● words) and a small German-English bilingual dic- ● ● 10 ● tionary (with 22k entires), Rapp (1999) demon- ● strated that reasonably accurate translations could ● Top 1 Top 10 be learned for 100 German nouns that were not 0 contained in the seed bilingual dictionary. His al- 0 100 200 300 400 500 600 gorithm worked by (1) building a context vector Corpus Frequency representing an unknown German word by count- ing its co-occurrence with all the other words Figure 2: Accuracy of single-word translations in- duced using contextual similarity as a function of the in the German monolingual corpus, (2) project- source word corpus frequency. Accuracy is the pro- ing this German vector onto the vector space of portion of the source words with at least one correct English using the seed bilingual dictionary, (3) (bilingual dictionary) translation in the top 1 and top calculating the similarity of this sparse projected 10 candidate lists. vector to vectors for English words that were con- structed using the English monolingual corpus, nouns in Rapp (1995), 1,000 most frequent words and (4) outputting the English words with the in Koehn and Knight (2002), or 2,000 most fre- highest similarity as the most likely translations. quent nouns in Haghighi et al. (2008)). Although A variety of subsequent work has extended the previous work reported high translation accuracy, original idea either by exploring different mea- it may be misleading to extrapolate the results to sures of vector similarity (Fung and Yee, 1998) SMT, where it is necessary to translate a much or by proposing other ways of measuring simi- larger set of words and phrases, including many larity beyond co-occurence within a context win- low frequency items. dow. For instance, Schafer and Yarowsky (2002) In a preliminary study, we plotted the accuracy demonstrated that word translations tend to co- of translations against the frequency of the source occur in time across languages. Koehn and Knight words in the monolingual corpus. Figure 2 shows (2002) used similarity in spelling as another kind the result for translations induced using contex- of cue that a pair of words may be translations of tual similarity (defined in Section 3.1). Unsur- one another. Garera et al. (2009) defined context prisingly, frequent terms have a substantially bet- vectors using dependency relations rather than ad- ter chance of being paired with a correct transla- jacent words. Bergsma and Van Durme (2011) tion, with words that only occur once having a low used the visual similarity of labeled web images chance of being translated accurately.1 This prob- to learn translations of nouns. Additional related lem is exacerbated when we move to multi-token work on learning translations from monolingual phrases. As with phrase translation features esti- corpora is discussed in Section 6. mated from parallel data, longer phrases are more sparse, making similarity scores less reliable than In this paper, we apply bilingual lexicon in- for single words. duction methods to statistical machine translation. Another impediment (not addressed in this Given the obvious benefits of not having to rely paper) for using lexicon induction for SMT is on scarce bilingual parallel training data, it is sur- the number of translations that must be learned. prising that bilingual lexicon induction has not Learning translations for all words in the source been used for SMT before now. There are sev- language requires n2 vector comparisons, since eral open questions that make its applicability to each word in the source language vocabulary must SMT uncertain. Previous research on bilingual lexicon induction learned translations only for a 1 For a description of the experimental setup used to pro- small number of high frequency words (e.g. 100 duce these translations, see Experiment 8 in Section 5.2. 132 ES Context Projected ES EN Context terrorist (en) Vector Context Vector compare Vectors Occurrences terrorista (es) economico s1 ✓ ✓ ✓ t1 policy planeta s2 ✓ t2 growth project tasa s3 ✓ ✓ ✓ t3 foreign dict. terrorist (en) ✓ ✓ Occurrences extranjero sN-1 ✓ tM-1 economic riqueza (es) empleo sN ✓ tM activity policy activity of para crecer para crecer to expand (projected) Time Figure 3: Scoring contextual similarity of phrases: first, contextual vectors are projected using a small Figure 4: Temporal histograms of the English phrase seed dictionary and then compared with the target lan- terrorist, its Spanish translation terrorista, and riqueza guage candidates. (wealth) collected from monolingual texts spanning a 13 year period. While the correct translation has a good temporal match, the non-translation riqueza has be compared against the vectors for all words in a distinctly different signature. the target language vocabulary. The size of the n2 comparisons hugely increases if we compare vec- N - and target phrase e with an M -dimensional tors for multi-word phrases instead of just words. vector (see Figure 3). The component values of In this work, we avoid this problem by assuming the vector representing a phrase correspond to that a limited set of phrase pairs is given a pri- how often each of the words in that vocabulary ori (but without scores). By limiting ourselves appear within a two word window on either side to phrases in a phrase table, we vastly limit the of the phrase. These counts are collected using search space of possible translations. This is an monolingual corpora. After the values have been idealization because high quality translations are computed, a contextual vector f is projected onto guaranteed to be present. However, as our lesion the English vector space using translations in a experiments in Section 5.1 show, a phrase table seed bilingual dictionary to map the component without accurate translation probability estimates values into their appropriate English vector posi- is insufficient to produce high quality translations. tions. This sparse projected vector is compared We show that lexicon induction methods can be to the vectors representing all English phrases e. used to replace bilingual estimation of phrase- and Each phrase pair in the phrase table is assigned lexical-translation probabilities, making a signifi- a contextual similarity score c(f, e) based on the cant step towards SMT without parallel corpora. similarity between e and the projection of f . Various means of computing the component 3 Monolingual Parameter Estimation values and vector similarity measures have been proposed in literature (e.g. Rapp (1999), Fung and We use bilingual lexicon induction methods to es- Yee (1998)). Following Fung and Yee (1998), we timate the parameters of a phrase-based transla- compute the value of the k-th component of f ’s tion model from monolingual data. Instead of contextual vector as follows: scores estimated from bilingual parallel data, we make use of cues present in monolingual data to wk = nf,k × (log(n/nk ) + 1) provide multiple orthogonal estimates of similar- ity between a pair of phrases. where nf,k and nk are the number of times sk ap- pears in the context of f and in the entire corpus, 3.1 Phrasal similarity features and n is the maximum number of occurrences of Contextual similarity. We extend the vector any word in the data. Intuitively, the more fre- space approach of Rapp (1999) to compute sim- quently sk appears with f and the less common ilarity between phrases in the source and tar- it is in the corpus in general, the higher its com- get languages. More formally, assume that ponent value. Similarity between two vectors is (s1 , s2 , . . . sN ) and (t1 , t2 , . . . tM ) are (arbitrarily measured as the cosine of the angle between them. indexed) source and target vocabularies, respec- Temporal similarity. In addition to contex- tively. A source phrase f is represented with an tual similarity, phrases in two languages may 133 be scored in terms of their temporal similarity 3.2 Lexical similarity features (Schafer and Yarowsky, 2002; Klementiev and In addition to the three phrase similarity features Roth, 2006; Alfonseca et al., 2009). The intu- used in our model – c(f, e), t(f, e) and w(f, e) – ition is that news stories in different languages we include four additional lexical similarity fea- will tend to discuss the same world events on the tures for each of phrase pair. The first three lex- same day. The frequencies of translated phrases ical features clex (f, e), tlex (f, e) and wlex (f, e) over time give them particular signatures that will are the lexical equivalents of the phrase-level con- tend to spike on the same dates. For instance, if textual, temporal and wikipedia topic similarity the phrase asian tsunami is used frequently dur- scores. They score the similarity of individual ing a particular time span, the Spanish transla- words within the phrases. To compute these tion maremoto asi´atico is likely to also be used lexical similarity features, we average similarity frequently during that time. Figure 4 illustrates scores over all possible word alignments across how the temporal distribution of terrorist is more the two phrases. Because individual words are similar to Spanish terrorista than to other Span- more frequent than multiword phrases, the accu- ish phrases. We calculate the temporal similar- racy of clex , tlex , and wlex tends to be higher than ity between a pair of phrases t(f, e) using the their phrasal equivalents (this is similar to the ef- method defined by Klementiev and Roth (2006). fect observed in Figure 2). We generate a temporal signature for each phrase by sorting the set of (time-stamped) documents in Orthographic / phonetic similarity. The final the monolingual corpus into a sequence of equally lexical similarity feature that we incorporate is sized temporal bins and then counting the number o(f, e), which measures the orthographic similar- of phrase occurrences in each bin. In our exper- ity between words in a phrase pair. Etymolog- iments, we set the window size to 1 day, so the ically related words often retain similar spelling size of temporal signatures is equal to the num- across languages with the same writing system, ber of days spanned by our corpus. We use cosine and low string edit distance sometimes signals distance to compare the normalized temporal sig- translation equivalency. Berg-Kirkpatrick and natures for a pair of phrases (f, e). Klein (2011) present methods for learning cor- respondences between the alphabets of two lan- Topic similarity. Phrases and their translations guages. We can also extend this idea to language are likely to appear in articles written about the pairs not sharing the same writing system since same topic in two languages. Thus, topic or cat- many cognates, borrowed words, and names re- egory information associated with monolingual main phonetically similar. Transliterations can be data can also be used to indicate similarity be- generated for tokens in a source phrase (Knight tween a phrase and its candidate translation. In and Graehl, 1997), with o(f, e) calculating pho- order to score a pair of phrases, we collect their netic similarity rather than orthographic. topic signatures by counting their occurrences in each topic and then comparing the resulting vec- The three phrasal and four lexical similarity tors. We again use the cosine similarity mea- scores are incorporated into the log linear trans- sure on the normalized topic signatures. In our lation model as feature functions, replacing the experiments, we use interlingual links between bilingually estimated phrase translation probabil- Wikipedia articles to estimate topic similarity. We ities φ and lexical weighting probabilities w. Our treat each linked article pair as a topic and collect seven similarity scores are not the only ones that counts for each phrase across all articles in its cor- could be incorporated into the translation model. responding language. Thus, the size of a phrase Various other similarity scores can be computed topic signature is the number of article pairs with depending on the available monolingual data and interlingual links in Wikipedia, and each compo- its associated metadata (see, e.g. Schafer and nent contains the number of times the phrase ap- Yarowsky (2002)). pears in (the appropriate side of) the correspond- 3.3 Reordering ing pair. Our Wikipedia-based topic similarity feature, w(f, e), is similar in spirit to polylingual The remaining component of the phrase-based topic models (Mimno et al., 2009), but it is scal- SMT model is the reordering model. We able to full bilingual lexicon induction. introduce a novel algorithm for estimating 134 Facebook Input: Source and target phrases f and e, Anlegen einfach Profils Source and target monolingual corpora Cf and Ce , eines Das Phrase table pairs T = {(f (i) , e(i) )}N i=1 . ist in Output: Orientation features (pm , ps , pd ). What Sf ← sentences containing f in Cf ; Se ← sentences containing e in Ce ; does (Bf , −, −) ← CollectOccurs(f, ∪N i=1 f (i) , S ); f (Be , Ae , De ) ← CollectOccurs(e, ∪i=1 e(i) , Se ); N your cm = cs = cd = 0; Facebook foreach unique f 0 in Bf do foreach translation e0 of f 0 in T do profile s cm = cm + #Be (e0 ); cs = cs + #Ae (e0 ); reveal cd = cd + #De (e0 ); c ← cm + cs + cd ; return ( ccm , ccs , ccd ) Figure 6: Collecting phrase orientation statistics for a English-German phrase pair (“profile”, “Profils”) CollectOccurs(r, R, S) from non-parallel sentences (the German sentence B ← (); A ← (); D ← (); translates as “Creating a Facebook profile is easy”). foreach sentence s ∈ S do foreach occurrence of phrase r in s do B ← B + (longest preceding r and in R); taining their corresponding translations (e, e0 ), we A ← A + (longest following r and in R); are able to increment orientation counts for (f, e) D ← D + (longest discontinuous w/ r and in R); by looking at whether e and e0 are adjacent, swapped, or discontinuous. The orientations cor- return (B, A, D); respond directly to those shown in Figure 1. One subtly of our method is that shorter and Figure 5: Algorithm for estimating reordering more frequent phrases (e.g. punctuation) are more probabilities from monolingual data. likely to appear in multiple orientations with a given phrase, and therefore provide poor evi- po (orientation|f, e) from two monolingual cor- dence of reordering. Therefore, we (a) collect pora instead a bitext. the longest contextual phrases (which also appear Figure 1 illustrates how the phrase pair orienta- in the phrase table) for reordering feature estima- tion statistics are estimated in the standard phrase- tion, and (b) prune the set of sentences so that based SMT pipeline. For a phrase pair like (f = we only keep a small set of least frequent contex- “Profils”, e = “profile”), we count its orien- tual phrases (this has the effect of dropping many tation with the previously translated phrase pair function words and punctuation marks and and re- (f 0 = “in Facebook”, e0 = “Facebook”) across lying more heavily on multi-word content phrases all translated sentence pairs in the bitext. to estimate the reordering).2 In our pipeline we do not have translated sen- Our algorithm for learning the reordering pa- tence pairs. Instead, we look for monolingual rameters is given in Figure 5. The algorithm sentences in the source corpus which contain estimates a probability distribution over mono- the source phrase that we are interested in, like tone, swap, and discontinuous orientations (pm , f = “Profils”, and at least one other phrase ps , pd ) for a phrase pair (f, e) from two mono- that we have a translation for, like f 0 = “in lingual corpora Cf and Ce . It begins by calling Facebook”. We then look for all target lan- CollectOccurs to collect the longest match- guage sentences in the target monolingual cor- ing phrase table phrases that precede f in source pus that contain the translation of f (here e = monolingual data (Bf ), as well as those that pre- “profile”) and any translation of f 0 . Figure 6 il- cede (Be ), follow (Ae ), and are discontinuous lustrates that it is possible to find evidence for (De ) with e in the target language data. For each po (swapped|Profils, profile), even from the non- unique phrase f 0 preceding f , we look up transla- parallel, non-translated sentences drawn from two tions in the phrase table T. Next, we count3 how independent monolingual corpora. By looking for 2 The pruning step has an additional benefit of minimizing foreign sentences containing pairs of adjacent for- the memory needed for orientation feature estimations. eign phrases (f, f 0 ) and English sentences con- 3 #L (x) returns the count of object x in list L. 135 Monolingual training corpora Spanish-English phrase table Europarl Gigaword Wikipedia Phrase pairs 3,093,228 date range 4/96-10/09 5/94-12/08 n/a Spanish phrases 89,386 uniq shared dates 829 5,249 n/a English phrases 926,138 Spanish articles n/a 3,727,954 59,463 Spanish unigrams 13,216 English articles n/a 4,862,876 59,463 Avg # translations 98.7 Spanish lines 1,307,339 22,862,835 2,598,269 Spanish bigrams 41,426 English lines 1,307,339 67,341,030 3,630,041 Avg # translations 31.9 Spanish words 28,248,930 774,813,847 39,738,084 Spanish trigrams 34,744 English words 27,335,006 1,827,065,374 61,656,646 Avg # translations 13.5 Table 1: Statistics about the monolingual training data and the phrase table that was used in all of the experiments. many translations e0 of f 0 appeared before, after was re-run for every experiment. or were discontinuous with e in the target lan- We estimate the parameters of our model from guage data. Finally, the counts are normalized and two sets of monolingual data, detailed in Table 1: returned. These normalized counts are the values we use as estimates of po (orientation|f, e). • First, we treat the two sides of the Europarl parallel corpus as independent, monolingual 4 Experimental Setup corpora. Haghighi et al. (2008) also used this method to show how well translations We use the Spanish-English language pair to test could be learned from monolingual corpora our method for estimating the parameters of an under ideal conditions, where the contextual SMT system from monolingual corpora. This al- and temporal distribution of words in the two lows us to compare our method against the nor- monolingual corpora are nearly identical. mal bilingual training procedure. We expect bilin- gual training to result in higher translation qual- • Next, we estimate the features from truly ity because it is a more direct method for learn- monolingual corpora. To estimate the con- ing translation probabilities. We systematically textual and temporal similarity features, we remove different parameters from the standard use the Spanish and English Gigaword cor- phrase-based model, and then replace them with pora.5 These corpora are substantially larger our monolingual equivalents. Our goal is to re- than the Europarl corpora, providing 27x as cover as much of the loss as possible for each of much Spanish and 67x as much English for the deleted bilingual components. contextual similarity, and 6x as many paired The standard phrase-based model that we use dates for temporal similarity. Topical simi- as our top-line is the Moses system (Koehn et larity is estimated using Spanish and English al., 2007) trained over the full Europarl v5 par- Wikipedia articles that are paired with inter- allel corpus (Koehn, 2005). With the exception language links. of maximum phrase length (set to 3 in our ex- periments), we used default values for all of the To project context vectors from Spanish to En- parameters. All experiments use a trigram lan- glish, we use a bilingual dictionary containing en- guage model trained on the English side of the tries for 49,795 Spanish words. Note that end-to- Europarl corpus using SRILM with Kneser-Ney end translation quality is robust to substantially smoothing. To tune feature weights in minimum reducing dictionary size, but we omit these ex- error rate training, we use a development bitext periments due to space constraints. The con- of 2,553 sentence pairs, and we evaluate per- text vectors for words and phrases incorporate co- formance on a test set of 2,525 single-reference occurrence counts using a two-word window on translated newswire articles. These development either side. and test datasets were distributed in the WMT The title of our paper uses the word towards be- shared task (Callison-Burch et al., 2010).4 MERT cause we assume that an inventory of phrase pairs is given. Future work will explore inducing the 4 Specifcially, news-test2008 plus news-syscomb2009 for 5 dev and newstest2009 for test. We use the afp, apw and xin sections of the corpora. 136 BL 10.52 B 10 4.00 5 BM/B M/M B/B -/M M/- B/- -/B o/- c/- t/- -/- 0 1 2 3 4 5 6 7 8 9 10 11 25 25 25 Exp Phrase scores / orientation scores 22.92 23.36 23.36 1 B/B bilingual / bilingual (Moses) 21.87 21.54 Estimated Using Europarl Estimated Using Monolingual Corpora 2 B/- bilingual / distortion 20 20 20 18.79 18.79 3 -/B none / bilingual 17.00 17.00 17.92 17.92 16.85 17.50 4 -/- none / distortion 15.35 14.78 15 14.07 14.07 14.02 BLEU 14.02 14.02 15 13.13 12.86 15 5, 12 -/M none / mono BLEU BLEU 13.13 6, 13 t/- temporal mono / distortion 10.52 10.15 10.15 10 7,14 o/- orthographic mono / distortion 10 10 8, 15 c/- contextual mono / distortion 16 w/- Wikipedia topical mono / distorion 4.00 5 BM/B 9, 17 M/- all mono / distortion 25 M/M 5 B/B -/M M/- BM/B t/-B/- -/B o/- c/- 22.92 t/- M/M -/M M/- w/- o/- -/- c/- 21.87 21.54 10, 18 M/M all mono / mono Estimated Using Europarl 0 11, 19 BM/B bilingual + all mono / bilingual 20 00 1 2 3 4 5 6 7 8 9 10 11 16.85 17.50 12 13 14 15 16 17 18 19 15.35 25 25 Exp 14.78Phrase scores / orientation scores 14.02 23.36 23.36 15 Figure 7: Much of the14.02 12.86 lossB/Bin BLEU 1 score bilingual / bilingual when bilinguallyEstimated (Moses) estimated features Using are removed Monolingual Corpora from a Spanish- 10.52 2 B/- bilingual / distortion English translation system (experiments 1-4) can be recovered when they are replaced with monolingual equiva- 20 20 18.79 18.79 10 3 -/B none / bilingual 17.92 17.92 17.00 17.00 lents estimated from monolingual 4 -/- noneEuroparl / distortion data (experiments 5-10). The labels indicate how the different types 14.02 14.07 14.02 14.07 15 15 4.00 5, 12 -/M of parameters are estimated, thenone first/ mono part is for phrase-table features, 13.13 the second is for reordering probabilities. 5 BLEU BLEU 13.13 BM/B M/M B/B -/M M/- 6, 13 t/- temporal mono / distortion B/- -/B o/- c/- t/- -/- 7,14 o/- orthographic mono / distortion 10.15 10.15 10 10 0 1 2 3 4 5 6 8, 15 16 7 c/- 8 contextual w/- 9 mono10/ distortion 11 Wikipedia topical mono / distorion 5 Experimental Results 25 25 Phrase scores / orientation scores 9, 17 M/- all mono / distortion 23.36 23.36 5 BM/B bilingual / bilingual (Moses) M/M Estimated Using Monolingual 10, 18 M/M Corpora -/M M/- Figures 7 and 8 give experimental results. Figure w/- o/- c/- all mono / mono t/- bilingual / distortion 20 20 11, 19 BM/B bilingual 18.79 18.79 none / bilingual 17.92+ all mono / bilingual 17.92 7 shows the performance of the standard phrase- 00 none / distortion 17.00 17.00 12 13 14 15 16 17 18 19 14.02 14.02 14.07 14.07 based model when each of the bilingually esti- 15 15 none / mono BLEU 13.13 BLEU 13.13 temporal mono / distortion orthographic mono / distortion 10.15 10.15 mated features are removed. It shows how much 10 10 contextual mono / distortion Wikipedia topical mono / distorion of the performance loss can be recovered using all mono / distortion 5 our monolingual features when they are estimated BM/B M/M -/M M/- w/- o/- c/- all mono / mono t/- bilingual + all mono / bilingual from the Europarl training corpus but treating 00 12 13 14 15 16 17 18 19 each side as an independent, monolingual cor- Figure 8: Performance of monolingual features de- pus. Figure 8 shows the recovery when using truly rived from truly monolingual corpora. Over 82% of monolingual corpora to estimate the parameters. the BLEU score loss can be recovered. 5.1 Lesion experiments phrase table itself from monolingual texts. Across Experiments 1-4 remove bilingually estimated pa- all of our experiments, we use the phrase table rameters from the standard model. For Spanish- that the bilingual model learned from the Europarl English, the relative contribution of the phrase- parallel corpus. We keep its phrase pairs, but we table features (which include the phrase transla- drop all of its scores. Table 1 gives details of the tion probabilities φ and the lexical weights w) is phrase pairs. In our experiments, we estimated greater than the reordering probabilities. When similarity and reordering scores for more than 3 the reordering probability po (orientation|f, e) is million phrase pairs. For each source phrase, the eliminated and replaced with a simple distance- set of possible translations was constrained and based distortion feature that does not require a likely to contain good translations. However, the bitext to estimate, the score dips only marginally average number of possible translations was high since word order in English and Spanish is simi- (ranging from nearly 100 translations for each un- lar. However, when both the reordering and the igram to 14 for each trigram). These contain a phrase table features are dropped, leaving only lot of noise and result in low end-to-end transla- the LM feature and the phrase penalty, the result- tion quality without good estimates of translation ing translation quality is abysmal, with the score quality, as the experiments in Section 5.1 show. dropping a total of over 17 BLEU points. 5.2 Adding equivalent monolingual features Software. Because many details of our estima- estimated using Europarl tion procedures must be omitted for space, we dis- tribute our full set of code along with scripts for Experiments 5-10 show how much our monolin- running our experiments and output translations. gual equivalents could recover when the monolin- These may be downed from http://www.cs. gual corpora are drawn from the two sides of the jhu.edu/˜anni/papers/lowresmt/ bitext. For instance, our algorithm for estimating 137 reordering probabilities from monolingual data (– Their method has no notion of translation similar- /M) adds 5 BLEU points, which is 73% of the po- ity aside from a bilingual dictionary. Similarly, tential recovery going from the model (–/–) to the S´anchez-Cartagena et al. (2011) supplement an model with bilingual reordering features (–/B). SMT phrase table with translation pairs extracted Of the temporal, orthographic, and contextual from a bilingual dictionary and give each a fre- monolingual features the temporal feature per- quency of one for computing translation scores. forms the best. Together (M/–), they recover Ravi and Knight (2011) treat MT without paral- more than each individually. Combining mono- lel training data as a decipherment task and learn lingually estimated reordering and phrase table a translation model from monolingual text. They features (M/M) yields a total gain of 13.5 BLEU translate corpora of Spanish time expressions and points, or over 75% of the BLEU score loss that subtitles, which both have a limited vocabulary, occurred when we dropped all features from the into English. Their method has not been applied phrase table. However, these results use “mono- to broader domains of text. lingual” corpora which have practically identical Most work on learning translations from mono- phrasal and temporal distributions. lingual texts only examine small numbers of fre- quent words. Huang et al. (2005) and Daum´e and 5.3 Estimating features using truly Jagarlamudi (2011) are exceptions that improve monolingual corpora MT by mining translations for OOV items. Experiments 12-18 estimate all of the features A variety of past research has focused on min- from truly monolingual corpora. Our novel al- ing parallel or comparable corpora from the web gorithm for estimating reordering holds up well (Munteanu and Marcu, 2006; Smith et al., 2010; and recovers 69% of the loss, only 0.4 BLEU Uszkoreit et al., 2010). Others use an existing points less than when estimated from the Europarl SMT system to discover parallel sentences within monolingual texts. The temporal similarity fea- independent monolingual texts, and use them to ture does not perform as well as when it was esti- re-train and enhance the system (Schwenk, 2008; mated using Europarl data, but the contextual fea- Chen et al., 2008; Schwenk and Senellart, 2009; ture does. The topic similarity using Wikipedia Rauf and Schwenk, 2009; Lambert et al., 2011). performs the strongest of the individual features. These are complementary but orthogonal to our Combining the monolingually estimated re- research goals. ordering features with the monolingually esti- mated similarity features (M/M) yields a total 7 Conclusion gain of 14.8 BLEU points, or over 82% of the BLEU point loss that occurred when we dropped This paper has demonstrated a novel set of tech- all features from the phrase table. This is equiv- niques for successfully estimating phrase-based alent to training the standard system on a bi- SMT parameters from monolingual corpora, po- text with roughly 60,000 lines or nearly 2 million tentially circumventing the need for large bitexts, words (learning curve omitted for space). which are expensive to obtain for new languages Finally, we supplement the standard bilingually and domains. We evaluated the performance of estimated model parameters with our monolin- our algorithms in a full end-to-end translation sys- gual features (BM/B), and we see a 1.5 BLEU tem. Assuming that a bilingual-corpus-derived point increase over the standard model. There- phrase table is available, we were able utilize our fore, our monolingually estimated scores capture monolingually-estimated features to recover over some novel information not contained in the stan- 82% of BLEU loss that resulted from removing dard feature set. the bilingual-corpus-derived phrase-table proba- bilities. We also showed that our monolingual fea- 6 Additional Related Work tures add 1.5 BLEU points when combined with standard bilingually estimated features. Thus our Carbonell et al. (2006) described a data-driven techniques have stand-alone efficacy when large MT system that used no parallel text. It produced bilingual corpora are not available and also make translation lattices using a bilingual dictionary a significant contribution to combined ensemble and scored them using an n-gram language model. performance when they are. 138 References on Data-Driven Machine Translation, Toulouse, France. Enrique Alfonseca, Massimiliano Ciaramita, and Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, Keith Hall. 2009. Gazpacho and summer rash: and Dan Klein. 2008. Learning bilingual lexi- lexical relationships from temporal patterns of web cons from monolingual corpora. In Proceedings of search queries. In Proceedings of EMNLP. ACL/HLT. Taylor Berg-Kirkpatrick and Dan Klein. 2011. Simple Fei Huang, Ying Zhang, and Stephan Vogel. 2005. effective decipherment via combinatorial optimiza- Mining key phrase translations from web corpora. tion. In Proceedings of the 2011 Conference on In Proceedings of EMNLP. Empirical Methods in Natural Language Process- ing (EMNLP-2011), Edinburgh, Scotland, UK. Alexandre Klementiev and Dan Roth. 2006. Weakly supervised named entity transliteration and discov- Shane Bergsma and Benjamin Van Durme. 2011. ery from multilingual comparable corpora. In Pro- Learning bilingual lexicons using the visual simi- ceedings of the ACL/Coling. larity of labeled web images. In Proceedings of the International Joint Conference on Artificial Intelli- Kevin Knight and Jonathan Graehl. 1997. Machine gence. transliteration. In Proceedings of ACL. Peter Brown, John Cocke, Stephen Della Pietra, Vin- Philipp Koehn and Kevin Knight. 2002. Learning a cent Della Pietra, Frederick Jelinek, Robert Mercer, translation lexicon from monolingual corpora. In and Paul Poossin. 1988. A statistical approach to ACL Workshop on Unsupervised Lexical Acquisi- language translation. In 12th International Confer- tion. ence on Computational Linguistics (CoLing-1988). Philipp Koehn, Franz Josef Och, and Daniel Marcu. Peter Brown, Stephen Della Pietra, Vincent Della 2003. Statistical phrase-based translation. In Pro- Pietra, and Robert Mercer. 1993. The mathemat- ceedings of HLT/NAACL. ics of machine translation: Parameter estimation. Philipp Koehn, Hieu Hoang, Alexandra Birch, Computational Linguistics, 19(2):263–311, June. Chris Callison-Burch, Marcello Federico, Nicola Chris Callison-Burch, Philipp Koehn, Christof Monz, Bertoldi, Brooke Cowan, Wade Shen, Christine Kay Peterson, Mark Przybocki, and Omar Zaidan. Moran, Richard Zens, Chris Dyer, Ondrej Bojar, 2010. Findings of the 2010 joint workshop on sta- Alexandra Constantin, and Evan Herbst. 2007. tistical machine translation and metrics for machine Moses: Open source toolkit for statistical machine translation. In Proceedings of the Workshop on Sta- translation. In Proceedings of the ACL-2007 Demo tistical Machine Translation. and Poster Sessions. Jaime Carbonell, Steve Klein, David Miller, Michael Philipp Koehn. 2005. Europarl: A parallel corpus for Steinbaum, Tomer Grassiany, and Jochen Frey. statistical machine translation. In Proceedings of 2006. Context-based machine translation. In Pro- the Machine Translation Summit. ceedings of AMTA. Shankar Kumar and William Byrne. 2004. Local Boxing Chen, Min Zhang, Aiti Aw, and Haizhou Li. phrase reordering models for statistical machine 2008. Exploiting n-best hypotheses for SMT self- translation. In Proceedings of HLT/NAACL. enhancement. In Proceedings of ACL/HLT, pages Patrik Lambert, Holger Schwenk, Christophe Ser- 157–160. van, and Sadaf Abdul-Rauf. 2011. Investigations David Chiang. 2005. A hierarchical phrase-based on translation model adaptation using monolingual model for statistical machine translation. In Pro- data. In Proceedings of the Workshop on Statistical ceedings of ACL. Machine Translation, pages 284–293, Edinburgh, Hal Daum´e and Jagadeesh Jagarlamudi. 2011. Do- Scotland, UK. main adaptation for machine translation by mining David Mimno, Hanna Wallach, Jason Naradowsky, unseen words. In Proceedings of ACL/HLT. David Smith, and Andrew McCallum. 2009. Pascale Fung and Lo Yuen Yee. 1998. An IR approach Polylingual topic models. In Proceedings of for translating new words from nonparallel, compa- EMNLP. rable texts. In Proceedings of ACL/CoLing. Dragos Stefan Munteanu and Daniel Marcu. 2006. Nikesh Garera, Chris Callison-Burch, and David Extracting parallel sub-sentential fragments from Yarowsky. 2009. Improving translation lexicon in- non-parallel corpora. In Proceedings of the duction from monolingual corpora via dependency ACL/Coling. contexts and part-of-speech equivalences. In Thir- Franz Josef Och and Hermann Ney. 2003. A sys- teenth Conference On Computational Natural Lan- tematic comparison of various statistical alignment guage Learning (CoNLL-2009), Boulder, Colorado. models. Computational Linguistics, 29(1):19–51. Ulrich Germann. 2001. Building a statistical machine Franz Josef Och and Hermann Ney. 2004. The align- translation system from scratch: How much bang ment template approach to statistical machine trans- for the buck can we expect? In ACL 2001 Workshop lation. Computational Linguistics, 30(4):417–449. 139 Franz Joseph Och. 2002. Statistical Machine Transla- tion: From Single-Word Models to Alignment Tem- plates. Ph.D. thesis, RWTH Aachen. Franz Josef Och. 2003. Minimum error rate training for statistical machine translation. In Proceedings of ACL. Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of ACL. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated English and Ger- man corpora. In Proceedings of ACL. Sadaf Abdul Rauf and Holger Schwenk. 2009. On the use of comparable corpora to improve SMT perfor- mance. In Proceedings of EACL. Sujith Ravi and Kevin Knight. 2011. Deciphering for- eign language. In Proceedings of ACL/HLT. Vctor M. S´anchez-Cartagena, Felipe S´anchez- Martnez, and Juan Antonio P´erez-Ortiz. 2011. Integrating shallow-transfer rules into phrase-based statistical machine translation. In Proceedings of the XIII Machine Translation Summit. Charles Schafer and David Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In Proceedings of CoNLL. Holger Schwenk and Jean Senellart. 2009. Transla- tion model adaptation for an Arabic/French news translation system by lightly-supervised training. In MT Summit. Holger Schwenk. 2008. Investigations on large-scale lightly-supervised training for statistical machine translation. In Proceedings of IWSLT. Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from compa- rable corpora using document level alignment. In Proceedings of HLT/NAACL. Christoph Tillman. 2004. A unigram orientation model for statistical machine translation. In Pro- ceedings of HLT/NAACL. Christoph Tillmann. 2003. A projection extension al- gorithm for statistical machine translation. In Pro- ceedings of EMNLP. Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and Moshe Dubiner. 2010. Large scale parallel docu- ment mining for machine translation. In Proceed- ings of CoLing. Ashish Venugopal, Stephan Vogel, and Alex Waibel. 2003. Effective phrase translation extraction from alignment models. In Proceedings of ACL. 140 Character-Based Pivot Translation for Under-Resourced Languages and Domains J¨org Tiedemann Department of Linguistics and Philology Uppsala University, Uppsala/Sweden

[email protected]

Abstract would not have been possible without the Euro- pean Union and its language policies to give an In this paper we investigate the use of example. character-level translation models to sup- port the translation from and to under- One of the main challenges of current NLP re- resourced languages and textual domains search is to port data-driven techniques to under- via closely related pivot languages. Our ex- resourced languages, which refers to the major- periments show that these low-level models ity of the world’s languages. One obvious ap- can be successful even with tiny amounts proach is to create appropriate data resources even of training data. We test the approach on for those languages in order to enable the use of movie subtitles for three language pairs and similar techniques designed for high-density lan- legal texts for another language pair in a do- guages. However, this is usually too expensive main adaptation task. Our pivot translations outperform the baselines by a large margin. and often impossible with the quantities needed. Another idea is to develop new models that can work with (much) less data but still make use 1 Introduction of resources and techniques developed for other well-resourced languages. Data-driven approaches have been extremely suc- cessful in most areas of natural language pro- In this paper, we explore pivot translation tech- cessing (NLP) and can be considered the main niques for the translation from and to resource- paradigm in application-oriented research and de- poor languages with the help of intermediate velopment. Research in machine translation is a resource-rich languages. We explore the fact typical example with the dominance of statisti- that many poorly resourced languages are closely cal models over the last decade. This is even en- related to well equipped languages, which en- forced due to the availability of toolboxes such as ables low-level techniques such as character- Moses (Koehn et al., 2007) which make it pos- based translation. We can show that these tech- sible to build translation engines within days or niques can boost the performance enormously, even hours for any language pair provided that ap- tested for several language pairs. Furthermore, we propriate training data is available. However, this show that pivoting can also be used to overcome reliance on training data is also the most severe data sparseness in specific domains. Even high limitation of statistical approaches. Resources in density languages are under-resourced in most large quantities are only available for a few lan- textual domains and pivoting via in-domain data guages and domains. In the case of SMT, the of another language can help to adapt statistical dilemma is even more apparent as parallel cor- models. In our experiments, we observe that re- pora are rare and usually quite sparse. Some lan- lated languages have the largest impact in such a guages can be considered lucky, for example, be- setup. cause of political situations that lead to the pro- The remaining parts of the paper are organized duction of freely available translated material on as follows: First we describe the pivot translation a large scale. A lot of research and development approach used in this study. Thereafter, we dis- 141 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 141–151, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics cuss character-based translation models followed 2003). In our setup we added the parameter α by a detailed presentation of our experimental that can be used to weight the importance of one results. Finally, we briefly summarize related model over the other. This can be useful as we work and conclude the paper with discussions and do not consider the entire hypothesis space but prospects for future work. only a small subset of N-best lists. In the sim- plest case, this weight is set to 0.5 making both 2 Pivot Models models equally important. An alternative to fit- Information from pivot languages can be incorpo- ting the interpolation weight would be to per- rated in SMT models in various ways. The main form a global optimization procedure. However, principle refers to the combination of source- a straightforward implementation of pivot-based to-pivot and pivot-to-target translation models. MERT would be prohibitively slow due to the In our setup, one of these models includes a expensive two-step translation procedure over n- resource-poor language (source or target) and the best lists. other one refers to a standard model with ap- A general condition for the pivot approach is to propriate data resources. A condition is that we assume independent training sets for both transla- have at least some training data for the translation tion models as already pointed out by (Bertoldi between pivot and the resource-poor language. et al., 2008). In contrast to research presented However, for the original task (source-to-target in related work (see, for example, (Koehn et al., translation) we do not require any data resources 2009)) this condition is met in our setup in which except for purposes of comparison. all data sets represent different samples over the We will explore various models for the transla- languages considered (see section 4).2 tion between the resource-poor language and the pivot language and most of them are not compat- 3 Character-Based SMT ible with standard phrase-based translation mod- The basic idea behind character-based translation els. Hence, triangulation methods (Cohn and La- models is to take advantage of the strong lexi- pata, 2007) for combining phrase tables are not cal and syntactic similarities between closely re- applicable in our case. Instead, we explore a lated languages. Consider, for example, Figure cascaded approach (also called “transfer method” 1. Related languages like Catalan and Spanish or (Wu and Wang, 2009)) in which we translate the Danish and Norwegian have common roots and, input text in two steps using a linear interpo- therefore, use similar concepts and express them lation for rescoring N-best lists. Following the in similar grammatical structures. Spelling con- method described in (Utiyama and Isahara, 2007) ventions can still be quite different but those dif- and (Wu and Wang, 2009), we use the best n hy- ferences are often very consistent. The Bosnian- potheses from the translation of source sentences Macedonian example also shows that we do not s to pivot sentences p and combine them with the have to require any alphabetic overlap in order to top m hypotheses for translating these pivot sen- obtain character-level similarities. tences to target sentences t: Regularities between such closely related lan- guages can be captured below the word level. We L can also assume a more or less monotonic rela- αλsp sp pt pt X tˆ ≈ argmax k hk (s, p) + (1 − α)λk hk (p, t) tion between the two languages which motivates t k=1 the idea of translation models over character N- where hxy grams treating translation as a transliteration task k are feature functions for model xy with appropriate weights λxy 1 Basically, this (Vilar et al., 2007). Conceptually it is straightfor- k . means that we simply add the scores and, sim- ward to think of phrase-based models on the char- ilar to related work, we assume that the feature acter level. Sequences of characters can be used weights can be set independently for each model instead of word N-grams for both, translation and using minimum error rate training (MERT) (Och, language models. Training can proceed with the same tools and approaches. The basic task is to 1 Note, that we do not require the same feature functions 2 in both models even though the formula above implies this Note that different samples may still include common for simplicity of representation. sentences. 142 cedure and the resulting transducer can be used to find the Viterbi alignment between characters ac- cording to the best sequence of edit operations ap- plied to transform one string into the other. Exten- sions to this model are possible, for example the use of many-to-many alignments which have been shown to be very effective in letter-to-phoneme alignment tasks (Jiampojamarn et al., 2007). One advantage of the edit-distance-based trans- ducer models is that the alignments they pre- dict are strictly monotonic and cannot easily be Figure 1: Some examples of movie subtitle transla- confused by spurious relations between charac- tions between closely related languages (either sharing ters over longer distances. Long distance align- parts of the same alphabet or not). ments are only possible in connection with a se- ries of insertions and deletions that usually in- prepare the data to comply with the training pro- crease the alignment costs in such a way that they cedures (see Figure 2). are avoided if possible. On the other hand, IBM word alignment models also prefer monotonic alignments over non-monotonic ones if there is no good reason to do otherwise (i.e., there is frequent evidence of distorted alignments). However, the size of the vocabulary in a character-level model is very small (several orders of magnitude smaller Figure 2: Data pre-processing for training models on than on the word level) and this may cause serious the character level. Spaces are represented by ’ ’ and each sentence is treated as one sequence of characters. confusion of the word alignment model that very much relies on context-independent lexical trans- lation probabilities. Hence, for character align- 3.1 Character Alignment ment, the lexical evidence is much less reliable One crucial difference is the alignment of charac- without their context. ters, which is required instead of an alignment of It is certainly possible to find a compromise be- words. Clearly, the traditional IBM word align- tween word-level and character-level models in ment models are not designed for this task es- order to generalize below word boundaries but pecially with respect to distortion. However, the avoiding alignment problems as discussed above. same generative story can still be applied in gen- Morpheme-based translation models have been eral. Vilar et al. (2007) explore a two-step proce- explored in several studies with similar motiva- dure where words are aligned first (with the tradi- tions as in our approach, a better generalization tional IBM models) to divide sentence pairs into from sparse training data (Fishel and Kirik, 2010; aligned segments of reasonable size and the char- Luong et al., 2010). However, these approaches acters are then aligned with the same algorithm. have the drawback that they require proper mor- An alternative is to use models designed for phological analyses. Data-driven techniques ex- transliteration or related character-level transfor- ist even for morphology, but their use in SMT mation tasks. Many approaches are based on still needs to be shown (Fishel, 2009). The sit- transducer models that resemble string edit oper- uation is comparable to the problems of integrat- ations such as insertions, deletions and substitu- ing linguistically motivated phrases into phrase- tions (Ristad and Yianilos, 1998). Weighted fi- based SMT (Koehn et al., 2003). Instead we opt nite state transducers (WFST’s) can be trained on for a more general approach to extend context to unaligned pairs of character sequences and have facilitate, especially, the alignment step. Figure 3 been shown to be very effective for transliteration shows how we can transform texts into sequences tasks or letter-to-phoneme conversions (Jiampoja- of bigrams that can be aligned with standard ap- marn et al., 2007). The training procedure usually proaches without making any assumptions about employs an expectation maximization (EM) pro- linguistically motivated segmentations. 143 cu ur rs so o c co on nf fi ir rm ma ad do o . . BLEU, NIST, METEOR etc. The same simple ¿ q qu u´e e´ e es s e es so o ? ? post-processing as mentioned in the previous sec- tion can be applied to turn the character transla- Figure 3: Two Spanish sentences as sequences of char- tions into “normal” text. However, it can be use- acter bigrams with a final ’ ’ marking the end of a sen- ful to look at some other measures as well that tence. consider near matches on the character level in- stead of matching words and word N-grams only. In this way we can construct a parallel corpus with Character-level models have the ability to produce slightly richer contextual information as input to strings that may be close to the reference and still the alignment program. The vocabulary remains do not match any of the words contained. They small (for example, 1267 bigrams in the case of may generate non-words that include mistakes Spanish compared to 84 individual characters in which look like spelling-errors or minor gram- our experiments) but lexical translation probabili- matical mistakes. Those words are usually close ties become now much more differentiated. enough to the correct target words to be recog- With this, it is now possible to use the align- nized by the user, which is often more acceptable ment between bigrams to train a character-level than leaving foreign words untranslated. This is translation system as we have the same number of especially true as many unknown words represent bigrams as we have characters (and the first char- important content words that bear a lot of infor- acter in each bigram corresponds to the charac- mation. The problem of unknown words is even ter at that position). Certainly, it is also possible more severe for morphologically rich language as to train a bigram translation model (and language many word forms are simply not part of (sparse) model). This has the (one and only) advantage training data sets. Untranslated words are espe- that one character of context across phrase bound- cially annoying when translating languages that aries (i.e. character N-grams) is used in the se- use different writing systems. Consider, for ex- lection of translation alternatives from the phrase ample, the following subtitles in Macedonian (us- table.3 ing Cyrillic letters) that have been translated from Bosnian (written in Latin characters): 3.2 Tuning Character-Level Models reference: И чаша вино, како и секогаш. A final remark on training character-based SMT word-based: И ˇ caˇ su vina, како секогаш. models is concerned with feature weight tun- char-based: И чаша вино, како секогаш. ing. It certainly makes not much sense to com- reference: Во старото светилиште. pute character-level BLEU scores for tuning fea- word-based: Во starom svetiliˇ stu. char-based: Во стар светилиштето. ture weights especially with the standard settings of matching relatively short N-grams. Instead The underlined parts mark examples of character- we would still like to measure performance in level differences with respect to the reference terms of word-level BLEU scores (or any other translation. For the pivot translation approach, it MT evaluation metric used in minimum error is important that the translations generated in the rate training). Therefore, it is important to post- first step can be handled by the second one. This process character-translated development sets be- means, that words generated by a character-based fore adjusting weights. This is simply done model should at least be valid input words for the by merging characters accordingly and replacing second step, even though they might refer to er- the place-holders with spaces again. Thereafter, roneous inflections in that context. Therefore, we MERT can run as usual. add another measure to our experimental results presented below – the number of unknown words 3.3 Evaluation with respect to the input language of the second Character-level translations can be evaluated in step. This applies only to models that are used the same way as other translation hypotheses, as the first step in pivot-based translations. For for example using automatic measures such as other models, we include a string similarity mea- 3 Using larger units (trigrams, for example) led to lower sure based on the longest common subsequence scores in our experiments (probably due to data sparseness) ratio (LCSR) (Stephen, 1992) in order to give an and, therefore, are not reported here. impression about the “closeness” of the system 144 output to the reference translations. and another 2000 sentences for testing. For Gali- cian, we only used 1000 sentences for each set 4 Experiments due to the lack of additional data. We were espe- cially careful when preparing the data to exclude We conducted a series of experiments to test all sentences from tuning and test sets that could the ideas of (character-level) pivot translation for be found in any pivot or direct translation model. resource-poor languages. We chose to use data Hence, all test sentences are unseen strings for all from a collection of translated subtitles com- models presented in this paper (but they are not piled in the freely available OPUS corpus (Tiede- comparable with each other as they are sampled mann, 2009b). This collection includes a large individually from independent data sets). variety of languages and contains mainly short sentences and sentence fragments, which suits language pair #sent’s #words character-level alignment very well. The selected Galician – English – – settings represent translation tasks between lan- Galician – Spanish 2k 15k guages (and domains) for which only very limited Catalan – English 50k 400k Catalan – Spanish 64k 500k training data is available or none at all. Spanish – English 30M 180M Below we present results from two general Macedonian – English 220k 1.2M tasks:4 (i) Translating between English and a Macedonian – Bosnian 12k 60k resource-poor language (in both directions) via Macedonian – Bulgarian 155k 800k a pivot language that is close related to the Bosnian – English 2.1M 11M resource-poor language. (ii) Translating between Bulgarian – English 14M 80M two languages in a domain for which no in- Table 1: Training data for the translation task between domain training data is available via a pivot lan- closely related languages in the domain of movie sub- guage with in-domain data. We will start with titles. Number of sentences (#sent’s) and number of the presentation of the first task and the character- words (#words) in thousands (k) and millions (M) (av- based translation between closely related lan- erages of source and target language). guages. The data sets represent several interesting test 4.1 Task 1: Pivoting via Related Languages cases: Galician is the least supported language We decided to look at resource-poor languages with extremely little training data for building our from two language families: Macedonian repre- pivot model. There is no data for the direct model senting a Slavic language from the Balkan re- and, therefore, no explicit baseline for this task. gion, Catalan and Galician representing two Ro- There is 30 times more data available for Catalan- mance languages spoken mainly in Spain. There English, but still too little for a decent standard is only little or no data available for translating SMT model. Interesting here is that we have more from or to English for these languages. However, or less the same amount of data available for the there are related languages with medium or large baseline and for the pivot translation between the amounts of training data. For Macedonian, we related languages. The data set for Macedonian use Bulgarian (which also uses a Cyrillic alpha- – English is by far the largest among the baseline bet) and Bosnian (another related language that models and also bigger than the sets available for mainly uses Latin characters) as the pivot lan- the related pivot languages. Especially Macedo- guage. For Catalan and Galician, the obvious nian – Bosnian is not well supported. The inter- choice was Spanish (however, Portuguese would, esting questions is whether tiny amounts of pivot for example, have been another reasonable op- data can still be competitive. In all three cases, tion for Galician). Table 1 lists the data avail- there is much more data available for the trans- able for training the various models. Furthermore, lation models between English and the pivot lan- we reserved 2000 sentences for tuning parameters guage. In the following section we will look at the 4 In all experiments we use standard tools like Moses, translation between related languages with vari- Giza++, SRILM, mteval etc. Details about basic settings are omitted here due to space constraints but can be found in ous models and training setups before we con- the supplementary material. The data sets are available from sider the actual translation task via the bridge lan- here: http://stp.lingfil.uu.se/∼joerg/index.php?resources guages. 145 bs-mk bg-mk es-gl es-ca Model BLEU % ↑LCSR BLEU % ↑LCSR BLEU % ↑LCSR BLEU % ↑LCSR word-based 15.43 0.5067 14.66 0.6225 41.11 0.7966 62.73 0.8526 char – WFST1:1 21.37++ 0.6903 13.33−− 0.6159 36.94 0.7832 73.17++ 0.8728 char – WFST2:2 19.17++ 0.6737 12.67−− 0.6190 43.39++ 0.8083 70.64++ 0.8684 char – IBMchar 23.17++ 0.6968 14.57 0.6347 45.21++ 0.8171 73.12++ 0.8767 char – IBMbigram 24.84++ 0.7046 15.01++ 0.6374 44.06++ 0.8144 74.21++ 0.8803 Table 2: Translating from a related pivot language to the target language. Bosnian (bs) / Bulgarian (bg) – Macedonian (mk); Galician (gl) / Catalan (ca) – Spanish (es). Word-based refers to standard phrase-based SMT models. All other models use phrases over character sequences. The WFSTx:y models use weighted finite state transducers for character alignment with units that are at most x and y characters long, respectively. Other models use Viterbi alignments created by IBM model 4 using GIZA++ (Och and Ney, 2003) between characters (IBMchar ) or bigrams (IBMbigram ). LCSR refers to the averaged longest common subsequence ratio between system translations and references. Results are significantly better (p < 0.01++ , p < 0.05+ ) or worse (p < 0.01−− , p < 0.05− ) than the word-based baseline. mk-bs mk-bg gl-es ca-es Model BLEU % ↓UNK BLEU % ↓UNK BLEU % ↓UNK BLEU % ↓UNK word-based 14.22 17.83% 14.77 5.29% 43.22 10.18% 59.34 3.80% char – WFST1:1 21.74++ 1.50% 16.04++ 0.77% 50.24++ 1.17% 62.87++ 0.45% char – WFST2:2 19.19++ 2.05% 15.32 0.96% 50.59++ 1.28% 59.84 0.47% char – IBMchar 24.15++ 1.30% 17.12++ 0.80% 51.18++ 1.38% 64.35 ++ 0.59% char – IBMbigram 24.82++ 1.00% 17.28++ 0.77% 50.70++ 1.36% 65.14++ 0.48% Table 3: Translating from the source language to a related pivot language. UNK gives the proportion of unknown words with respect to the translation model from the pivot language to English. 4.1.1 Translating Related Languages produce consistently worse translation models (at The main challenge for the translation mod- least in terms of BLEU scores). The reason for els between related languages is the restriction to this might be that the IBM models can handle very limited parallel training data. Character-level noise in the training data more robustly. How- models make it possible to generalize to very ba- ever, in terms of unknown words, WFST-based sic translation units leading to robust models in alignment is very competitive and often the best the sense of models without unknown events. The choice (but not much different from the best IBM basic question is whether they provide reasonable based models). The use of character bigrams translations with respect to given accepted refer- leads to further BLEU improvements for all data ences. Tables 2 and 3 give a comprehensive sum- sets except Galician-Spanish. However, this data mary of various models for the languages selected set is extremely small, which may cause unpre- in our experiments. dictable results. In any case, the differences We can see that at least one character-based between character-based alignments and bigram- translation model outperforms the standard word- based ones are rather small and our experiments based model in all cases. This is true (and not very do not lead to conclusive results. surprising) for the language pairs with very little 4.1.2 Pivot Translation training data but it is also the case for language pairs with slightly more reasonable data sets like In this section we now look at cascaded transla- Bulgarian-Macedonian. The automatic measures tions via the related pivot language. Tables 4 and indicate decent translation performances at this 5 summarize the results for various settings. stage which encourages their use in pivot trans- As we can see, the pivot translations for Cata- lation that we will discuss in the next section. lan and Galician outperform the baselines by a Furthermore, we can also see the influence of large margin. Here, the baselines are, of course, different character alignment algorithms. Some- very weak due to the minimal amount of train- what surprisingly, the best results are achieved ing data. Furthermore, the Catalan-English test with IBM alignment models that are not designed set appears to be very easy considering the rela- for this purpose. Transducer-based alignments tively high BLEU scores achieved even with tiny 146 Model (BLEU in %) 1x1 10x10 Model (BLEU in %) 1x1 10x10 English – Catalan (baseline) 26.70 English – Maced. (baseline) 11.04 English – (Spanish = Catalan) 8.38 English – Bosn. -word- Maced. 7.33−− 7.64 English – Spanish -word- Catalan 38.91++ 39.59++ English – Bosn. -char- Maced. 9.99 10.34 English – Spanish -char- Catalan 44.46++ 46.82++ English – Bulg. -word- Maced. 12.49++ 12.62++ Catalan – English (baseline) 27.86 English – Bulg. -char- Maced. 11.57++ 11.59+ (Catalan = Spanish) – English 9.52 Maced. – English (baseline) 20.24 Catalan -word- Spanish – English 38.41++ 38.65++ Maced. -word- Bosn. – English 12.36−− 12.48−− Catalan -char- Spanish – English 40.43++ 40.73++ Maced. -char- Bosn. – English 18.73− 18.64−− English – Galician (baseline) — Maced. -word- Bulg. – English 19.62 19.74 English – (Spanish = Galician) 7.46 Maced. -char- Bulg. – English 21.05 21.10 English – Spanish -word- Galician 20.55 20.76 English – Spanish -char- Galician 21.12 21.09 Table 5: Translating between Macedonian (Maced) Galician – English (baseline) — and English via Bosnian (Bosn) / Bulgarian (Bulg). (Galician = Spanish) – English 5.76 Galician -word- Spanish – English 13.16 13.20 Galician -char- Spanish – English 16.04 16.02 the BLEU scores are much lower for all models involved (even for the high-density languages), Table 4: Translating between Galician/Catalan and En- which indicates larger problems with the gener- glish via Spanish using a standard phrase-based SMT ation of correct output and intermediate transla- baseline, Spanish–English SMT models to translate tions. from/to Catalan/Galician and pivot-based approaches Interesting is the fact that we can achieve al- using word-level models or character-level models most the same performance as the baseline when (based on IBMbigram alignments) with either one-best translating via Bosnian even though we had much (1x1) or N-best lists (10x10 with α = 0.85). less training data at our disposal for the translation between Macedonian and Bosnian. In this setup, amounts of training data for the baseline. Still, no we can see that a character-based model was nec- test sentence appears in any training or develop- essary in order to obtain the desired abstraction ment set for either direct translation or pivot mod- from the tiny amount of training data. els. From the results, we can also see that Catalan and Galician are quite different from Spanish and 4.2 Task 2: Pivoting for Domain Adaptation require language-specific treatment. Using a large Sparse resources are not only a problem for spe- Spanish – English model (with over 30% BLEU cific languages but also for specific domains. in both directions) to translate from or to Cata- SMT models are very sensitive to domain shifts lan or Galician is not an option. The experiments and domain-specific data is often rare. In the fol- show that character-based pivot models lead to lowing, we investigate a test case of translating better translations than word-based pivot models between two languages (English and Norwegian) (in terms of BLEU scores). This reflects the per- with reasonable amounts of data resources but in formance gains presented in Table 2. Rescoring the wrong domain (movie subtitles instead of le- of N-best lists, on the other hand, does not have gal texts). Here again, we facilitate the transla- a big impact on our results. However, we did not tion process by a pivot language, this time with spend time optimizing the parameters of N-best domain-specific data. size and interpolation weight. The task is to translate legal texts from Norwe- The results from the Macedonian task are not as gian (Bokm˚al) to English and vice versa. The test clear. This is especially due to the different setup set is taken from the English–Norwegian Parallel in which the baseline uses more training data than Corpus (ENPC) (Johansson et al., 1996) and con- any of the related language pivot models. How- tains 1493 parallel sentences (a selection of Eu- ever, we can still see that the pivot translation via ropean treaties, directives and agreements). Oth- Bulgarian clearly outperforms the baseline. For erwise, there is no training data available in this the case of translating to Macedonian via Bulgar- domain for English and Norwegian. Table 6 lists ian, the word-based model seems to be more ro- the other data resources we used in our study. bust than the character-level model. This may be As we can see, there is decent amount of train- due to a larger number of non-words generated ing data for English – Norwegian, but the domain by the character-based pivot model. In general, is strikingly different. On the other hand, there 147 Language pair Domain #sent’s #words tion process is enormous. As expected, the out- English–Norwegian subtitles 2.4M 18M of-domain baseline does not perform well even Norwegian–Danish subtitles 1.5M 10M Danish–English DGT-TM 430k 9M though it uses the largest amount of training data in our setup. It is even outperformed by the in- Table 6: Training data available for the domain adapta- domain pivot model when pretending that Norwe- tion task. DGT-TM refers to the translation memories gian is in fact Danish. For the translation into En- provided by the JRC (Steinberger et al., 2006) glish, the in-domain language model helps a lit- tle bit (similar resources are not available for the is in-domain data for other languages like Danish other direction). However, having the strong in- that may act as an intermediate pivot. Further- domain model for translating to (and from) the more, we have out-of-domain data for the transla- pivot language improves the scores dramatically. tion between pivot and Norwegian. The sizes of The out-of-domain model in the other part of the the training data sets for the pivot models are com- cascaded translation does not destroy this advan- parable (in terms of words). The in-domain pivot tage completely and the overall score is much data is controlled and very consistent and, there- higher than any other baseline. fore, high quality translations can be expected. In our setup, we used again a closely related The subtitle data is noisy and includes various language as a pivot. However, this time we movie genres. It is important to mention that the had more data available for training the pivot pivot data still does not contain any sentence in- translation model. Naturally, the advantages of cluded in the English–Norwegian test set. the character-level approach diminishes and the Table 7 summarizes the results of our experi- word-level model becomes a better alternative. ments when using Danish and in-domain data as However, there can still be a good reason for the a pivot in translations from and to Norwegian. use of a character-based model as we can see in the success of the bigram model (–subsbi –) in the Model (task: English – Norwegian) BLEU translation from Norwegian to English (via Dan- (step 1) English –dgt– Danish 52.76 ish). A character-based model may generalize be- (step 2) Danish –subswo – Norwegian 29.87 yond domain-specific terminology which leads to (step 2) Danish –subsch – Norwegian 29.65 (step 2) Danish –subsbi – Norwegian 25.65 a reduction of unknown words when applied to English –subs– Norwegian (baseline) 7.20 a new domain. Note that using a character-based English –dgt– (Danish = Norwegian) 9.44++ model in step two could possibly cause more harm English –dgt– Danish -subswo - Norwegian 17.49++ than using it in step one of the pivot-based pro- English –dgt– Danish -subsch - Norwegian 17.61++ English –dgt– Danish -subsbi - Norwegian 14.07++ cedure. Using n-best lists for a subsequent word- based translation in step two may fix errors caused Model (task: Norwegian – English) BLEU by character-based translation simply by ignoring (step 1) Norwegian –subswo – Danish 30.15 hypotheses containing them, which makes such a (step 1) Norwegian –subsch – Danish 27.81 model more robust to noisy input. (step 1) Norwegian –subsbi – Danish 28.52 (step 2) Danish –dgt– English 57.23 Finally, as an alternative, we can also look at Norwegian –subs– English (baseline) 11.41 other pivot languages. The domain adaptation (Norwegian = Danish) –dgt– English 13.21++ task is not at all restricted to closely related pivot Norwegian –subs+dgtLM– English 13.33++ languages especially considering the success of Norwegian –subswo – Danish –dgt– English 25.75++ (Norwegian –subsch – Danish –dgt– English 23.77++ word-based models in the experiments above. Ta- Norwegian –subsbi – Danish –dgt– English 26.29++ ble 8 lists results for three other pivot languages. Surprisingly, the results are much worse than Table 7: Translating out-of-domain data via Dan- for the Danish test case. Apparently, these mod- ish. Models using in-domain data are marked with els are strongly influenced by the out-of-domain dgt and out-of-domain models are marked with subs. subs+dgtLM refers to a model with an out-of-domain translation between Norwegian and the pivot lan- translation model and an added in-domain language guage. The only success can be seen with an- model. The subscripts wo, ch and bi refer to word, other closely related language, Swedish. Lexical character and bigram models, respectively. and syntactic similarity seems to be important to create models that are robust enough for domain The influence of in-domain data in the transla- shifts in the cascaded translation setup. 148 Pivot=xx en–xx xx–no en–xx–no extremely sparse data sets. Moreover, charac- German 53.09 23.60 3.15−− ter level models introduce an abstraction that re- French 66.47 17.84 5.03−− Swedish 52.62 24.79 10.07++ duce the number of unknown words dramatically. Pivot=xx no–xx xx–en no–xx–en In most cases, these unknown words represent German 15.02 53.02 5.52−− information-rich units that bear large portions of French 17.69 65.85 8.78−− the meaning to be translated. The following illus- Swedish 19.72 59.55 16.35++ trates this effect on example translations with and Table 8: Alternative word-based pivot translations be- without pivot model: tween Norwegian (no) and English (en). Example: Catalan English (via Spanish) Referen e: I have to grade these papers. 5 Related Work Baseline: Tin que quali ar these examens. Pivotword : Tin que quali ar these tests. There is a wide range of pivot language ap- Pivotchar : I have to grade these papers. proaches to machine translation and a number Example: Ma edonian English (via Bulgarian) of strategies have been proposed. One of them Referen e: It's a simple matter of self-preservation. Baseline: It's simply a question of ñåáåñî÷óâóâà»å. is often called triangulation and usually refers Pivotword : That's a matter of ñåáåñî÷óâóâà»å. to the combination of phrase tables (Cohn and Pivotchar : It's just a question of yourself. Lapata, 2007). Phrase translation probabilities are merged and lexical weights are estimated by Leaving unseen words untranslated is not only an- bridging word alignment models (Wu and Wang, noying (especially if the input language uses a 2007; Bertoldi et al., 2008). Cascaded translation different writing system) but often makes transla- via pivot languages are discussed by (Utiyama tions completely incomprehensible. Pivot trans- and Isahara, 2007) and are frequently used by var- lations will still not be perfect (see example ious researchers (de Gispert and Mari˜no, 2006; two above), but can at least be more intelli- Koehn et al., 2009; Wu and Wang, 2009) and gible. Character-based models can even take commercial systems such as Google Translate. care of tokenization errors as the one shown A third strategy is to generate or augment data above (“Tincque” should be two words “Tinc sets with the help of pivot models. This is, for que”). Fortunately, the generation of non-word example, explored by (de Gispert and Mari˜no, sequences (observed as unknown words) does not 2006) and (Wu and Wang, 2009) (who call it the seem to be a big problem and no special treatment synthetic method). Pivoting has also been used is required to avoid such output. We would still for paraphrasing and lexical adaptation (Bannard like to address this issue in future work by adding and Callison-Burch, 2005; Crego et al., 2010). a word level LM in character-based SMT. How- (Nakov and Ng, 2009) investigate pivot languages ever, (Vilar et al., 2007) already showed that this for resource-poor languages (but only when trans- did not have any positive effect in their character- lating from the resource-poor language). They based system. In a second study, we also showed also use transliteration for adapting models to a that pivot models can be useful for adapting to new (related) language. Character-level SMT has a new domain. The use of in-domain pivot data been used for transliteration (Matthews, 2007; leads to systems that outperform out-of-domain Tiedemann and Nabende, 2009) and also for the translation models by a large margin. Our find- translation between closely related languages (Vi- ings point to many prospects for future work. lar et al., 2007; Tiedemann, 2009a). For example, we would like to investigate combi- nations of character-based and word-based mod- 6 Conclusions and Discussion els. Character-based models may also be used for In this paper, we have discussed possibilities to treating unknown words only. Multiple source ap- translate via pivot languages on the character proaches via several pivots is another possibility level. These models are useful to support under- to be explored. Finally, we also need to further resourced languages and explore strong lexical investigate the robustness of the approach with re- and syntactic similarities between closely related spect to other language pairs, data sets and learn- languages. Such an approach makes it possible ing parameters. to train reasonable translation models even with 149 References Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Pro- Colin Bannard and Chris Callison-Burch. 2005. Para- ceedings of the 2003 Conference of the North Amer- phrasing with bilingual parallel corpora. In Pro- ican Chapter of the Association for Computational ceedings of the 43rd Annual Meeting of the Associa- Linguistics on Human Language Technology - Vol- tion for Computational Linguistics (ACL’05), pages ume 1, NAACL ’03, pages 48–54, Stroudsburg, PA, 597–604, Ann Arbor, Michigan, June. Association USA. Association for Computational Linguistics. for Computational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Nicola Bertoldi, Madalina Barbaiani, Marcello Fed- Chris Callison-Burch, Marcello Federico, Nicola erico, and Roldano Cattoni. 2008. Phrase-Based Bertoldi, Brooke Cowan, Wade Shen, Christine Statistical Machine Translation with Pivot Lan- Moran, Richard Zens, Chris Dyer, Ondrej Bojar, guages. In Proceedings of the International Work- Alexandra Constantin, and Evan Herbst. 2007. shop on Spoken Language Translation, pages 143– Moses: Open source toolkit for statistical ma- 149, Hawaii, USA. chine translation. In Proceedings of the 45th An- Trevor Cohn and Mirella Lapata. 2007. Machine nual Meeting of the Association for Computational translation by triangulation: Making effective use Linguistics Companion Volume Proceedings of the of multi-parallel corpora. In Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, 45th Annual Meeting of the Association of Compu- Czech Republic, June. Association for Computa- tational Linguistics, pages 728–735, Prague, Czech tional Linguistics. Republic, June. Association for Computational Lin- Philipp Koehn, Alexandra Birch, and Ralf Steinberger. guistics. 2009. 462 machine translation systems for europe. Josep Maria Crego, Aur´elien Max, and Franc¸ois Yvon. In Proceedings of MT Summit XII, pages 65–72, Ot- 2010. Local lexical adaptation in machine transla- tawa, Canada. tion through triangulation: SMT helping SMT. In Proceedings of the 23rd International Conference Minh-Thang Luong, Preslav Nakov, and Min-Yen on Computational Linguistics (Coling 2010), pages Kan. 2010. A hybrid morpheme-word represen- 232–240, Beijing, China, August. Coling 2010 Or- tation for machine translation of morphologically ganizing Committee. rich languages. In Proceedings of the 2010 Con- ference on Empirical Methods in Natural Language A. de Gispert and J.B. Mari˜no. 2006. Catalan-english Processing, pages 148–157, Cambridge, MA, Octo- statistical machine translation without parallel cor- ber. Association for Computational Linguistics. pus: Bridging through spanish. In Proceedings of the 5th Workshop on Strategies for developing Ma- David Matthews. 2007. Machine transliteration of chine Translation for Minority Languages (SALT- proper names. Master’s thesis, School of Informat- MIL’06) at LREC, pages 65–68, Genova, Italy. ics, University of Edinburgh. Mark Fishel and Harri Kirik. 2010. Linguistically Preslav Nakov and Hwee Tou Ng. 2009. Im- motivated unsupervised segmentation for machine proved statistical machine translation for resource- translation. In Proceedings of the International poor languages using related resource-rich lan- Conference on Language Resources and Evaluation guages. In Proceedings of the 2009 Conference on (LREC), pages 1741–1745, Valletta, Malta. Empirical Methods in Natural Language Process- Mark Fishel. 2009. Deeper than words: Morph-based ing, pages 1358–1367, Singapore, August. Associ- alignment for statistical machine translation. In ation for Computational Linguistics. Proceedings of the Conference of the Pacific Associ- Franz Josef Och and Hermann Ney. 2003. A sys- ation for Computational Linguistics PacLing 2009, tematic comparison of various statistical alignment Sapporo, Japan. models. Computational Linguistics, 29(1):19–51. Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Franz Josef Och. 2003. Minimum error rate training Sherif. 2007. Applying many-to-many alignments in statistical machine translation. In Proceedings and hidden markov models to letter-to-phoneme of the 41st Annual Meeting of the Association for conversion. In Human Language Technologies Computational Linguistics, pages 160–167, Sap- 2007: The Conference of the North American Chap- poro, Japan, July. Association for Computational ter of the Association for Computational Linguis- Linguistics. tics; Proceedings of the Main Conference, pages Eric Sven Ristad and Peter N. Yianilos. 1998. 372–379, Rochester, New York, April. Association Learning string edit distance. IEEE Transactions for Computational Linguistics. on Pattern Recognition and Machine Intelligence, Stig Johansson, Jarle Ebeling, and Knut Hofland. 20(5):522–532, May. 1996. Coding and aligning the English-Norwegian Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Parallel Corpus. In K. Aijmer, B. Altenberg, Camelia Ignat, Tomaˇz Erjavec, and Dan Tufis¸. and M. Johansson, editors, Languages in Contrast, 2006. The JRC-Acquis: A multilingual aligned par- pages 87–112. Lund University Press. allel corpus with 20+ languages. In Proceedings of 150 the 5th International Conference on Language Re- sources and Evaluation (LREC), pages 2142–2147. Graham A. Stephen. 1992. String Search. Technical report, School of Electronic Engineering Science, University College of North Wales, Gwynedd. J¨org Tiedemann and Peter Nabende. 2009. Translat- ing transliterations. International Journal of Com- puting and ICT Research, 3(1):33–41. J¨org Tiedemann. 2009a. Character-based PSMT for closely related languages. In Proceedings of 13th Annual Conference of the European Association for Machine Translation (EAMT’09), pages 12 – 19, Barcelona, Spain. J¨org Tiedemann. 2009b. News from OPUS - A col- lection of multilingual parallel corpora with tools and interfaces. In Recent Advances in Natural Lan- guage Processing, volume V, pages 237–248. John Benjamins, Amsterdam/Philadelphia. Masao Utiyama and Hitoshi Isahara. 2007. A com- parison of pivot methods for phrase-based statisti- cal machine translation. In Human Language Tech- nologies 2007: The Conference of the North Amer- ican Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 484–491, Rochester, New York, April. Asso- ciation for Computational Linguistics. David Vilar, Jan-Thorsten Peter, and Hermann Ney. 2007. Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Trans- lation, pages 33–39, Prague, Czech Republic, June. Association for Computational Linguistics. Hua Wu and Haifeng Wang. 2007. Pivot language ap- proach for phrase-based statistical machine transla- tion. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 856–863, Prague, Czech Republic, June. Associa- tion for Computational Linguistics. Hua Wu and Haifeng Wang. 2009. Revisiting pivot language approach for machine translation. In Pro- ceedings of the Joint Conference of the 47th An- nual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 154–162, Suntec, Singapore, August. Association for Computational Linguistics. 151 Does more data always yield better translations? Guillem Gasc´o, Martha-Alicia Rocha, Germ´an Sanchis-Trilles, ´ Andr´es-Ferrer and Francisco Casacuberta Jesus Departament de Sistemes Inform`atics i Computaci´o Universitat Polit`ecnica de Val`encia Cam´ı de Vera s/n, 46022 Val`encia, Spain {ggasco,mrocha,gsanchis,jandres,fcn}@dsic.upv.es Abstract that in which the SMT system is to be used or as- sessed; secondly, the use of all this data for train- Nowadays, there are large amounts of data ing the system increases the computational train- available to train statistical machine trans- ing requirements. Despite the previous remarks, lation systems. However, it is not clear whether all the training data actually help the de facto standard consists in training SMT sys- or not. A system trained on a subset of such tems with all the available data. This is due to huge bilingual corpora might outperform the widespread misconception that the more data the use of all the bilingual data. This paper a system is trained with, the better its performance studies such issues by analysing two train- should be. Although the previous statement is the- ing data selection techniques: one based oretically true if all the data belongs to the same on approximating the probability of an in- domain, this is not the case in the problems tack- domain corpus; and another based on in- led by most of the SMT systems. For instance, frequent n-gram occurrence. Experimental results not only report significant improve- enterprises often need to build on-demand sys- ments over random sentence selection but tems (Yuste et al., 2010). In this case, since we also an improvement over a system trained are interested in translating some specific text, it with the whole available data. Surprisingly, is not clear whether training a system with all data the improvements are obtained with just a yields better performance than training it with a small fraction of the data that accounts for wisely selected subset of bilingual sentences. less than 0.5% of the sentences. After- The bilingual sentence selection (BSS) task is wards, we show that a much larger room for improvement exists, although this is done stated as the problem of selecting the best sub- under non-realistic conditions. set of bilingual sentences from an available pool of sentences, with which to train a SMT system. This paper is concerned to BSS, and mainly two 1 Introduction ideas are developed. Globalisation and the popularisation of the Inter- On the one hand, two BSS strategies that at- net have lead to a rapid increase in the amount of tempt to build better translation systems are anal- bilingual corpora available. Entities such as the ysed. Such strategies are able to improve state-of- European Union, the United Nations and other the-art translation quality without the very high multinational organisations need to translate all computational resources that are required when the documentation they generate. Such transla- using the complete pool of sentences. Both tech- tions happen every day and provide very large niques span through two orthogonal criteria when multilingual corpora, which are oftentimes diffi- selecting bilingual sentences from the available cult to process and significantly increase the com- pool: avoiding to introduce a bias in the original putational requirements needed to train statistical data distribution, and increasing the informative- machine translation (SMT) systems. For instance, ness of the corpus. the corpora made available for recent machine On the other hand, we prove that among all pos- translation evaluations are in the order of 1 billion sible subsets from the sentence pool, there is at running words (Callison-Burch et al., 2010). least a small one that yields large improvements However, two main problems arise when at- (up to 10 BLEU points) with respect to a system tempting to use this huge pool of sentences for trained with all the data. In order to retrieve such training SMT systems: firstly, a large portion of subset, we had to use an oracle that employs infor- this data is obtained from domains that differ from mation extracted from the reference translations 152 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 152–161, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics only for the purpose of selecting bilingual sen- infrequent n-grams. In Section 5 experimental re- tences. However, references are not used at any sults are reported. Finally, the main results of the stage within the translation system for obtaining work and several future work directions are dis- the hypotheses. Note that although we are not cussed in Section 6. able to achieve such an improvement without an oracle, this result restates the BSS problem as an 2 Related Work interesting approach not only for reducing com- Training data selection has been receiving an in- putational effort but also for significantly boost- creasing amount of attention within the SMT ing performance. To our knowledge, no previous community. For instance, in (Li et al., 2010; work has quantified the room of improvement in Gasc´o et al., 2010) several BSS techniques, sim- which BSS techniques could incur. ilar to those analysed in this paper, have been In order to assess the performance of the dif- applied for training MT systems when there are ferent BSS techniques, translation results are ob- large training corpora available. However, nei- tained by using a standard state-of-the-art SMT ther such techniques have been formalised, nor its system (Koehn et al., 2007). The most recent lit- performance thoroughly analysed. A similar ap- erature defines the SMT problem (Papineni et al., proach that gives weights to different subcorpora 1998; Och and Ney, 2002) as follows: given an was proposed in (Matsoukas et al., 2009). input sentence f from a certain source language, In (Lu et al., 2007), information retrieval meth- the purpose is to find an output sentence e ˆ in a ods are used in order to produce different sub- certain target language such that models which are then weighted according to the K sentence to be translated. In such work, authors X ˆ = arg max e λk hk (f , e) (1) define the baseline as the result obtained train- e k=1 ing only with the corpus that share the same do- main of the test. Afterwards they claim that they where hk (f , e) is a score function representing an are able to improve baseline translation quality by important feature for the translation of f into e, adding new sentences retrieved with their method. as for example the language model of the target However, they neither compare their technique language, a reordering model or several transla- with random sentence selection, nor with a model tion models. λk are the log-linear combination trained with all the corpora. weights. Although the techniques that are applied for The main contributions of this paper are: BSS are often very similar to those applied for ac- • A BSS technique is analysed, which im- tive learning (AL), both problems are essentially proves the results obtained with a random different. Since the AL strategies assume that bilingual sentence selection strategy when the pool of sentences are not translated, they are the specific domain to be translated signifi- usually interested in finding the best monolingual cantly differs from that of the pool of sen- subset of sentences to be translated by a human tences. annotator. In contrast, in BSS, it is assumed that a • Another BSS technique is analysed that, us- fairly large amount of bilingual corpora is readily ing less than 0.5% of the sentences avail- available, and the main goal consists in selecting able, significantly improves over random se- only those sentences which will maximise system lection, beating a system trained with all the performance. pool of sentences. Some works have applied sentence selection in small scale AL frameworks. These works extend • We prove, by means of an oracle, that a wise the training corpora at most with 5000 sentences. BSS technique can yield large improvements In (Ananthakrishnan et al., 2010), sentences are when compared with systems trained with all selected by means of discriminative techniques. data available. In (Haffari et al., 2009) a technique is proposed The remaining of the paper is structured as fol- for increasing the counts of phrases that are con- lows. Section 2 summarises the related work. sidered infrequent. Both works significantly dif- Sections 3 and 4 present two BSS techniques, fer from the current work not only on the frame- namely, probabilistic sampling and recovery of work, but also on the scale of the experiments, the 153 proposed techniques and the obtained improve- the sample bias. The proposed approach relies in ments. Similar ideas applied to adaptation prob- conserving the probability distribution of the task lems have been proposed in (Moore and Lewis, domain by wisely selecting the bilingual pairs to 2010; Axelrod et al., 2011). be used from the whole pool of sentences. Hence, it is mandatory to exclude sentences from the pool 3 Probabilistic Sampling that distort the actual probability. In order to ap- As discussed in Section 2, BSS has inherently proximate the probability distribution, we assume attached many meaningful links with AL tech- that a small but representative corpus is avail- niques. Selecting samples for learning our mod- able from the task domain. This corpus, referred els, incurs in a well-known difficulty in AL, the henceforth as the in-domain corpus, provides a so-called sample bias problem (Dasgupta, 2009). way to build an initial model which approximates This problem, which is spread to the BSS case, the actual probability of the system. The pool of is summarised as the distortion introduced by the sentences will be oppositely denoted as the out- active strategy into the probability distribution un- of-domain corpus. derlying the training corpus. This bias forces the The actual probability of the task domain, the training algorithm to learn a distorted probability so called in-domain probability, is approximated model which can significantly differ from the ac- with the following model tual one. p(e, f , |e|, |f |) = p(e, f | |e|, |f |) · p(|e|, |f |) (5) In order to further analyse the sampling bias problem, consider the maximum likelihood esti- where p(|e|, |f |) denotes the in-domain length mation (MLE) of a probability model, pθ (e, f ) probability, and p(e, f | |e|, |f |) the in-domain for a given corpus of N data points,{(en , fn )}, bilingual probability. sampled from the actual probability distribution, The length probability is estimated by MLE Pr(e, f ). Recall that e denotes a target sen- tence whereas f stands for its source counter- N (|e| + |f |) part. MLE techniques aims at minimising the p(|e|, |f |) = (6) N Kullback-Leibler divergence between the actual unknown probability distribution and the proba- where N (|e|+|f |) is the number of bilingual pairs bility model (Bishop, 2006), defined as in the in-domain corpus such that their lengths sum up to |e|+|f | and N denotes the total num- X Pr(e, f ) ber of sentences. Note that no distinction is made KL(Pr | pθ ) = Pr(e, f ) log pθ (e, f ) between source and target lengths since the model e,f (2) is intended for sampling. When minimising, Eq. (2) is simplified to The complexity of the in-domain bilingual probability distribution, p(e, f | |e|, |f |), requires X θˆ = arg max Pr(e, f ) log(pθ (e, f )) (3) a more sophisticated approximation θ e,f P exp( k γk fk (e, f )) p(e, f /|e|, |f |) = (7) which is approximated by a sufficiently large Z dataset under the commonly hold assumption that being Z a normalisation constant; and where it is independently and identically distributed ac- fk (. . .) and γk are the features of the model and cording to Pr(e, f ) as their respective parametric weights. Specifically, X four logarithmic features were considered for this θˆ = arg max log(pθ (en , fn )) (4) θ sampling technique: a direct and an inverse IBM n model 4 (Brown et al., 1994); and both, source Therefore, by perturbing the sample {(en , fn )} and target, 5-gram language models. All fea- with an active strategy, we are, in fact, modifying ture models are estimated in the in-domain cor- the approximation to Eq.(3) and learning a differ- pus with standard techniques (Brown et al., 1994; ent underlying probability distribution. Stolcke, 2002). As a first approach, the parame- In this section a statistical framework is pro- ters of the log-linear model in Eq. (7), γk , were posed to build systems with BSS while avoiding uniformly fixed to 1. 154 Once we have an appropriate model for the be different from the concatenation of the transla- in-domain probability distribution, the proposed tions of both words separately. method randomly samples a given number of When selecting sentences from the pool it is bilingual pairs from the out-of-domain corpora important to choose sentences that contain n- (the pool of sentences). The process of extend- grams that have never been seen (or have been ing the in-domain corpus with additional bilin- seen just a few times) in the training set. Such gual pairs from the out-of-domain corpus is sum- n-grams will be henceforth referred to as infre- marised as follows: quent n-grams . An n-gram is considered infre- • Decide according to the in-domain length quent when it appears less times than an infre- probability in Eq. (6), how many samples quent threshold t. If the source language sen- should be drawn for each length, i.e. divide tences to be translated are known beforehand, the the number of sentences to add into length set of infrequent n-grams can be reduced to those dependent buckets. present in such sentences. Then, the technique consists in selecting from the pool those sentences • Randomly draw the number of samples which contain infrequent n-grams present in the specified in each bucket according to the source sentences to be translated. in-domain bilingual probability in Eq. (7) Sentences in the pool are sorted by their infre- among all the bilingual sentences that share quency score in order to select first the most in- the current bucket length. formative. Let X the set of n-grams that appear in the sentences to be translated and w one of Although the pool of sentences is typically them; C(w) the counts of w in the source lan- large, it is not large enough to gather a signifi- guage training set; and N (w) the counts of w cant amount of probability mass. Consequently, in the source sentence f to be scored. The infre- a small set of sentences accumulate most of the quency score of f is: probability mass and tend to be selected multi- ple times. To avoid this awkward and undesired behaviour, the sampling is performed without re- X i(f ) = min(1, N (w)) max(0, t−C(w)) (8) placement. w∈X 4 Infrequent n-gram Recovery In order to avoid giving a high score to noisy Another criterion when confronting the BSS task sentences with a lot of occurrences of the same in- is to increase the informativeness of the training frequent n-gram, only one occurrence of each n- set. Thus, it seems important to choose sentences gram is taken into account to compute the score. that provide information not seen in the training In addition, the score gives more importance to corpus. Note that this criterion is sometimes op- the n-grams with lowest counts in the training posed to the one presented in Section 3. set. Although it could be possible to select the The performance of phrase-based machine highest scored sentences, we updated the scores translation systems strongly relies in the quality each time a sentence is selected. This decision of the phrases extracted from the training sam- was taken to avoid the selection of too many sen- ples. In most of the cases, the inference of such tences with the same infrequent n-gram. First, phrases or rules is based on word alignments, sentences in the pool are scored using Equation which cannot be computed accurately when ap- (8). Then, in each iteration, the sentence f ∗ with pearing rarely in the training corpus. The extreme the highest score is selected, added to the training case are the out-of-vocabulary words: words that set and removed from the pool. In addition, the do not appear in the training set, cannot be trans- counts of the n-grams present in f ∗ are updated lated. Moreover, this problem can be extended to and, hence, the scores of the rest of the sentences sequences of words (n-grams). Consider a 2-gram in the pool. Since rescoring the whole pool would fi fj appearing few or no times in the training set. incur in a very high computational cost, a subop- Although fi and fj may appear separately in the timal search strategy was followed, in which the training set, the system might not be able to in- search was constrained to a given set of highest fer the translation of the 2-gram fi fj , which may scoring sentences. Here it was set to one million. 155 t=1 t = 10 t = 25 Subset Language |S| |W | |V | tr all tr all tr all English 747K 24.6K train 47.5K 1-gr 11.6 1.3 40.5 3.5 59.9 5.1 French 793K 31.7K 2-gr 38 9.8 73.2 21.3 84.9 27.9 English 9.2K 1.9K dev 571 3-gr 66.8 33.5 91.1 55.7 96.4 64.9 French 10.3K 2.2K 4-gr 87.1 65.8 98.2 85.5 99.4 90.7 English 12.6K 2.4K test 641 French 12.8K 2.7K Table 1: Percentage of infrequent n-grams in the TED Table 2: TED corpus main figures. K denotes thou- test set when considering only the TED training set sands of elements. |S| stands for number of sentences, (tr), and when adding the out-of-domain pool (all), |W | for number of running words, and |V | for vocab- for different infrequency thresholds t. ulary size. Table 1 shows the percentage of source lan- Subset Language |S| |W | |V | guage infrequent n-grams for the test of a rela- English 1.71M 29.9K train 77.2K French 1.99M 48K tively small corpus such as the TED corpus (for English 49.8K 8.7K details see Section 5) when considering just the dev 08 2.1K French 55.4K 7.7K in-domain training set (≈ 40K sentences) and the English 65.6K 8.9K test 09 2.5K same percentage when adding the larger out of do- French 72.5K 10.6K main corpora. The percentages in the table have English 62K 8.9K test 10 2.5K been computed separately for different values of French 70.5K 10.3K the threshold t and for n-grams of order from 1 to Table 3: News Commentary corpus main figures. 4. Note that the reduction in the number of infre- quent n-grams is very high for the 1-grams but de- GIZA++ (Och and Ney, 2003). The language creases progressively when considering n-grams model used was a 5-gram with modified Kneser- of higher order. This indicates that the infrequent Ney smoothing (Kneser and Ney, 1995), built n-grams recovery technique should be very effec- with SRILM toolkit (Stolcke, 2002). The log- tive for lower order n-grams, but might have less linear combination weights in Eq. (1) were opti- effect for higher order n-grams. Therefore, and mised using Minimum Error Rate Training (Och in order to lower the computational cost involved, and Ney, 2002) on the corresponding develop- the experiments carried out for this paper were ment sets. performed considering only infrequent 1-grams, Experiments were carried out on two corpora: 2-grams and 3-grams. TED (Paul et al., 2010) and News Commentary (NC) (Callison-Burch et al., 2010). TED is an 5 Experiments English-French corpus composed of subtitles for In the present Section, we first describe the exper- a collection of public speeches on a variety of top- imental framework employed to assess the perfor- ics. The same partitions as in the IWSLT2010 mance of the BSS techniques described. Then, re- evaluation task (Paul et al., 2010) have been used. sults for the probabilistic sentence selection strat- Subtitles have been concatenated into complete egy are shown, followed by results obtained with sentences. NC is a slightly larger English-French the infrequent n-grams technique. Some exam- corpus in the news domain. Main figures of both ple translations are shown and, finally, we also corpora are shown in Tables 2 and 3. As for the report experiments using the infrequent n-grams pool of sentences, three large corpora have been technique in Oracle mode, in order to establish used: Europarl (Euro), United Nations (UN) and the potential improvement for such technique and Gigaword (Giga), in the partition established for for BSS in general. the 2010 workshop on SMT of the ACL (Callison- Burch et al., 2010). Sentences of length greater 5.1 Experimental Setup than 50 have been pruned. Table 4 shows the main All experiments were carried out using the figures of the tokenised and lowercased corpora. open-source SMT toolkit Moses (Koehn et al., When translating between some language 2007), in its standard non-monotonic configura- pairs, there are words that remain invariable, like tion. The phrase tables were generated by means for example numbers or punctuation marks in the of symmetrised word alignments obtained with case of European languages. In fact, an easy and 156 Corpus Language |S| |W | |V | Europarl English 25.6M 81K Gigaword Euro 1.25M 0.03 Relative frequency French 28.2M 101K UN TED English 94.4M 302K NC UN 5M French 107M 283K 0.02 English 303M 1.6M Giga 15.5M French 361M 1.6M 0.01 Table 4: Figures of the corpora used as sentence pool. M stands for millions of elements. 0 effective technique that is commonly used is to re- 0 10 20 30 40 50 60 70 80 90 100 produce out-of-vocabulary words from the source Combined sentence length sentence in the target hypothesis. However, in- Figure 2: Combined length relative frequency. variable n-grams are usually infrequent as well, 5.2 Results for Probabilistic Sampling which implies that the infrequent n-grams tech- In addition to the probabilistic sampling tech- nique would select sentences containing such n- nique proposed in Section 3, we also analysed the grams, even though they do not provide further effect of sampling only according to the combined information. As a first approach, we exclude n- source-reference length, with the purpose of es- grams without any letter. tablishing whether potential improvements were Baseline experiments have been carried out for only due to the length component, or rather to the TED and NC corpora using the corresponding complete sampling model. Results for the 2009 training set. For comparison purposes, we also test set are shown in Figure 1. Several things included results for a purely random sentence se- should be noted: lection without replacement. In the plots, each point corresponding to random selection represent • Performing sentence selection only according the average of 10 repetitions. Experiments using to sentence lengths does not achieve better all data are also reported, although a 64GB ma- performance than random selection. chine was necessary, even with binarized phrase • Selecting sentences according to probabilis- and distortion tables. tic sampling is able to improve random se- Experiments were conducted by selecting a lection in the case of the TED corpus, but fixed amount of sentences according to each one is not able to do so in the case of the NC of the techniques described above. Then, these corpus. Significance tests for the 500K case sentences were included into the training data and reported that the differences were significant subsequent SMT systems were built for translat- in the case of the TED corpus, but not in the ing the test set. case of the NC corpus. Results are shown in terms of BLEU (Papineni • In the case of the TED corpus, the perfor- et al., 2001), which is an accuracy metric that mance achieved with the system built by measures n-gram precision, with a penalty for sampling 500K sentences is only 0.5 BLEU sentences that are too short. Although it could points below the performance achieved by be argued that improvements obtained might be the system built with all the data available. due to a side effect of the brevity penalty, this The explanation to the fact that probabilistic was not found to be true: the BSS techniques (in- sampling is able to improve over random sam- cluding random) and considering all data yielded pling only in the case of the TED corpus, but not very similar brevity penalties (±0.005), within in the case of NC, relies in the nature of the cor- each corpus. In addition, TER scores (Snover et pora. Although both of them belong to a very al., 2006) were also computed, but are omitted generic domain, their characteristics are very dif- for clarity purposes and since they were found to ferent. In fact, the NC data is very similar to the be coherent with BLEU. TER is an error metric sentences in the pool, but, in contrast, the sen- that computes the minimum number of edits re- tences present in the TED corpus have a much quired to modify the system hypotheses so that more different structure. This difference is illus- they match the references translations. trated in Figure 2, where the relative frequency of 157 TED corpus NC corpus 24 in domain length in domain length all sampling 22 all sampling random random 23 BLEU BLEU 21 22 20 19 21 0 100K 200K 300K 400K 500K 0 100K 200K 300K 400K 500K Number of sentences added Number of sentences added Figure 1: Effect of adding sentences over the BLEU score using the probabilistic sampling, length sampling and random selection techniques for the two corpora, TED and News Commentary. Horizontal lines represent the scores when using just the in domain training set and all the data available. TED corpus NC corpus 26 all in domain random 23 t=10 t=25 25 22 BLEU BLEU 24 21 23 all in domain 20 random 22 t=10 19 t=25 21 0 50k 100k 200k 0 50k 100k 200k Number of sentences added Number of sentences added Figure 3: Effect of adding sentences over the BLEU score using the infrequent n-grams (with different thresh- olds) and random selection techniques for the two corpora, TED and News Commentary. Horizontal lines repre- sent the scores when using just the in domain training set and all the data available. each combined sentence length is shown. In this sented similar curves, although less sentences can plot, it stands out clearly that the TED corpus has be selected and hence improvements obtained are a very different length distribution than the other slightly lower. Several conclusions can be drawn: four corpora considered, whereas the NC corpus • The translation quality provided by the in- presents a very similar distribution. This implies frequent n-grams technique is significantly that, when considering TED, an intelligent data better than the results achieved with random selection strategy will have better chances to im- selection, comparing similar amount of sen- prove random selection than in the case of NC. tences. Specifically, the improvements ob- 5.3 Results for Infrequent n-grams Recovery tained are in the range of 3 BLEU points. • Results for the TED corpus are more irreg- Figure 3 shows the effect of adding sentences us- ular. The best performance is achieved for ing the infrequent n-grams and the random se- t = 25 and 50K sentences added. In NC, the lection techniques on the 2009 test set. Once best result is for t = 10 and 112K. all the infrequent n-grams have been covered t times, the infrequency score for all the sen- • Selecting sentences with the infrequent n- tences remaining in the pool is 0, and none of grams technique provides better results than them can be selected. Hence, the number of including all the available data. While using sentences that can be selected for each t is lim- less than 0.5% of the data, improvements be- ited. Although for clarity we only show results tween 0.5 and 1 BLEU points are achieved. for t = {10, 25}, experiments have also been car- When looking at Figure 3, one might suspect ried out for t = {1, 5, 10, 25}. Such results pre- that t needs to be set specifically for a given test 158 set, and that results from one set are not to be ex- Src the budget has also been criticised by klaus . trapolated to other test sets. For this reason, we Bsl le budget a e´ galement e´ t´e criticised par m. klaus . Rdm le budget a e´ galement e´ t´e critiqu´ees par m. klaus . selected the best configuration in Figure 3 and PS le budget a e´ galement e´ t´e critiqu´ee par klaus . used it to build a new system for translating the All le budget a e´ galement e´ t´e critiqu´e par klaus . unseen NC 2010 test set. Such experiment, with Infr le budget a e´ galement e´ t´e critiqu´e par klaus . t = 10 and including all sentences with score Ref klaus critique e´ galement le budget . greater than 0 (≈ 110K), is shown in Table 5 and Src and one has come from music . evidences that improvements are actually coher- Bsl et un a de la musique . ent among different test sets. Rdm et on vient de musique . PS et on a viennent de musique . technique BLEU TER #phrases All et de la musique . in-domain 19.0 65.2 5.1M Infr et un est venu de la musique . Ref et un vient du monde de la musique . all data 22.7 60.8 1236M infreq. t = 10 23.6 59.2 16.5M Figure 4: Examples of two translations for each of the SMT systems built: Src (source sentence), Bsl (base- Table 5: Effect of the infrequent n-gram recovery tech- line), Rdm (random selection), PS (probabilistic sam- nique for an unseen test set, when setting t = 10 and pling), All (all the data available), Infr (Infrequent n- number of phrases (parameters) of the models. grams) and Ref (reference). to translate criticised, which is considered out-of- 5.4 Oracle Results vocabulary. Even though random selection is able In order to analyse the potential of BSS tech- to solve this problem (luckily), it does not achieve niques, the infrequent n-grams recovery tech- to translate it correctly, introducing a concordance nique in Section 4 was implemented in oracle error. A similar thing happens when using prob- mode. In this way, sentences from the pool abilistic sampling, where a grammatical error is were selected according to the infrequent n-grams also present, and only Infr and All are able present in the reference translations of the test set. to present a correct translation. This is not only Note that test references were not included into casual, since, by ensuring that a given n-gram ap- the training data as such, but were rather used pears at least a certain number of times t, the odds to establish which bilingual sentences within the of including all possible translations of criticised pool were best suitable for training the SMT sys- are incremented significantly. Note that, even if tem. In this way, we were able to establish the po- the Infr translation is different from the refer- tential for improvement of a BSS technique. In- ence, it is equally correct. In the second example, terestingly, the SMT system trained in this way the baseline translation is pretty much correct, but achieved 31 BLEU points on the News Commen- has a different meaning (something like “and one tary 2009 test set, i.e. an 8 BLEU points improve- has music”). Similarly, when including all data ment over the system trained with all the data the translation obtained by the system means “and available. This result would have beaten all the some music”. In this case, both random and prob- systems that took part in the 2009 Workshop on abilistic selection present grammatically incorrect Machine translation (Callison-Burch et al., 2009). sentences, and only Infr is able to provide a cor- This result is really important: although we are rect translation, although pretty literal and differ- aware that the sentences were selected in a non- ent from the reference. realistic manner, it proves that an appropriate BSS technique would be able to boost SMT perfor- 6 Discussion mance in a very significant manner. Similar re- sults were obtained with the TED and NC 2010 Bilingual sentence selection (BSS) might be un- test sets, with 10 and 7 points improvement, re- derstood to be closely related to adaptation, even spectively. though both paradigms tackle problems which are, in essence, different. The goal of an adap- 5.5 Example Translations tation technique is to adapt model parameters, Example translations are shown in Figure 4. In which have been estimated on a large out-of- the first example, the baseline system is not able domain (or generic) data set, so that they are 159 best suitable for dealing with a domain-specific larger corpora or even more complex techniques, test set. This adaptation process is ought to be such as synchronous grammars or hierarchical achieved by means of a (potentially small) adapta- models. For instance, the infrequent n-grams tion set, which belongs to the same domain as the technique has beaten all the other systems using test data. In contrast, BSS tackles with the prob- just a small fraction of the corpus, only 0.5%, and lem of how to select samples from a large pool is yet able to outperform a system trained with all of training data, regardless of whether such pool the data by 0.9 BLEU points and the random base- of data is in-domain or out-of-domain. Hence, in line by 3 points. This baseline has been proved to one case we can assume to have a fairly well es- be difficult to beat by other works. timated translation model, which is to be adapted, Preliminary experiments were performed in or- whereas in BSS we still have full control over the der to analyse the perplexity of the references, the estimation of such model and need not to aim at a number of out of vocabulary words (OoVs) and specific domain, although it might often be so. the ratio of target-source phrases. These exper- BSS is related with instance weighting (Jiang iments revealed that the improvements obtained and Zhai, 2007; Foster et al., 2010). Adapta- are largely correlated with a decrease in perplex- tion and BSS can be considered to be orthogo- ity and in the number of OoVs. On the one hand, nal (yet complementary) problems under the in- reducing the amount of OoVs was mirrored by stance weighting paradigm. In such case, instance an important improvement in BLEU when the weighting can be considered to span a complete amount of additional data was small, and also paradigmatic space between both. At one end, entailed a decrease in perplexity. However, a there is sample selection (BSS for SMT), while at reduction in perplexity by itself did not always the other end there is adaptation. For instance, it imply significant improvements. Moreover, no is quite common to confront the adaptation prob- real conclusion could be drawn from the analy- lem by extracting different phrase-tables from dif- sis of target-source phrase ratio. Hence, we un- ferent corpora, and then interpolate such tables. derstand that the improvements obtained are pro- This technique could be also applied to promote vided mainly by a more specialised estimation of the performance of the system built by means of the model parameters. However, further experi- BSS. However, this is left out as future work. ments should still be conducted in order to verify We thoroughly analysed two BSS approaches this conclusion. that obtain competitive results, while using a small fraction of the training data, although there Acknowledgments is still much to be gained. For instance, oracle re- The research leading to these results has re- sults have also been reported in this work, yield- ceived funding from the European Union Seventh ing improvements of up to 10 BLEU points. Even Framework Programme (FP7/2007-2013) under though the use of an oracle typically implies that grant agreement nr. 287755. This work was the results obtained are not realistic, recall that also supported by the Spanish MEC/MICINN un- the proposed oracle is special, in the sense that it der the MIPRCV ”Consolider Ingenio 2010” pro- only uses the reference sentences for the specific gram (CSD2007-00018), and iTrans2 (TIN2009- purpose of selecting training samples, but the ref- 14511) project. Also supported by the Span- erences are not included into the training data as ish MITyC under the erudito.com (TSI-020110- such. This is useful for assessing the potential be- 2009-439) project and Instituto Tecnol´ogico de hind BSS: ideally, if we were able to design a BSS Le´on, DGEST-PROMEP y CONACYT, M´exico. strategy that, without using the references, would select exactly those training samples, we would be boosting system performance by 10 BLEU points. This re-states BSS as a compelling technique that has not yet received the attention it deserves. BSS is not aimed at optimising computational requirements, but does so as a byproduct. This may seem despicable but it would allow to run more experiments with the same resources, use 160 References Moran, Richard Zens, Chris Dyer, Ontraj Bojar, Alexandra Constantin, and Evan Herbst. 2007. Sankaranarayanan Ananthakrishnan, Rohit Prasad, Moses: Open source toolkit for statistical machine David Stallard, and Prem Natarajan. 2010. Dis- translation. In Proc. of ACL, pages 177–180. criminative sample selection for statistical machine Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Gan- translation. In Proc. of the EMNLP, pages 626–635, itkevitch, Ann Irvine, Sanjeev Khudanpur, Lane Cambridge, MA, October. Schwartz, Wren Thornton, Ziyuan Wang, Jonathan Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Weese, and Omar Zaidan. 2010. Joshua 2.0: A 2011. Domain adaptation via pseudo in-domain toolkit for parsing-based machine translation with data selection. In Proc of the EMNLP, pages 355– syntax, semirings, discriminative training and other 362. goodies. In Proc. of the MATR(ACL), pages 139– Christopher M. Bishop. 2006. Pattern Recognition 143, Uppsala, Sweden, July. and Machine Learning. Springer. Yajuan Lu, Jin Huang, and Qun Liu. 2007. Improv- Peter F. Brown, Stephen Della Pietra, Vincent J. Della ing statistical machine translation performance by Pietra, and Robert L. Mercer. 1994. The mathe- training data selection and optimization. In Proc. of matics of statistical machine translation: Parameter the EMNLP-CoNLL, pages 343–350, Prague, Czech estimation. Computational Linguistics, 19(2):263– Republic, June. 311. Spyros Matsoukas, Antti-Veikko I. Rosti, and Bing Chris Callison-Burch, Philipp Koehn, Christof Monz, Zhang. 2009. Discriminative corpus weight es- and Josh Schroeder. 2009. Findings of the 2009 timation for machine translation. In Proc. of the Workshop on Statistical Machine Translation. In EMNLP, pages 708–717, Singapore, August. Proc of the WSMT, pages 1–28, Athens, Greece, Robert C. Moore and William Lewis. 2010. Intelli- March. gent selection of language model training data. In Chris Callison-Burch, Philipp Koehn, Christof Monz, ACL (Short Papers), pages 220–224. Kay Peterson, Mark Przybocki, and Omar Zaidan. Franz J. Och and Hermann Ney. 2002. Discrimina- 2010. Findings of the 2010 joint Workshop on Sta- tive training and maximum entropy models for sta- tistical Machine Translation and Metrics for Ma- tistical machine translation. In Proc. of ACL, pages chine Translation. In Proc. of the MATR(ACL), 295–302. pages 17–53, Uppsala, Sweden, July. Franz J. Och and Hermann Ney. 2003. A systematic Sanjoy Dasgupta. 2009. The two faces of active learn- comparison of various statistical alignment models. ing. In Proc. of The twentieth Conference on Algo- In Computational Linguistics, volume 29, pages rithmic Learning Theory, page 1, Porto (Portugal), 19–51. October. Kishore Papineni, Salim Roukos, and Todd Ward. George Foster, Cyril Goutte, and Roland Kuhn. 2010. 1998. Maximum likelihood and discriminative Discriminative instance weighting for domain adap- training of direct translation models. In Proc. of tation in statistical machine translation. In Proc. of ICASSP’98, pages 189–192. the EMNLP, pages 451–459, Cambridge, MA, Oc- Kishore Papineni, Salim Roukos, Todd Ward, and tober. Wei-Jing Zhu. 2001. Bleu: A method for automatic Guillem Gasc´o, Vicent Alabau, Jes´us Andr´es-Ferrer, evaluation of machine translation. In Technical Re- Jes´us Gonz´alez-Rubio, Martha-Alicia Rocha, port RC22176 (W0109-022). Germ´an Sanchis-Trilles, Francisco Casacuberta, Michael Paul, Marcello Federico, and Sebastian Stker. Jorge Gonz´alez, and Joan-Andreu S´anchez. 2010. 2010. Overview of the IWSLT 2010 evaluation ITI-UPV system description for IWSLT 2010. In campaign. In Proc. of the IWSLT 2010, Paris, Proc. of the IWSLT 2010, Paris, France, December. France, December. Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- 2009. Active learning for statistical phrase-based nea Micciulla, and John Makhoul. 2006. A study machine translation. In Proc. of HLT/NAACL’09, of translation edit rate with targeted human annota- pages 415–423, Morristown, NJ, USA. tion. In Proc. of AMTA’06. Jing Jiang and ChengXiang Zhai. 2007. Instance Andreas Stolcke. 2002. SRILM – an extensible lan- weighting for domain adaptation in NLP. In Proc. guage modeling toolkit. In Proc. of ICSLP. of ACL’07, pages 264–271. Elia Yuste, Manuel Herranz, Antonio Lagarda, Li- Reinhard Kneser and Hermann Ney. 1995. Improved onel Taraz´on, Isa´ıas S´anchez-Cortina, and Fran- backing-off for m-gram language modeling. Proc. cisco Casacuberta. 2010. Pangeamt - putting of ICASSP, II:181–184, May. open standards to work... well. In Proc. of the Philipp Koehn, Hieu Hoang, Alexandra Birch, AMTA2010. Denver, CO, USA, November. Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christie 161 Recall-Oriented Learning of Named Entities in Arabic Wikipedia Behrang Mohit∗ Nathan Schneider† Rishav Bhowmick∗ Kemal Oflazer∗ Noah A. Smith† School of Computer Science, Carnegie Mellon University ∗ P.O. Box 24866, Doha, Qatar † Pittsburgh, PA 15213, USA {behrang@,nschneid@cs.,rishavb@qatar.,ko@cs.,nasmith@cs.}cmu.edu Abstract delineated. One hallmark of this divergence be- tween Wikipedia and the news domain is a dif- We consider the problem of NER in Arabic ference in the distributions of named entities. In- Wikipedia, a semisupervised domain adap- deed, the classic named entity types (person, or- tation setting for which we have no labeled ganization, location) may not be the most apt for training data in the target domain. To fa- articles in other domains (e.g., scientific or social cilitate evaluation, we obtain annotations topics). On the other hand, Wikipedia is a large for articles in four topical groups, allow- ing annotators to identify domain-specific dataset, inviting semisupervised approaches. entity types in addition to standard cate- In this paper, we describe advances on the prob- gories. Standard supervised learning on lem of NER in Arabic Wikipedia. The techniques newswire text leads to poor target-domain are general and make use of well-understood recall. We train a sequence model and show building blocks. Our contributions are: that a simple modification to the online learner—a loss function encouraging it to • A small corpus of articles annotated in a new “arrogantly” favor recall over precision— scheme that provides more freedom for annota- substantially improves recall and F1 . We tors to adapt NE analysis to new domains; then adapt our model with self-training on unlabeled target-domain data; enforc- • An “arrogant” learning approach designed to ing the same recall-oriented bias in the self- boost recall in supervised training as well as training stage yields marginal gains.1 self-training; and • An empirical evaluation of this technique as ap- plied to a well-established discriminative NER 1 Introduction model and feature set. This paper considers named entity recognition Experiments show consistent gains on the chal- (NER) in text that is different from most past re- lenging problem of identifying named entities in search on NER. Specifically, we consider Arabic Arabic Wikipedia text. Wikipedia articles with diverse topics beyond the commonly-used news domain. These data chal- 2 Arabic Wikipedia NE Annotation lenge past approaches in two ways: Most of the effort in NER has been fo- First, Arabic is a morphologically rich lan- cused around a small set of domains and guage (Habash, 2010). Named entities are ref- general-purpose entity classes relevant to those erenced using complex syntactic constructions domains—especially the categories PER ( SON ), (cf. English NEs, which are primarily sequences ORG ( ANIZATION ), and LOC ( ATION ) (POL), of proper nouns). The Arabic script suppresses which are highly prominent in news text. Ara- most vowels, increasing lexical ambiguity, and bic is no exception: the publicly available NER lacks capitalization, a key clue for English NER. corpora—ACE (Walker et al., 2006), ANER (Be- Second, much research has focused on the use najiba et al., 2008), and OntoNotes (Hovy et al., of news text for system building and evaluation. 2006)—all are in the news domain.2 However, Wikipedia articles are not news, belonging instead 2 to a wide range of domains that are not clearly OntoNotes contains news-related text. ACE includes some text from blogs. In addition to the POL classes, both 1 The annotated dataset and a supplementary document corpora include additional NE classes such as facility, event, with additional details of this work can be found at: product, vehicle, etc. These entities are infrequent and may http://www.ark.cs.cmu.edu/AQMAR not be comprehensive enough to cover the larger set of pos- 162 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 162–173, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics History Science Sports Technology Table 1: Translated titles dev: Damascus Atom Raul ´ ´ Gonzales Linux of Arabic Wikipedia arti- Imam Hussein Shrine Nuclear power Real Madrid Solaris cles in our development test: Crusades Enrico Fermi 2004 Summer Olympics Computer and test sets, and some Islamic Golden Age Light Christiano Ronaldo Computer Software NEs with standard and Islamic History Periodic Table Football Internet article-specific classes. Ibn Tolun Mosque Physics Portugal football team Richard Stallman Additionally, Prussia and Amman were reserved Ummaya Mosque Muhammad al-Razi FIFA World Cup X Window System for training annotators, Claudio Filippone (PER) àñJ.ÊJ ¯ ñK XñÊ¿; Linux (SOFTWARE) ºJJ Ë; Spanish League (CHAMPIONSHIPS) úGAJ.ƒB@ ø PðYË@; proton (PARTICLE) àñ KðQK.; nuclear and mating Gulf War for esti- inter-annotator agreement. radiation (GENERIC - MISC) ø ðñJË@ ¨AªƒB@ ; Real Zaragoza (ORG) 颂¯Qå… ÈAK P appropriate entity classes will vary widely by do- testing examples, but not as training data. In §4 main; occurrence rates for entity classes are quite we will discuss our semisupervised approach to different in news text vs. Wikipedia, for instance learning, which leverages ACE and ANER data (Balasuriya et al., 2009). This is abundantly as an annotated training corpus. clear in technical and scientific discourse, where much of the terminology is domain-specific, but it 2.1 Annotation Strategy holds elsewhere. Non-POL entities in the history We conducted a small annotation project on Ara- domain, for instance, include important events bic Wikipedia articles. Two college-educated na- (wars, famines) and cultural movements (roman- tive Arabic speakers annotated about 3,000 sen- ticism). Ignoring such domain-critical entities tences from 31 articles. We identified four top- likely limits the usefulness of the NE analysis. ical areas of interest—history, technology, sci- Recognizing this limitation, some work on ence, and sports—and browsed these topics un- NER has sought to codify more robust invento- til we had found 31 articles that we deemed sat- ries of general-purpose entity types (Sekine et al., isfactory on the basis of length (at least 1,000 2002; Weischedel and Brunstein, 2005; Grouin words), cross-lingual linkages (associated articles et al., 2011) or to enumerate domain-specific in English, German, and Chinese3 ), and subjec- types (Settles, 2004; Yao et al., 2003). Coarse, tive judgments of quality. The list of these arti- general-purpose categories have also been used cles along with sample NEs are presented in ta- for semantic tagging of nouns and verbs (Cia- ble 1. These articles were then preprocessed to ramita and Johnson, 2003). Yet as the number extract main article text (eliminating tables, lists, of classes or domains grows, rigorously docu- info-boxes, captions, etc.) for annotation. menting and organizing the classes—even for a Our approach follows ACE guidelines (LDC, single language—requires intensive effort. Ide- 2005) in identifying NE boundaries and choos- ally, an NER system would refine the traditional ing POL tags. In addition to this traditional form classes (Hovy et al., 2011) or identify new entity of annotation, annotators were encouraged to ar- classes when they arise in new domains, adapting ticulate one to three salient, article-specific en- to new data. For this reason, we believe it is valu- tity categories per article. For example, names able to consider NER systems that identify (but of particles (e.g., proton) are highly salient in the do not necessarily label) entity mentions, and also Atom article. Annotators were asked to read the to consider annotation schemes that allow annota- entire article first, and then to decide which non- tors more freedom in defining entity classes. traditional classes of entities would be important Our aim in creating an annotated dataset is to in the context of article. In some cases, annotators provide a testbed for evaluation of new NER mod- reported using heuristics (such as being proper els. We will use these data as development and 3 These three languages have the most articles on Wikipedia. Associated articles here are those that have been sible NEs (Sekine et al., 2002). Nezda et al. (2006) anno- manually hyperlinked from the Arabic page as cross-lingual tated and evaluated an Arabic NE corpus with an extended correspondences. They are not translations, but if the associ- set of 18 classes (including temporal and numeric entities); ations are accurate, these articles should be topically similar this corpus has not been released publicly. to the Arabic page that links to them. 163 Token position agreement rate 92.6% Cohen’s κ: 0.86 History: Gulf War, Prussia, Damascus, Crusades Token agreement rate 88.3% Cohen’s κ: 0.86 WAR CONFLICT • • • Token F1 between annotators 91.0% Science: Atom, Periodic table Entity boundary match F1 94.0% THEORY • CHEMICAL • • Entity category match F1 87.4% NAME ROMAN • PARTICLE • • Table 2: Inter-annotator agreement measurements. Sports: Football, Raul ´ ´ Gonzales SPORT ◦ CHAMPIONSHIP • nouns or having an English translation which is AWARD ◦ NAME ROMAN • conventionally capitalized) to help guide their de- Technology: Computer, Richard Stallman termination of non-canonical entities and entity COMPUTER VARIETY ◦ SOFTWARE • COMPONENT • classes. Annotators produced written descriptions of their classes, including example instances. Table 3: Custom NE categories suggested by one or This scheme was chosen for its flexibility: in both annotators for 10 articles. Article titles are trans- contrast to a scenario with a fixed ontology, anno- lated from Arabic. • indicates that both annotators vol- tators required minimal training beyond the POL unteered a category for an article; ◦ indicates that only conventions, and did not have to worry about one annotator suggested the category. Annotators were delineating custom categories precisely enough not given a predetermined set of possible categories; that they would extend straightforwardly to other rather, category matches between annotators were de- topics or domains. Of course, we expect inter- termined by post hoc analysis. NAME ROMAN indi- cates an NE rendered in Roman characters. annotator variability to be greater for these open- ended classification criteria. 2.3 Validating Category Intuitions 2.2 Annotation Quality Evaluation To investigate the variability between annotators with respect to custom category intuitions, we During annotation, two articles (Prussia and Am- asked our two annotators to independently read man) were reserved for training annotators on 10 of the articles in the data (scattered across our the task. Once they were accustomed to anno- four focus domains) and suggest up to 3 custom tation, both independently annotated a third ar- categories for each. We assigned short names to ticle. We used this 4,750-word article (Gulf War, these suggestions, seen in table 3. In 13 cases, éJ KA JË@ i.J Êm Ì '@ H. Qk) to measure inter-annotator both annotators suggested a category for an article agreement. Table 2 provides scores for token- that was essentially the same (•); three such cat- level agreement measures and entity-level F1 be- egories spanned multiple articles. In three cases tween the two annotated versions of the article.4 a category was suggested by only one annotator These measures indicate strong agreement for (◦).5 Thus, we see that our annotators were gen- locating and categorizing NEs both at the token erally, but not entirely, consistent with each other and chunk levels. Closer examination of agree- in their creation of custom categories. Further, al- ment scores shows that PER and MIS classes have most all of our article-specific categories corre- the lowest rates of agreement. That the mis- spond to classes in the extended NE taxonomy of cellaneous class, used for infrequent or article- (Sekine et al., 2002), which speaks to the reason- specific NEs, receives poor agreement is unsur- ableness of both sets of categories—and by exten- prising. The low agreement on the PER class sion, our open-ended annotation process. seems to be due to the use of titles and descriptive Our annotation of named entities outside of the terms in personal names. Despite explicit guide- traditional POL classes creates a useful resource lines to exclude the titles, annotators disagreed on for entity detection and recognition in new do- the inclusion of descriptors that disambiguate the mains. Even the ability to detect non-canonical NE (e.g., the father in H . h. Qk.: George . B@ €ñK types of NEs should help applications such as QA Bush, the father). and MT (Toral et al., 2005; Babych and Hart- 4 The position and boundary measures ignore the distinc- ley, 2003). Possible avenues for future work tions between the POLM classes. To avoid artificial inflation include annotating and projecting non-canonical of the token and token position agreement rates, we exclude 5 the 81% of tokens tagged by both annotators as not belong- When it came to tagging NEs, one of the two annota- ing to an entity. tors was assigned to each article. Custom categories only suggested by the other annotator were ignored. 164 NEs from English articles to their Arabic coun- Training words NEs terparts (Hassan et al., 2007), automatically clus- ACE+ANER 212,839 15,796 tering non-canonical types of entities into article- Wikipedia (unlabeled, 397 docs) 1,110,546 — specific or cross-article classes (cf. Frietag, 2004), Development or using non-canonical classes to improve the ACE 7,776 638 Wikipedia (4 domains, 8 docs) 21,203 2,073 (author-specified) article categories in Wikipedia. Test Hereafter, we merge all article-specific cate- ACE 7,789 621 gories with the generic MIS category. The pro- Wikipedia (4 domains, 20 docs) 52,650 3,781 portion of entity mentions that are tagged as MIS, while varying to a large extent by document, is Table 4: Number of words (entity mentions) in data sets. a major indication of the gulf between the news data (<10%) and the Wikipedia data (53% for the tures known to work well for Arabic NER (Be- development set, 37% for the test set). najiba et al., 2008; Abdul-Hamid and Darwish, Below, we aim to develop entity detection mod- 2010), we incorporate some additional features els that generalize beyond the traditional POL en- enabled by Wikipedia. We do not employ a tities. We do not address here the challenges of gazetteer, as the construction of a broad-domain automatically classifying entities or inferring non- gazetteer is a significant undertaking orthogo- canonical groupings. nal to the challenges of a new text domain like Wikipedia.10 A descriptive list of our features is 3 Data available in the supplementary document. Table 4 summarizes the various corpora used in We use a first-order structured perceptron; none this work.6 Our NE-annotated Wikipedia sub- of our features consider more than a pair of con- corpus, described above, consists of several Ara- secutive BIO labels at a time. The model enforces bic Wikipedia articles from four focus domains.7 the constraint that NE sequences must begin with We do not use these for supervised training data; B (so the bigram hO, Ii is disallowed). they serve only as development and test data. A Training this model on ACE and ANER data larger set of Arabic Wikipedia articles, selected achieves performance comparable to the state of on the basis of quality heuristics, serves as unla- the art (F1 -measure11 above 69%), but fares much beled data for semisupervised learning. worse on our Wikipedia test set (F1 -measure Our out-of-domain labeled NE data is drawn around 47%); details are given in §5. from the ANER (Benajiba et al., 2007) and 4.1 Recall-Oriented Perceptron ACE-2005 (Walker et al., 2006) newswire cor- pora. Entity types in this data are POL cate- By augmenting the perceptron’s online update gories (PER, ORG, LOC) and MIS. Portions of the with a cost function term, we can incorporate a ACE corpus were held out as development and task-dependent notion of error into the objective, test data; the remainder is used in training. as with structured SVMs (Taskar et al., 2004; Tsochantaridis et al., 2005). Let c(y, y 0 ) denote 4 Models a measure of error when y is the correct label se- quence but y 0 is predicted. For observed sequence Our starting point for statistical NER is a feature- x and feature weights (model parameters) w, the based linear model over sequences, trained using structured hinge loss is `hinge (x, y, w) = the structured perceptron (Collins, 2002).8 In addition to lexical and morphological9 fea- > 0 0 max 0 w g(x, y ) + c(y, y ) − w> g(x, y) 6 Additional details appear in the supplement. y 7 We downloaded a snapshot of Arabic Wikipedia (1) (http://ar.wikipedia.org) on 8/29/2009 and pre- The maximization problem inside the parentheses processed the articles to extract main body text and metadata is known as cost-augmented decoding. If c fac- using the mwlib package for Python (PediaPress, 2010). 8 10 A more leisurely discussion of the structured percep- A gazetteer ought to yield further improvements in line tron and its connection to empirical risk minimization can with previous findings in NER (Ratinov and Roth, 2009). 11 be found in the supplementary document. Though optimizing NER systems for F1 has been called 9 We obtain morphological analyses from the MADA tool into question (Manning, 2006), no alternative metric has (Habash and Rambow, 2005; Roth et al., 2008). achieved widespread acceptance in the community. 165 tors similarly to the feature function g(x, y), then Input: labeled data hhx(n) , y (n) iiNn=1 ; unlabeled we can increase penalties for y that have more data hx¯ (j) iJj=1 ; supervised learner L; local mistakes. This raises the learner’s aware- number of iterations T 0 ness about how it will be evaluated. Incorporat- Output: w ing cost-augmented decoding into the perceptron w ← L(hhx(n) , y (n) iiN n=1 ) leads to this decoding step: for t = 1 to T 0 do for j = 1 to J do yˆ(j) ← arg maxy w> g(x ¯ (j) , y) yˆ ← arg max w> g(x, y 0 ) + c(y, y 0 ) , (2) y0 w ← L(hhx(n) , y (n) iiN n=1 ∪ hhx¯ (j) , yˆ(j) iiJj=1 ) Algorithm 1: Self-training. which amounts to performing stochastic subgradi- ent ascent on an objective function with the Eq. 1 there is no available labeled training data. Yet loss (Ratliff et al., 2006). the available unlabeled data is vast, so we turn to In this framework, cost functions can be for- semisupervised learning. mulated to distinguish between different types of Here we adapt self-training, a simple tech- errors made during training. For a tag sequence nique that leverages a supervised learner (like the y = hy1 , y2 , . . . , yM i, Gimpel and Smith (2010b) perceptron) to perform semisupervised learning define word-local cost functions that differently (Clark et al., 2003; Mihalcea, 2004; McClosky penalize precision errors (i.e., yi = O ∧ yˆi 6= O et al., 2006). In our version, a model is trained for the ith word), recall errors (yi 6= O ∧ yˆi = O), on the labeled data, then used to label the un- and entity class/position errors (other cases where labeled target data. We iterate between training yi 6= yˆi ). As will be shown below, a key problem on the hypothetically-labeled target data plus the in cross-domain NER is poor recall, so we will original labeled set, and relabeling the target data; penalize recall errors more severely: see Algorithm 1. Before self-training, we remove sentences hypothesized not to contain any named M  0 if yi = yi0  X entity mentions, which we found avoids further c(y, y 0 ) = β if yi 6= O ∧ yi0 = O (3) encouragement of the model toward low recall. 1 otherwise  i=1 5 Experiments for a penalty parameter β > 1. We call our learner We investigate two questions in the context of the “recall-oriented” perceptron (ROP). NER for Arabic Wikipedia: We note that Minkov et al. (2006) similarly ex- plored the recall vs. precision tradeoff in NER. • Loss function: Does integrating a cost func- Their technique was to directly tune the weight tion into our learning algorithm, as we have of a single feature—the feature marking O (non- done in the recall-oriented perceptron (§4.1), entity tokens); a lower weight for this feature will improve recall and overall performance on incur a greater penalty for predicting O. Below Wikipedia data? we demonstrate that our method, which is less • Semisupervised learning for domain adap- coarse, is more successful in our setting.12 tation: Can our models benefit from large In our experiments we will show that injecting amounts of unlabeled Wikipedia data, in addi- “arrogance” into the learner via the recall-oriented tion to the (out-of-domain) labeled data? We loss function substantially improves recall, espe- experiment with a self-training phase following cially for non-POL entities (§5.3). the fully supervised learning phase. 4.2 Self-Training and Semisupervised We report experiments for the possible combi- Learning nations of the above ideas. These are summarized As we will show experimentally, the differences in table 5. Note that the recall-oriented percep- between news text and Wikipedia text call for do- tron can be used for the supervised learning phase, main adaptation. In the case of Arabic Wikipedia, for the self-training phase, or both. This leaves us with the following combinations: 12 The distinction between the techniques is that our cost function adjusts the whole model in order to perform better • reg/none (baseline): regular supervised learner. at recall on the training data. • ROP/none: recall-oriented supervised learner. 166 Figure 1: Tuning the recall-oriented cost parame- ter for different learning settings. We optimized for development set F1 , choosing penalty β = 200 for recall-oriented supervised learning (in the plot, ROP/*—this is regardless of whether a stage of self-training will follow); β = 100 for recall- oriented self-training following recall-oriented su- pervised learning (ROP/ROP); and β = 3200 for recall-oriented self-training following regular super- vised learning (reg/ROP). • reg/reg: standard self-training setup. baseline is on par with the state of the art for Ara- • ROP/reg: recall-oriented supervised learner, fol- bic NER on ACE news text (Abdul-Hamid and lowed by standard self-training. Darwish, 2010).15 • reg/ROP: regular supervised model as the initial la- beler for recall-oriented self-training. Here is the performance of the baseline entity • ROP/ROP (the “double ROP” condition): recall- detection model on our 20-article test set:16 oriented supervised model as the initial labeler for P R F1 recall-oriented self-training. Note that the two technology 60.42 20.26 30.35 ROPs can use different cost parameters. science 64.96 25.73 36.86 history 63.09 35.58 45.50 For evaluating our models we consider the sports 71.66 59.94 65.28 named entity detection task, i.e., recognizing overall 66.30 35.91 46.59 which spans of words constitute entities. This is measured by per-entity precision, recall, and Unsurprisingly, performance on Wikipedia data F1 .13 To measure statistical significance of differ- varies widely across article domains and is much ences between models we use Gimpel and Smith’s lower than in-domain performance. Precision (2010) implementation of the paired bootstrap re- scores fall between 60% and 72% for all domains, sampler of (Koehn, 2004), taking 10,000 samples but recall in most cases is far worse. Miscella- for each comparison. neous class recall, in particular, suffers badly (un- der 10%)—which partially accounts for the poor 5.1 Baseline recall in science and technology articles (they Our baseline is the perceptron, trained on the have by far the highest proportion of MIS entities). POL entity boundaries in the ACE+ANER cor- pus (reg/none).14 Development data was used to 5.2 Self-Training select the number of iterations (10). We per- Following Clark et al. (2003), we applied self- formed 3-fold cross-validation on the ACE data training as described in Algorithm 1, with the and found wide variance in the in-domain entity perceptron as the supervised learner. Our unla- detection performance of this model: beled data consists of 397 Arabic Wikipedia ar- P R F1 ticles (1 million words) selected at random from fold 1 70.43 63.08 66.55 all articles exceeding a simple length threshold fold 2 87.48 81.13 84.18 (1,000 words); see table 4. We used only one iter- fold 3 65.09 51.13 57.27 ation (T 0 = 1), as experiments on development average 74.33 65.11 69.33 data showed no benefit from additional rounds. (Fold 1 corresponds to the ACE test set described Several rounds of self-training hurt performance, in table 4.) We also trained the model to perform 15 POL detection and classification, achieving nearly Abdul-Hamid and Darwish report as their best result a identical results in the 3-way cross-validation of macroaveraged F1 -score of 76. As they do not specify which data they used for their held-out test set, we cannot perform ACE data. From these data we conclude that our a direct comparison. However, our feature set is nearly a 13 Only entity spans that exactly match the gold spans are superset of their best feature set, and their result lies well counted as correct. We calculated these scores with the within the range of results seen in our cross-validation folds. 16 conlleval.pl script from the CoNLL 2003 shared task. Our Wikipedia evaluations use models trained on 14 In keeping with prior work, we ignore non-POL cate- POLM entity boundaries in ACE. Per-domain and overall gories for the ACE evaluation. scores are microaverages across articles. 167 S ELF - TRAINING S UPERVISED none reg ROP reg 66.3 35.9 46.59 66.7 35.6 46.41 59.2 40.3 47.97 ROP 60.9 44.7 51.59 59.8 46.2 52.11 58.0 47.4 52.16 Table 5: Entity detection precision, recall, and F1 for each learning setting, microaveraged across the 24 articles in our Wikipedia test set. Rows differ in the supervised learning condition on the ACE+ANER data (regular vs. recall-oriented perceptron). Columns indicate whether this supervised learning phase was followed by self- training on unlabeled Wikipedia data, and if so which version of the perceptron was used for self-training. Figure 2: Recall improve- ment over baseline in the test baseline set by gold NER category, entities words recall counts for those categories in PER 1081 1743 49.95 the data, and recall scores for ORG 286 637 23.92 our baseline model. Markers LOC 1019 1413 61.43 in the plot indicate different MIS 1395 2176 9.30 experimental settings corre- overall 3781 5969 35.91 sponding to cells in table 5. an effect attested in earlier research (Curran et al., vised phase (bottom left cell), the recall gains 2007) and sometimes known as “semantic drift.” are substantial—nearly 9% over the baseline. In- Results are shown in table 5. We find that stan- tegrating this bias within self-training (last col- dard self-training (the middle column) has very umn of the table) produces a more modest im- little impact on performance.17 Why is this the provement (less than 3%) relative to the base- case? We venture that poor baseline recall and the line. In both cases, the improvements to recall domain variability within Wikipedia are to blame. more than compensate for the amount of degra- dation to precision. This trend is robust: wher- 5.3 Recall-Oriented Learning ever the recall-oriented perceptron is added, we The recall-oriented bias can be introduced in ei- observe improvements in both recall and F1 . Per- ther or both of the stages of our semisupervised haps surprisingly, these gains are somewhat addi- learning framework: in the supervised learn- tive: using the ROP in both learning phases gives ing phase, modifying the objective of our base- a small (though not always significant) gain over line (§5.1); and within the self-training algorithm alternatives (standard supervised perceptron, no (§5.2).18 As noted in §4.1, the aim of this ap- self-training, or self-training with a standard per- proach is to discourage recall errors (false nega- ceptron). In fact, when the standard supervised tives), which are the chief difficulty for the news learner is used, recall-oriented self-training suc- text–trained model in the new domain. We se- ceeds despite the ineffectiveness of standard self- lected the value of the false positive penalty for training. cost-augmented decoding, β, using the develop- Performance breakdowns by (gold) class, fig- ment data (figure 1). ure 2, and domain, figure 3, further attest to the The results in table 5 demonstrate improve- robustness of the overall results. The most dra- ments due to the recall-oriented bias in both matic gains are in miscellaneous class recall— stages of learning.19 When used in the super- each form of the recall bias produces an improve- 17 In neither case does regular self-training produce a sig- ment, and using this bias in both the supervised nificantly different F1 score than no self-training. and self-training phases is clearly most success- 18 Standard Viterbi decoding was used to label the data ful for miscellaneous entities. Correspondingly, within the self-training algorithm; note that cost-augmented the technology and science domains (in which this decoding only makes sense in learning, not as a prediction technique, since it deliberately introduces errors relative to a class dominates—83% and 61% of mentions, ver- correct output that must be provided. 19 In terms of F1 , the worst of the 3 models with the ROP provements due to self-training are marginal, however: ROP supervised learner significantly outperforms the best model self-training produces a significant gain only following reg- with the regular supervised learner (p < 0.005). The im- ular supervised learning (p < 0.05). 168 Figure 3: Supervised learner precision vs. recall as evaluated on Wikipedia test data in different topical domains. The regular perceptron (baseline model) is contrasted with ROP. No self-training is applied. sus 6% and 12% for history and sports, respec- at regularization (Chelba and Acero, 2006) and tively) receive the biggest boost. Still, the gaps feature design (Daum´e III, 2007); we alter the between domains are not entirely removed. loss function. Not surprisingly, the double-ROP Most improvements relate to the reduction of approach harms performance on the original do- false negatives, which fall into three groups: main (on ACE data, we achieve 55.41% F1 , far (a) entities occurring infrequently or partially below the standard perceptron). Yet we observe in the labeled training data (e.g. uranium); that models can be prepared for adaptation even (b) domain-specific entities sharing lexical or con- before a learner is exposed a new domain, sacri- textual features with the POL entities (e.g. Linux, ficing performance in the original domain. titanium); and (c) words with Latin characters, The recall-oriented bias is not merely encour- common in the science and technology domains. aging the learner to identify entities already seen (a) and (b) are mostly transliterations into Arabic. in training. As recall increases, so does the num- An alternative—and simpler—approach to ber of new entity types recovered by the model: controlling the precision-recall tradeoff is the of the 2,070 NE types in the test data that were Minkov et al. (2006) strategy of tuning a single never seen in training, only 450 were ever found feature weight subsequent to learning (see §4.1 by the baseline, versus 588 in the reg/ROP condi- above). We performed an oracle experiment to tion, 632 in the ROP/none condition, and 717 in determine how this compares to recall-oriented the double-ROP condition. learning in our setting. An oracle trained with We note finally that our method is a simple the method of Minkov et al. outperforms the three extension to the standard structured perceptron; models in table 5 that use the regular perceptron cost-augmented inference is often no more ex- for the supervised phase of learning, but under- pensive than traditional inference, and the algo- performs the supervised ROP conditions.20 rithmic change is equivalent to adding one addi- Overall, we find that incorporating the recall- tional feature. Our recall-oriented cost function oriented bias in learning is fruitful for adapting to is parameterized by a single value, β; recall is Wikipedia because the gains in recall outpace the highly sensitive to the choice of this value (fig- damage to precision. ure 1 shows how we tuned it on development data), and thus we anticipate that, in general, such 6 Discussion tuning will be essential to leveraging the benefits of arrogance. To our knowledge, this work is the first sugges- tion that substantively modifying the supervised learning criterion in a resource-rich domain can 7 Related Work reap benefits in subsequent semisupervised appli- Our approach draws on insights from work in cation in a new domain. Past work has looked the areas of NER, domain adaptation, NLP with 20 Tuning the O feature weight to optimize for F1 on our Wikipedia, and semisupervised learning. As all test set, we found that oracle precision would be 66.2, recall are broad areas of research, we highlight only the would be 39.0, and F1 would be 49.1. The F1 score of our most relevant contributions here. best model is nearly 3 points higher than the Minkov et al.– style oracle, and over 4 points higher than the non-oracle Research in Arabic NER has been focused on version where the development set is used for tuning. compiling and optimizing the gazetteers and fea- 169 ture sets for standard sequential modeling algo- work, major topical differences distinguish the rithms (Benajiba et al., 2008; Farber et al., 2008; training and test corpora—and consequently, their Shaalan and Raza, 2008; Abdul-Hamid and Dar- salient NE classes. In these respects our NER wish, 2010). We make use of features identi- setting is closer to that of Florian et al. (2010), fied in this prior work to construct a strong base- who recognize English entities in noisy text, (Sur- line system. We are unaware of any Arabic NER deanu et al., 2011), which concerns information work that has addressed diverse text domains like extraction in a topically distinct target domain, Wikipedia. Both the English and Arabic ver- and (Dalton et al., 2011), which addresses English sions of Wikipedia have been used, however, as NER in noisy and topically divergent text. resources in service of traditional NER (Kazama Self-training (Clark et al., 2003; Mihalcea, and Torisawa, 2007; Benajiba et al., 2008). Attia 2004; McClosky et al., 2006) is widely used et al. (2010) heuristically induce a mapping be- in NLP and has inspired related techniques that tween Arabic Wikipedia and Arabic WordNet to learn from automatically labeled data (Liang et construct Arabic NE gazetteers. al., 2008; Petrov et al., 2010). Our self-training Balasuriya et al. (2009) highlight the substan- procedure differs from some others in that we use tial divergence between entities appearing in En- all of the automatically labeled examples, rather glish Wikipedia versus traditional corpora, and than filtering them based on a confidence score. the effects of this divergence on NER perfor- Cost functions have been used in non- mance. There is evidence that models trained structured classification settings to penalize cer- on Wikipedia data generalize and perform well tain types of errors more than others (Chan and on corpora with narrower domains. Nothman Stolfo, 1998; Domingos, 1999; Kiddon and Brun, et al. (2009) and Balasuriya et al. (2009) show 2011). The goal of optimizing our structured NER that NER models trained on both automatically model for recall is quite similar to the scenario ex- and manually annotated Wikipedia corpora per- plored by Minkov et al. (2006), as noted above. form reasonably well on news corpora. The re- verse scenario does not hold for models trained 8 Conclusion on news text, a result we also observe in Arabic We explored the problem of learning an NER NER. Other work has gone beyond the entity de- model suited to domains for which no labeled tection problem: Florian et al. (2004) addition- training data are available. A loss function to en- ally predict within-document entity coreference courage recall over precision during supervised for Arabic, Chinese, and English ACE text, while discriminative learning substantially improves re- Cucerzan (2007) aims to resolve every mention call and overall entity detection performance, es- detected in English Wikipedia pages to a canoni- pecially when combined with a semisupervised cal article devoted to the entity in question. learning regimen incorporating the same bias. The domain and topic diversity of NEs has been We have also developed a small corpus of Ara- studied in the framework of domain adaptation bic Wikipedia articles via a flexible entity an- research. A group of these methods use self- notation scheme spanning four topical domains training and select the most informative features (publicly available at http://www.ark.cs. and training instances to adapt a source domain cmu.edu/AQMAR). learner to the new target domain. Wu et al. (2009) bootstrap the NER leaner with a subset of unla- Acknowledgments beled instances that bridge the source and target We thank Mariem Fekih Zguir and Reham Al Tamime domains. Jiang and Zhai (2006) and Daum´e III for assistance with annotation, Michael Heilman for (2007) make use of some labeled target-domain his tagger implementation, and Nizar Habash and col- data to tune or augment the features of the source leagues for the MADA toolkit. We thank members of the ARK group at CMU, Hal Daum´e, and anonymous model towards the target domain. Here, in con- reviewers for their valuable suggestions. This publica- trast, we use labeled target-domain data only for tion was made possible by grant NPRP-08-485-1-083 tuning and evaluation. Another important dis- from the Qatar National Research Fund (a member of tinction is that domain variation in this prior the Qatar Foundation). The statements made herein work is restricted to topically-related corpora (e.g. are solely the responsibility of the authors. newswire vs. broadcast news), whereas in our 170 References Stephen Clark, James Curran, and Miles Osborne. 2003. Bootstrapping POS-taggers using unlabelled Ahmed Abdul-Hamid and Kareem Darwish. 2010. data. In Walter Daelemans and Miles Osborne, Simplified feature set for Arabic named entity editors, Proceedings of the Seventh Conference on recognition. In Proceedings of the 2010 Named En- Natural Language Learning at HLT-NAACL 2003, tities Workshop, pages 110–115, Uppsala, Sweden, pages 49–55. July. Association for Computational Linguistics. Michael Collins. 2002. Discriminative training meth- Mohammed Attia, Antonio Toral, Lamia Tounsi, Mon- ods for hidden Markov models: theory and experi- ica Monachini, and Josef van Genabith. 2010. ments with perceptron algorithms. In Proceedings An automatically built named entity lexicon for of the ACL-02 Conference on Empirical Methods in Arabic. In Nicoletta Calzolari, Khalid Choukri, Natural Language Processing (EMNLP), pages 1– Bente Maegaard, Joseph Mariani, Jan Odijk, Ste- 8, Stroudsburg, PA, USA. Association for Compu- lios Piperidis, Mike Rosner, and Daniel Tapias, ed- tational Linguistics. itors, Proceedings of the Seventh Conference on Silviu Cucerzan. 2007. Large-scale named entity International Language Resources and Evaluation disambiguation based on Wikipedia data. In Pro- (LREC’10), Valletta, Malta, May. European Lan- ceedings of the 2007 Joint Conference on Empirical guage Resources Association (ELRA). Methods in Natural Language Processing and Com- Bogdan Babych and Anthony Hartley. 2003. Im- putational Natural Language Learning (EMNLP- proving machine translation quality with automatic CoNLL), pages 708–716, Prague, Czech Republic, named entity recognition. In Proceedings of the 7th June. International EAMT Workshop on MT and Other James R. Curran, Tara Murphy, and Bernhard Scholz. Language Technology Tools, EAMT ’03. 2007. Minimising semantic drift with Mutual Dominic Balasuriya, Nicky Ringland, Joel Nothman, Exclusion Bootstrapping. In Proceedings of PA- Tara Murphy, and James R. Curran. 2009. Named CLING, 2007. entity recognition in Wikipedia. In Proceedings Jeffrey Dalton, James Allan, and David A. Smith. of the 2009 Workshop on The People’s Web Meets 2011. Passage retrieval for incorporating global NLP: Collaboratively Constructed Semantic Re- evidence in sequence labeling. In Proceedings of sources, pages 10–18, Suntec, Singapore, August. the 20th ACM International Conference on Infor- Association for Computational Linguistics. mation and Knowledge Management (CIKM ’11), Yassine Benajiba, Paolo Rosso, and Jos´e Miguel pages 355–364, Glasgow, Scotland, UK, October. Bened´ıRuiz. 2007. ANERsys: an Arabic named ACM. entity recognition system based on maximum en- Hal Daum´e III. 2007. Frustratingly easy domain tropy. In Alexander Gelbukh, editor, Proceedings adaptation. In Proceedings of the 45th Annual of CICLing, pages 143–153, Mexico City, Mexio. Meeting of the Association of Computational Lin- Springer. guistics, pages 256–263, Prague, Czech Republic, Yassine Benajiba, Mona Diab, and Paolo Rosso. 2008. June. Association for Computational Linguistics. Arabic named entity recognition using optimized Pedro Domingos. 1999. MetaCost: a general method feature sets. In Proceedings of the 2008 Confer- for making classifiers cost-sensitive. Proceedings ence on Empirical Methods in Natural Language of the Fifth ACM SIGKDD International Confer- Processing, pages 284–293, Honolulu, Hawaii, Oc- ence on Knowledge Discovery and Data Mining, tober. Association for Computational Linguistics. pages 155–164. Philip K. Chan and Salvatore J. Stolfo. 1998. To- Benjamin Farber, Dayne Freitag, Nizar Habash, and ward scalable learning with non-uniform class and Owen Rambow. 2008. Improving NER in Arabic cost distributions: a case study in credit card fraud using a morphological tagger. In Nicoletta Calzo- detection. In Proceedings of the Fourth Interna- lari, Khalid Choukri, Bente Maegaard, Joseph Mar- tional Conference on Knowledge Discovery and iani, Jan Odjik, Stelios Piperidis, and Daniel Tapias, Data Mining, pages 164–168, New York City, New editors, Proceedings of the Sixth International Lan- York, USA, August. AAAI Press. guage Resources and Evaluation (LREC’08), pages Ciprian Chelba and Alex Acero. 2006. Adaptation of 2509–2514, Marrakech, Morocco, May. European maximum entropy capitalizer: Little data can help Language Resources Association (ELRA). a lot. Computer Speech and Language, 20(4):382– Radu Florian, Hany Hassan, Abraham Ittycheriah, 399. Hongyan Jing, Nanda Kambhatla, Xiaoqiang Luo, Massimiliano Ciaramita and Mark Johnson. 2003. Su- Nicolas Nicolov, and Salim Roukos. 2004. A persense tagging of unknown nouns in WordNet. In statistical model for multilingual entity detection Proceedings of the 2003 Conference on Empirical and tracking. In Susan Dumais, Daniel Marcu, Methods in Natural Language Processing, pages and Salim Roukos, editors, Proceedings of the Hu- 168–175. man Language Technology Conference of the North 171 American Chapter of the Association for Compu- Dirk Hovy, Chunliang Zhang, Eduard Hovy, and tational Linguistics: HLT-NAACL 2004, page 18, Anselmo Peas. 2011. Unsupervised discovery of Boston, Massachusetts, USA, May. Association for domain-specific knowledge from text. In Proceed- Computational Linguistics. ings of the 49th Annual Meeting of the Association Radu Florian, John Pitrelli, Salim Roukos, and Imed for Computational Linguistics: Human Language Zitouni. 2010. Improving mention detection ro- Technologies, pages 1466–1475, Portland, Oregon, bustness to noisy input. In Proceedings of EMNLP USA, June. Association for Computational Linguis- 2010, pages 335–345, Cambridge, MA, October. tics. Association for Computational Linguistics. Jing Jiang and ChengXiang Zhai. 2006. Exploit- Dayne Freitag. 2004. Trained named entity recog- ing domain structure for named entity recognition. nition using distributional clusters. In Dekang Lin In Proceedings of the Human Language Technol- and Dekai Wu, editors, Proceedings of EMNLP ogy Conference of the NAACL (HLT-NAACL), pages 2004, pages 262–269, Barcelona, Spain, July. As- 74–81, New York City, USA, June. Association for sociation for Computational Linguistics. Computational Linguistics. Kevin Gimpel and Noah A. Smith. 2010a. Softmax- Jun’ichi Kazama and Kentaro Torisawa. 2007. margin CRFs: Training log-linear models with loss Exploiting Wikipedia as external knowledge for functions. In Proceedings of the Human Language named entity recognition. In Proceedings of Technologies Conference of the North American the 2007 Joint Conference on Empirical Meth- Chapter of the Association for Computational Lin- ods in Natural Language Processing and Com- guistics, pages 733–736, Los Angeles, California, putational Natural Language Learning (EMNLP- USA, June. CoNLL), pages 698–707, Prague, Czech Republic, Kevin Gimpel and Noah A. Smith. 2010b. June. Association for Computational Linguistics. Softmax-margin training for structured log- Chloe Kiddon and Yuriy Brun. 2011. That’s what linear models. Technical Report CMU-LTI- she said: double entendre identification. In Pro- 10-008, Carnegie Mellon University. http: ceedings of the 49th Annual Meeting of the Associ- //www.lti.cs.cmu.edu/research/ ation for Computational Linguistics: Human Lan- reports/2010/cmulti10008.pdf. guage Technologies, pages 89–94, Portland, Ore- Cyril Grouin, Sophie Rosset, Pierre Zweigenbaum, gon, USA, June. Association for Computational Karn Fort, Olivier Galibert, and Ludovic Quin- Linguistics. tard. 2011. Proposal for an extension of tradi- Philipp Koehn. 2004. Statistical significance tests for tional named entities: from guidelines to evaluation, machine translation evaluation. In Dekang Lin and an overview. In Proceedings of the 5th Linguis- Dekai Wu, editors, Proceedings of EMNLP 2004, tic Annotation Workshop, pages 92–100, Portland, pages 388–395, Barcelona, Spain, July. Association Oregon, USA, June. Association for Computational for Computational Linguistics. Linguistics. LDC. 2005. ACE (Automatic Content Extraction) Nizar Habash and Owen Rambow. 2005. Arabic to- Arabic annotation guidelines for entities, version kenization, part-of-speech tagging and morpholog- 5.3.3. Linguistic Data Consortium, Philadelphia. ical disambiguation in one fell swoop. In Proceed- ings of the 43rd Annual Meeting of the Associa- Percy Liang, Hal Daum´e III, and Dan Klein. 2008. tion for Computational Linguistics (ACL’05), pages Structure compilation: trading structure for fea- 573–580, Ann Arbor, Michigan, June. Association tures. In Proceedings of the 25th International Con- for Computational Linguistics. ference on Machine Learning (ICML), pages 592– Nizar Habash. 2010. Introduction to Arabic Natural 599, Helsinki, Finland. Language Processing. Morgan and Claypool Pub- Chris Manning. 2006. Doing named entity recogni- lishers. tion? Don’t optimize for F1 . http://nlpers. Ahmed Hassan, Haytham Fahmy, and Hany Hassan. blogspot.com/2006/08/doing-named- 2007. Improving named entity translation by ex- entity-recognition-dont.html. ploiting comparable and parallel corpora. In Pro- David McClosky, Eugene Charniak, and Mark John- ceedings of the Conference on Recent Advances son. 2006. Effective self-training for parsing. In in Natural Language Processing (RANLP ’07), Proceedings of the Human Language Technology Borovets, Bulgaria. Conference of the NAACL, Main Conference, pages Eduard Hovy, Mitchell Marcus, Martha Palmer, 152–159, New York City, USA, June. Association Lance Ramshaw, and Ralph Weischedel. 2006. for Computational Linguistics. OntoNotes: the 90% solution. In Proceedings of Rada Mihalcea. 2004. Co-training and self-training the Human Language Technology Conference of for word sense disambiguation. In HLT-NAACL the NAACL (HLT-NAACL), pages 57–60, New York 2004 Workshop: Eighth Conference on Computa- City, USA, June. Association for Computational tional Natural Language Learning (CoNLL-2004), Linguistics. Boston, Massachusetts, USA. 172 Einat Minkov, Richard Wang, Anthony Tomasic, and Khaled Shaalan and Hafsa Raza. 2008. Arabic William Cohen. 2006. NER systems that suit user’s named entity recognition from diverse text types. In preferences: adjusting the recall-precision trade-off Advances in Natural Language Processing, pages for entity extraction. In Proceedings of the Human 440–451. Springer. Language Technology Conference of the NAACL, Mihai Surdeanu, David McClosky, Mason R. Smith, Companion Volume: Short Papers, pages 93–96, Andrey Gusev, and Christopher D. Manning. 2011. New York City, USA, June. Association for Com- Customizing an information extraction system to putational Linguistics. a new domain. In Proceedings of the ACL 2011 Luke Nezda, Andrew Hickl, John Lehmann, and Sar- Workshop on Relational Models of Semantics, Port- mad Fayyaz. 2006. What in the world is a Shahab? land, Oregon, USA, June. Association for Compu- Wide coverage named entity recognition for Arabic. tational Linguistics. In Proccedings of LREC, pages 41–46. Ben Taskar, Carlos Guestrin, and Daphne Koller. Joel Nothman, Tara Murphy, and James R. Curran. 2004. Max-margin Markov networks. In Sebastian 2009. Analysing Wikipedia and gold-standard cor- Thrun, Lawrence Saul, and Bernhard Sch¨olkopf, pora for NER training. In Proceedings of the 12th editors, Advances in Neural Information Processing Conference of the European Chapter of the Associ- Systems 16. MIT Press. ation for Computational Linguistics (EACL 2009), Antonio Toral, Elisa Noguera, Fernando Llopis, and pages 612–620, Athens, Greece, March. Associa- Rafael Mu˜noz. 2005. Improving question an- tion for Computational Linguistics. swering using named entity recognition. Natu- PediaPress. 2010. mwlib. http://code. ral Language Processing and Information Systems, pediapress.com/wiki/wiki/mwlib. 3513/2005:181–191. Slav Petrov, Pi-Chuan Chang, Michael Ringgaard, and Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hiyan Alshawi. 2010. Uptraining for accurate de- Hofmann, and Yasemin Altun. 2005. Large margin terministic question parsing. In Proceedings of the methods for structured and interdependent output 2010 Conference on Empirical Methods in Natural variables. Journal of Machine Learning Research, Language Processing, pages 705–713, Cambridge, 6:1453–1484, September. MA, October. Association for Computational Lin- Christopher Walker, Stephanie Strassel, Julie Medero, guistics. and Kazuaki Maeda. 2006. ACE 2005 multi- Lev Ratinov and Dan Roth. 2009. Design chal- lingual training corpus. LDC2006T06, Linguistic lenges and misconceptions in named entity recog- Data Consortium, Philadelphia. nition. In Proceedings of the Thirteenth Confer- Ralph Weischedel and Ada Brunstein. 2005. ence on Computational Natural Language Learning BBN pronoun coreference and entity type cor- (CoNLL-2009), pages 147–155, Boulder, Colorado, pus. LDC2005T33, Linguistic Data Consortium, June. Association for Computational Linguistics. Philadelphia. Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Dan Wu, Wee Sun Lee, Nan Ye, and Hai Leong Chieu. Zinkevich. 2006. Subgradient methods for maxi- 2009. Domain adaptive bootstrapping for named mum margin structured learning. In ICML Work- entity recognition. In Proceedings of the 2009 Con- shop on Learning in Structured Output Spaces, ference on Empirical Methods in Natural Language Pittsburgh, Pennsylvania, USA. Processing, pages 1523–1532, Singapore, August. Ryan Roth, Owen Rambow, Nizar Habash, Mona Association for Computational Linguistics. Diab, and Cynthia Rudin. 2008. Arabic morpho- Tianfang Yao, Wei Ding, and Gregor Erbach. 2003. logical tagging, diacritization, and lemmatization CHINERS: a Chinese named entity recognition sys- using lexeme models and feature ranking. In Pro- tem for the sports domain. In Proceedings of the ceedings of ACL-08: HLT, pages 117–120, Colum- Second SIGHAN Workshop on Chinese Language bus, Ohio, June. Association for Computational Processing, pages 55–62, Sapporo, Japan, July. As- Linguistics. sociation for Computational Linguistics. Satoshi Sekine, Kiyoshi Sudo, and Chikashi Nobata. 2002. Extended named entity hierarchy. In Pro- ceedings of LREC. Burr Settles. 2004. Biomedical named entity recogni- tion using conditional random fields and rich feature sets. In Nigel Collier, Patrick Ruch, and Adeline Nazarenko, editors, COLING 2004 International Joint workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP) 2004, pages 107–110, Geneva, Switzerland, Au- gust. COLING. 173 Tree Representations in Probabilistic Models for Extended Named Entities Detection Marco Dinarelli Sophie Rosset LIMSI-CNRS LIMSI-CNRS Orsay, France Orsay, France

[email protected] [email protected]

Abstract labelling approach. Additionally, the use of noisy data like transcriptions of French broadcast data, In this paper we deal with Named En- makes the task very challenging for traditional tity Recognition (NER) on transcriptions of NLP solutions. To deal with such problems, we French broadcast data. Two aspects make the task more difficult with respect to previ- adopt a two-steps approach, the first being real- ous NER tasks: i) named entities annotated ized with Conditional Random Fields (CRF) (Laf- used in this work have a tree structure, thus ferty et al., 2001), the second with a Probabilistic the task cannot be tackled as a sequence la- Context-Free Grammar (PCFG) (Johnson, 1998). belling task; ii) the data used are more noisy The motivations behind that are: than data used for previous NER tasks. We approach the task in two steps, involving • Since the named entities have a tree struc- Conditional Random Fields and Probabilis- ture, it is reasonable to use a solution com- tic Context-Free Grammars, integrated in a single parsing algorithm. We analyse the ing from syntactic parsing. However pre- effect of using several tree representations. liminary experiments using such approaches Our system outperforms the best system of gave poor results. the evaluation campaign by a significant margin. • Despite the tree-structure of the entities, trees are not as complex as syntactic trees, 1 Introduction thus, before designing an ad-hoc solution for the task, which require a remarkable effort Named Entity Recognition is a traditinal task of and yet it doesn’t guarantee better perfor- the Natural Language Processing domain. The mances, we designed a solution providing task aims at mapping words in a text into seman- good results and which required a limited de- tic classes, such like persons, organizations or lo- velopment effort. calizations. While at first the NER task was quite simple, involving a limited number of classes (Gr- • Conditional Random Fields are models ro- ishman and Sundheim, 1996), along the years bust to noisy data, like automatic transcrip- the task complexity increased as more complex tions of ASR systems (Hahn et al., 2010), class taxonomies were defined (Sekine and No- thus it is the best choice to deal with tran- bata, 2004). The interest in the task is related to scriptions of broadcast data. Once words its use in complex frameworks for (semantic) con- have been annotated with basic entity con- tent extraction, such like Relation Extraction ap- stituents, the tree structure of named entities plications (Doddington et al., 2004). is simple enough to be reconstructed with This work presents research on a Named Entity relatively simple model like PCFG (Johnson, Recognition task defined with a new set of named 1998). entities. The characteristic of such set is in that named entities have a tree structure. As conce- The two models are integrated in a single pars- quence the task cannot be tackled as a sequence ing algorithm. We analyze the effect of the use of 174 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 174–184, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics pers.ind org.adm S amount name.firstname.last kind demonym object Zahra Abouch Conseil de Gouvernement irakien func.coll Figure 1: Examples of structured named entities annotated on the amount time.date.rel org.adm data used in this work val object loc.adm.town name time-modifier val kind name several tree representations, which result in differ- Figure 2: An example of named entity tree corresponding to en- tities of a whole sentence. Tree leaves, corresponding to sentence ent parsing models with different performances. words have been removed to keep readability We provide a detailed evaluation of our mod- els. Results can be compared with those obtained Quaero training dev # sentences 43,251 112 in the evaluation campaign where the same data words entities words entities were used. Our system outperforms the best sys- # tokens 1,251,432 245,880 2,659 570 tem of the evaluation campaign by a significant # vocabulary 39,631 134 891 30 # components – 133662 – 971 margin. # components dict. – 28 – 18 The rest of the paper is structured as follows: in # OOV rate [%] – – 17.15 0 the next section we introduce the extended named entities used in this work, in section 3 we describe Table 1: Statistics on the training and development sets of the Quaero corpus our two-steps algorithm for parsing entity trees, in section 4 we detail the second step of our ap- proach based on syntactic parsing approaches, in name.last for pers.ind or val (for value) and ob- particular we describe the different tree represen- ject for amount. tations used in this work to encode entity trees These named entities have been annotated on in parsing models. In section 6 we describe and transcriptions of French broadcast news coming comment experiments, and finally, in section 7, from several radio channels. The transcriptions we draw some conclusions. constitute a corpus that has been split into train- ing, development and evaluation sets.The evalu- 2 Extended Named Entities ation set, in particular, is composed of two set The most important aspect of the NER task we of data, Broadcast News (BN in the table) and investigated is provided by the tree structure of Broadcast Conversations (BC in the table). The named entities. Examples of such entities are evaluation of the models presented in this work given in figure 1 and 2, where words have been re- is performed on the merge of the two data types. move for readability issues and are: (“90 persons Some statistics of the corpus are reported in ta- are still present at Atambua. It’s there that 3 employ- ble 1 and 2. This set of named entities has been ees of the High Conseil of United Nations for refugees defined in order to provide more fine semantic in- have been killed yesterday morning”): formation for entities found in the data, e.g. a person is better specified by first and last name, 90 personnes toujours pr´esentes a` and is fully described in (Grouin, 2011) . In or- Atambua c’ est l`a qu’ hier matin ont der to avoid confusion, entities that can be associ- e´ t´e tu´es 3 employ´es du haut commis- ated directly to words, like name.first, name.last, sariat des Nations unies aux r´efugi´es , val and object, are called entity constituents, com- le HCR ponents or entity pre-terminals (as they are pre- Words realizing entities in figure 2 are in bold, terminals nodes in the trees). The other entities, and they correspond to the tree leaves in the like pers.ind or amount, are called entities or non- picture. As we see in the figures, entities terminal entities, depending on the context. can have complex structures. Beyond the use 3 Models Cascade for Extended Named of subtypes, like individual in person (to give Entities pers.ind), or administrative in organization (to give org.adm), entities with more specific con- Since the task of Named Entity Recognition pre- tent can be constituents of more general enti- sented here cannot be modeled as sequence la- ties to form tree structures, like name.first and belling and, as mentioned previously, an approach 175 Quaero test BN test BC 3.1 Conditional Random Fields # sentences 1704 3933 words entities words entities CRFs are particularly suitable for sequence la- # tokens 32945 2762 69414 2769 # vocabulary 28 28 belling tasks (Lafferty et al., 2001). Beyond the # components – 4128 – 4017 possibility to include a huge number of features # components dict. – 21 – 20 # OOV rate [%] 3.63 0 3.84 0 using the same framework as Maximum Entropy models (Berger et al., 1996), CRF models en- Table 2: Statistics on the test set of the Quaero corpus, divided in code global conditional probabilities normalized Broadcast News (BN) and Broadcast Conversations (BC) at sentence level. Given a sequence of N words W1N = w1 , ..., wN and its corresponding components se- quence E1N = e1 , ..., eN , CRF trains the condi- tional probabilities P (E1N |W1N ) = N M ! 1 Y X n+2 Figure 3: Processing schema of the two-steps approach proposed exp λm · hm (en−1 , en , wn−2 ) (1) Z n=1 m=1 in this work: CRF plus PCFG coming from syntactic parsing to perform named where λm are the training parameters. entity annotation in “one-shot” is not robust on n+2 hm (en−1 , en , wn−2 ) are the feature functions the data used in this work, we adopt a two-steps. capturing dependencies of entities and words. Z The first is designed to be robust to noisy data and is the partition function: is used to annotate entity components, while the second is used to parse complete entity trees and N XY n+2 Z= H(˜ en−1 , e˜n , wn−2 ) (2) is based on a relatively simple model. Since we ˜N e n=1 1 are dealing with noisy data, the hardest part of the task is indeed to annotate components on words. On the other hand, since entity trees are relatively which ensures that probabilities sum up to one. simple, at least much simpler than syntactic trees, e˜n−1 and e˜n are components for previous and cur- n+2 once entity components have been annotated in a rent words, H(˜ en−1 , e˜n , wn−2 ) is an abbreviation PM n+2 first step, for the second step, a complex model is for m=1 λm · hm (en−1 , en , wn−2 ), i.e. the set not required, which would also make the process- of active feature functions at current position in ing slower. Taking all these issues into account, the sequence. the two steps of our system for tree-structured In the last few years different CRF implemen- named entity recognition are performed as fol- tations have been realized. The implementation lows: we refer in this work is the one described in (Lavergne et al., 2010), which optimize the fol- 1. A CRF model (Lafferty et al., 2001) is used lowing objective function: to annotate components on words. ρ2 −log(P (E1N |W1N )) + ρ1 kλk1 + kλk22 (3) 2. A PCFG model (Johnson, 1998) is used 2 to parse complete entity trees upon compo- nents, i.e. using components annotated by kλk1 and kλk22 are the l1 and l2 regulariz- CRF as starting point. ers (Riezler and Vasserman, 2004), and together This processing schema is depicted in figure 3. in a linear combination implement the elastic net Conditional Random Fields are described shortly regularizer (Zou and Hastie, 2005). As mentioned in the next subsection. PCFG models, constituting in (Lavergne et al., 2010), this kind of regulariz- the main part of this work together with the analy- ers are very effective for feature selection at train- sis over tree representations, is described more in ing time, which is a very good point when dealing details in the next sections. with noisy data and big set of features. 176 4 Models for Parsing Trees The models used in this work for parsing en- tity trees refer to the models described in (John- son, 1998), in (Charniak, 1997; Caraballo and Charniak, 1997) and (Charniak et al., 1998), and which constitutes the basis of the maximum en- tropy model for parsing described in (Charniak, Figure 4: Baseline tree representations used in the PCFG parsing model 2000). A similar lexicalized model has been pro- posed also by Collins (Collins, 1997). All these models are based on a PCFG trained from data and used in a chart parsing algorithm to find the best parse for the given input. The PCFG model of (Johnson, 1998) is made of rules of the form: • Xi ⇒ Xj Xk Figure 5: Filler-parent tree representations used in the PCFG pars- • Xi ⇒ w ing model where X are non-terminal entities and w are terminal symbols (words in our case).1 The prob- have all rules in the form of 4 and 5, is straight- ability associated to these rules are: forward and can be done with simple algorithms P (Xi ⇒ Xj , Xk ) not discussed here. pi→j,k = (4) P (Xi ) 4.1 Tree Representations for Extended P (Xi ⇒ w) Named Entities pi→w = (5) P (Xi ) As discussed in (Johnson, 1998), an important The models described in (Charniak, 1997; point for a parsing algorithm is the representation Caraballo and Charniak, 1997) encode probabil- of trees being parsed. Changing the tree represen- ities involving more information, such as head tation can change significantly the performances words. In order to have a PCFG model made of of the parser. Since there is a large difference be- rules with their associated probabilities, we ex- tween entity trees used in this work and syntac- tract rules from the entity trees of our corpus. This tic trees, from both meaning and structure point processing is straightforward, for example from of view, it is worth performing an analysis with the tree depicted in figure 2, the following rules the aim of finding the most suitable representa- are extracted: tion for our task. In order to perform this analy- S ⇒ amount loc.adm.town time.dat.rel amount sis, we start from a named entity annotated on the amount ⇒ val object words de notre president , M. Nicolas Sarkozy(of time.date.rel ⇒ name time-modifier our president, Mr. Nicolas Sarkozy). The corre- object ⇒ func.coll sponding named entity is shown in figure 4. As func.coll ⇒ kind org.adm decided in the annotation guidelines, fillers can be org.adm ⇒ name part of a named entity. This can happen for com- plex named entities involving several words. The Using counts of these rules we then compute representation shown in figure 4 is the default rep- maximum likelihood probabilities of the Right resentation and will be referred to as baseline. A Hand Side (RHS) of the rule given its Left Hand problem created by this representation is the fact Side (LHS). Also binarization of rules, applied to that fillers are present also outside entities. Fillers of named entities should be, in principle, distin- 1 These rules are actually in Chomsky Normal Form, i.e. guished from any other filler, since they may be unary or binary rules only. A PCFG, in general, can have any informative to discriminate entities. rule, however, the algorithm we are discussing convert the PCFG rules into Chomsky Normal Form, thus for simplicity Following this intuition, we designed two dif- we provide directly such formulation. ferent representations where entity fillers are con- 177 Figure 8: Parent-node-filler tree representations used in the PCFG parsing model Figure 6: Parent-context tree representations used in the PCFG parsing model referred to as parent-node-filler. This representa- tion is a good trade-off between contextual infor- mation and rigidity, by still representing entities as concatenation of labels, while using a common special label for entity fillers. This allows to keep lower the number of entities annotated on words, Figure 7: Parent-node tree representations used in the PCFG pars- ing model i.e. components. Using different tree representations affects both the structure and the performance of the parsing textualized so that to be distinguished from the model. The structure is described in the next sec- other fillers. In the first representation we give to tion, the performance in the evaluation section. the filler the same label of the parent node, while in the second representation we use a concatena- 4.2 Structure of the Model tion of the filler and the label of the parent node. Lexicalized models for syntactic parsing de- These two representations are shown in figure 5 scribed in (Charniak, 2000; Charniak et al., 1998) and 6, respectively. The first one will be referred and (Collins, 1997), integrate more information to as filler-parent, while the second will be re- than what is used in equations 4 and 5. Consider- ferred as parent-context. A problem that may be ing a particular node in the entity tree, not includ- introduced by the first representation is that some ing terminals, the information used is: entities that originally were used only for non- terminal entities will appear also as components, • s: the head word of the node, i.e. the most i.e. entities annotated on words. This may intro- important word of the chunk covered by the duce some ambiguity. current node Another possible contextualization can be to • h: the head word of the parent node annotate each node with the label of the parent node. This representation is shown in figure 7 • t: the entity tag of the current node and will be referred to as parent-node. Intuitively, this representation is effective since entities an- • l: the entity tag of the parent node notated directly on words provide also the en- The head word of the parent node is defined tity of the parent node. However this representa- percolating head words from children nodes to tion increases drastically the number of entities, parent nodes, giving the priority to verbs. They in particular the number of components, which can be found using automatic approaches based in our case are the set of labels to be learned by on words and entity tag co-occurrence or mutual the CRF model. For the same reason this repre- information. Using this information, the model sentation produces more rigid models, since label described in (Charniak et al., 1998) is P (s|h, t, l). sequences vary widely and thus is not likely to This model being conditioned on several pieces match sequences not seen in the training data. of information, it can be affected by data sparsity Finally, another interesting tree representation problems. Thus, the model is actually approxi- is a variation of the parent-node tree, where en- mated as an interpolation of probabilities: tity fillers are only distinguished from fillers not in an entity, using the label ne-filler, but they are P (s|h, t, l) = not contextualized with entity information. This λ1 P (s|h, t, l) + λ2 P (s|ch , t, l)+ representation is shown in figure 8 and it will be λ3 P (s|t, l) + λ4 P (s|t) (6) 178 have shown less effective for syntactic parsing where λi , i = 1, ..., 4, are parameters of the than their lexicalized couter-parts, there are evi- model to be tuned, and ch is the cluster of head dences showing that they can be effective in our words for a given entity tag t. With such model, task. With reference to figure 4, considering the when not all pieces of information are available to entity pers.ind instantiated by Nicolas Sarkozy, estimate reliably the probability with more con- our algorithm detects first name.first for Nicolas ditioning, the model can still provide a proba- and name.last for Sarkozy using the CRF model. bility with terms conditioned with less informa- As mentioned earlier, once the CRF model has de- tion. The use of head words and their percola- tected components, since entity trees have not a tion over the tree is called lexicalization. The complex structure with respect to syntactic trees, goal of tree lexicalization is to add lexical infor- even a simple model like the one in equation 7 mation all over the tree. This way the probabil- or 8 is effective for entity tree parsing. For ex- ity of all rules can be conditioned also on lexi- ample, once name.first and name.last have been cal information, allowing to define the probabili- detected by CRF, pers.ind is the only entity hav- ties P (s|h, t, l) and P (s|ch , t, l). Tree lexicaliza- ing name.first and name.last as children. Am- tion reflects the characteristics of syntactic pars- biguities, like for example for kind or qualifier, ing, for which the models described in (Charniak, which can appear in many entities, can affect the 2000; Charniak et al., 1998) and (Collins, 1997) model 7, but they are overcome by the model 8, were defined. Head words are very informative taking the entity tag of the parent node into ac- since they constitute keywords instantiating la- count. Moreover, the use of CRF allows to in- bels, regardless if they are syntactic constituents clude in the model much more features than the or named entities. However, for named entity lexicalized model in equation 6. Using features recognition it doesn’t make sense to give prior- like word prefixes (P), suffixes (S), capitalization ity to verbs when percolating head words over the (C), morpho-syntactic features (MS) and other tree, even more because head words of named en- features indicated as F2 , the CRF model encodes tities are most of the time nouns. Moreover, it the conditional probability: doesn’t make sense to give priority to the head word of a particular entity with respect to the oth- P (t|w, P, S, C, M S, F ) (9) ers, all entities in a sentence have the same im- portance. Intuitively, lexicalization of entity trees is not straightforward as lexicalization of syntac- where w is an input word and t is the corre- tic trees. At the same time, using not lexicalized sponding component. trees doesn’t make sense with models like 6, since The probability of the CRF model, used in the all the terms involve lexical information. Instead, first step to tag input words with components, we can use the model of (Johnson, 1998), which is combined with the probability of the PCFG define the probability of a tree τ as: model, used to parse entity trees starting from Y components. Thus the structure of our model is: P (τ ) = P (X → α)Cτ (X→α) (7) X→α P (t|w, P, S, C, M S, F ) · P (τ ) (10) here the RHS of rules has been generalized with or α, representing RHS of both unary and binary P (t|w, P, S, C, M S, F ) · P (τ |l) (11) rules 4 and 5. Cτ (X → α) is the number of times the rule X → α appears in the tree τ . The model 7 is instantiated when using tree representations depending if we are using the tree representa- shown in Fig. 4, 5 and 6. When using representa- tion given in figure 4, 5 and 6 or in figure 7 and 8, tions given in Fig. 7 and 8, the model is: respectively. A scale factor could be used to com- P (τ |l) (8) bine the two scores, but this is optional as CRFs can provide normalized posterior probabilities. where l is the entity label of the parent node. 2 The set of features used in the CRF model will be de- Although non-lexicalized models like 7 and 8 scribed in more details in the evaluation section. 179 5 Related Work target-side features (Tang et al., 2006). An inte- gration of the same kind of features has been tried While the models used for named entity detection also in the model used in this work, without giv- and the set of named entities defined along the ing significant improvements, but making model years have been discussed in the introduction and training much harder. Thus, this direction has not in section 2, since CRFs and models for parsing been further investigated. constitute the main issue in our work, we discuss some important models here. 6 Evaluation Beyond the models for parsing discussed in In this section we describe experiments performed section 4, together with motivations for using or to evaluate our models. We first describe the set- not in our work, another important model for syn- tings used for the two models involved in the en- tactic parsing has been proposed in (Ratnaparkhi, tity tree parsing, and then describe and comment 1999). Such model is made of four Maximum the results obtained on the test corpus. Entropy models used in cascade for parsing at different stages. Also this model makes use of 6.1 Settings head words, like those described in section 4, thus The CRF implementation used in this work is de- the same considerations hold, moreover it seems scribed in (Lavergne et al., 2010), named wapiti.3 quite complex for real applications, as it involves We didn’t optimize parameters ρ1 and ρ2 of the the use of four different models together. The elastic net (see section 3.1), although this im- models described in (Johnson, 1998), (Charniak, proves significantly the performances and leads 1997; Caraballo and Charniak, 1997), (Charniak to more compact models, default values lead in et al., 1998), (Charniak, 2000), (Collins, 1997) most cases to very accurate models. We used a and (Ratnaparkhi, 1999), constitute the main in- wide set of features in CRF models, in a window dividual models proposed for constituent-based of [−2, +2] around the target word: syntactic parsing. Later other approaches based on models combination have been proposed, like • A set of standard features like word prefixes e.g. the reranking approach described in (Collins and suffixes of length from 1 to 6, plus some and Koo, 2005), among many, and also evolutions Yes/No features like Does the word start with or improvements of these models. capital letter?, etc. More recently, approaches based on log-linear models have been proposed (Clark and Curran, • Morpho-syntactic features extracted from 2007; Finkel et al., 2008) for parsing, called also the output of the tool tagger (Allauzen and “Tree CRF”, using also different training criteria Bonneau-Maynard, 2008) (Auli and Lopez, 2011). Using such models in our • Features extracted from the output of the se- work has basically two problems: one related to mantic analyzer (Rosset et al., (2009)) pro- scaling issues, since our data present a large num- vided by the tool WMatch (Galibert, 2009). ber of labels, which makes CRF training problem- atic, even more when using “Tree CRF”; another This analysis morpho-syntactic information as problem is related to the difference between syn- well as semantic information at the same level tactic parsing and named entity detection tasks, of named entities. Using two different sets of as mentioned in sub-section 4.2. Adapting “Tree morpho-syntactic features results in more effec- CRF” to our task is thus a quite complex work, it tive models, as they create a kind of agreement constitutes an entire work by itself, we leave it as for a given word in case of match. Concerning feature work. the PCFG model, grammars, tree binarization and Concerning linear-chain CRF models, the the different tree representations are created with one we use is a state-of-the-art implementation our own scripts, while entity tree parsing is per- (Lavergne et al., 2010), as it implements the formed with the chart parsing algorithm described most effective optimization algorithms as well as in (Johnson, 1998).4 state-of-the-art regularizers (see sub-section 3.1). 3 available at http://wapiti.limsi.fr Some improvement of linear-chain CRF have 4 available at http://web.science.mq.edu.au/ been proposed, trying to integrate higher order ˜mjohnson/Software.htm 180 CRF PCFG DEV TEST Model # features # labels # rules Model SER F1 SER F1 baseline 3,041,797 55 29,611 baseline 20.0% 73.4% 14.2% 79.4% filler-parent 3,637,990 112 29,611 filler-parent 16.2% 77.8% 12.5% 81.2% parent-context 3,605,019 120 29,611 parent-context 15.2% 78.6% 11.9% 81.4% parent-node 3,718,089 441 31,110 parent-node 6.6% 96.7% 5.9% 96.7% parent-node-filler 3,723,964 378 31,110 parent-node-filler 6.8% 95.9% 5.7% 96.8% Table 3: Statistics showing the characteristics of the different Table 4: Results computed from oracle predictions obtained with models used in this work the different models presented in this work DEV TEST 6.2 Evaluation Metrics Model SER F1 SER F1 baseline 33.5% 72.5% 33.4% 72.8% All results are expressed in terms of Slot Error filler-parent 31.3% 74.4% 33.4% 72.7% parent-context 30.9% 74.6% 33.3% 72.8% Rate (SER) (Makhoul et al., 1999) which has a parent-node 31.2% 77.8% 31.4% 79.5% similar definition of word error rate for ASR sys- parent-node-filler 28.7% 78.9% 30.2% 80.3% tems, with the difference that substitution errors Table 5: Results obtained with our combined algorithm based on are split in three types: i) correct entity type with CRF and PCFG wrong segmentation; ii) wrong entity type with correct segmentation; iii) wrong entity type with will have more rules. For example, the rule wrong segmentation; here, i) and ii) are given half pers.ind ⇒ name.first name.last can points, while iii), as well as insertion and deletion appear as it is or contextualized with func.ind, errors, are given full points. Moreover, results are like in figure 8. In contrast the other tree repre- given using the well known F 1 measure, defined sentations modify only fillers, thus the number of as a function of precision and recall. rules is not affected. 6.3 Results Concerning CRF models, as shown in table 3, the use of the different tree representations results In this section we provide evaluations of the mod- in an increasing number of labels to be learned by els described in this work, based on combination CRF. This aspect is quite critical in CRF learn- of CRF and PCFG and using different tree repre- ing, as training time is exponential in the number sentations of named entity trees. of labels. Indeed, the most complex models, ob- 6.3.1 Model Statistics tained with parent-node and parent-node-filler tree representations, took roughly 8 days for train- As a first evaluation, we describe some statis- ing. Additionally, increasing the number of labels tics computed from the CRF and PCFG models can create data sparseness problems, however this using the tree representations. Such statistics pro- problem doesn’t seem to arise in our case since, vide interesting clues of how difficult is learning apart the baseline model which has quite less fea- the task and which performance we can expect tures, all the others have approximately the same from the model. Statistics for this evaluation are number of features, meaning that there are actu- presented in table 3. Rows corresponds to the dif- ally enough data to learn the models, regardless ferent tree representations described in this work, the number of labels. while in the columns we show the number of fea- tures and labels for the CRF models (# features 6.3.2 Evaluations of Tree Representations and # labels), and the number of rules for PCFG In this section we evaluate the models in terms models (# rules). of the evaluation metrics described in previous As we can see from the table, the number section, Slot Error Rate (SER) and F1 measure. of rules is the same for the tree representations In order to evaluate PCFG models alone, we baseline, filler-parent and parent-context, and performed entity tree parsing using as input ref- for the representations parent-node and parent- erence transcriptions, i.e. manual transcriptions node-filler. This is the consequence of the con- and reference component annotations taken from textualization applied by the latter representa- development and test sets. This can be consid- tions, i.e. parent-node and parent-node-filler ered a kind of oracle evaluations and provides us create several different labels depending from an upper bound of the performance of the PCFG the context, thus the corresponding grammar models. Results for this evaluation are reported in 181 Participant SER the 2011 evaluation campaign of extended named P1 48.9 P2 41.0 entity recognition (Galibert et al., 2011; 2) Re- parent-context 33.3 sults are reported in table 6, where the other two parent-node 31.4 parent-node-filler 30.2 participants to the campaign are indicated as P 1 and P 2. These two participants P1 and P2, used Table 6: Results obtained with our combined algorithm based on a system based on CRF, and rules for deep syn- CRF and PCFG tactic analysis, respectively. In particular, P 2 ob- tained superior performances in previous evalua- table 4. As it can be intuitively expected, adding tion campaign on named entity recognition. The more contextualization in the trees results in more system we proposed at the evaluation campaign accurate models, the simplest model, baseline, used a parent-context tree representation. The has the worst oracle performance, filler-parent results obtained at the evaluation campaign are and parent-context models, adding similar con- in the first three lines of Table 6. We compare textualization information, have very similar ora- such results with those obtained with the parent- cle performances. Same line of reasoning applies node and parent-node-filler tree representations, to models parent-node and parent-node-filler, reported in the last two rows of the same table. As which also add similar contextualization and have we can see, the new tree representations described very similar oracle predictions. These last two in this work allow to achieve the best absolute per- models have also the best absolute oracle perfor- formances. mances. However, adding more contextualization in the trees results also in more rigid models, the 7 Conclusions fact that models are robust on reference transcrip- tions and based on reference component annota- In this paper we have presented a Named Entity tions, doesn’t imply a proportional robustness on Recognition system dealing with extended named component sequences generated by CRF models. entities with a tree structure. Given such represen- tation of named entities, the task cannot be mod- This intuition is confirmed from results re- eled as a sequence labelling approach. We thus ported in table 5, where a real evaluation of our proposed a two-steps system based on CRF and models is reported, using this time CRF out- PCFG. CRF annotate entity components directly put components as input to PCFG models, to on words, while PCFG apply parsing techniques parse entity trees. The results reported in ta- to predict the whole entity tree. We motivated ble 5 show in particular that models using base- our choice by showing that it is not effective to line, filler-parent and parent-context tree repre- apply techniques used widely for syntactic pars- sentations have similar performances, especially ing, like for example tree lexicalization. We pre- on test set. Models characterized by parent-node sented an analysis of different tree representations and parent-node-filler tree representations have for PCFG, which affect significantly parsing per- indeed the best performances, although the gain formances. with respect to the other models is not as much as it could be expected given the difference in We provided and discussed a detailed evalua- the oracle performances discussed above. In par- tion of all the models obtained by combining CRF ticular the best absolute performance is obtained and PCFG with the different tree representation with the model parent-node-filler. As we men- proposed. Our combined models result in better tioned in subsection 4.1, this model represents the performances with respect to other models pro- best trade-off between rigidity and accuracy using posed at the official evaluation campaign, as well the same label for all entity fillers, but still distin- as our previous model used also at the evaluation guishing between fillers found in entity structures campaign. and other fillers found in words not instantiating any entity. Acknowledgments This work has been funded by the project Quaero, 6.3.3 Comparison with Official Results under the program Oseo, French State agency for As a final evaluation of our models, we pro- innovation. vide a comparison of official results obtained at 182 References Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Ralph Grishman and Beth Sundheim. 1996. Mes- Innovative applications of artificial intelligence, sage Understanding Conference-6: a brief history. AAAI’97/IAAI’97, pages 598–603. AAAI Press. In Proceedings of the 16th conference on Com- putational linguistics - Volume 1, pages 466–471, Eugene Charniak. 2000. A maximum-entropy- Stroudsburg, PA, USA. Association for Computa- inspired parser. In Proceedings of the 1st North tional Linguistics. American chapter of the Association for Computa- Satoshi Sekine and Chikashi Nobata. 2004. Defini- tional Linguistics conference, pages 132–139, San tion, Dictionaries and Tagger for Extended Named Francisco, CA, USA. Morgan Kaufmann Publish- Entity Hierarchy. In Proceedings of LREC. ers Inc. G. Doddington, A. Mitchell, M. Przybocki, Sharon A. Caraballo and Eugene Charniak. 1997. L. Ramshaw, S. Strassel, and R. Weischedel. New figures of merit for best-first probabilistic chart 2004. The Automatic Content Extraction (ACE) parsing. Computational Linguistics, 24:275–298. Program–Tasks, Data, and Evaluation. Proceedings Michael Collins. 1997. Three generative, lexicalised of LREC 2004, pages 837–840. models for statistical parsing. In Proceedings of the Cyril Grouin, Sophie Rosset, Pierre Zweigenbaum, 35th Annual Meeting of the Association for Com- Karn Fort, Olivier Galibert, Ludovic Quintard. putational Linguistics and Eighth Conference of the 2011. Proposal for an extension or traditional European Chapter of the Association for Computa- named entities: From guidelines to evaluation, an tional Linguistics, ACL ’98, pages 16–23, Strouds- overview. In Proceedings of the Linguistic Annota- burg, PA, USA. Association for Computational Lin- tion Workshop (LAW). guistics. J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- Eugene Charniak, Sharon Goldwater, and Mark John- ditional random fields: Probabilistic models for son. 1998. Edge-based best-first chart parsing. In segmenting and labeling sequence data. In Pro- In Proceedings of the Sixth Workshop on Very Large ceedings of the Eighteenth International Confer- Corpora, pages 127–133. Morgan Kaufmann. ence on Machine Learning (ICML), pages 282–289, Alexandre Allauzen and H´el´ene Bonneau-Maynard. Williamstown, MA, USA, June. 2008. Training and evaluation of pos taggers on the Mark Johnson. 1998. Pcfg models of linguistic french multitag corpus. In Proceedings of the Sixth tree representations. Computational Linguistics, International Language Resources and Evaluation 24:613–632. (LREC’08), Marrakech, Morocco, may. Stefan Hahn, Marco Dinarelli, Christian Raymond, Olivier Galibert. 2009. Approches et m´ethodologies Fabrice Lef`evre, Patrick Lehen, Renato De Mori, pour la r´eponse automatique a` des questions Alessandro Moschitti, Hermann Ney, and Giuseppe adapt´ees a` un cadre interactif en domaine ouvert. Riccardi. 2010. Comparing stochastic approaches Ph.D. thesis, Universit´e Paris Sud, Orsay. to spoken language understanding in multiple lan- guages. IEEE Transactions on Audio, Speech and Rosset Sophie, Galibert Olivier, Bernard Guillaume, Language Processing (TASLP), 99. Bilinski Eric, and Adda Gilles. The LIMSI mul- tilingual, multitask QAst system. In Proceed- Adam L. Berger, Stephen A. Della Pietra, and Vin- ings of the 9th Cross-language evaluation forum cent J. Della Pietra. 1996. A maximum entropy conference on Evaluating systems for multilin- approach to natural language processing. COMPU- gual and multimodal information access, CLEF’08, TATIONAL LINGUISTICS, 22:39–71. pages 480–487, Berlin, Heidelberg, 2009. Springer- Thomas Lavergne, Olivier Capp´e, and Franc¸ois Yvon. Verlag. 2010. Practical very large scale CRFs. In Proceed- ings the 48th Annual Meeting of the Association for Azeddine Zidouni, Sophie Rosset, and Herv´e Glotin. Computational Linguistics (ACL), pages 504–513. 2010. Efficient combined approach for named en- Association for Computational Linguistics, July. tity recognition in spoken language. In Proceedings Stefan Riezler and Alexander Vasserman. 2004. In- of the International Conference of the Speech Com- cremental feature selection and l1 regularization munication Assosiation (Interspeech), Makuhari, for relaxed maximum-entropy modeling. In Pro- Japan ceedings of the International Conference on Em- John Makhoul, Francis Kubala, Richard Schwartz, pirical Methods for Natural Language Processing and Ralph Weischedel. 1999. Performance mea- (EMNLP). sures for information extraction. In Proceedings of Hui Zou and Trevor Hastie. 2005. Regularization and DARPA Broadcast News Workshop, pages 249–252. variable selection via the Elastic Net. Journal of the Adwait Ratnaparkhi. 1999. Learning to Parse Natural Royal Statistical Society B, 67:301–320. Language with Maximum Entropy Models. Journal Eugene Charniak. 1997. Statistical parsing with of Machine Learning, vol. 34, issue 1-3, pages 151– a context-free grammar and word statistics. In 175. 183 Michael Collins and Terry Koo. 2005. Discriminative Re-ranking for Natural Language Parsing. Journal of Machine Learning, vol. 31, issue 1, pages 25–70. Clark, Stephen and Curran, James R. 2007. Wide- Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. Journal of Computational Lin- guistics, vol. 33, issue 4, pages 493–552. Finkel, Jenny R. and Kleeman, Alex and Manning, Christopher D. 2008. Efficient, Feature-based, Conditional Random Field Parsing. Proceedings of the Association for Computational Linguistics, pages 959–967, Columbus, Ohio. Michael Auli and Adam Lopez 2011. Training a Log- Linear Parser with Loss Functions via Softmax- Margin. Proceedings of Empirical Methods for Natural Language Processing, pages 333–343, Ed- inburgh, U.K. Tang, Jie and Hong, MingCai and Li, Juan-Zi and Liang, Bangyong. 2006. Tree-Structured Con- ditional Random Fields for Semantic Annotation. Proceedgins of the International Semantic Web Conference, pages 640–653, Edited by Springer. Olivier Galibert; Sophie Rosset; Cyril Grouin; Pierre Zweigenbaum; Ludovic Quintard. 2011. Struc- tured and Extended Named Entity Evaluation in Au- tomatic Speech Transcriptions. IJCNLP 2011. Marco Dinarelli, Sophie Rosset. Models Cascade for Tree-Structured Named Entity Detection IJCNLP 2011. 184 When Did that Happen? — Linking Events and Relations to Timestamps Dirk Hovy*, James Fan, Alfio Gliozzo, Siddharth Patwardhan and Chris Welty IBM T. J. Watson Research Center 19 Skyline Drive Hawthorne, NY 10532

[email protected]

, {fanj,gliozzo,siddharth,welty}@us.ibm.com Abstract In this paper we present methods to estab- lish links between events (e.g. “bombing” or We present work on linking events and flu- “election”) or fluents (e.g. “spouseOf” or “em- ents (i.e., relations that hold for certain ployedBy”) and temporal expressions (e.g. “last periods of time) to temporal information in text, which is an important enabler for Tuesday” and “November 2008”). While previ- many applications such as timelines and ous research has mainly focused on temporal links reasoning. Previous research has mainly for events only, we deal with both events and flu- focused on temporal links for events, and ents with the same method. For example, consider we extend that work to include fluents the sentence below as well, presenting a common methodol- ogy for linking both events and relations Before his death in October, Steve Jobs to timestamps within the same sentence. led Apple for 15 years. Our approach combines tree kernels with classical feature-based learning to exploit For a machine reading system processing this context and achieves competitive F1-scores sentence, we would expect it to link the fluent on event-time linking, and comparable F1- CEO of (Steve Jobs, Apple) to time duration “15 scores for fluents. Our best systems achieve F1-scores of 0.76 on events and 0.72 on flu- years”. Similarly we expect it to link the event ents. “death” to the time expression “October”. We do not take a strong “ontological” position on what events and fluents are, as part of our 1 Introduction task these distinctions are made a priori. In other It is a long-standing goal of NLP to process natu- words, events and fluents are input to our tempo- ral language content in such a way that machines ral linking framework. In the remainder of this pa- can effectively reason over the entities, relations, per, we also do not make a strong distinction be- and events discussed within that content. The ap- tween relations in general and fluents in particu- plications of such technology are numerous, in- lar, and use them interchangeably, since our focus cluding intelligence gathering, business analytics, is only on the specific types of relations that rep- healthcare, education, etc. Indeed, the promise resent fluents. While we only use binary relations of machine reading is actively driving research in in this work, there is nothing in the framework this area (Etzioni et al., 2007; Barker et al., 2007; that would prevent the use of n-ary relations. Our Clark and Harrison, 2010; Strassel et al., 2010). work focuses on accurately identifying temporal Temporal information is a crucial aspect of this links for eventual use in a machine reading con- task. For a machine to successfully understand text. natural language text, it must be able to associate In this paper, we describe a single approach that time points and temporal durations with relations applies to both fluents and events, using feature and events it discovers in text. engineering as well as tree kernels. We show that ∗ The first author conducted this research during an in- we can achieve good results for both events and ternship at IBM Research. fluents using the same feature space, and advocate 185 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 185–193, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics the versatility of our approach by achieving com- the TempEval tasks overlap with ours in many petitive results on yet another similar task with a ways. Our task is similar to task A and C of different data set. TempEval-1 (Verhagen et al., 2007) in the sense Our approach requires us to capture contextual that we attempt to identify temporal relation be- properties of text surrounding events, fluents and tween events and time expressions or document time expressions that enable an automatic system dates. However, we do not use a restricted set of to detect temporal linking within our framework. events, but focus primarily on a single temporal A common strategy for this is to follow standard relation tlink instead of named relations like BE- feature engineering methodology and manually FORE, AFTER or OVERLAP (although we show develop features for a machine learning model that we can incorporate these as well). Part of our from the lexical, syntactic and semantic analysis task is similar to task C of TempEval-2 (Verha- of the text. A key contribution of our work in this gen et al., 2010), determining the temporal rela- paper is to demonstrate a shallow tree-like repre- tion between an event and a time expression in sentation of the text that enables us to employ tree the same sentence. In this paper, we do apply our kernel models, and more accurately detect tempo- system to TempEval-2 data and compare our per- ral linking. The feature space represented by such formance to the participating systems. tree kernels is far larger than a manually engi- Our work is similar to that of Boguraev and neered feature space, and is capable of capturing Ando (2005), whose research only deals with the contextual information required for temporal temporal links between events and time expres- linking. sions (and does not consider relations at all). They The remainder of this paper goes into the de- employ a sequence tagging model with manual tails of our approach for temporal linking, and feature engineering for the task and achieved presents empirical evidence for the effectiveness state-of-the-art results on Timebank (Pustejovsky of our approach. The contributions of this paper et al., 2003) data. Our task is slightly different be- can be summarized as follows: cause we include relations in the temporal linking, and our use of tree kernels enables us to explore a 1. We define a common methodology to link wider feature space very quickly. events and fluents to timestamps. Filatova and Hovy (2001) also explore tempo- 2. We use tree kernels in combination with clas- ral linking with events, but do not assume that sical feature-based approaches to obtain sig- events and time stamps have been provided by an nificant gains by exploiting context. external process. They used a heuristics-based ap- proach to assign temporal expressions to events 3. Empirical evidence illustrates that our (also relying on the proximity as a base case). framework for temporal linking is very ef- They report accuracy of the assignment for the fective for the task, achieving an F1-score of correctly classified events, the best being 82.29%. 0.76 on events and 0.72 on fluents/relations, Our best event system achieves an accuracy of as well as 0.65 for TempEval2, approaching 84.83%. These numbers are difficult to compare, state-of-the-art. however, since accuracy does not efficiently cap- ture the performance of a system on a task with so 2 Related Work many negative examples. Most of the previous work on relation extraction Mirroshandel et al. (2011) describe the use of focuses on entity-entity relations, such as in the syntactic tree kernels for event-time links. Their ACE (Doddington et al., 2004) tasks. Temporal results on TempEval are comparable to ours. In relations are part of this, but to a lesser extent. contrast to them, we found, though, that syntactic The primary research effort in event temporality tree kernels alone do not perform as well as using has gone into ordering events with respect to one several flat tree representations. another (e.g., Chambers and Jurafsky (2008)), and 3 Problem Definition detecting their typical durations (e.g., Pan et al. (2006)). The task of linking events and relations to time Recently, TempEval workshops have focused stamps can be defined as the following: given a set on the temporal related issues in NLP. Some of of expressions denoting events or relation men- 186 tions in a document, and a set of time expressions 4 Temporal Linking Framework in the same document, find all instances of the tlink relation between elements of the two input As previously mentioned, we approach the tem- sets. The existence of a tlink (e, t) means that e, poral linking problem as a classification task. In which is an event or a relation mention, occurs the framework of classification, we refer to each within the temporal context specified by the time pair of (event/relation, temporal expression) oc- expression t. curring within a sentence as an instance. The goal Thus, our task can be cast as a binary rela- is to devise a classifier that separates positive (i.e., tion classification task: for each possible pair linked) instances from negative ones, i.e., pairs of (event/relation, time) in a document, decide where there is no link between the event/relation whether there exists a link between the two, and and the temporal expression in question. The lat- if so, express it in the data. ter case is far more frequent, so we have an inher- In addition, we make these assumptions about ent bias toward negative examples in our data.1 the data: Note that the basis of the positive and nega- tive links is the context around the target terms. 1. There does not exist a timestamp for ev- It is impossible even for humans to determine the ery event/relation in a document. Although existence of a link based only on the two terms events and relations typically have temporal without their context. For instance, given just two context, it may not be explicitly stated in a words (e.g., “said” and “yesterday”) there is no document. way to tell if it is a positive or a negative example. We need the context to decide. 2. Every event/relation has at most one time ex- Therefore, we base our classification models on pression associated with it. This is a simpli- contextual features drawn from lexical and syn- fying assumption, which in the case of rela- tactic analyses of the text surrounding the target tions we explore as future work. terms. For this, we first define a feature-based approach, then we improve it by using tree ker- 3. Each temporal expression can be linked to nels. These two subsections, plus the treatment one or more events or relations. Since mul- of fluent relations, are the main contributions of tiple events or relations may happen for a this paper. In all of this work, we employ SVM given time, it is safe to assume that each tem- classifiers (Vapnik, 1995) for machine learning. poral expression can be linked to more than one event/relation. 4.1 Feature Engineering A manual analysis of development data provided In general, the events/relations and their associ- several intuitions about the kinds of features that ated timestamps may occur within the same sen- would be useful in this task. Based on this anal- tence or may occur across different sentences. In ysis and with inspiration from previous work (cf. this paper, we focus on our effort and our evalua- Boguraev and Ando (2005)) we established three tion on the same sentence linking task. categories of features whose description follows. In order to solve the problem of temporal link- ing completely, however, it will be important to Features describing events or relations. We also address the links that hold between entities check whether the event or relation is phrasal, a across sentences. We estimate, based on our data verb, or noun, whether it is present tense, past set, that across sentence links account for 41% of tense, or progressive, the type assigned to the all correct event-time pairs in a document. For flu- event/relation by the UIMA type system used for ents, the ratio is much higher, more than 80% of processing, and whether it includes certain trig- the correct fluent-time links are across sentences. ger words, such as reporting verbs (“said”, “re- One of the main obstacles for our approach in the ported”, etc.). cross-sentence case is the very low ratio of posi- tive to negative instances (3 : 100) in the set of all 1 Initially, we employed an instance filtering method to pairs in a document. Most pairs are not linked to address this, which proved to be ineffective and was subse- one another. quently left out. 187 Features describing temporal expressions. (SVMlight with tree kernels, Moschitti (2004)), it We check for the presence of certain trigger words is faster and easier than traditional feature engi- (last, next, old, numbers, etc.) and the type of neering. The tree structure also allows us to use the expression (DURATION, TIME, or DATE) as different levels of representations (POS, lemma, specified by the UIMA type system. etc.) and combine their contributions, while at the same time taking into account the ordering of la- Features describing context. We also in- bels. We use POS, lemma, semantic type, and a clude syntactic/structural features, such as testing representation that replaces each word with a con- whether the relation/event dominates the temporal catenation of its features (capitalization, count- expression, which one comes first in the sentence able, abstract/concrete noun, etc.). order, and whether either of them is dominated by a separate verb, preposition, “that” (which of- We developed a shallow tree representation that ten indicates a subordinate sentence) or counter- captures the context of the target terms, without factual nouns or verbs (which would negate the encoding too much structure (which may prevent temporal link). generalization). In essence, our tree structure in- It is not surprising that some of the most in- duces behavior somewhat similar to a string ker- formative features (event comes before tempo- nel. In addition, we can model the tasks by pro- ral expression, time is syntactic child of event) viding specific markup on the generated tree. For are strongly correlated with the baselines. Less example, in our experiment we used the labels salient features include the test for certain words EVENT (or equivalently RELATION) and TIME- indicating the event is a noun, a verb, and if so STAMP to mark our target terms. In order to re- which tense it has and whether it is a reporting duce the complexity of this comparison, we focus verb. on the substring between event/relation and time stamp and the rest of the tree structure is trun- 4.2 Tree Kernel Engineering cated. We expect that there exist certain patterns be- Figure 1 illustrates an example of the structure tween the entities of a temporal link, which mani- described so far for both lemmas and POS tags fest on several levels: some on the lexical level, (note that the lowest level of the tree contains tok- others expressed by certain sequences of POS enized items, so their number can differ form the tags, NE labels, or other representations. Kernels actual words, as in “attorney general”). Similar provide a principled way of expanding the number trees are produced for each level of representa- of dimensions in which we search for a decision tions used, and for each instance (i.e., pair of time boundary, and allow us to easily model local se- expressions and event/relation). If a sentence con- quences and patterns in a natural way (Giuliano et tains more than one event/relation, we create sep- al., 2009). While it is possible to define a space arate trees for each of them, which differ in the po- in which we find a decision boundary that sepa- sition of the EVENT/RELATION marks (at level rates positive and negative instances with manu- 1 of the tree). ally engineered features, these features can hardly capture the notion of context as well as those ex- The tree kernel implicitly expands this struc- plored by a tree kernel. ture into a number of substructures allowing us Tree Kernels are a family of kernel functions to capture sequential patterns in the data. As we developed to compute the similarity between tree will see, this step provides significant boosts to structures by counting the number of subtrees the task performance. they have in common. This generates a high- dimensional feature space that can be handled ef- Curiously, using a full-parse syntactic tree as ficiently using dynamic programming techniques input representation did not help performance. (Shawe-Taylor and Christianini, 2004). For our This is in line with our finding that syntactic re- purposes we used an implementation of the Sub- lations are less important than sequential patterns tree and Subset Tree (SST) (Moschitti, 2006). (see also Section 5.2). Therefore we adopted the The advantages of using tree kernels are “string kernel like” representation illustrated in two-fold: thanks to an existing implementation Figure 1. 188 Scores of supporters of detained Egyptian opposition leader Nur demonstrated outside the attorney general’s office in Cairo last Saturday, demanding he be freed immediately. BOW EVENT TERM TERM TERM TERM TERM TIME TOK TOK TOK TOK TOK TOK TOK TOK demonstrate outside attorney general office in cairo last saturday BOP EVENT TERM TERM TERM TERM TERM TIME TOK TOK TOK TOK TOK TOK TOK TOK VBD ADV NNP NN IN NNP JJ NNP Figure 1: Input Sentence and Tree Kernel Representations for Bag of Words (BOW) and POS tags (BOP) 5 Evaluation given the skewed nature of the data (much smaller number of positive examples), we could achieve a We now apply our models to real world data, and high accuracy simply by classifying all instances empirically demonstrate their effectiveness at the as negative, i.e., not assigning a time stamp at all. task of temporal linking. In this section, we de- We thus decided to report precision, recall and F1. scribe the data sets that were used for evaluation, Unless stated otherwise, results were achieved via the baselines for comparison, parameter settings, 10-fold cross-validation (10-CV). and the results of the experiments. The number of instances (i.e., pairs of event 5.1 Benchmark and temporal expression) for each of the differ- ent cases listed above was (in brackets the ratio of We evaluated our approach in 3 different tasks: positive to negative instances). 1. Linking Timestamps and Events in the IC • events: 2046 (505 positive, 1541 negative) domain • relations: 6526 (1847 positive, 4679 nega- 2. Linking Timestamps and Relations in the IC tive) domain 3. Linking Events to Temporal Expressions The size of the relation data set after filtering is (TempEval-2, task C) 5511 (1847 positive, 3395 negative). In order to increase the originally lower number The first two data sets contained annotations of event instances, we made use of the annotated in the intelligence community (IC) domain, i.e., event-coreference as a sort of closure to add more mainly news reports about terrorism. It com- instances: if events A and B corefer, and there prised 169 documents. This dataset has been de- is a link between A and time expression t, then veloped in the context of the machine reading pro- there is also a link between B and t. This was not gram (MRP) (Strassel et al., 2010). In both cases explicitly expressed in the data. our goal is to develop a binary classifier to judge For the task at hand, we used gold standard whether the event (or relation) overlaps with the annotations for timestamps, events and relations. time interval denoted by the timestamp. Success The task was thus not the identification of these of this classification can be measured by precision objects (a necessary precursor and a difficult task and recall on annotated data. in itself), but the decision as to which events and We originally considered using accuracy as a time expressions could and should be linked. measure of performance, but this does not cor- We also evaluated our system on TempEval- rectly reflect the true performance of the system: 2 (Verhagen et al., 2010) for better comparison 189 baseline comparison Evaluation Measures Events 100 to the state-of-the-art. TempEval-2 data included 88.0 80 76.6 75.4 76.5 76.2 the task of linking events to temporal expressions 68.3 63.0 63.0 62.0 60 (there called “task C”), using several link types 48.0 45.0 35.0 % 40 (OVERLAP, BEFORE, AFTER, BEFORE-OR- 20 OVERLAP, OVERLAP-OR-AFTER). This is a 0 bit different from our settings as it required the Precision Recall F1 metric implementation of a multi-class classifier. There- BL-parent BL-closest features +tree kernel fore we trained three different binary classifiers (using the same feature set) for the first three of those types (for which there was sufficient train- Figure 2: Performance on events ing data) and we used a one-versus-all strategy to System Accuracy distinguish positive from negative examples. The TRIOS 65% output of the system is the category with the high- this work 64.5% est SVM decision score. Since we only use three JU-CSE, NCSU-indi all 63% labels, we incur an error every time the gold la- TRIPS, USFD2 bel is something else. Note that this is stricter than the evaluation in the actual task, which left Table 1: Comparison to Best Systems in TempEval-2 contestants with the option of skipping examples their systems could not classify. 5.3 Events 5.2 Baselines Figure 2 shows the improvements of the feature- Intuitively, one would expect temporal expres- based approach over the two baseline, and the ad- sions to be close to the event they denote, or even ditional gain obtained by using the tree kernel. syntactically related. In order to test this, we ap- Both the features and tree kernels mainly improve plied two baselines. In the first, each temporal ex- precision, while the tree kernel adds a small boost pression was linked to the closest event (as mea- in recall. It is remarkable, though, that the closest- sured in token distance). In the second, we at- event baseline has a very high recall value. This Page 1 tached each temporal expression to its syntactic suggests that most of the links actually do occur head, if the head was an event. Results are re- between items that are close to one another. For a ported in Figure 2. possible explanation for the low precision value, While these results are encouraging for our see the error analysis (Section 5.5). task, it seems at first counter-intuitive that the Using a two-tailed t-test, we compute the sig- syntactic baseline does worse than the proximity- nificance in the difference between the F1-scores. based one. It does, however, reveal two facts: Both the feature-based and the tree kernel ap- events are not always synonymous with syntactic proach improvements are statistically significant units, and they are not always bound to tempo- at p < 0.001 over the baseline scores. ral expressions through direct syntactic links. The Table 1 compares the performances of our sys- latter makes even more sense given that the links tem to the state-of-the-art systems on TempEval-2 can even occur across sentence boundaries. Pars- Data, task C, showing that our approach is very ing quality could play a role, yet seems far fetched competitive. The best systems there used sequen- to account for the difference. tial models. We attribute the competitive nature More important than syntactic relations seem of our results to the use of tree kernels, which en- to be sequential patterns on different levels, a fact ables us to make use of contextual information. we exploit with the different tree representations used (POS tags, NE types, etc.). 5.4 Relations For relations, we only applied the closest- In general, performance for relations is not as high relation baseline. Since relations consist of two or as for events (see Figure 3). The reason here is more arguments that occur in different, often sep- two-fold: relations consist of two (or more) ele- arated syntactic constituents, a syntactic approach ments, which can be in various positions with re- seems futile, especially given our experience with spect to one another and the temporal expression, events. Results are reported in Figure 3. and each relation can be expressed in a number of 190 baseline comparison Evaluation Metric Relations 100 90 80.6 ples where time expression and event/relation are 80 70.8 74.0 70.4 72.2 70 63.1 immediately adjacent, but unrelated, as in “the 60 50 35.0 man arrested last Tuesday told the police ...”, % 40 29.0 30 24.0 where last Tuesday modifies arrested. It limits 20 10 0 the amount of context that is available to the tree Precision Recall F1 kernels, since we truncate the tree representations metric BL-closest features +tree kernel to the words between those two elements. This case closely resembles the problem we see in the learning curves closest-event/relation baseline, which, as we have Figure 3: Performance on relations/fluents seen, does not perform too well. In this case, the Learning Curves Relations 80 incorrect event (“told”) is as close to the time ex- 75 pression as the correct one (“arrested”), resulting 70 65 in a false positive that affects precision. Features F1 score 60 capturing the order of the elements do not seem 55 50 help here, since the elements can be arranged in 45 any order (i.e., temporal expression before or af- 40 0 10 20 30 40 50 60 70 80 90 100 ter the event/relation). The only way to solve this % of data problem would be to include additional informa- features w/ tree tion about whether a time expression is already kernel attached to another event/relation. Figure 4: Learning curves for relation models 5.6 Ablations To quantify the utility of each tree representation, different ways. we also performed all-but-one ablation tests, i.e., Again, we perform significance tests on the dif- left out each of the tree representations in turn, ran ference in F1 scores and find that our improve- 10-fold cross-validation on the data and observed ments over the baseline are statistically significant the effect on F1. The larger the loss in F1, the Page 1 at p < 0.001. The improvement of the tree kernel more informative the left-out-representation. We over the feature-based approach, however, are not performed ablations for both events and relations, statistically significant at the same value. and found that the ranking of the representations The learning curve over parts of the training is the same for both. data (exemplary shown here for relations, Figure In events and relations alike, leaving out POS 4)2 indicates that there is another advantage to us- trees has the greatest effect on F1, followed by ing tree kernels: the approach can benefit from the feature-bundle representation. Lemma and se- more data. This is conceivably because it allows mantic type representation have less of an impact. the kernel to find more common subtrees in the We hypothesize that the former two capture un- various representations the more examples it gets, derlying regularities better by representing differ- while the feature space rather finds more instances ent words with the same label. Lemmas in turn that invalidate the expressiveness of features (i.e., are too numerous to form many recurring pat- it encounters positive and negative instances that terns, and semantic type, while having a smaller have very similar feature vectors). The curve sug- label alphabet, does not assign a label to every gests that tree kernels could yield even better re- word, thus creating a very sparse representation sults with more data, while there is little to no ex- that picks up more noise than signal. pected gain using only features. Page 1 In preliminary tests, we also used annotated 5.5 Error Analysis dependency trees as input to the tree kernel, but found that performance improved when they were Examining the misclassified examples in our data, left out. This is at odds with work that clearly we find that both feature-based and tree-kernel showed the value of syntactic tree kernels (Mir- approaches struggle to correctly classify exam- roshandel et al., 2011). We identify two poten- 2 The learning curve for events looks similar and is omit- tial causes—either our setup was not capable of ted due to space constraints. correctly capturing and exploiting the information 191 from the dependency trees, or our formulation of the 22nd National Conference for Artificial Intelli- the task was not amenable to it. We did not inves- gence, Vancouver, Canada, July. tigate this further, but leave it to future work. Branimir Boguraev and Rie Kubota Ando. 2005. Timeml-compliant text analysis for temporal rea- 6 Conclusion and Future Work soning. In Proceedings of IJCAI, volume 5, pages 997–1003. IJCAI. We cast the problem of linking events and rela- Nathanael Chambers and Dan Jurafsky. 2008. Unsu- tions to temporal expressions as a classification pervised learning of narrative event chains. pages task using a combination of features and tree ker- 789–797. Association for Computational Linguis- nels, with probabilistic type filtering. Our main tics. contributions are: Peter Clark and Phil Harrison. 2010. Machine read- ing as a process of partial question-answering. In • We showed that within-sentence temporal Proceedings of the NAACL HLT Workshop on For- links for both events and relations can be ap- malisms and Methodology for Learning by Reading, Los Angeles, CA, June. proached with a common strategy. George Doddington, Alexis Mitchell, Mark Przybocki, • We developed flat tree representations and Lance Ramshaw, Stephanie Strassel, and Ralph Weischedel. 2004. The automatic content extrac- showed that these produce considerable tion program – tasks, data and evaluation. In Pro- gains, with significant improvements over ceedings of the LREC Conference, Canary Islands, different baselines. Spain, July. Oren Etzioni, Michele Banko, and Michael Cafarella. • We applied our technique without great ad- 2007. Machine reading. In Proceedings of the justments to an existing data set and achieved AAAI Spring Symposium Series, Stanford, CA, competitive results. March. Elena Filatova and Eduard Hovy. 2001. Assigning • Our best systems achieve F1 score of 0.76 time-stamps to event-clauses. In Proceedings of on events and 0.72 on relations, and are ef- the workshop on Temporal and spatial information fective at the task of temporal linking. processing, volume 13, pages 1–8. Association for Computational Linguistics. We developed the models as part of a machine Claudio Giuliano, Alfio Massimiliano Gliozzo, and reading system and are currently evaluating it in Carlo Strapparava. 2009. Kernel methods for min- an end-to-end task. imally supervised wsd. Computational Linguistics, Following tasks proposed in TempEval-2, we 35(4). plan to use our approach for across-sentence clas- Seyed A. Mirroshandel, Mahdy Khayyamian, and sification, as well as a similar model for linking Gholamreza Ghassem-Sani. 2011. Syntactic tree kernels for event-time temporal relation learning. entities to the document creation date. Human Language Technology. Challenges for Com- puter Science and Linguistics, pages 213–223. Acknowledgements Alessandro Moschitti. 2004. A study on convolution We would like to thank Alessandro Moschitti for kernels for shallow semantic parsing. In Proceed- his help with the tree kernel setup, and the review- ings of the 42nd Annual Meeting on Association for ers who supplied us with very constructive feed- Computational Linguistics, pages 335–es. Associa- tion for Computational Linguistics. back. Research supported in part by Air Force Alessandro Moschitti. 2006. Making tree kernels Contract FA8750-09-C-0172 under the DARPA practical for natural language learning. In Proceed- Machine Reading Program. ings of EACL, volume 6. Feng Pan, Rutu Mulkar, and Jerry R. Hobbs. 2006. Learning event durations from event descriptions. References In Proceedings of the 21st International Conference Ken Barker, Bhalchandra Agashe, Shaw-Yi Chaw, on Computational Linguistics and the 44th annual James Fan, Noah Friedland, Michael Glass, Jerry meeting of the Association for Computational Lin- Hobbs, Eduard Hovy, David Israel, Doo Soon Kim, guistics, pages 393–400. Association for Computa- Rutu Mulkar-Mehta, Sourabh Patwardhan, Bruce tional Linguistics. Porter, Dan Tecuci, and Peter Yeh. 2007. Learn- James Pustejovsky, Patrick Hanks, Roser Saurı, An- ing by reading: A prototype system, performance drew See, Robert Gaizauskas, Andrea Setzer, baseline and lessons learned. In Proceedings of Dragomir Radev, Beth Sundheim, David Day, Lisa 192 Ferro, and Marcia Lazo. 2003. The TIMEBANK Corpus. In Proceedings of Corpus Linguistics 2003, pages 647–656. John Shawe-Taylor and Nello Christianini. 2004. Ker- nel Methods for Pattern Analysis. Cambridge Uni- versity Press. Stephanie Strassel, Dan Adams, Henry Goldberg, Jonathan Herr, Ron Keesing, Daniel Oblinger, Heather Simpson, Robert Schrag, and Jonathan Wright. 2010. The DARPA Machine Read- ing Program-Encouraging Linguistic and Reason- ing Research with a Series of Reading Tasks. In Proceedings of LREC 2010. Vladimir Vapnik. 1995. The Nature of Statistical Learning Theory. Springer, New York, NY. Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Puste- jovsky. 2007. Semeval-2007 task 15: Tempeval temporal relation identification. In Proceedings of the 4th International Workshop on Semantic Evalu- ations, pages 75–80. Association for Computational Linguistics. Marc Verhagen, Roser Sauri, Tommaso Caselli, and James Pustejovsky. 2010. Semeval-2010 task 13: Tempeval-2. In Proceedings of the 5th Interna- tional Workshop on Semantic Evaluation, pages 57– 62. Association for Computational Linguistics. 193 Compensating for Annotation Errors in Training a Relation Extractor Bonan Min Ralph Grishman New York University New York University 715 Broadway, 7th floor 715 Broadway, 7th floor New York, NY 10003 USA New York, NY 10003 USA

[email protected] [email protected]

two annotators independently annotate a corpus, Abstract and then asking a senior annotator to adjudicate the disagreements 2 . This annotation procedure The well-studied supervised Relation roughly requires 3 passes 3 over the same corpus. Extraction algorithms require training Therefore it is very expensive. The ACE 2005 data that is accurate and has good annotation on relations is conducted in this way. coverage. To obtain such a gold standard, In this paper, we analyzed a snapshot of ACE the common practice is to do independent training data and found that each annotator double annotation followed by missed a significant fraction of relation mentions adjudication. This takes significantly and annotated some spurious ones. We found more human effort than annotation done that it is possible to separate most missing by a single annotator. We do a detailed examples from the vast majority of true-negative analysis on a snapshot of the ACE 2005 unlabeled examples, and in contrast, most of the annotation files to understand the relation mentions that are adjudicated as differences between single-pass incorrect contain useful expressions for learning annotation and the more expensive nearly a relation extractor. Based on this observation, three-pass process, and then propose an we propose an algorithm that purifies negative algorithm that learns from the much examples and applies transductive inference to cheaper single-pass annotation and utilize missing examples during the training achieves a performance on a par with the process on the single-pass annotation. Results extractor trained on multi-pass annotated show that the extractor trained on single-pass data. Furthermore, we show that given annotation with the proposed algorithm has a the same amount of human labor, the performance that is close to an extractor trained better way to do relation annotation is not on the 3-pass annotation. We further show that to annotate with high-cost quality the proposed algorithm trained on a single-pass assurance, but to annotate more. annotation on the complete set of documents has a higher performance than an extractor trained on 1. Introduction 3-pass annotation on 90% of the documents in the same corpus, although the effort of doing a Relation Extraction aims at detecting and single-pass annotation over the entire set costs categorizing semantic relations between pairs of less than half that of doing 3 passes over 90% of entities in text. It is an important NLP task that the documents. From the perspective of learning has many practical applications such as a high-performance relation extractor, it suggests answering factoid questions, building knowledge that a better way to do relation annotation is not bases and improving web search. to annotate with a high-cost quality assurance, Supervised methods for relation extraction but to annotate more. have been studied extensively since rich annotated linguistic resources, e.g. the Automatic Content Extraction 1 (ACE) training corpus, were 2 The senior annotator also found some missing examples as released. We will give a summary of related shown in figure 1. 3 methods in section 2. Those methods rely on In this paper, we will assume that the adjudication pass has a similar cost compared to each of the two first-passes. The accurate and complete annotation. To obtain high adjudicator may not have to look at as many sentences as an quality annotation, the common wisdom is to let annotator, but he is required to review all instances found by both annotators. Moreover, he has to be more skilled and may have to spend more time on each instance to be able to 1 http://www.itl.nist.gov/iad/mig/tests/ace/ resolve disagreements. 194 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 194–203, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics 2. Background positive instances and using all the not-a-relation cases (same as described above) as negative 2.1 Supervised Relation Extraction examples. RC is trained on the annotated examples with their tagged types. During testing, One of the most studied relation extraction tasks RD is applied first to identify whether an is the ACE relation extraction evaluation example expresses some relation, then RC is sponsored by the U.S. government. ACE 2005 applied to determine the most likely type only if defined 7 major entity types, such as PER it is detected as correct by RD. (Person), LOC (Location), ORG (Organization). State-of-the-art supervised methods for A relation in ACE is defined as an ordered pair relation extraction also differ from each other on of entities appearing in the same sentence which data representation. Given a relation mention, expresses one of the predefined relations. ACE feature-based methods (Miller et al., 2000; 2005 defines 7 major relation types and more Kambhatla, 2004; Boschee et al., 2005; than 20 subtypes. Following previous work, we Grishman et al., 2005; Zhou et al., 2005; Jiang ignore sub-types in this paper and only evaluate and Zhai, 2007; Sun et al., 2011) extract a rich on types when reporting relation classification list of structural, lexical, syntactic and semantic performance. Types include General-affiliation features to represent it; in contrast, the kernel (GEN-AFF), Part-whole (PART-WHOLE), based methods (Zelenko et al., 2003; Bunescu Person-social (PER-SOC), etc. ACE provides a and Mooney, 2005a; Bunescu and Mooney, large corpus which is manually annotated with 2005b; Zhao and Grishman, 2005; Zhang et al., entities (with coreference chains between entity 2006a; Zhang et al., 2006b; Zhou et al., 2007; mentions annotated), relations, events and Qian et al., 2008) represent each instance with an values. Each mention of a relation is tagged with object such as augmented token sequences or a a pair of entity mentions appearing in the same parse tree, and used a carefully designed kernel sentence as its arguments. More details about the function, e.g. subsequence kernel (Bunescu and ACE evaluation are on the ACE official website. Mooney, 2005b) or convolution tree kernel Given a sentence s and two entity mentions (Collins and Duffy, 2001), to calculate their arg1 and arg2 contained in s, a candidate relation similarity. These objects are usually augmented mention r with argument arg1 preceding arg2 is with features such as semantic features. defined as r=(s, arg1, arg2). The goal of Relation In this paper, we use the hierarchical learning Detection and Classification (RDC) is to strategy since it simplifies the problem by letting determine whether r expresses one of the types us focus on relation detection only. The relation defined. If so, classify it into one of the types. classification stage remains unchanged and we Supervised learning treats RDC as a will show that it benefits from improved classification problem and solves it with detection. For experiments on both relation supervised Machine Learning algorithms such as detection and relation classification, we use MaxEnt and SVM. There are two commonly SVM 4 (Vapnik 1998) as the learning algorithm used learning strategies (Sun et al., 2011). Given since it can be extended to support transductive an annotated corpus, one could apply a flat inference as discussed in section 4.3. However, learning strategy, which trains a single multi- for the analysis in section 3.2 and the purification class classifier on training examples labeled as preprocess steps in section 4.2, we use a one of the relation types or not-a-relation, and MaxEnt 5 model since it outputs probabilities 6 for apply it to determine its type or output not-a its predictions. For the choice of features, we use relation for each candidate relation mention the full set of features from Zhou et al. (2005) during testing. The examples of each type are the since it is reported to have a state-of-the-art relation mentions that are tagged as instances of performance (Sun et al., 2011). that type, and the not-a-relation examples are constructed from pairs of entities that appear in 2.2 ACE 2005 annotation the same sentence but are not tagged as any of The ACE 2005 training data contains 599 articles the types. Alternatively, one could apply a hierarchical learning strategy, which trains two classifiers, a binary classifier RD for relation 4 SVM-Light is used. http://svmlight.joachims.org/ 5 detection and the other a multi-class classifier RC OpenNLP MaxEnt package is used. for relation classification. RD is trained by http://maxent.sourceforge.net/about.html 6 SVM also outputs a value associated with each prediction. grouping tagged relation mentions of all types as However, this value cannot be interpreted as probability. 195 from newswire, broadcast news, weblogs, usenet the entity mentions. Following up, we checked newsgroups/discussion forum, conversational the relation mentions 7 from fp1 and fp2 against telephone speech and broadcast conversations. the adjudicated list of entity mentions from adj The annotation process is conducted as follows: and found that 682 and 665 relation mentions two annotators working independently annotate respectively have at least one argument which each article and complete all annotation tasks doesn’t appear in the list of adjudicated entity (entities, values, relations and events). After two mentions. annotators both finished annotating a file, all Given the list of relation mentions with both discrepancies are then adjudicated by a senior arguments appearing in the list of adjudicated annotator. This results in a high-quality entity mentions, figure 1 shows the inter- annotation file. More details can be found in the annotator agreement of the ACE 2005 relation documentation of ACE 2005 Multilingual annotation. In this figure, the three circles Training Data V3.0. represent the list of relation mentions in fp1, fp2 Since the final release of the ACE training and adj, respectively. corpus only contains the final adjudicated annotations, in which all the traces of the two 47 first-pass annotations are removed, we use a 645 538 snapshot of almost-finished annotation, ACE 2005 Multilingual Training Data V3.0, for our fp1 fp2 analysis. In the remainder of this paper, we will 3065 call the two independent first-passes of annotation fp1 and fp2. The higher-quality data 1486 1525 done by merging fp1 and fp2 and then having disagreements adjudicated by the senior 383 annotator is called adj. From this corpus, we adj removed the files that have not been completed Figure 1. Inter-annotator agreement of ACE 2005 relation for all three passes. On the final corpus annotation. Numbers are the distinct relation mentions whose both arguments are in the list of adjudicated entity consisting of 511 files, we can differentiate the mentions. annotations on which the three annotators have agreed and disagreed. It shows that each annotator missed a A notable fact of ACE relation annotation is significant number of relation mentions that it is done with arguments from the list of annotated by the other. Considering that we annotated entity mentions. For example, in a removed 682/665 relation mentions from fp1/fp2 relation mention tyco's ceo and president dennis because we generate this figure based on the list kozlowski which expresses an EMP-ORG of adjudicated entity mentions, we estimate that relation, the two arguments tyco and dennis fp1 and fp2 both missed around 18.3-28.5% 8 of kozlowski must have been tagged as entity the relation mentions. This clearly shows that mentions previously by the annotator. Since fp1 both of the annotators missed a significant and fp2 are done on all tasks independently, their fraction of the relation mentions. They also disagreement on entity annotation will be annotated some spurious relation mentions (as propagated to relation annotation; thus we need adjudicated in adj), although the fraction is to deal with these cases specifically. smaller (close to 10% of all relation mentions in adj). 3. Analysis of data annotation ACE 2005 relation annotation guidelines (ACE English Annotation Guidelines for 3.1 General statistics Relations, version 5.8.3) defined 7 syntactic classes and the other class. We plot the As discussed in section 2, relation mentions are distribution of syntactic classes of the annotated annotated with entity mentions as arguments, and the lists of annotated entity mentions vary in fp1, 7 fp2 and adj. To estimate the impact propagated This is done by selecting the relation mentions whose both arguments are in the list of adjudicated entity mentions. from entity annotation, we first calculate the ratio 8 We calculate the lower bound by assuming that the 682 of overlapping entity mentions between entities relation mentions removed from fp1 are found in fp2, annotated in fp1/fp2 with adj. We found that although with different argument boundary and headword fp1/fp2 each agrees with adj on around 89% of tagged. The upper bound is calculated by assuming that they are all irrelevant and erroneous relation mentions. 196 relations in figure 2 (3 of the classes, accounting examples that are not annotated in adj, and use it together for less than 10% of the cases, are to make predictions on the mixed pool of correct omitted) and the other class. It seems that it is examples, missing examples and spurious ones. generally easier for the annotators to find and To illustrate how distinguishable the missing agree on relation mentions of the type examples (false negatives) are from the true Preposition/PreMod/Possessives but harder to negative ones, 1) we apply the MaxEnt model on find and agree on the ones belonging to Verbal both false negatives and true negatives, 2) put and Other. The definition and examples of these them together and rank them by the model- syntactic classes can be found in the annotation predicted probabilities of being positive, 3) guidelines. calculate their relative rank in this pool. We plot In the following sections, we will show the the Cumulative distribution of frequency (CDF) analysis on fp1 and adj since the result is similar of the ranks (as percentages in the mixed pools) for fp2. of false negatives in figure 3. We took similar steps for the spurious ones (false positives) and plot them in figure 3 as well (However, they are ranked by model-predicted probabilities of being negative). Figure 2. Percentage of examples of major syntactic classes. Figure 3: cumulative distribution of frequency (CDF) of the relative ranking of model-predicted probability of being 3.2 Why the differences? positive for false negatives in a pool mixed of false negatives and true negatives; and the CDF of the relative To understand what causes the missing ranking of model-predicted probability of being negative for annotations and the spurious ones, we need false positives in a pool mixed of false positives and true methods to find how similar/different the false positives. positives are to true positives and also how For false negatives, it shows a highly skewed similar/different the false negatives (missing distribution in which around 75% of the false annotations) are to true negatives. If we adopt a negatives are ranked within the top 10%. That good similarity metric, which captures the means the missing examples are lexically, structural, lexical and semantic similarity structurally or semantically similar to correct between relation mentions, this analysis will help examples, and are distinguishable from the true us to understand the similarity/difference from an negative examples. However, the distribution of extraction perspective. false positives (spurious examples) is close to We use a state-of-the-art feature space (Zhou uniform (flat curve), which means they are et al., 2005) to represent examples (including all generally indistinguishable from the correct correct examples, erroneous ones and untagged examples. examples) and use MaxEnt as the weight learning model since it shows competitive 3.3 Categorize annotation errors performance in relation extraction (Jiang and The automatic method shows that the errors Zhai, 2007) and outputs probabilities associated (spurious annotations) are very similar to the with each prediction. We train a MaxEnt model correct examples but provides little clue as to for relation detection on true positives and true why that is the case. To understand their causes, negatives, which respectively are the subset of we sampled 65 examples from fp1 (10% of the correct examples annotated by fp1 (and 645 errors), read the sentences containing these adjudicated as correct ones) and negative 197 Example Category Percentage Relation Notes (examples are similar Sampled text of spurious examples in fp1 Type ones in adj for comparison) Duplicate relation … his budding friendship … his budding friendship with US President mention for 49.2% ORG-AFF with US President George George W. Bush in the face of … coreferential W. Bush in the face of … entity mentions Hundreds of thousands of demonstrators took to PHYS the streets in Britain… (Symmetric relation) Correct 20% The dead included the quack doctor, 55-year-old The dead included the quack PER-SOC Nityalila Naotia, his teenaged son and… doctor, 55-year-old Nityalila Naotia, his teenaged son Putin had even secretly invited British Prime Argument not 15.4% PER-SOC Minister Tony Blair, Bush's staunchest backer in list in the war on Iraq… "The amazing thing is they are going to turn Violate San Francisco into ground zero for every criminal reasonable 6.2% PHYS who wants to profit at their chosen profession", reader rule Paredes said. PART- …a likely candidate to run Vivendi Universal's Arguments are tagged WHOLE entertainment unit in the United States… reversed Errors 6.1% PART- Khakamada argued that the United WHOLE States would also need Russia's help "to make the Relation type error new Iraqi government seem legitimate. illegal Up to 20,000 protesters promotion PHYS Up to 20,000 protesters thronged the plazas and thronged the plazas and through 3% streets of San Francisco, where… streets of San Francisco, “blocked” where… categories Table 1. Categories of spurious relation mentions in fp1 (on a sample of 10% of relation mentions), ranked by the percentage of the examples in each category. In the sample text, red text (also marked with dotted underlines) shows head words of the first arguments and the underlined text shows head words of the second arguments. erroneous relation mentions and compared them mistake. The third largest category is argument to the correct relation mentions in the same not in list, by which we mean that at least one of sentence; we categorized these examples and the arguments is not in the list of adjudicated show them in table 1. The most common type of entity mentions. error is duplicate relation mention for Based on Table 1, we can see that as many as coreferential entity mentions. The first row in 72%-88% of the examples which are adjudicated table 1 shows an example, in which there is a as incorrect are actually correct if viewed from a relation ORG-AFF tagged between US and relation learning perspective, since most of them George W. Bush in adj. Because President and contain informative expressions for tagging George W. Bush are coreferential, the example relations. The annotation guideline is designed <US, President > from fp1 is adjudicated as to ensure high quality while not imposing too incorrect. This shows that if a relation is much burden on human annotators. To reduce expressed repeatedly across relation mentions annotation effort, it defined rules such as illegal whose arguments are coreferential, the promotion through “blocked” categories. The adjudicator only tags one of the relation mentions annotators’ practice suggests that they are as correct, although the other is correct too. This following another rule not to annotate duplicate shared the same principle with another type of relation mention for coreferential entity error illegal promotion through “blocked” mentions. This follows the similar principle of categories 9 as defined in the annotation reducing annotation effort but is not explicitly guideline. The second largest category is correct, stated in the guideline: to avoid propagation of a by which we mean the example is a correct relation through a coreference chain. However, relation mention and the adjudicator made a these examples are useful for learning more ways to express a relation. Moreover, even for the 9 erroneous examples (as shown in table 1 as For example, in sentence Smith went to a hotel in Brazil, (Smith, hotel) is a taggable PHYS Relation but (Smith, violate reasonable reader rule and errors), most Brazil) is not, because to get the second relationship, one of them have some level of similar structures or would have to “promote” Brazil through hotel. For the semantics to the targeted relation. Therefore, it is precise definition of annotation rules, please refer to ACE very hard to distinguish them without human (Automatic Content Extraction) English Annotation proofreading. Guidelines for Relations, version 5.8.3. 198 Exp # Training Testing Detection (%) Classification (%) data data Precision Recall F1 Precision Recall F1 1 fp1 adj 83.4 60.4 70.0 75.7 54.8 63.6 2 fp2 adj 83.5 60.5 70.2 76.0 55.1 63.9 3 adj adj 80.4 69.7 74.6 73.4 63.6 68.2 Table 2. Performance of RDC trained on fp1/fp2/adj, and tested on adj. 4. Relation extraction with low-cost 3.4 Why missing annotations and how many examples are missing? annotation For the large number of missing annotations, 4.1 Baseline algorithm there are a couple of possible reasons. One To see whether a single-pass annotation is useful reason is that it is generally easier for a human for relation detection and classification, we did annotator to annotate correctly given a well- 5-fold cross validation (5-fold CV) with each of defined guideline, but it is hard to ensure fp1, fp2 and adj as the training set, and tested on completeness, especially for a task like relation adj. The experiments are done with the same 511 extraction. Furthermore, the ACE 2005 documents we used for the analysis. As shown in annotation guideline defines more than 20 table 2, we did 5-fold CV on adj for experiment relation subtypes. These many subtypes make it 3. For fairness, we use settings similar to 5-fold hard for an annotator to keep all of them in mind CV for experiment 1 and 2. Take experiment 1 as while doing the annotation, and thus it is an example: we split both of fp1 and adj into 5 inevitable that some examples are missed. folds, use 4 folds from fp1 as training data, and 1 Here we proceed to approximate the number fold from adj as testing data and does one train- of missing examples given limited knowledge. test cycle. We rotate the folds (both training and Let each annotator annotate n examples and testing) and repeat 5 times. The final results are assume that each pair of annotators agrees on a averaged over the 5 runs. Experiment 2 was certain fraction p of the examples. Assuming the conducted similarly. In the reminder of the paper, examples are equally likely to be found by an 5-fold CV experiments are all conducted in this annotator, therefore the total number of unique way. examples found by 𝑘 annotators is ∑𝑘𝑖=0(1 − Table 2 shows that a relation tagger trained on 𝑝)𝑖 𝑛. If we had an infinite number of annotators the single-pass annotated data fp1 performs (𝑘 → ∞), the total number of unique examples worse than the one trained on merged and 𝑛 adjudicated data adj, with 4.6 points lower F will be , which is the upper bound of the total 𝑝 measure in relation detection, and 4.6 points number of examples. In the case of the ACE lower relation classification. For detection, 2005 relation mention annotation, since the two precision on fp1 is 3 points higher than on adj annotators annotate around 4500 examples and but recall is much lower (close to 10 points). The they agree on 2/3 of them, the total number of all recall difference shows that the missing positive examples is around 6750. This is close annotations contain expressions that can help to to the number of relation mentions in the find more correct examples during testing. The adjudicated list: 6459. Here we assume the small precision difference indirectly shows that adjudicator is doing a more complex task than an the spurious ones in fp1 (as adjudicated) do not annotator, resolving the disagreements and hurt precision. Performance on classification completing the annotation (as shown in figure 1). shows a similar trend because the relation The assumption of the calculation is a little classifier takes the examples predicted by the crude but reasonable given the limited number of detector as correct as its input. Therefore, if there passes of annotation we have. Recent research (Ji is an error, it gets propagated to this stage. Table et al, 2010) shows that, by adding annotators for 2 also shows similar performance differences IE tasks, the merged annotation tends to between fp2 and adj. converge after having 5 annotators. To In the remainder of this paper, we will discuss understand the annotation behavior better, in a few algorithms to improve a relation tagger particular whether annotation will converge after trained on single-pass annotated data 10. Since we adding a few annotators, more passes of annotation need to be collected. We leave this as future work. 10 We only use fp1 and adj in the following experiments because we observed that fp1 and fp2 are similar in general in the analysis, though a fraction of the annotation in fp1 199 already showed that most of the spurious training process of a supervised relation annotations are not actually errors from an extraction algorithm. extraction perspective and table 2 shows that The algorithm is similar to Li and Liu 2003. they do not hurt precision, we will only focus on However, we drop a few noisy examples instead utilizing the missing examples, in other words, of choosing a small purified subset since we have training with an incomplete annotation. relatively few false negatives compared to the entire set of unannotated examples. Moreover, 4.2 Purify the set of negative examples after step 3, most false negatives are clustered As discussed in section 2, traditional supervised within the small region of top ranked examples methods find all pairs of entity mentions that which has a high model-predicated probability of appear within a sentence, and then use the pairs being positive. The intuition is similar to what that are not annotated as relation mentions as the we observed from figure 3 for false negatives negative examples for the purpose of training a since we also observed very similar distribution relation detector. It relies on the assumption that using the model trained with noisy data. the annotators annotated all relation mentions Therefore, we can purify negatives by removing and missed no (or very few) examples. However, examples in this noisy subset. this is not true for training on a single-pass However, the false negatives are still mixed annotation, in which a significant portion of with true negatives. For example, still slightly relation mentions are left not annotated. If this more than half of the top 2000 examples are true scheme is applied, all of the correct pairs which negatives. Thus we cannot simply flip their the annotators missed belong to this “negative” labels and use them as positive examples. In the category. Therefore, we need a way to purify the following section, we will use them in the form “negative” set of examples obtained by this of unlabeled examples to help train a better conventional approach. model. Li and Liu (2003) focuses on classifying 4.3 Transductive inference on unlabeled documents with only positive examples. Their examples algorithm initially sets all unlabeled data to be negative and trains a Rocchio classifier, selects Transductive SVM (Vapnik, 1998; Joachims, negative examples which are closer to the 1999) is a semi-supervised learning method negative centroid than positive centroid as the which learns a model from a data set consisting purified negative examples, and then retrains the of both labeled and unlabeled examples. model. Their algorithm performs well for text Compared to its popular antecedent SVM, it also classification. It is based on the assumption that learns a maximum margin classification there are fewer unannotated positive examples hyperplane, but additionally forces it to separate than negative ones in the unlabeled set, so true a set of unlabeled data with large margin. The negative examples still dominate the set of noisy optimization function of Transductive SVM “negative” examples in the purification step. (TSVM) is the following: Based on the same assumption, our purification process consists of the following steps: 1) Use annotated relation mentions as positive examples; construct all possible relation mentions that are not annotated, and initially set them to be negative. We call this noisy data set D. 2) Train a MaxEnt relation detection model Mdet on D. Figure 4. TSVM optimization function for non-separable 3) Apply Mdet on all unannotated case (Joachims, 1999) examples, and rank them by the model- TSVM can leverage an unlabeled set of predicted probabilities of being positive, examples to improve supervised learning. As 4) Remove the top N examples from D. shown in section 3, a significant number of These preprocessing steps result in a purified relation mentions are missing from the single- data set 𝐷𝑝𝑢𝑟𝑒 . We can use 𝐷𝑝𝑢𝑟𝑒 for the normal pass annotation data. Although it is not possible to find all missing annotations without human and fp2 is different. Moreover, algorithms trained on them effort, we can improve the model by further show similar performance. 200 utilizing the fact that some unannotated examples +tSVM: First, the same purification process of should have been annotated. +purify is applied. Then we follow the steps The purification process discussed in the described in section 4.3 to construct the set of previous section removes N examples which unlabeled examples, and set all the rest of have a high density of false negatives. We further purified negative examples to be negative. utilize the N examples as follows: Finally, we train TSVM on both labeled and 1) Construct a training corpus 𝐷ℎ𝑦𝑏𝑟𝑖𝑑 from unlabeled data and replace the relation detection 𝐷𝑝𝑢𝑟𝑒 by taking a random sample 11 of N*(1- in the RDC algorithm. The relation classification p)/p (p is the ratio of annotated examples to is unchanged. all examples; p=0.05 in fp1) negatively Table 3 shows the results. All experiments are labeled examples in 𝐷𝑝𝑢𝑟𝑒 and setting them to done with 5-fold cross validation 13 using testing be unlabeled. In addition, the N examples data from adj. The first three rows show removed by the purification process are added experiments trained on fp1, and the last row back as unlabeled examples. (ADJ) shows the unmodified RDC algorithm 2) Train TSVM on 𝐷ℎ𝑦𝑏𝑟𝑖𝑑 . trained on adj for comparison. The purification of negative examples shows significant The second step trained a model which performance gain, 3.7% F1 on relation detection replaced the detection model in the hierarchical and 3.4% on relation classification. The precision detection-classification learning scheme we used. decreases but recall increases substantially since We will show in the next section that this the missing examples are not treated as improves the model. negatives. Experiment shows that the purification 5. Experiments process removes more than 60% of the false Experiments were conducted over the same set of negatives. Transductive SVM further improved documents on which we did analysis: the 511 performance by a relatively small margin. This documents which have completed annotation in shows that the latent positive examples can help all of the fp1, fp2 and adj from the ACE 2005 refine the model. Results also show that Multilingual Training Data V3.0. To transductive inference can find around 17% of reemphasize, we apply the hierarchical learning missing relation mentions. We notice that the scheme and we focus on improving relation performance of relation classification is detection while keeping relation classification improved since by improving relation detection, unchanged (results show that its performance is some examples that do not express a relation are improved because of the improved detection). removed. The classification performance on We use SVM as our learning algorithm with the single-pass annotation is close to the one trained full feature set from Zhou et al. (2005). on adj due to the help from a better relation Baseline algorithm: The relation detector is detector trained with our algorithm. unchanged. We follow the common practice, We also did 5-fold cross validation with a which is to use annotated examples as positive model trained on a fraction of the 4/5 (4 folds) of ones and all possible untagged relation mentions adj data (each experiment shown in table 4 uses as negative ones. We sub-sampled the negative 4 folds of adj documents for training since one data by ½ since that shows better performance. fold is left for cross validation). The documents +purify: This algorithm adds an additional are sampled randomly. Table 4 shows results for purification preprocessing step (section 4.2) varying training data size. Compared to the before the hierarchical learning RDC algorithm. results shown in the “+tSVM” row of table 3, we After purification, the RDC algorithm is trained can see that our best model trained on single-pass on the positive examples and purified negative annotation outperforms SVM trained on 90% of examples. We set N=2000 12 in all experiments. the dual-pass, adjudicated data in both relation detection and classification, although it costs less 11 than half the 3-pass annotation. This suggests We included this large random sample so that the balance of positive to negative examples in the unlabeled set would that given the same amount of human effort for be similar to that of the labeled data. The test data is not included in the unlabeled set. 12 We choose 2000 because it is close to the number of relations missed from each single-pass annotation. In should perform multiple passes of independent annotation practice, it contains more than 70% of the false negatives, on a small dataset and measure inter-annotator agreements. 13 and it is less than 10% of the unannotated examples. To Details about the settings for 5-fold cross validation are in estimate how many examples are missing (section 3.4), one section 4.1. 201 Detection (%) Classification (%) Algorithm Precision Recall F1 Precision Recall F1 Baseline 83.4 60.4 70.0 75.7 54.8 63.6 +purify 76.8 70.9 73.7 69.8 64.5 67.0 +tSVM 76.4 72.1 74.2 69.4 65.2 67.2 ADJ (on adj) 80.4 69.7 74.6 73.4 63.6 68.2 Table 3. 5-fold cross-validation results. All are trained on fp1 (except the last row showing the unchanged algorithm trained on adj for comparison), and tested on adj. McNemar's test show that the improvement from +purify to +tSVM, and from +tSVM to ADJ are statistically significant (with p<0.05). Percentage of Detection (%) Classification (%) adj used Precision Recall F1 Precision Recall F1 60% × 4/5 86.9 41.2 55.8 78.6 37.2 50.5 70% × 4/5 85.5 51.3 64.1 77.7 46.6 58.2 80% × 4/5 83.3 58.1 68.4 75.8 52.9 62.3 90% × 4/5 82.0 64.9 72.5 74.9 59.4 66.2 Table 4. Performance with SVM trained on a fraction of adj. It shows 5 fold cross validation results. relation annotation, annotating more documents mentions. They use an evaluation scheme to with single-pass offers advantages over avoid being penalized by the relation mentions annotating less data with high quality assurance which are not annotated because of this behavior. (dual passes and adjudication). 7. Conclusion 6. Related work We analyzed a snapshot of the ACE 2005 relation annotation and found that each single- Dligach et al. (2010) studied WSD annotation pass annotation missed around 18-28% of from a cost-effectiveness viewpoint. They relation mentions and contains around 10% showed empirically that, with same amount of spurious mentions. A detailed analysis showed annotation dollars spent, single-annotation is that it is possible to find some of the false better than dual-annotation and adjudication. The negatives, and that most spurious cases are common practice for quality control of WSD actually correct examples from a system annotation is similar to Relation annotation. builder’s perspective. By automatically purifying However, the task of WSD annotation is very negative examples and applying transductive different from relation annotation. WSD requires inference on suspicious examples, we can train a that every example must be assigned some tag, relation classifier whose performance is whereas that is not required for relation tagging. comparable to a classifier trained on the dual- Moreover, relation tagging requires identifying annotated and adjudicated data. Furthermore, we two arguments and correctly categorizing their show that single-pass annotation is more cost- types. effective than annotation with high quality The purified approach applied in this paper is assurance. related to the general framework of learning from positive and unlabeled examples. Li and Liu (2003) initially set all unlabeled data to be Acknowledgments negative and train a Rocchio classifier, then Supported by the Intelligence Advanced select negative examples which are closer to the Research Projects Activity (IARPA) via Air negative centroid than positive centroid as the Force Research Laboratory (AFRL) contract purified negative examples. We share a similar number FA8650-10-C-7058. The U.S. assumption with Li and Liu (2003) but we use a Government is authorized to reproduce and different method to select negative examples distribute reprints for Governmental purposes since the false negative examples show a very notwithstanding any copyright annotation skewed distribution, as described in section 5.2. thereon. The views and conclusions contained Transductive SVM was introduced by Vapnik herein are those of the authors and should not be (1998) and later refined in Joachims (1999). A interpreted as necessarily representing the few related methods were studied on the subtask official policies or endorsements, either of relation classification (the second stage of the expressed or implied, of IARPA, AFRL, or the hierarchical learning scheme) in Zhang (2005). U.S. Government. Chan and Roth (2011) observed the similar phenomenon that ACE annotators rarely duplicate a relation link for coreferential 202 Xiao-Li Li and Bing Liu. 2003. Learning to classify References text using positive and unlabeled data. In Proceedings of IJCAI-2003. ACE. http://www.itl.nist.gov/iad/mig/tests/ace/ Longhua Qian, Guodong Zhou, Qiaoming Zhu and ACE (Automatic Content Extraction) English Peide Qian. 2008. Exploiting constituent Annotation Guidelines for Relations, version 5.8.3. dependencies for tree kernel-based semantic 2005. http://projects.ldc.upenn.edu/ace/. relation extraction . In Proc. of COLING-2008. ACE 2005 Multilingual Training Data V3.0. 2005. Ang Sun, Ralph Grishman and Satoshi Sekine. 2011. LDC2005E18. LDC Catalog. Semi-supervised Relation Extraction with Large- Elizabeth Boschee, Ralph Weischedel, and Alex scale Word Clustering. In Proceedings of ACL- Zamanian. 2005. Automatic information extraction. 2011. In Proceedings of the International Conference on Vladimir N. Vapnik. 1998. Statistical Learning Intelligence Analysis. Theory. John Wiley. Razvan C. Bunescu and Raymond J. Mooney. 2005a. Dmitry Zelenko, Chinatsu Aone, and Anthony A shortest path dependency kenrel for relation Richardella. 2003. Kernel methods for relation extraction. In Proceedings of HLT/EMNLP-2005. extraction. Journal of Machine Learning Research. Razvan C. Bunescu and Raymond J. Mooney. 2005b. Min Zhang, Jie Zhang and Jian Su. 2006a. Exploring Subsequence kernels for relation extraction. In syntactic features for relation extraction using a Proceedings of NIPS-2005. convolution tree kernel, In Proceedings of HLT- Yee Seng Chan and Dan Roth. 2011. Exploiting NAACL-2006. Syntactico-Semantic Structures for Relation Min Zhang, Jie Zhang, Jian Su, and GuoDong Zhou. Extraction. In Proceedings of ACL-2011. 2006b. A composite kernel to extract relations Michael Collins and Nigel Duffy. Convolution between entities with both flat and structured Kernels for Natural Language. In Proceedings of features. In Proceedings of COLING-ACL-2006. NIPS-2001. Zhu Zhang. 2005. Mining Inter-Entity Semantic Dmitriy Dligach, Rodney D. Nielsen and Martha Relations Using Improved Transductive Learning. Palmer. 2010. To annotate more accurately or to In Proceedings of ICJNLP-2005. annotate more. In Proceedings of Fourth Linguistic Shubin Zhao and Ralph Grishman, 2005. Extracting Annotation Workshop at ACL 2010 Relations with Integrated Information Using Kern Ralph Grishman, David Westbrook and Adam el Methods. In Proceedings of ACL-2005. Meyers. 2005. NYU’s English ACE 2005 System Guodong Zhou, Jian Su, Jie Zhang and Min Zhang. Description. In Proceedings of ACE 2005 2005. Exploring various knowledge in relation Evaluation Workshop extraction. In Proceedings of ACL-2005. Scott Miller, Heidi Fox, Lance Ramshaw, and Ralph Guodong Zhou, Min Zhang, DongHong Ji, and Weischedel. 2000. A novel use of statistical QiaoMing Zhu. 2007. Tree kernel-based relation parsing to extract information from text In extraction with context-sensitive structured parse Proceedings of NAACL-2010. tree information. In Proceedings of Heng Ji, Ralph Grishman, Hoa Trang Dang and Kira EMNLP/CoNLL-2007. Griffitt. 2010. An Overview of the TAC2010 Knowledge Base Population Track. In Proceedings of TAC-2010 Jing Jiang and ChengXiang Zhai. 2007. A systematic exploration of the feature space for relation extraction. In Proceedings of HLT-NAACL-2007. Thorsten Joachims. 1999. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of ICML-1999. Nanda Kambhatla. 2004. Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction. In Proceedings of ACL-2004 203 Incorporating Lexical Priors into Topic Models Jagadeesh Jagarlamudi Hal Daum´e III Raghavendra Udupa University of Maryland University of Maryland Microsoft Research College Park, USA College Park, USA Bangalore, India

[email protected] [email protected] [email protected]

Abstract left with a skewed impression of the corpus, and perhaps one that does not perform well in extrin- Topic models have great potential for help- sic tasks. ing users understand document corpora. To illustrate this problem, we ran LDA on This potential is stymied by their purely un- supervised nature, which often leads to top- the most frequent five categories of the Reuters- ics that are neither entirely meaningful nor 21578 (Lewis et al., 2004) text corpus. This doc- effective in extrinsic tasks (Chang et al., ument distribution is very skewed: more than half 2009). We propose a simple and effective of the collection belongs to the most frequent cat- way to guide topic models to learn topics egory (“Earn”). The five topics identified by the of specific interest to a user. We achieve LDA are shown in Table 1. A brief observation this by providing sets of seed words that a of the topics reveals that LDA has roughly allo- user believes are representative of the un- derlying topics in a corpus. Our model cated topics 1 & 2 for the most frequent class uses these seeds to improve both topic- (“Earn”) and one topic for the subsequent two word distributions (by biasing topics to pro- frequent classes (“Acquisition” and “Forex”) and duce appropriate seed words) and to im- merged the least two frequent classes (“Crude” prove document-topic distributions (by bi- and “Grain”) into a single topic. The red colored asing documents to select topics related to words in topic 5 correspond to the “Crude” class the seed words they contain). Extrinsic and blue words are from the “Grain” class. evaluation on a document clustering task reveals a significant improvement when us- This leads to the situation where the topics ing seed information, even over other mod- identified by LDA are not in accordance with the els that use seed information na¨ıvely. underlying topical structure of the corpus. This is a problem not just with LDA: it is potentially a problem with any extension thereof that have 1 Introduction focused on improving the semantic coherence of Topic models such as Latent Dirichlet Allocation the words in each topic (Griffiths et al., 2005; (LDA) (Blei et al., 2003) have emerged as a pow- Wallach, 2005; Griffiths et al., 2007), the doc- erful tool to analyze document collections in an ument topic distributions (Blei and McAuliffe, unsupervised fashion. When fit to a document 2008; Lacoste-Julien et al., 2008) or other aspects collection, topic models implicitly use document (Blei. and Lafferty., 2009). level co-occurrence information to group seman- We address this problem by providing some ad- tically related words into a single topic. Since the ditional information to the model. Initially, along objective of these models is to maximize the prob- with the document collection, a user may provide ability of the observed data, they have a tendency higher level view of the document collection. For to explain only the most obvious and superficial instance, as discussed in Section 4.4, when run aspects of a corpus. They effectively sacrifice per- on historical NIPS papers, LDA fails to find top- formance on rare topics to do a better job in mod- ics related to Brain Imaging, Cognitive Science or eling frequently occurring words. The user is then Hardware, even though we know from the call for 204 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 204–213, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics mln, dlrs, billion, year, pct, company, share, april, record, cts, quarter, march, earnings, stg, first, pay mln, NUM, cts, loss, net, dlrs, shr, profit, revs, year, note, oper, avg, shrs, sales, includes lt, company, shares, corp, dlrs, stock, offer, group, share, common, board, acquisition, shareholders bank, market, dollar, pct, exchange, foreign, trade, rate, banks, japan, yen, government, rates, today oil, tonnes, prices, mln, wheat, production, pct, gas, year, grain, crude, price, corn, dlrs, bpd, opec Table 1: Topics identified by LDA on the frequent-5 categories of the Reuters corpus. The categories are Earn, Acquisition, Forex, Grain and Crude (in the order document frequency). 1 company, billion, quarter, shrs, earnings We build a model that uses the seed words 2 acquisition, procurement, merge in two ways: to improve both topic-word and 3 exchange, currency, trading, rate, euro document-topic probability distributions. For 4 grain, wheat, corn, oilseed, oil ease of exposition, we present these ideas sep- 5 natural, gas, oil, fuel, products, petrol arately and then in combination (Section 2.3). To improve topic-word distributions, we set up Table 2: An example for sets of seed words (seed top- a model in which each topic prefers to gener- ics) for the frequent-5 categories of the Reuters-21578 categorization corpus. We use them as running exam- ate words that are related to the words in a seed ple in the rest of the paper. set (Section 2.1). To improve document-topic distributions, we encourage the model to select document-level topics based on the existence of papers that such topics should exist in the corpus. input seed words in that document (Section 2.2). By allowing the user to provide some seed words Before moving on to the details of our mod- related to these underrepresented topics, we en- els, we briefly recall the generative story of the courage the model to find evidence of these top- LDA model and the reader is encouraged to refer ics in the data. Importantly, we only encourage to (Blei et al., 2003) for further details. the model to follow the seed sets and do not force 1. For each topic k = 1 · · · T, it. So if it has compelling evidence in the data • choose φk ∼ Dir(β). to overcome the seed information then it still has the freedom to do so. Our seeding approach in 2. For each document d, choose θd ∼ Dir(α). combination with the interactive topic modeling • For each token i = 1 · · · Nd : (Hu et al., 2011) will allow a user to both explore (a) Select a topic zi ∼ Mult(θd ). a corpus, and also guide the exploration towards (b) Select a word wi ∼ Mult(φzi ). the distinctions that he/she finds more interesting. where T is the number of topics, α, β are hyper- 2 Incorporating Seeds parameters of the model and φk and θd are topic- word and document-topic Multinomial probabil- Our approach to allowing a user to guide the topic ity distributions respectively. discovery process is to let him provide seed infor- mation at the level of word type. Namely, the user 2.1 Word-Topic Distributions (Model 1) provides sets of seed words that are representative In regular topic models, each topic k is defined of the corpus. Table 2 shows an example of seed by a Multinomial distribution φk over words. We sets one might use for the Reuters corpus. This extend this notion and instead define a topic as a kind of supervision is similar to the seeding in mixture of two Multinomial distributions: a “seed bootstrapping literature (Thelen and Riloff, 2002) topic” distribution and a “regular topic” distribu- or prototype-based learning (Haghighi and Klein, tion. The seed topic distribution is constrained to 2006). Our reliance on seed sets is orthogonal only generate words from a corresponding seed to existing approaches that use external knowl- set. The regular topic distribution may generate edge, which operate at the level of documents any word (including seed words). For example, (Blei and McAuliffe, 2008), tokens (Andrzejew- seed topic 4 (in Table 2) can only generate the ski and Zhu, 2009) or pair-wise constraints (An- five words in its set. The word “oil” can be gener- drzejewski et al., 2009). ated by seed topics 4 and 5, as well as any regular 205 doc their distribution to only generate words in the corresponding seed set. Then, for each token in a z=1 z=2 · · · · · · · · · z=T document, we first generate a topic. After choos- 1 − π1 π1 1 − πT πT ing a topic, we flip a (biased) coin to pick either the seed or the regular topic distribution. Once φr1 φs1 φrT φsT this distribution is selected we generate a word from it. It is important to note that although there are 2×T topic-word distributions in total, each Figure 1: Tree representation of a document in Model document is still a mixture of only T topics (as 1. shown in Fig. 1). This is crucial in relating seed and regular topics and is similar to the way top- topic. We want to emphasize that, like any regular ics and aspects are tied in TAM model (Paul and topic, each seed topic is a non-uniform probabil- Girju, 2010). ity distribution over the words in its set. The user To understand how this model gathers words only inputs the sets of seed words and the model related to seed words, consider a seed topic (say will infer their probability distributions. the fourth row in Table 2) with seed words {grain, For the sake of simplicity, we describe our wheat, corn, etc. }. Now by assigning all the re- model by assuming a one-to-one correspondence lated words such as “tonnes”, “agriculture”, “pro- between seed and regular topics. This assumption duction” etc. to its corresponding regular topic, can be easily relaxed by duplicating the seed top- the model can potentially put high probability ics when there are more regular topics. As shown mass on topic z = 4 for agriculture related doc- in Fig. 1, each document is a mixture over T top- uments. Instead, if it places these words in an- ics, where each of those topics is a mixture of other regular topic, say z = 3, then the document a regular topic (φr· ) and its associated seed topic probability mass has to be distributed among top- (φs· ) distributions. The parameter πk controls the ics 3 and 4 and as a result the model will pay a probability of drawing a word from the seed topic steeper penalty. Thus the model uses seed topic distribution versus the regular topic distribution. to gather related words into its associated regu- For our first model, we assume that the corpus is lar topic and as a consequence the document-topic generated based on the following generative pro- distributions also become focussed. cess (its graphical notation is shown in Fig. 2(a)): We have experimented with two ways of choos- ing the binary variable xi (step 2b) of the gener- 1. For each topic k=1· · · T, ative story. In the first method, we fix this sam- pling probability to a constant value which is in- (a) Choose regular topic φrk ∼ Dir(βr ). dependent of the chosen topic (i.e. πi = π ˆ , ∀i = (b) Choose seed topic φsk ∼ Dir(βs ). 1 · · · T). And in the second method we learn the (c) Choose πk ∼ Beta(1, 1). probability as well (Sec. 4). 2. For each document d, choose θd ∼ Dir(α). 2.2 Document-Topic distributions (Model 2) • For each token i = 1 · · · Nd : In the previous model we used seed words to im- (a) Select a topic zi ∼ Mult(θd ). prove topic-word probability distributions. Here (b) Select an indicator xi ∼ Bern(πzi ) we propose a model to explore the use of seed (c) if xi is 0 words to improve document-topic probability dis- – Select a word wi ∼ Mult(φrzi ). tributions. Unlike the previous model, we will // choose from regular topic present this model in the general case where the (d) if xi is 1 number of seed topics is not equal to the number – Select a word wi ∼ Mult(φszi ). of regular topics. Hence, we associate each seed // choose from seed topic set (we refer seed set as group for conciseness) with a Multinomial distribution over the regular The first step is to generate Multinomial distribu- topics which we call group-topic distribution. tions for both seed topics and regular topics. The To give an overview of our model, first, we seed topics are drawn in a way that constrains transfer the seed information from words onto 206 γ τ γ τ ~b ζ ~b ζ α θ g g α ψ θ α ψ θ x z φs z φs x z βr φr w βr φr w βr φr w T Nd D T Nd D T Nd D (a) Model 1 (b) Model 2 (c) SeededLDA Figure 2: The graphical notation of all the three models. In Model 1 we use seed topics to improve the topic-word probability distributions. In Model 2, the seed topic information is first transfered to the document level based on the document tokens and then it is used to improve document-topic distributions. In the final, SeededLDA, model we combine both the models. In Model 1 and SeededLDA, we dropped the dependency of φs on hyper parameter βs since it is observed. And, for clarity, we also dropped the dependency of x on π. the documents that contain them. Then, the represented using the binary vector ~b. This bi- document-topic distribution is drawn in a two step nary vector can be populated based on the docu- process: we sample a seed set (g for group) and ment words and hence it is treated as an observed then use its group-topic distribution (ψg ) as prior variable. For example, consider the (very short!) to draw the document-topic distribution (θd ). We document “oil companies have merged”. Accord- used this two step process, to allow flexible num- ing to the seed sets from Table 2, we define a bi- ber of seed and regular topics, and to tie the topic nary vector that denotes which seed topics contain distributions of all the documents within a group. words in this document. In this case, this vec- We assume the following generative story and its tor ~b = h1, 1, 0, 1, 1i, indicating the presence of graphical notation is shown in Fig. 2(b). seeds from sets 1, 2, 4 and 5.1 As discussed in (Williamson et al., 2010), generating binary vec- 1. For each k = 1· · · T, tor is crucial if we want a document to talk about (a) Choose φrk ∼ Dir(βr ). topics that are less prominent in the corpus. 2. For each seed set s = 1· · · S, The binary vector ~b, that indicates which seeds exist in this document, defines a mean of a (a) Choose group-topic distribution ψs ∼ Dirichlet distribution from which we sample a Dir(α). // the topic distribution for sth document-group distribution, ζ d (step 3b). We group (seed set) – a vector of length T. set the concentration of this Dirichlet to a hy- 3. For each document d, perparamter τ , which we set by hand (Sec. 4); (a) Choose a binary vector ~b of length S. thus, ζ d ∼ Dir(τ~b). From the resulting multino- (b) Choose a document-group distribution mial, we draw a group variable g for this docu- ζ d ∼ Dir(τ~b). ment. This group variable brings clustering struc- (c) Choose a group variable g ∼ Mult(ζ d ) ture among the documents by grouping the docu- ments that are likely to talk about same seed set. (d) Choose θd ∼ Dir(ψg ). // of length T Once the group variable (g) is drawn, we (e) For each token i = 1 · · · Nd : choose the document-topic distribution (θd ) from i. Select a topic zi ∼ Mult(θd ). a Dirichlet distribution with the group’s-topic dis- ii. Select a word wi ∼ Mult(φrzi ). tribution as the prior (step 3d). This step ensures that the topic distributions of documents within We first generate T topic-word distributions each group are related. The remaining sampling (φk ) and S group-topic distributions (ψs ). Then for each document, we generate a list of seed sets 1 As a special case, if no seed word is found in the docu- that are allowed for this document. This list is ment, ~b is defined as the all-ones vector. 207 process proceeds like LDA. We sample a topic 2.4 Automatic Seed Selection for each word and then generate a word from its In (Andrzejewski and Zhu, 2009; Andrzejewski corresponding topic-word distribution. Observe et al., 2009), the seed information is provided that, if the binary vector is all ones and if we manually. Here, we describe the use of feature se- set θd = ζ d then this model reduces to the LDA lection techniques, prevalent in the classification model with τ and βr as the hyperparameters. literature, to automatically derive the seed sets. If 2.3 SeededLDA we want the topicality structure identified by the LDA to align with the underlying class structure, Both of our models use seed words in different then the seed words need to be representative of ways to improve topic-word and document-topic the underlying topicality structure. To enable this, distributions respectively. We can combine both we first take class labeled data (doesn’t need to the above models easily. We refer to the combined be multi-class labeled data unlike (Ramage et al., model as SeededLDA and its generative story is 2009)) and identify the discriminating features for as follows (its graphical notation is shown in Fig. each class. Then we choose these discriminating 2(c)). The variables have same semantics as in the features as the initial sets of seed words. In prin- previous models. ciple, this is similar to the prototype driven unsu- 1. For each k=1· · · T, pervised learning (Haghighi and Klein, 2006). We use Information Gain (Mitchell, 1997) to (a) Choose regular topic φrk ∼ Dir(βr ). identify the required discriminating features. The (b) Choose seed topic φsk ∼ Dir(βs ). Information Gain (IG) of a word (w) in a class (c) (c) Choose πk ∼ Beta(1, 1). is given by 2. For each seed set s = 1· · · S, IG(c, w) = H(c) − H(c|w) (a) Choose group-topic distribution ψs ∼ Dir(α). where H(c) is the entropy of the class and H(c|w) 3. For each document d, is the conditional entropy of the class given the word. In computing Information Gain, we bina- (a) Choose a binary vector ~b of length S. rize the document vectors and consider whether a (b) Choose a document-group distribution word occurs in any document of a given class or ζ d ∼ Dir(τ~b). not. Thus obtained ranked list of words for each (c) Choose a group variable g ∼ Mult(ζ d ). class are filtered for ambiguous words and then (d) Choose θd ∼ Dir(ψg ). // of length T used as initial sets of seed words to be input to the (e) For each token i = 1 · · · Nd : model. i. Select a topic zi ∼ Mult(θd ). 3 Related Work ii. Select an indicator xi ∼ Bern(πzi ). iii. if xi is 0 Seed-based supervision is closely related to the • Select a word wi ∼ Mult(φrzi ). idea of seeding in the bootstrapping literature for iv. if xi is 1 learning semantic lexicons (Thelen and Riloff, • Select a word wi ∼ Mult(φszi ). 2002). The goals are similar as well: growing a small set of seed examples into a much larger In the SeededLDA model, the process for gen- set. A key difference is the type of semantic in- erating group variable of a document is same as formation that the two approaches aim to capture: the one described in the Model 2. And like in the semantic lexicons are based on much more spe- Model 2, we sample a document-topic probability cific notions of semantics (e.g. all the country distribution as a Dirichlet draw with the group- names) than the generic “topic” semantics of topic topic distribution of the chosen group as prior. models. The idea of seeding has also been used Subsequently, we choose a topic for each token in prototype-driven learning (Haghighi and Klein, and then flip a biased coin. We choose either the 2006) and shown similar efficacies for these semi- seed or the regular topic based on the result of the supervised learning approaches. coin toss and then generate a word from its distri- LDAWN (Boyd-Graber et al., 2007) models bution. sets of words for the word sense disambiguation 208 task. It assumes that a topic is a distribution (Sec. 2.2). However our model differs from La- over synsets and relies on the Wordnet to obtain beledLDA in the subsequent steps. Rather than the synsets. The most related prior work is that using the group distribution directly, we sam- of (Andrzejewski et al., 2009), who propose the ple a group variable and use it to constrain the use Dirichlet Forest priors to incorporate Must document-topic distributions of all the documents Link and Cannot Link constraints into the topic within this group. Moreover, in their model the models. This work is analogous to constrained binary vector is observed directly in the form of K-means clustering (Wagstaff et al., 2001; Basu document labels while, in our case, it is automat- et al., 2008). A must link between a pair word ically populated based on the document tokens. types represents that the model should encourage Interactive topic modeling brings the user into both the words to have either high or low prob- the loop, by allowing him/her to make suggestions ability in any particular topic. A cannot link be- on how to improve the quality of the topics at each tween a word pair indicates both the words should iteration (Hu et al., 2011). In their approach, the not have high probability in a single topic. In the authors use Dirichlet Forest method to incorpo- Dirichlet Forest approach, the constraints are first rate the user’s preferences. In our experiments converted into trees with words as the leaves and (Sec. 4), we show that SeededLDA performs bet- edges having pre-defined weights. All the trees ter than Dirichlet Forest method, so SeededLDA are joined to a dummy node to form a forest. The when used with their framework can allow an user sampling for a word translates into a random walk to explore a document collection in a more mean- on the forest: starting from the root and selecting ingful manner. one of its children based on the edge weights until you reach a leaf node. 4 Experiments While the Dirichlet Forest method requires su- We evaluate different aspects of the model sep- pervision in terms of Must link and Cannot link arately. Our experimental setup proceeds as fol- information, the Topics In Sets (Andrzejewski and lows: a) Using an existing model, we evaluate the Zhu, 2009) model proposes a different approach. effectiveness of automatically derived constraints Here, the supervision is provided at the token indicating the potential benefits of adding seed level. The user chooses specific tokens and re- words into the topic models. b) We evaluate each strict them to occur only with in a specified list of of our proposed models in different settings and topics. While this needs minimal changes to the compare with multiple baseline systems. inference process of LDA, it requires information Since our aim is to overcome the domi- at the level of tokens. The word type level seed nance of majority topics by encouraging the information can be converted into token level in- topicality structure identified by the topic mod- formation (like we do in Sec. 4) but this prevents els to align with that of the document cor- their model from distinguishing the tokens based pus, we choose extrinsic evaluation as the on the word senses. primary evaluation method. We use docu- Several models have been proposed which use ment clustering task and use frequent-5 cate- supervision at the document level. Supervised gories of Reuters-21578 corpus (Lewis et al., LDA (Blei and McAuliffe, 2008) and DiscLDA 2004) and four classes from the 20 News- (Lacoste-Julien et al., 2008) try to predict the cat- groups data set (i.e.‘rec.autos’, ‘sci.electronics’, egory labels (e.g. sentiment classification) for ‘comp.hardware’ and ‘alt.atheism’). For both the input documents based on a document labeled the corpora we do the standard preprocessing data. Of these models, the most related one to of removing stopwords and infrequent words SeededLDA is the LabeledLDA model (Ramage (Williamson et al., 2010). et al., 2009). Their model operates on multi-class For all the models, we use a Collapsed Gibbs labeled corpus. Each document is assumed to be sampler (Griffiths and Steyvers, 2004) for the in- a mixture over a known subset of topics (classes) ference process. We use the standard hyperparam- with each topic being a distribution over words. eters values α = 1.0, β = 0.01 and τ = 1.0 and The process of generating document topic distri- run the sampler for 1000 iterations, but one can bution in LabeledLDA is similar to the process use techniques like slice sampling to estimate the of generating group distribution in our Model 2 hyperparameters (Johnson and Goldwater, 2009). 209 Reuters 20 Newsgroups F-measure VI F-measure VI LDA 0.64 (±.05) 1.26 (±.16) 0.77 (±.06) 0.9 (±.13) Dirichlet Forest 0.67∗ (±.02) 1.17 (±.11) 0.79(±.01) 0.83∗ (±.03) ∆ over LDA (+4.68%) (-7.1%) (+2.6%) (-7.8%) Table 3: The effect of adding constraints by Dirichlet Forest Encoding. For Variational Information (VI) a lower score indicates a better clustering. ∗ indicates statistical significance at p = 0.01 as measured by the t-test. All the four improvements are significant at p = 0.05. We run all the models with the same number of every pair of words belonging to different sets. topics as the number of clusters. Then, for each The accuracies are averaged over 25 different ran- document, we find the topic that has maximum dom initializations and are shown in Table 3. We probability in the posterior document-topic distri- have also indicated the relative performance gains bution and assign it to that cluster. The accuracy compared to LDA. The significant improvement of the document clustering is measured in terms over the plain LDA demonstrates the effectiveness of F-measure and Variation of Information. F- of the automatic extraction of seed words in topic measure is calculated based on the pairs of doc- models. uments, i.e. if two documents belong to a cluster in both ground truth and the clustering proposed 4.2 Document Clustering by the system then it is counted as correct, other- In the next experiment, we compare our models wise it is counted as wrong. Variational Informa- with LDA and other baselines. The first baseline tion (VI) of two clusterings X and Y is given as (maxCluster) simply counts the number of tokens (Meil˘a, 2007): in each document from each of the seed topics and assigns the document to the seed topic that has VI(X, Y ) = H(X) + H(Y ) − 2I(X, Y ) most tokens. This results in a clustering of doc- uments based on the seed topic they are assigned where H(X) denotes the entropy of the clustering to. This baseline evaluates the effectiveness of the X and I(X, Y ) denotes the mutual information seed words with respect to the underlying cluster- between the two clusterings. For VI, a lower value ing. Apart from the maxCluster baseline, we use indicates a better clustering. All the accuracies are LDA and z-labels (Andrzejewski and Zhu, 2009) averaged over 25 different random initializations as our baselines. For z-labels, we treat all the to- and all the significance results are measured using kens of a seed word in the same way. Table 4 the t-test at p = 0.01. shows the comparison of our models with respect to the baseline systems.2 Comparing the perfor- 4.1 Seed Extraction mance of maxCluster to that of LDA, we observe The seeds were extracted automatically (Sec. 2.4) that the seed words themselves do a poor job in based on a small sample of labeled data other than clustering the documents. the test data. We first extract 25 seeds words per We experimented with two variants of Model 1. each class and then remove the seed words that In the first run (Model 1) we sample the πk value, appear in more than one class. After this filtering, i.e. the probability of choosing a seed topic for on an average, we are left with 9 and 15 words per each topic. While in the ‘Model 1 (ˆ π = 0.7)’ run, each seed topic for Reuters and 20 Newsgroups we fix this probability to a constant value of 0.7 ir- corpora respectively. respective of the topic.3 Though both the models We use the existing Dirichlet Forest method to 2 The code used for LDA baseline in Tables 3 and 4 evaluate the effectiveness of the automatically ex- are different. For Table 3, we use the code available from tracted seed words. The Must and Cannot links http://pages.cs.wisc.edu/∼andrzeje/research/df lda.html. required for the supervision (Andrzejewski et al., We use our own version for Table 4. We tried to produce a comparable baseline by running the former for more 2009) are automatically obtained by adding a iterations and with different hyperparameters. In Table 3, must-link between every pair of words belonging we report their best results. 3 to the same seed set and a split constraint between We chose this value based on intuition; it is not tuned. 210 Reuters 20 Newsgroups F-measure VI F-measure VI maxCluster 0.53 1.75 0.58 1.44 LDA 0.66 (±.04) 1.2 (±.12) 0.76 (±.06) 0.9 (±.14) z-labels 0.73 (±.01) 1.04 (±.01) 0.8 (±.00) 0.82 (±.01) ∆ over LDA (+10.6%) (-13.3%) (+5.26%) (-8.8%) Model 1 0.69 (±.00) 1.13 (±.01) 0.8 (±.01) 0.81 (±.02) π = 0.7) Model 1 (ˆ 0.73 (±.00) 1.09 (±.01) 0.8 (±.01) 0.81 (±.02) Model 2 0.66 (±.04) 1.22 (±.1) 0.77 (±.07) 0.85 (±.12) SeededLDA 0.76∗ (±.01) 0.99∗ (±.03) 0.81∗ (±.01) 0.75∗ (±.02) ∆ over LDA (+15.5%) (-17.5%) (+6.58%) (-16.7%) Table 4: Accuracies on document clustering task with different models. ∗ indicates significant improvement compared to the z-labels approach, as measured by the t-test with p = 0.01. The relative performance gains are with respect to the LDA model and are provided for comparison with Dirichlet Forest method (in Table 3.) performed better than LDA, fixing the probabil- these intervals reveals the superior performance ity gave better results. When we attempt to learn of SeededLDA compared to all the baselines. The this value, the model chooses to explain some of standard deviation of the F-measures over dif- the seed words by the regular topics. On the other ferent random initializations of our our model is hand, when π is fixed, it explains almost all the about 1% for both the corpora while it is 4% and seed words based on the seed topics. The next 6% for the LDA on Reuters and 20 Newsgroups row (Model 2) indicates the performance of our corpora respectively. The reduction in the vari- second model on the same data sets. The first ance, across all the approaches that use seed infor- model seems to be performing better than the sec- mation, shows the increased robustness of the in- ond model, which is justifiable since the latter ference process when using seed words. From the uses seed topics indirectly. Though the variants accuracies in both the tables, it is clear that Seed- of Model 1 and Model 2 performed better than edLDA model out-performs other models which the LDA, they fell short of the z-labels approach. try to incorporate seed information into the topic Table 4 also shows the performance of our com- models. bined model (SeededLDA) on both the corpora. 4.3 Effect of Ambiguous Seeds When the models are combined, the performance In the following experiment we study the effect improves over each of them and is also better than of ambiguous seeds. We allow a seed word to oc- the baseline systems. As explained before, our in- cur in multiple seed sets. Table 6 shows the cor- dividual models improve both the topic-word and responding results. The performance drops when document-topic distributions respectively. But it we add ambiguous seed words, but it is still higher turns out that the knowledge learnt by both the in- than that of the LDA model. This suggests that the dividual models is complementary to each other. quality of the seed topics is determined by the dis- As a result the combined model performed better criminative power of the seed words rather than than the individual models and other baseline sys- the number of seed words in each seed topic. The tems. Comparing the last rows of Tables 4 and 3, topics identified by the SeededLDA on Reuters we notice that the relative performance gains ob- corpus are shown in the Table 5. With the help of served in the case of SeededLDA is significantly the seed sets, the model is able to split the ‘Grain’ higher than the performance gains obtained by and ‘Crude’ into two separate topics which were incorporating the constraints using the Dirichlet merged into a single topic by the plain LDA. Forest method. Moreover, as indicated in the Ta- ble 4, SeededLDA achieves significant gains over 4.4 Qualitative Evaluation on NIPS papers the z-labels approach as well. We ran LDA and SeededLDA models on the NIPS We have also provided the standard intervals papers from 2001 to 2010. For this corpus, the for each of the approaches. A quick inspection of seed words are chosen from the call for proposal. 211 group, offer, common, cash, agreement, shareholders, acquisition, stake, merger, board, sale oil, price, prices, production, lt, gas, crude, 1987, 1985, bpd, opec, barrels, energy, first, petroleum 0, mln, cts, net, loss, 2, dlrs, shr, 3, profit, 4, 5, 6, revs, 7, 9, 8, year, note, 1986, 10, 0, sales tonnes, wheat, mln, grain, week, corn, department, year, export, program, agriculture, 0, soviet, prices bank, market, pct, dollar, exchange, billion, stg, today, foreign, rate, banks, japan, yen, rates, trade Table 5: Topics identified by SeededLDA on the frequent-5 categories of Reuters corpus Reuters 20 Newsgroups drzejewski et al., 2009). Moreover, since, in our F VI F VI method each seed topic is a distribution over the LDA 0.66 1.2 0.76 0.9 seed words, the convex combination of regular SeededLDA 0.76 0.99 0.81 0.75 and seed topics can be seen as adding different SeededLDA weights (ci ) to different components of the prior 0.71 1.08 0.79 0.78 (amb) vector. Thus our Model 1 can be seen as an asym- metric generalization of the Informed priors. Table 6: Effect of ambiguous seed words on Seed- For comparability purposes, in this paper, we edLDA. experimented with same number of regular topics as the number of seed topics. But as explained in There are 10 major areas with sub areas under the modeling part, our model is general enough each of them. We ran both the models with 10 top- to handle situation with unequal number of seed ics. For SeededLDA, the words in each of the ar- and regular topics. In this case, we assume that eas are selected as seed words and we filter out the the seed topics indicate a higher level of topical- ambiguous seed words. Upon a qualitative obser- ity structure of the corpus and associate each seed vation of the output topics, we found that LDA has topic (or group) with a distribution over the regu- identified seven major topics and left out “Brain lar topics. On the other hand, in many NLP appli- Imaging”, “Cognitive Science and Artificial In- cations, we tend to have only a partial information telligence” and “Hardware Technologies” areas. rather than high-level supervision. In such cases, Not surprisingly, but reassuringly, these areas are one can create some empty seed sets and tweak underrepresented among the NIPS papers. On the the model 2 to output a 1 in the binary vector cor- other hand, SeededLDA successfully identifies all responding to these seed sets. In this paper, we of the major topics. The topics identified by LDA used information gain to select the discriminating and SeededLDA are shown in the supplementary seed words. But in the real world applications, material. one can use publicly available ODP categorization data to obtain the higher level seed words and thus 5 Discussion explore the corporal in a more meaningful way. In this paper, we have explored two methods In traditional topic models, a symmetric Dirich- to incorporate lexical prior into the topic mod- let distribution is used as prior for topic-word dis- els, combining them into a single model that we tributions. A first attempt method to incorporate call SeededLDA. From our experimental analysis, seed words into the model is to use an asymmetric we found that automatically derived seed words Dirichlet distribution as prior for the topic-word can improve clustering performance significantly. distributions (also called as Informed priors). For Moreover, we found out that allowing a seed word example, to encourage Topic 5 to align with a seed to be shared across multiple sets of seed words de- set we can choose an asymmetric prior of the form grades the performance. β~5 = {β, · · · , β + c, · · · , β}, i.e. we increase the component values corresponding to the seed 6 Acknowledgments words by a positive constant value. This favors the desired seed words to be drawn with a higher We thank the anonymous reviewers for their help- probability from this topic. But, it is argued else- ful comments. This material is partially supported where that words drawn from such distributions by the National Science Foundation under Grant rarely pick words other than the seed words (An- No. IIS-1153487. 212 References - Volume 1, HLT ’11, pages 248–257, Stroudsburg, PA, USA. Association for Computational Linguis- Andrzejewski, D. and Zhu, X. (2009). Latent dirichlet tics. allocation with topic-in-set knowledge. In Proceed- Johnson, M. and Goldwater, S. (2009). Improving ings of the NAACL HLT 2009 Workshop on Semi- nonparameteric bayesian inference: experiments Supervised Learning for Natural Language Pro- on unsupervised word segmentation with adap- cessing, SemiSupLearn ’09, pages 43–48, Morris- tor grammars. In Proceedings of Human Lan- town, NJ, USA. Association for Computational Lin- guage Technologies: The 2009 Annual Conference guistics. of the North American Chapter of the Association Andrzejewski, D., Zhu, X., and Craven, M. (2009). In- for Computational Linguistics, NAACL ’09, pages corporating domain knowledge into topic modeling 317–325, Stroudsburg, PA, USA. Association for via dirichlet forest priors. In ICML ’09: Proceed- Computational Linguistics. ings of the 26th Annual International Conference Lacoste-Julien, S., Sha, F., and Jordan, M. (2008). on Machine Learning, pages 25–32, New York, NY, DiscLDA: Discriminative learning for dimensional- USA. ACM. ity reduction and classification. In Proceedings of Basu, S., Ian, D., and Wagstaff, K. (2008). Con- NIPS ’08. strained Clustering : Advances in Algorithms, The- Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). ory, and Applications. Chapman & Hall/CRC Pres. Rcv1: A new benchmark collection for text catego- Blei, D. and McAuliffe, J. (2008). Supervised topic rization research. J. Mach. Learn. Res., 5:361–397. models. In Advances in Neural Information Pro- Meil˘a, M. (2007). Comparing clusterings—an infor- cessing Systems 20, pages 121–128, Cambridge, mation based distance. J. Multivar. Anal., 98:873– MA. MIT Press. 895. Blei., D. M. and Lafferty., J. (2009). Topic models. In Mitchell, T. M. (1997). Machine Learning. McGraw- Text Mining: Theory and Applications. Taylor and Hill, New York. Francis. Paul, M. and Girju, R. (2010). A two-dimensional Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). La- topic-aspect model for discovering multi-faceted tent dirichlet allocation. Journal of Maching Learn- topics. In AAAI. ing Research, 3:993–1022. Ramage, D., Hall, D., Nallapati, R., and Manning, Boyd-Graber, J., Blei, D. M., and Zhu, X. (2007). A C. D. (2009). Labeled LDA: a supervised topic topic model for word sense disambiguation. In Em- model for credit attribution in multi-labeled cor- pirical Methods in Natural Language Processing. pora. In Proceedings of the 2009 Conference on Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., and Empirical Methods in Natural Language Process- Blei, D. M. (2009). Reading tea leaves: How hu- ing: Volume 1 - Volume 1, EMNLP ’09, pages 248– mans interpret topic models. In Neural Information 256, Morristown, NJ, USA. Association for Com- Processing Systems. putational Linguistics. Griffiths, T., Steyvers, M., and Tenenbaum, J. (2007). Thelen, M. and Riloff, E. (2002). A bootstrapping Topics in semantic representation. Psychological method for learning semantic lexicons using extrac- Review, 114(2):211–244. tion pattern contexts. In In Proc. 2002 Conf. Empir- Griffiths, T. L. and Steyvers, M. (2004). Finding sci- ical Methods in NLP (EMNLP). entific topics. Proceedings of National Academy of Wagstaff, K., Cardie, C., Rogers, S., and Schr¨odl, S. Sciences USA, 101 Suppl 1:5228–5235. (2001). Constrained k-means clustering with back- Griffiths, T. L., Steyvers, M., Blei, D. M., and Tenen- ground knowledge. In Proceedings of the Eigh- baum, J. B. (2005). Integrating topics and syntax. teenth International Conference on Machine Learn- In Advances in Neural Information Processing Sys- ing, ICML ’01, pages 577–584, San Francisco, CA, tems, volume 17, pages 537–544. USA. Morgan Kaufmann Publishers Inc. Haghighi, A. and Klein, D. (2006). Prototype-driven Wallach, H. M. (2005). Topic modeling: beyond bag- learning for sequence models. In Proceedings of of-words. In NIPS 2005 Workshop on Bayesian the main conference on Human Language Tech- Methods for Natural Language Processing. nology Conference of the North American Chap- Williamson, S., Wang, C., Heller, K. A., and Blei, ter of the Association of Computational Linguis- D. M. (2010). The IBP compound dirichlet pro- tics, HLT-NAACL ’06, pages 320–327, Strouds- cess and its application to focused topic modeling. burg, PA, USA. Association for Computational Lin- In ICML, pages 1151–1158. guistics. Hu, Y., Boyd-Graber, J., and Satinoff, B. (2011). In- teractive topic modeling. In Proceedings of the 49th Annual Meeting of the Association for Computa- tional Linguistics: Human Language Technologies 213 DualSum: a Topic-Model based approach for update summarization Jean-Yves Delort Enrique Alfonseca Google Research Google Research Brandschenkestrasse 110 Brandschenkestrasse 110 8002 Zurich, Switzerland 8002 Zurich, Switzerland

[email protected] [email protected]

Abstract of sentences extracted from the document collec- tion. Extracts can have coherence and cohesion Update summarization is a new challenge problems, but they generally offer a good trade- in multi-document summarization focusing off between linguistic quality and informative- on summarizing a set of recent documents ness. relatively to another set of earlier docu- ments. We present an unsupervised proba- While numerous extractive summarization bilistic approach to model novelty in a doc- techniques have been proposed for multi- ument collection and apply it to the genera- document summarization (Erkan and Radev, tion of update summaries. The new model, 2004; Radev et al., 2004; Shen and Li, 2010; Li et called D UAL S UM, results in the second or al., 2011), few techniques have been specifically third position in terms of the ROUGE met- designed for update summarization. Most exist- rics when tuned for previous TAC competi- ing approaches handle it as a redundancy removal tions and tested on TAC-2011, being statis- tically indistinguishable from the winning problem, with the goal of producing a summary of system. A manual evaluation of the gen- collection B that is as dissimilar as possible from erated summaries shows state-of-the art re- either collection A or from a summary of collec- sults for D UAL S UM with respect to focus, tion A. A problem with this approach is that it can coherence and overall responsiveness. easily classify as redundant sentences in which novel information is mixed with existing informa- tion (from collection A). Furthermore, while this 1 Introduction approach can identify sentences that contain novel Update summarization is the problem of extract- information, it cannot model explicitly what the ing and synthesizing novel information in a col- novel information is. lection of documents with respect to a set of doc- Recently, Bayesian models have successfully uments assumed to be known by the reader. This been applied to multi-document summarization problem has received much attention in recent showing state-of-the-art results in summarization years, as can be observed in the number of partic- competitions (Haghighi and Vanderwende, 2009; ipants to the special track on update summariza- Jin et al., 2010). These approaches offer clear and tion organized by DUC and TAC since 2007. The rigorous probabilistic interpretations that many problem is usually formalized as follows: Given other techniques lack. Furthermore, they have the two collections A and B, where the documents in advantage of operating in unsupervised settings, A chronologically precede the documents in B, which can be used in real-world scenarios, across generate a summary of B under the assumption domains and languages. To our best knowledge, that the user of the summary has already read the previous work has not used this approach for up- documents in A. date summarization. Extractive techniques are the most common In this article, we propose a novel nonpara- approaches in multi-document summarization. metric Bayesian approach for update summariza- Summaries generated by such techniques consist tion. Our approach, which is a variation of Latent 214 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 214–223, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Dirichlet Allocation (LDA) (Blei et al., 2003), mary S to approximate, KL is commonly used as aims to learn to distinguish between common in- the scoring function to select the subset of sen- formation and novel information. We have eval- tences S ∗ that minimizes the KL divergence with uated this approach on the ROUGE scores and T: demonstrate that it produces comparable results to the top system in TAC-2011. Furthermore, our X pT (w) approach improves over that system when evalu- S ∗ = argminKL(T, S) = pT (w) log S pS (w) ated manually in terms of linguistic quality and w∈V overall responsiveness. where w is a word from the vocabulary V. This 2 Related work strategy is called KLSum. Usually, a smoothing factor τ is applied on the candidate distribution S 2.1 Bayesian approaches in Summarization in order to avoid the divergence to be undefined1 . Most Bayesian approaches to summarization are This objective function selects the most repre- based on topic models. These generative mod- sentative sentences of the collection, and at the els represent documents as mixtures of latent top- same time it also diversifies the generated sum- ics, where a topic is a probability distribution over mary by penalizing redundancy. Since the prob- words. In T OPIC S UM (Haghighi and Vander- lem of finding the subset of sentences from a wende, 2009), each word is generated by a sin- collection that minimizes the KL divergence is gle topic which can be a corpus-wide background NP-complete, a greedy algorithm is often used in distribution over common words, a distribution practice2 . Some variations of this objective func- of document-specific words or a distribution of tion can be considered, such as penalizing sen- the core content of a given cluster. BAYES S UM tences that contain document-specific topics (Ma- (Daum´e and Marcu, 2006) and the Special Words son and Charniak, 2011) or rewarding sentences and Background model (Chemudugunta et al., appearing closer to the beginning of the docu- 2006) are very similar to T OPIC S UM. ment. A commonality of all these models is the use of Wang et al. (2009) propose a Bayesian ap- collection and document-specific distributions in proach for summarization that does not use KL order to distinguish between the general and spe- for reranking. In their model, Bayesian Sentence- cific topics in documents. In the context of sum- based Topic Models, every sentence in a docu- marization, this distinction helps to identify the ment is assumed to be associated to a unique la- important pieces of information in a collection. tent topic. Once the model parameters have been Models that use more structure in the repre- calculated, a summary is generated by choosing sentation of documents have also been proposed the sentence with the highest probability for each for generating more coherent and less redun- topic. dant summaries, such as H IER S UM (Haghighi While hierarchical topic modeling approaches and Vanderwende, 2009) and TTM (Celikyilmaz have shown remarkable effectiveness in learning and Hakkani-Tur, 2011). For instance, H IER S UM the latent topics of document collections, they are models the intuitions that first sentences in docu- not designed to capture the novel information in ments should contain more general information, a collection with respect to another one, which is and that adjacent sentences are likely to share the primary focus of update summarization. specic content vocabulary. However, H IER S UM, which builds upon T OPIC S UM, does not show 2.2 Update Summarization a statistically signicant improvement in ROUGE The goal of update summarization is to generate over T OPIC S UM. an update summary of a collection B of recent A number of techniques have been proposed to documents assuming that the users already read rank sentences of a collection given a word distri- earlier documents from a collection A. We refer bution (Carbonell and Goldstein, 1998; Goldstein 1 In our experiments we set τ = 0.01. et al., 1999). The Kullback-Leibler divergence 2 In our experiments, we follow the same approach as in (KL) is a widely used measure in summarization. (Haghighi and Vanderwende, 2009) by greedily adding sen- Given a target distribution T that we want a sum- tences to a summary so long as they decrease KL divergence. 215 to collection A as the base collection and to col- 3 DualSum lection B as the update collection. 3.1 Model Formulation Update summarization is related to novelty de- The input for D UAL S UM is a set of pairs of collec- tection which can be defined as the problem of tions of documents C = {(Ai , Bi )}i=1...m , where determining whether a document contains new in- Ai is a base document collection and Bi is an up- formation given an existing collection (Soboroff date document collection. We use c to refer to a and Harman, 2005). Thus, while the goal of nov- collection pair (Ac , Bc ). elty detection is to determine whether some infor- In D UAL S UM, documents are modeled as a bag mation is new, the goal of update summarization of words that are assumed to be sampled from a is to extract and synthesize the novel information. mixture of latent topics. Each word is associated with a latent variable that specifies which topic distribution is used to generate it. Words in a doc- Update summarization is also related to con- ument are assumed to be conditionally indepen- trastive summarization, i.e. the problem of jointly dent given the hidden topic. generating summaries for two entities in order to As in previous Bayesian works for summariza- highlight their differences (Lerman and McDon- tion (Daum´e and Marcu, 2006; Chemudugunta ald, 2009). The primary difference here is that et al., 2006; Haghighi and Vanderwende, 2009), update summarization aims to extract novel or up- D UAL S UM not only learns collection-specific dis- dated information in the update collection with re- tributions, but also a general background distri- spect to the base collection. bution over common words, φG and a document- specific distribution φcd for each document d in The most common approach for update sum- collection pair c, which is useful to separate the marization is to apply a normal multi-document specific aspects from the general aspects of c. The summarizer, with some added functionality to re- main novelty is that D UAL S UM introduces spe- move sentences that are redundant with respect cific machinery for identifying novelty. to collection A. This can be achieved using sim- To capture the differences between the base and ple filtering rules (Fisher and Roark, 2008), Max- the update collection for each pair c, D UAL S UM imal Marginal Relevance (Boudin et al., 2008), or learns two topics for every collection pair. The more complex graph-based algorithms (Shen and joint topic, φAc captures the common information Li, 2010; Wenjie et al., 2008). The goal here is between the two collections in the pair, i.e. the to boost sentences in B that bring out completely main event that both collections are discussing. novel information. One problem with this ap- The update topic, φBc focuses on the specific as- proach is that it is likely to discard as redundant pects that are specific of the documents inside the sentences in B containing novel information if it update collection. is mixed with known information from collection In the generative model, A. • For a document d in a collection Ac , words can be originated from one of three differ- Another approach is to introduce specific fea- ent topics: φG , φcd and φAc , the last one of tures intended to capture the novelty in collection which captures the main topic described in B. For example, comparing collections A and B, the collection pair. FastSum derives features for the collection B such as number of named entities in the sentence that • For a document d in a collection Bc , words already occurred in the old cluster or the number can be originated from one of four different of new content words in the sentence not already topics: φG , φcd , φAc and φBc . The last one mentioned in the old cluster that are subsequently will capture the most important updates to used to train a Support Vector Machine classifier the main topic. (Schilder et al., 2008). A limitation with this ap- proach is there are no large training sets available To make this representation easier, we can also and, the more features it has, the more it is af- state that both collections are generated from the fected by the sparsity of the training data. four topics, but we constrain the topic probability 216 1. Sample φG ∼ Dir(λG ) there should be more words in the background 2. For each collection pair c = (Ac , Bc ): than in the other distributions, so the mass is ex- • Sample φAc ∼ Dir(λA ) pected to be shared on a larger number of words. • Sample φBc ∼ Dir(λB ) Unlike for the word distributions, mixing prob- • For each document d of type ucd ∈ {A, B}: abilities are drawn from a Dirichlet distribution - Sample φcd ∼ Dir(λD ) with asymmetric priors. The prior knowledge - If (ucd = A) sample ψ cd ∼ Dir(γ A ) about the origin of words in the base and up- - If (ucd = B) sample ψ cd ∼ Dir(γ B ) date collections is again encoded at the level the - For each word w in document d: hyper-parameters. For example, if we set γ A = (a) Sample a topic z ∼ M ult(ψ cd ), z ∈ (5, 3, 2, 0), this would reflect the intuition that, {G, cd, Ac , Bc } on average, in the base collections, 50% of the (b) Sample a word w ∼ M ult(φz ) words originate from the background distribution, 30% from the document-specific distribution, and Figure 1: Generative model in D UAL S UM. 20% from the joint topic. Similarly, if we set γ B = (5, 2, 2, 1), the prior reflects the assumption γA γB that, on average, in the update collections, 50% of the words originate from the background distri- bution, 20% from the document-specific distribu- u ψ tion, 20% from the joint topic, and 10% from the novel, update topic3 . The priors we have actually z λG used are reported in Section 4. 3.2 Learning and inference λD φD w φG In order to find the optimal model parameters, the following equation needs to be computed: φA φB p(z, ψ, φ, w, u) λA λB p(z, ψ, φ|w, u) = p(w, u) Omitting hyper-parameters for notational sim- Figure 2: Graphical model representation of D UAL - plicity, the joint distribution over the observed S UM. variables is: p(w, u) = p(φG )× for φBc to be always zero when generating a base Y document. p(φAc )p(φBc ) × c We denote ucd ∈ {A, B} the type of a docu- Y Z ment d in pair c. This is an observed, Boolean p(ucd )p(φcd ) p(ψ cd |ucd )dψ cd × d ∆ variable stating whether the document d belongs YX to the base or the update collection inside the pair p(wcdn |zcdn )p(zcdn |ψ cd ) c. n cdn The generation process of documents in D U - where ∆ denotes the 4-dimensional simplex4 . AL S UM is described in Figure 1, and the plate Since this equation is intractable, we need to per- diagram corresponding to this generative story form approximate inference in order to estimate is shown in Figure 2. D UAL S UM is an LDA- the model parameters. A number of Bayesian sta- like model, where topic distributions are multi- tistical inference techniques can be used to ad- nomial distributions over words and topics that dress this problem. are sampled from Dirichlet distributions. We use 3 λ = (λG , λD , λA , λB ) as symmetric priors for the To highlight the difference between asymmetric and Dirichlet distributions generating the word distri- symmetric priors we put the indices in superscript and sub- script respectively. butions. In our experiments, we set λG = 0.1 and 4 Remember that, for base documents, words cannot λD = λA = λB = 0.001. A greater value is as- be generated by the update topic, so ∆ denotes the 3- signed to λG in order to reflect the intuition that dimensional simplex for base documents. 217 (v) Variational approaches (Blei et al., 2003) and where k ∈ K, nk denotes the number of times collapsed Gibbs sampling (Griffiths and Steyvers, (cd) word v is assigned to topic k, and nk denotes 2004) are common techniques for approximate in- the number of words in document d of collection ference in Bayesian models. They offer different c that are assigned to topic k. advantages: the variational approach is arguably By the strong law of large numbers, the average faster computationally, but the Gibbs sampling of sample parameters should converge towards approach is in principal more accurate since it the true expected value of the model parameter. asymptotically approaches the correct distribution Therefore, good estimates of the model parame- (Porteous et al., 2008). In this section, we pro- ters can be obtained averaging over the sampled vide details on a collapsed Gibbs sampling strat- values. As suggested by Gamerman and Lopes egy to infer the model parameters of D UAL S UM (2006), we have set a lag (20 iterations) between for a given dataset. samples in order to reduce auto-correlation be- Collapsed Gibbs sampling is a particular case tween samples. Our sampler also discards the first of Markov Chain Monte Carlo (MCMC) that in- 100 iterations as burn-in period in order to avoid volves repeatedly sampling a topic assignment for averaging from samples that are still strongly in- each word in the corpus. A single iteration of the fluenced by the initial assignment. Gibbs sampler is completed after sampling a new topic for each word based on the previous assign- 4 Experiments in Update ment. In a collapsed Gibbs sampler, the model Summarization parameters are integrated out (or collapsed), al- The Bayesian graphical model described in the lowing to only sample z. Let us call wcdn the n-th previous section can be run over a set of news word in document d in collection c, and zcdn its collections to learn the background distribution, topic assignment. For Gibbs sampling, we need a joint distribution for each collection, an update to calculate p(zcdn |w, u, z−cdn ) where z−cdn de- distribution for each collection and the document- notes the random vector of topic assignments ex- specific distributions. Once this is done, one of cept the assignment zcdn . the learned collections can be used to generate the summary that best approximates this collection, p(zcdn = j|w, u, z−cdn , γ A , γ B , λ) ∝ using the greedy algorithm described by Haghighi (w ) (cd) cdn n−cdn,j + λj n−cdn,j + γjucd and Vanderwende (2009). Still, there are some pa- PV (v) n−cdn,j + V λj P (cd) + γkucd ) rameters that can be defined and which affects the v=1 k∈K (n−cdn,k results obtained: (v) where K = {G, cd, Ac , Bc }, n−cdn,j denotes the • D UAL S UM’s choice of hyper-parameters af- number of times word v is assigned to topic j fects how the topics are learned. excluding current assignment of word wcdn and (cd) • The documents can be represented with n- n−cdn,k denotes the number of words in document grams of different lengths. d of collection c that are assigned to topic j ex- cluding current assignment of word wcdn . • It is possible to generate a summary that ap- After each sampling iteration, the model pa- proximates the joint distribution, the update- rameters can be estimated using the following for- only distribution, or a combination of both. mulas5 . This section describes how these parameters have been tuned. (w) nk + λk φkw =P (v) 4.1 Parameter tuning V v=1 nk + V λk We use the TAC 2008 and 2009 update task datasets as training set for tuning the hyper- (cd) n + λk parameters for the model, namely the pseudo- ψkcd = P k(cd) n. + V λ k counts for the two Dirichlet priors that affects the 5 The interested reader is invited to consult (Wang, 2011) topic mix assignment for each document. By per- for more details on using Gibbs sampling for LDA-like mod- forming a grid search over a large set of pos- els sible hyper-parameters, these have been fixed to 218 γ A = (90, 190, 50, 0) and γ B = (90, 170, 45, 25) as the values that produced the best ROUGE-2 score on those two datasets. Regarding the base collection, this can be inter- preted as setting as prior knowledge that roughly 27% of the words in the original dataset originate from the background distribution, 58% from the document-specific distributions, and 15% from the topic of the original collection. We remind the reader that the last value in γ A is set to zero because, due to the problem definition, the origi- nal collection must have no words generated from the update topic, which reflects the most recent developments that are still not present in the base Figure 3: Variation in ROUGE-2 score in the TAC- 2010 dataset as we change the mixture weight for the collections A. joined topic model between 0 and 1. Regarding the update set, 27% of the words are assumed to originate again from the background distribution, 51% from the document-specific dis- tributions, 14% from an topic in common with the original collection, and 8% from the update- specific topic. One interesting fact to note from these settings is that most of the words belong to topics that are specific to single documents (58% and 51% respectively for both sets A and B) and to the background distribution, whereas the joint and update topics generate a much smaller, lim- ited set of words. This helps these two distribu- tions to be more focused. The other settings mentioned at the beginning of this section have been tuned using the TAC- Figure 4: Effect of the mixture weight in ROUGE-2 scores (TAC-2010 dataset). Results are reported us- 2010 dataset, which we reserved as our develop- ing bigrams (above, blue), unigrams (middle, red) and ment set. Once the different document-specific trigrams (below, yellow). and collection-specific distributions have been ob- tained, we have to choose the target distribu- tion T to with which the possible summaries will increases until it plateaus at a maximum around be compared using the KL metric. Usually, the roughly the interval [0.6, 0.8], and from that point human-generated update summaries not only in- performance slowly degrades as at the right part clude the terms that are very specific about the last of the curve the update model is given very little developments, but they also include a little back- importance in generating the summary. Based on ground regarding the developing event. There- these results, from this point onwards, the mixture fore, we try, for KLSum, a simple mixture be- weight has been set to 0.7. Note that using only tween the joint topic (φA ) and the update topic the joint distribution (setting the mixture weight (φB ). to 1.0) also produces reasonable results, hinting Figure 3 shows the ROUGE-2 results obtained that it successfully incorporates the most impor- as we vary the mixture weight between the joint tant n-grams from across the base and the update φA distribution and the update-specific φB distri- collections at the same time. bution. As can be seen at the left of the curve, us- A second parameter is the size of the n-grams ing only the update-specific model, which disre- for representing the documents. The original gards the generic words about the topic described, implementations of S UM BASIC (Nenkova and produces much lower results. The results improve Vanderwende, 2005) and T OPIC S UM (Haghighi as the relative weight of the joined topic model and Vanderwende, 2009) were defined over sin- 219 gle words (unigrams). Still, Haghighi and Van- automatically evaluated using the TAC-2011 derwende (2009) report some improvements in dataset. Table 1 shows the ROUGE results ob- the ROUGE-2 score when representing words as tained. Because of the non-deterministic nature a bag of bigrams, and Darling (2010) mention of Gibbs sampling, the results reported here are similar improvements when running S UM BASIC the average of five runs for all the baselines and with bigrams. Figure 4 shows the effect on the for D UAL S UM. D UAL S UM outperforms two of ROUGE-2 curve when we switch to using uni- the baselines in all three ROUGE metrics, and it grams and trigrams. As stated in previous work, also outperforms T OPIC S UMB on two of the three using bigrams has better results than using uni- metrics. grams. Using trigrams was worse than either of them. This is probably because trigrams are too The top three systems in TAC-2011 have been specific and the document collections are small, included for comparison. The results between so the models are more likely to suffer from data these three systems, and between them and D U - AL S UM , are all indistinguishable at 95% confi- sparseness. dence. Note that the best baseline, T OPIC S UMB , 4.2 Baselines is quite competitive, with results that are indis- D UAL S UM is a modification of T OPIC S UM de- tinguishable to the top participants in this year’s signed specifically for the case of update sum- evaluation. Note as well that, because we have marization, by modifying T OPIC S UM’s graphical five different runs for our algorithms, whereas model in a way that captures the dependency be- we just have one output for the TAC participants, tween the joint and the update collections. Still, it the confidence intervals in the second case were is important to discover whether the new graphi- slightly bigger when checking for statistical sig- cal model actually improves over simpler applica- nificance, so it is slightly harder for these systems tions of T OPIC S UM to this task. The three base- to assert that they outperform the baselines with lines that we have considered are: 95% confidence. These results would have made D UAL S UM the second best system for ROUGE- • Running T OPIC S UM on the set of collections 1 and ROUGE-SU4, and the third best system in containing only the update documents. We terms of ROUGE-2. call this run T OPIC S UMB . The supplementary materials contain a detailed • Running T OPIC S UM on the set of collections example of the the topic model obtained for the containing both the base and the update doc- background in the TAC-2011 dataset, and the base uments. Contrary to the previous run, the and update models for collection D1110. As topic model for each collection in this run expected, the top unigrams and bigrams are all will contain information relevant to the base closed-class words and auxiliary verbs. Because events. We call this run T OPIC S UMA∪B . trigrams are longer, background trigrams actu- • Running T OPIC S UM twice, once on the set ally include some content words (e.g. university of collections containing the update docu- or director). Regarding the models for φA and ments, and the second time on the set of φB , the base distribution contains words related collections containing the base documents. to the original event of an earthquake in Sichuan Then, for each collection, the obtained base province (China), and the update distribution fo- and update models are combined in a mix- cuses more on the official (updated) death toll ture model using a mixture weight between numbers. It can be noted here that the tokenizer zero and one. The weight has been tuned us- we used is very simple (splitting tokens separated ing TAC-2010 as development set. We call with white-spaces or punctuation) so that num- this run T OPIC S UMA +T OPIC S UMB . bers such as 7.9 (the magnitude of the earthquake) and 12,000 or 14,000 are divided into two tokens. 4.3 Automatic evaluation We thought this might be a for the bigram-based D UAL S UM and the three baselines6 have been system to produce better results, but we ran the 6 Using the settings obtained in the previous section, hav- summarizers with a numbers-aware tokenizer and ing been optimized on the datasets from previous TAC com- the statistical differences between versions still petitions. hold. 220 Method R-1 R-2 R-SU4 T OPIC S UMB 0.3442 0.0868 0.1194 T OPIC S UMA∪B 0.3385 0.0809 0.1159 T OPIC S UMA +T OPIC S UMB 0.3328 0.0770 0.1125 D UAL S UM 0.3575‡†∗ 0.0924†∗ 0.1285‡†∗ TAC-2011 best system (Peer 43) 0.3559†∗ 0.0958†∗ 0.1308‡†∗ TAC-2011 2nd system (Peer 25) 0.3582†∗ 0.0926∗ 0.1276†∗ TAC-2011 3rd system (Peer 17) 0.3558†∗ 0.0886 0.1279†∗ Table 1: Results on the TAC-2011 dataset. ‡ , † and ∗ indicate that a result is significantly better than T OPIC S UMB , T OPIC S UMA∪B and T OPIC S UMA +T OPIC S UMB , respectively (p < 0.05). 4.4 Manual evaluation Best system Aspect Peer 43 Same DualSum While the ROUGE metrics provides an arguable Overall Responsiveness 39 25 68 Focus 41 22 69 estimate of the informativeness of a generated Coherence 39 30 63 summary, it does not account for other important Non-redundancy 40 53 39 aspects such as the readability or the overall re- sponsiveness. To evaluate such aspects, a manual Table 2: Results of the side-by-side manual evaluation. evaluation is required. A fairly standard approach for manual evaluation is through pairwise com- garding Non-redundancy, DualSum and Peer 43 parison (Haghighi and Vanderwende, 2009; Ce- obtain similar results but the majority of raters likyilmaz and Hakkani-Tur, 2011). found no difference between the two systems. In this approach, raters are presented with pairs Fleiss κ has been used to measure the inter-rater of summaries generated by two systems and they agreement. For each aspect, we observe κ ∼ 0.2 are asked to say which one is best with respect which corresponds to a slight agreement; but if we to some aspects. We followed a similar approach focus on tasks where the 3 ratings reflect a prefer- to compare DualSum with Peer 43 - the best sys- ence for either of the two systems, then κ ∼ 0.5, tem with respect to ROUGE-2, on the TAC 2011 which indicates moderate agreement. dataset. For each collection, raters were presented with three summaries: a reference summary ran- 4.5 Efficiency and applicability domly chosen from the model summaries, and the The running time for summarizing the TAC col- summaries generated by Peer 43 and DualSum. lections with DualSum, averaged over a hundred They were asked to read the summaries and say runs, is 4.97 minutes, using one core (2.3 GHz). which one of the two generated summaries is best Memory consumption was 143 MB. with respect to: 1) Overall responsiveness: which It is important to note as well that, while T OP - summary is best overall (both in terms of content IC S UM incorporates an additional layer to model and fluency), 2) Focus: which summary contains topic distributions at the sentence level, we noted less irrelevant details, 3) Coherence: which sum- early in our experiments that this did not improve mary is more coherent and 4) Non-redundancy: the performance (as evaluated with ROUGE) and which summary repeats less the same informa- consequently relaxed that assumption in Dual- tion. For each aspect, the rater could also reply Sum. This resulted in a simplification of the that both summary were of the same quality. model and a reduction of the sampling time. For each of the 44 collections in TAC-2011, 3 While five minutes is fast enough to be able ratings were collected from raters7 . Results are to experiment and tune parameters with the TAC reported in Table 2. DualSum outperforms Peer collections, it would be quite slow for a real- 43 in three aspects, including Overall Responsive- time summarization system able to generate sum- ness, which aggregates all the other scores and maries on request. As can be seen from the plate can be considered the most important one. Re- diagram in Figure 2, all the collections are gen- 7 In total 132 raters participated to the task via our own erated independently from each other. The only crowdsourcing platform, not mentioned yet for blind review. exception, for which it is necessary to have all 221 the collections available at the same time dur- pling to on-line settings. By fixing the back- ing Gibbs sampling, is the background distribu- ground distribution we are able to summarize a tion, which is estimated from all the collections distribution in only three seconds, which seems simultaneously, roughly representing 27% of the reasonable for some on-line applications. words, that should appear distributed across all As future work, we plan to explore the use of documents. D UAL S UM to generate more general contrastive The good news is that this background distri- summaries, by identifying differences between bution will contain closed-class words in the lan- collections whose differences are not of temporal guage, which are domain-independent (see sup- nature. plementary material for examples). Therefore, we can generate this distribution from one of Acknowledgments the TAC datasets only once, and then it can be The research leading to these results has received reused. Fixing the background distribution to a funding from the European Union’s Seventh pre-computed value requires a very simple mod- Framework Programme (FP7/2007-2013) under ification of the Gibbs sampling implementation, grant agreement number 257790. We would also which just needs to adjust at each iteration the like to thank Yasemin Altun and the anonymous collection and document-specific models, and the reviewers for their useful comments on the draft topic assignment for the words. of this paper. Using this modified implementation, it is now possible to summarize a single collection inde- pendently. The summarization of a single col- References lection of the size of the TAC collections is re- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. duced on average to only three seconds on the 2003. Latent dirichlet allocation. J. Mach. Learn. same hardware settings, allowing the use of this Res., 3:993–1022, March. summarizer in an on-line application. Florian Boudin, Marc El-B`eze, and Juan-Manuel Torres-Moreno. 2008. A scalable MMR approach 5 Conclusions to sentence scoring for multi-document update sum- marization. In Coling 2008: Companion volume: The main contribution of this paper is D UAL S UM, Posters, pages 23–26, Manchester, UK, August. a new topic model that is specifically designed to Coling 2008 Organizing Committee. identify and extract novelty from pairs of collec- J. Carbonell and J. Goldstein. 1998. The use of mmr, tions. diversity-based reranking for reordering documents It is inspired by T OPIC S UM (Haghighi and and producing summaries. In Proceedings of the Vanderwende, 2009), with two main changes: 21st annual international ACM SIGIR conference Firstly, while T OPIC S UM can only learn the main on Research and development in information re- topic of a collection, D UAL S UM focuses on the trieval, pages 335–336. ACM. differences between two collections. Secondly, Asli Celikyilmaz and Dilek Hakkani-Tur. 2011. Dis- covery of topically coherent sentences for extrac- while T OPIC S UM incorporates an additional layer tive summarization. In Proceedings of the 49th An- to model topic distributions at the sentence level, nual Meeting of the Association for Computational we have found that relaxing this assumption and Linguistics: Human Language Technologies, pages modeling the topic distribution at document level 491–499, Portland, Oregon, USA, June. Associa- does not decrease the ROUGE scores and reduces tion for Computational Linguistics. the sampling time. Chaitanya Chemudugunta, Padhraic Smyth, and Mark The generated summaries, tested on the TAC- Steyvers. 2006. Modeling general and specific as- 2011 collection, would have resulted on the sec- pects of documents with a probabilistic topic model. ond and third position in the last summarization In NIPS, pages 241–248. W.M. Darling. 2010. Multi-document summarization competition according to the different ROUGE from first principles. In Proceedings of the third scores. This would make D UAL S UM statistically Text Analysis Conference, TAC-2010. NIST. indistinguishable from the top system with 0.95 Hal Daum´e, III and Daniel Marcu. 2006. Bayesian confidence. query-focused summarization. In Proceedings of We also propose and evaluate the applicability the 21st International Conference on Computa- of an alternative implementation of Gibbs sam- tional Linguistics and the 44th annual meeting 222 of the Association for Computational Linguistics, search, Redmond, Washington, Tech. Rep. MSR-TR- ACL-2006, pages 305–312, Stroudsburg, PA, USA. 2005-101. Association for Computational Linguistics. Ian Porteous, David Newman, Alexander Ihler, Arthur G¨unes Erkan and Dragomir R. Radev. 2004. Lexrank: Asuncion, Padhraic Smyth, and Max Welling. graph-based lexical centrality as salience in text 2008. Fast collapsed Gibbs sampling for latent summarization. J. Artif. Int. Res., 22:457–479, De- Dirichlet allocation. In KDD ’08: Proceeding of cember. the 14th ACM SIGKDD international conference on S. Fisher and B. Roark. 2008. Query-focused super- Knowledge discovery and data mining, pages 569– vised sentence ranking for update summaries. In 577, New York, NY, USA, August. ACM. Proceedings of the first Text Analysis Conference, Dragomir R. Radev, Hongyan Jing, Malgorzata Sty´s, TAC-2008. and Daniel Tam. 2004. Centroid-based summariza- Dani Gamerman and Hedibert F. Lopes. 2006. tion of multiple documents. Inf. Process. Manage., Markov Chain Monte Carlo: Stochastic Simulation 40:919–938, November. for Bayesian Inference. Chapman and Hall/CRC. Frank Schilder, Ravikumar Kondadadi, Jochen L. Lei- Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and dner, and Jack G. Conrad. 2008. Thomson reuters Jaime Carbonell. 1999. Summarizing text docu- at tac 2008: Aggressive filtering with fastsum for ments: sentence selection and evaluation metrics. update and opinion summarization. In Proceedings In Proceedings of the 22nd annual international of the first Text Analysis Conference, TAC-2008. ACM SIGIR conference on Research and develop- Chao Shen and Tao Li. 2010. Multi-document sum- ment in information retrieval, SIGIR ’99, pages marization via the minimum dominating set. In 121–128, New York, NY, USA. ACM. Proceedings of the 23rd International Conference T. L. Griffiths and M. Steyvers. 2004. Finding scien- on Computational Linguistics, COLING ’10, pages tific topics. Proceedings of the National Academy 984–992, Stroudsburg, PA, USA. Association for of Sciences, 101(Suppl. 1):5228–5235, April. Computational Linguistics. A. Haghighi and L. Vanderwende. 2009. Exploring Ian Soboroff and Donna Harman. 2005. Novelty de- content models for multi-document summarization. tection: the trec experience. In Proceedings of the In Proceedings of Human Language Technologies: conference on Human Language Technology and The 2009 Annual Conference of the North Ameri- Empirical Methods in Natural Language Process- can Chapter of the Association for Computational ing, HLT ’05, pages 105–112, Stroudsburg, PA, Linguistics, pages 362–370. Association for Com- USA. Association for Computational Linguistics. putational Linguistics. Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Feng Jin, Minlie Huang, and Xiaoyan Zhu. 2010. The Gong. 2009. Multi-document summarization us- thu summarization systems at tac 2010. In Proceed- ing sentence-based topic models. In Proceedings ings of the third Text Analysis Conference, TAC- of the ACL-IJCNLP 2009 Conference Short Papers, 2010. ACLShort ’09, pages 297–300, Stroudsburg, PA, Kevin Lerman and Ryan McDonald. 2009. Con- USA. Association for Computational Linguistics. trastive summarization: an experiment with con- Yi Wang. 2011. Distributed gibbs sampling of latent sumer reviews. In Proceedings of Human Lan- dirichlet allocation: The gritty details. guage Technologies: The 2009 Annual Conference Li Wenjie, Wei Furu, Lu Qin, and He Yanxiang. 2008. of the North American Chapter of the Association Pnr2: ranking sentences with positive and nega- for Computational Linguistics, Companion Volume: tive reinforcement for query-oriented update sum- Short Papers, NAACL-Short ’09, pages 113–116, marization. In Proceedings of the 22nd Interna- Stroudsburg, PA, USA. Association for Computa- tional Conference on Computational Linguistics - tional Linguistics. Volume 1, COLING ’08, pages 489–496, Strouds- Xuan Li, Liang Du, and Yi-Dong Shen. 2011. Graph- burg, PA, USA. Association for Computational Lin- based marginal ranking for update summarization. guistics. In Proceedings of the Eleventh SIAM International Conference on Data Mining. SIAM / Omnipress. Rebecca Mason and Eugene Charniak. 2011. Ex- tractive multi-document summaries should explic- itly not contain document-specific content. In Pro- ceedings of the Workshop on Automatic Summariza- tion for Different Genres, Media, and Languages, WASDGML ’11, pages 49–54, Stroudsburg, PA, USA. Association for Computational Linguistics. A. Nenkova and L. Vanderwende. 2005. The im- pact of frequency on summarization. Microsoft Re- 223 Large-Margin Learning of Submodular Summarization Models Ruben Sipos Pannaga Shivaswamy Thorsten Joachims Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Cornell University Cornell University Cornell University Ithaca, NY 14853 USA Ithaca, NY 14853 USA Ithaca, NY 14853 USA

[email protected] [email protected] [email protected]

Abstract Bilmes (2010), using a submodular scoring func- tion based on inter-sentence similarity. On the one In this paper, we present a supervised hand, this scoring function rewards summaries learning approach to training submodu- that are similar to many sentences in the origi- lar scoring functions for extractive multi- nal documents (i.e. promotes coverage). On the document summarization. By taking a other hand, it penalizes summaries that contain structured prediction approach, we pro- sentences that are similar to each other (i.e. dis- vide a large-margin method that directly optimizes a convex relaxation of the de- courages redundancy). While obtaining the exact sired performance measure. The learning summary that optimizes the objective is computa- method applies to all submodular summa- tionally hard, they show that a greedy algorithm rization methods, and we demonstrate its is guaranteed to compute a good approximation. effectiveness for both pairwise as well as However, their work does not address how to coverage-based scoring functions on mul- select a good inter-sentence similarity measure, tiple datasets. Compared to state-of-the- leaving this problem as well as selecting an appro- art functions that were tuned manually, our method significantly improves performance priate trade-off between coverage and redundancy and enables high-fidelity models with num- to manual tuning. ber of parameters well beyond what could To overcome this problem, we propose a su- reasonably be tuned by hand. pervised learning method that can learn both the similarity measure as well as the cover- age/reduncancy trade-off from training data. Fur- 1 Introduction thermore, our learning algorithm is not limited to the model of Lin and Bilmes (2010), but applies to Automatic document summarization is the prob- all monotone submodular summarization models. lem of constructing a short text describing the Due to the diminishing-returns property of mono- main points in a (set of) document(s). Exam- tone submodular set functions and their computa- ple applications range from generating short sum- tional tractability, this class of functions provides maries of news articles, to presenting snippets for a rich space for designing summarization meth- URLs in web-search. In this paper we focus on ods. To illustrate the generality of our approach, extractive multi-document summarization, where we also provide experiments for a coverage-based the final summary is a subset of the sentences model originally developed for diversified infor- from multiple input documents. In this way, ex- mation retrieval (Swaminathan et al., 2009). tractive summarization avoids the hard problem In general, our method learns a parameterized of generating well-formed natural-language sen- monotone submodular scoring function from su- tences, since only existing sentences from the in- pervised training data, and its implementation is put documents are presented as part of the sum- available for download.1 Given a set of docu- mary. ments and their summaries as training examples, A current state-of-the-art method for document 1 summarization was recently proposed by Lin and http://www.cs.cornell.edu/˜rs/sfour/ 224 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 224–233, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics we formulate the learning problem as a struc- concept of eigenvector centrality in a graph of tured prediction problem and derive a maximum- sentence similarities. Similarly, TextRank (Mi- margin algorithm in the structural support vec- halcea and Tarau, 2004) is also graph based rank- tor machine (SVM) framework. Note that, un- ing system for identification of important sen- like other learning approaches, our method does tences in a document by using sentence similar- not require a heuristic decomposition of the learn- ity and PageRank (Brin and Page, 1998). Sen- ing task into binary classification problems (Ku- tence extraction can also be implemented using piec et al., 1995), but directly optimizes a struc- other graph based scoring approaches (Mihalcea, tured prediction. This enables our algorithm to di- 2004) such as HITS (Kleinberg, 1999) and po- rectly optimize the desired performance measure sitional power functions. Graph based methods (e.g. ROUGE) during training. Furthermore, our can also be paired with clustering such as in Col- method is not limited to linear-chain dependen- labSum (Wan et al., 2007). This approach first cies like (Conroy and O’leary, 2001; Shen et al., uses clustering to obtain document clusters and 2007), but can learn any monotone submodular then uses graph based algorithm for sentence se- scoring function. lection which includes inter and intra-document This ability to easily train summarization mod- sentence similarities. Another clustering-based els makes it possible to efficiently tune models algorithm (Nomoto and Matsumoto, 2001) is a to various types of document collections. In par- diversity-based extension of MMR that finds di- ticular, we find that our learning method can re- versity by clustering and then proceeds to reduce liably tune models with hundreds of parameters redundancy by selecting a representative for each based on a training set of about 30 examples. cluster. This increases the fidelity of models compared The manually tuned sentence pairwise model to their hand-tuned counterparts, showing sig- (Lin and Bilmes, 2010; Lin and Bilmes, 2011) we nificantly improved empirical performance. We took inspiration from is based on budgeted sub- provide a detailed investigation into the sources modular optimization. A summary is produced of these improvements, identifying further direc- by maximizing an objective function that includes tions for research. coverage and redundancy terms. Coverage is de- fined as the sum of sentence similarities between 2 Related work the selected summary and the rest of the sen- Work on extractive summarization spans a large tences, while redundancy is the sum of pairwise range of approaches. Starting with unsupervised intra-summary sentence similarities. Another ap- methods, one of the widely known approaches proach based on submodularity (Qazvinian et al., is Maximal Marginal Relevance (MMR) (Car- 2010) relies on extracting important keyphrases bonell and Goldstein, 1998). It uses a greedy ap- from citation sentences for a given paper and us- proach for selection and considers the trade-off ing them to build the summary. between relevance and redundancy. Later it was In the supervised setting, several early methods extended (Goldstein et al., 2000) to support multi- (Kupiec et al., 1995) made independent binary de- document settings by incorporating additional in- cisions whether to include a particular sentence formation available in this case. Good results can in the summary or not. This ignores dependen- be achieved by reformulating this as a knapsack cies between sentences and can result in high re- packing problem and solving it using dynamic dundancy. The same problem arises when using programing (McDonald, 2007). Alternatively, we learning-to-rank approaches such as ranking sup- can use annotated phrases as textual units and se- port vector machines, support vector regression lect a subset that covers most concepts present and gradient boosted decision trees to select the in the input (Filatova and Hatzivassiloglou, 2004) most relevant sentences for the summary (Metzler (which can also be achieved by our coverage scor- and Kanungo, 2008). ing function if it is extended with appropriate fea- Introducing some dependencies can improve tures). the performance. One limited way of introduc- A popular stochastic graph-based summariza- ing dependencies between sentences is by using a tion method is LexRank (Erkan and Radev, 2004). linear-chain HMM. The HMM is assumed to pro- It computes sentence importance based on the duce the summary by having a chain transitioning 225 between summarization and non-summarization however it uses vine-growth model and employs states (Conroy and O’leary, 2001) while travers- search to to find the best policy which is then used ing the sentences in a document. A more expres- to generate a summary. sive approach is using a CRF for sequence label- A specific subclass of submodular (but not ing (Shen et al., 2007) which can utilize larger and monotone) functions are defined by Determinan- not necessarily independent feature spaces. The tal Point Processes (DPPs) (Kulesza and Taskar, disadvantage of using linear chain models, how- 2011). While they provide an elegant probabilis- ever, is that they represent the summary as a se- tic interpretation of the resulting summarization quence of sentences. Dependencies between sen- models, the lack of monotonicity means that no tences that are far away from each other cannot efficient approximation algorithms are known for be modeled efficiently. In contrast to such lin- computing the highest-scoring summary. ear chain models, our approach on submodular scoring functions can model long-range depen- 3 Submodular document summarization dencies. In this way our method can use proper- In this section, we illustrate how document sum- ties of the whole summary when deciding which marization can be addressed using submodular set sentences to include in it. functions. The set of documents to be summa- More closely related to our work is that of Li rized is split into a set of individual sentences et al. (2009). They use the diversified retrieval x = {s1 , ..., sn }. The summarization method method proposed in Yue and Joachims (2008) for then selects a subset yˆ ⊆ x of sentences that max- document summarization. Moreover, they assume imizes a given scoring function Fx : 2x → R that subtopic labels are available so that additional subject to a budget constraint (e.g. less than B constraints for diversity, coverage and balance can characters). be added to the structural SVM learning prob- lem. In contrast, our approach does not require the yˆ = arg max Fx (y) s.t. |y| ≤ B (1) knowledge of subtopics (thus allowing us to ap- y⊆x ply it to a wider range of tasks) and avoids adding In the following we restrict the admissible scoring additional constraints (simplifying the algorithm). functions F to be submodular. Furthermore, it can use different submodular ob- jective functions, for example word coverage and Definition 1. Given a set x, a function F : 2x → sentence pairwise models described later in this R is submodular iff for all u ∈ U and all sets s paper. and t such that s ⊆ t ⊆ x, we have, Another closely related work also takes a max- F (s ∪ {u}) − F (s) ≥ F (t ∪ {u}) − F (t). margin discriminative learning approach in the structural SVM framework (Berg-Kirkpatrick et Intuitively, this definition says that adding u to al., 2011) or by using MIRA (Martins and Smith, a subset s of t increases f at least as much as 2009) to learn the parameters for summarizing adding it to t. Using two specific submodular a set of documents. However, they do not con- functions as examples, the following sections il- sider submodular functions, but instead solve an lustrate how this diminishing returns property nat- Integer Linear Program (ILP) or an approxima- urally reflects the trade-off between maximizing tion thereof. The ILP encodes a compression coverage while minimizing redundancy. model where arbitrary parts of the parse trees of sentences in the summary can be cut and re- 3.1 Pairwise scoring function moved. This allows them to select parts of sen- The first submodular scoring function we con- tences and yet preserve some gramatical struc- sider was proposed by Lin and Bilmes (2010) and ture. Their work focuses on learning a particular is based on a model of pairwise sentence similar- compression model based on ILP inference, while ities. It scores a summary y using the following our work explores learning a general and large function, which Lin and Bilmes (2010) show is class of sentence selection models using submod- submodular: ular optimization. The third notable approach X X uses SEARN (Daum´e, 2006) to learn parameters Fx (y) = σ(i, j) − λ σ(i, j). (2) for joint summarization and compression model, i∈x\y,j∈y i,j∈y:i6=j 226 Figure 1: Illustration of the pairwise model. Not all Figure 2: Illustration of the coverage model. Word edges are shown for clarity purposes. Edge thickness border thickness represents importance. denotes the similarity score. An example of how a summary is scored is il- In the above equation, σ(i, j) ≥ 0 denotes a mea- lustrated in the Figure 2. Analogous to the defini- sure of similarity between pairs of sentences i and tion of similarity σ(i, j) in the pairwise model, the j. The first term in Eq. 2 is a measure of how simi- choice of the word importance function ω(v) is lar the sentences included in summary y are to the crucial in the coverage model. A simple heuristic other sentences in x. The second term penalizes is to weigh words highly that occur in many sen- y by how similar its sentences are to each other. tences of x, but in few other documents (Swami- λ > 0 is a scalar parameter that trades off be- nathan et al., 2009). However, we will show in the tween the two terms. Maximizing Fx (y) amounts following how to learn ω(v) from training data. to increasing the similarity of the summary to ex- cluded sentences while minimizing repetitions in Algorithm 1 Greedy algorithm for finding the the summary. An example is illustrated in Figure best summary yˆ given a scoring function Fx (y). 1. In the simplest case, σ(i, j) may be the TFIDF Parameter: r > 0. (Salton and Buckley, 1988) cosine similarity, but yˆ ← ∅ we will show later how to learn sophisticated sim- A←x ilarity functions. while A 6= ∅ do y ∪ {l}) − Fx (ˆ Fx (ˆ y) k ← arg max r 3.2 Coverage scoring function Pl∈A (cl ) A second scoring function we consider was if ck+ i∈ˆy ci ≤ B and Fx (ˆ y ∪{k})−Fx (ˆ y) ≥ first proposed for diversified document retrieval 0 then (Swaminathan et al., 2009; Yue and Joachims, yˆ ← yˆ ∪ {k} 2008), but it naturally applies to document sum- end if marization as well (Li et al., 2009). It is based on A ← A\{k} a notion of word coverage, where each word v has end while some importance weight ω(v) ≥ 0. A summary y covers a word if at least one of its sentences 3.3 Computing a Summary contains the word. The score of a summary is then simply the sum of the word weights its cov- Computing the summary that maximizes either of ers (though we could also include a concave dis- the two scoring functions from above (i.e. Eqns. count function that rewards covering a word mul- (2) and (3)) is NP-hard (McDonald, 2007). How- tiple times (Raman et al., 2011)): ever, it is known that the greedy algorithm 1 can achieve a 1 − 1/e approximation to the optimum X Fx (y) = ω(v). (3) solution for any linear budget constraint (Lin and v∈V (y) Bilmes, 2010; Khuller et al., 1999). Even further, this algorithm provides a 1 − 1/e approximation In the above equation, V (y) denotes the union of for any monotone submodular scoring function. all words in y. This function is analogous to a The algorithm starts with an empty summariza- maximum coverage problem, which is known to tion. In each step, a sentence is added to the sum- be submodular (Khuller et al., 1999). mary that results in the maximum relative increase 227 of the objective. The increase is relative to the called the joint feature-map between input x and amount of budget that is used by the added sen- output y. Note that both submodular scoring func- tence. The algorithm terminates when the budget tion in Eqns. (2) and (3) can be brought into the B is reached. form wT Ψ(x, y) for the linear parametrization in Note that the algorithm has a parameter r in Eq. (6) and (7): the denominator of the selection rule, which Lin X X and Bilmes (2010) report to have some impact Ψp (x, y) = φpx (i, j) − λ φpx (i, j), (6) on performance. In the algorithm, ci represents i∈x\y,j∈y i,j∈y:i6=j X the cost of the sentence (i.e., length). Thus, the Ψc (x, y) = φcx (v). (7) algorithm actually selects sentences with large v∈V (y) marginal unity relative to their length (trade-off controlled by the parameter r). Selecting r to be After this transformation, it is easy to see that less than 1 gives more importance to “information computing the maximizing summary in Eq. (1) density” (i.e. sentences that have a higher ratio and the structural SVM prediction rule in Eq. (5) of score increase per length). The 1 − 1e greedy are equivalent. approximation guarantee holds despite this addi- To learn the weight vector w, structural SVMs tional parameter (Lin and Bilmes, 2010). More require training examples (x1 , y 1 ), ..., (xn , y n ) of details on our choice of r and its effects are pro- input/output pairs. In document summarization, vided in the experiments section. however, the “correct” extractive summary is typ- ically not known. Instead, training documents 4 Learning algorithm xi are typically annotated with multiple manual (non-extractive) summaries (denoted by Y i ). To In this section, we propose a supervised learning determine a single extractive target summary y i method for training a submodular scoring func- for training, we find the extractive summary that tion to produce desirable summaries. In particu- (approximately) optimizes ROUGE score – or lar, for the pairwise and the coverage model, we some other loss function ∆(Y i , y) – with respect show how to learn the similarity function σ(i, j) to Y i . and the word importance weights ω(v) respec- y i = argmin ∆(Y i , y) (8) tively. In particular, we parameterize σ(i, j) and y∈Y ω(v) using a linear model, allowing that each de- We call the y i determined in this way the “target” pends on the full set of input sentences x: summary for xi . Note that y i is a greedily con- σx (i, j) = wTφpx (i, j) ωx (v) = wTφcx (v). (4) structed approximate target summary based on its proximity to Y i via ∆. Because of this, we will In the above equations, w is a weight vector that learn a model that can predict approximately good is learned, and φpx (i, j) and φcx (v) are feature vec- summaries y i from xi . However, we believe that tors. In the pairwise model, φpx (i, j) may include most of the score difference between manual sum- feature like the TFIDF cosine between i and j or maries and y i (as explored in the experiments sec- the number of words from the document titles that tion) is due to it being an extractive summary and i and j share. In the coverage model, φcx (v) may not due to greedy construction. include features like a binary indicator of whether Following the structural SVM approach, we v occurs in more than 10% of the sentences in x can now formulate the problem of learning w as or whether v occurs in the document title. the following quadratic program (QP): We propose to learn the weights following a n 1 CX large-margin framework using structural SVMs min kwk2 + ξi (9) w,ξ≥0 2 n (Tsochantaridis et al., 2005). Structural SVMs i=1 learn a discriminant function s.t. w> Ψ(xi , y i ) − w> Ψ(xi , yˆi ) ≥ h(x) = arg max w> Ψ(x, y) (5) y i , Y i ) − ξi , ∀ˆ ∆(ˆ y i 6= y i , ∀1 ≤ i ≤ n. y∈Y The above formulation ensures that the scor- that predicts a structured output y given a (pos- ing function with the target summary (i.e. sibly also structured) input x. Ψ(x, y) ∈ RN is w> Ψ(xi , y i )) is larger than the scoring function 228 Algorithm 2 Cutting-plane algorithm for solving low: the learning optimization problem. Parameter: desired tolerance > 0. ∆(Y i , yˆ) = max(0, ∆R (Y i , yˆ) − ∆R (Y i , y i )), ∀i : Wi ← ∅ repeat The loss ∆ was used in our experiments. Thus for ∀i do training a structural SVM with this loss aims to yˆ ← arg max wT Ψ(xi , y) + ∆(Y i , y) maximize the ROUGE-1 F score with the man- y ual summaries provided in the training examples, if wT Ψ(xi , y i ) + ≤ wT Ψ(xi , yˆ) + while trading it off with margin. Note that we ∆(Y i , yˆ) − ξi then could also use a different loss function (as the Wi ← Wi ∪ {ˆ y} method is not tied to this particular choice), if we w ← solve QP (9) using constraints Wi had a different target evaluation metric. Finally, end if once a w is obtained from structural SVM train- end for ing, a predicted summary for a test document x until no Wi has changed during iteration can be obtained from (5). 5 Experiments for any other summary yˆi (i.e., w> Ψ(xi , yˆi )). In this section, we empirically evaluate the ap- The objective function learns a large-margin proach proposed in this paper. Following Lin and weight vector w while trading it off with an upper Bilmes (2010), experiments were conducted on bound on the empirical loss. The two quantities two different datasets (DUC ’03 and ’04). These are traded off with a parameter C > 0. datasets contain document sets with four manual Even though the QP has exponentially many summaries for each set. For each document set, constraints in the number of sentences in the in- we concatenated all the articles and split them put documents, it can be solved approximately into sentences using the tool provided with the in polynomial time via a cutting plane algorithm ’03 dataset. For the supervised setting we used (Tsochantaridis et al., 2005). The steps of the 10 resamplings with a random 20/5/5 (’03) and cutting-plane algorithm are shown in Algorithm 40/5/5 (’04) train/test/validation split. We deter- 2. In each iteration of the algorithm, for each mined the best C value in (9) using the perfor- training document xi , a summary yˆi which most mance on each validation set and then report aver- violates the constraint in (9) is found. This is done age performence over the corresponding test sets. by finding Baseline performance (the approach of Lin and Bilmes (2010)) was computed using all 10 test yˆ ← arg max wT Ψ(xi , y) + ∆(Y i , y), sets as a single test set. For all experiments and y∈Y datasets, we used r = 0.3 in the greedy algorithm for which we use a variant of the greedy algorithm as recommended in Lin and Bilmes (2010) for the in Figure 1. After a violating constraint for each ’03 dataset. We find that changing r has only a training example is added, the resulting quadratic small influence on performance.2 program is solved. These steps are repeated until The construction of features for learning is or- all the constraints are satisfied to a required preci- ganized by word groups. The most trivial group sion . is simply all words (basic). Considering the prop- Finally, special care has to be taken to appro- erties of the words themselves, we constructed priately define the loss function ∆ given the dis- several features from properties such as capital- parity of Y i and y i . Therefore, we first define an ized words, non-stop words and words of cer- intermediate loss function tain length (cap+stop+len). We obtained another set of features from the most frequently occur- ∆R (Y, yˆ) = max(0, 1 − ROU GE1F (Y, yˆ)), ing words in all the articles (minmax). We also considered the position of a sentence (containing based on the ROUGE-1 F score. To ensure that 2 Setting r to 1 and thus eliminating the non-linearity does the loss function is zero for the target label as de- lower the score (e.g. to 0.38466 for the pairwise model on fined in (8), we normalized the above loss as be- DUC ’03 compared with the results on Figure 3). 229 the word) in the article as another feature (loca- formance numbers than those reported in Lin and tion). All those word groups can then be further Bilmes (2010) – better on DUC ’03 and somewhat refined by selecting different thresholds, weight- lower on DUC ’04, if evaluated on the same selec- ing schemes (e.g. TFIDF) and forming binned tion of test examples as theirs. We conjecture that variants of these features. this is due to small differences in implementation For the pairwise model we use cosine similar- and/or preprocessing of the dataset. Furthermore, ity between sentences using only words in a given as authors of Lin and Bilmes (2010) note in their word group during computation. For the word paper, the ’03 and ’04 datasets behave quite dif- coverage model we create separate features for ferently. covering words in different groups. This gives us fairly comparable feature strength in both mod- model dataset ROUGE-1 F (stderr) els. The only further addition is the use of differ- pairwise DUC ’03 0.3929 (0.0074) ent word coverage levels in the coverage model. coverage 0.3784 (0.0059) First we consider how well does a sentence cover hand-tuned 0.3571 (0.0063) a word (e.g. a sentence with five instances of the pairwise DUC ’04 0.4066 (0.0061) same word might cover it better than another with coverage 0.3992 (0.0054) only a single instance). And secondly we look at hand-tuned 0.3935 (0.0052) how important it is to cover a word (e.g. if a word appears in a large fraction of sentences we might Figure 3: Results obtained on DUC ’03 and ’04 want to be sure to cover it). Combining those two datasets using the supervised models. Increase in per- formance over the hand-tuned is statistically signifi- criteria using different thresholds we get a set of cant (p ≤ 0.05) for the pairwise model on the both features for each word. Our coverage features are datasets, but only on DUC ’03 for the coverage model. motivated from the approach of Yue and Joachims (2008). In contrast, the hand-tuned pairwise base- Figure 3 also reports the performance for line uses only TFIDF weighted cosine similarity the coverage model as trained by our algorithm. between sentences using all words, following the These results can be compared against those for approach in Lin and Bilmes (2010). the pairwise model. Since we are using features The resulting summaries are evaluated using of comparable strength in both approaches, as ROUGE version 1.5.5 (Lin and Hovy, 2003). We well as the same greedy algorithm and structural selected the ROUGE-1 F measure because it was SVM learning method, this comparison largely used by Lin and Bilmes (2010) and because it is reflects the quality of models themselves. On the one of the commonly used performance scores in ’04 dataset both models achieve the same perfor- recent work. However, our learning method ap- mance while on ’03 the pairwise model performs plies to other performance measures as well. Note significantly (p ≤ 0.05) better than the coverage that we use the ROUGE-1 F measure both for the model. loss function during learning, as well as for the Overall, the pairwise model appears to perform evaluation of the predicted summaries. slightly better than the coverage model with the datasets and features we used. Therefore, we fo- 5.1 How does learning compare to manual cus on the pairwise model in the following. tuning? In our first experiment, we compare our super- 5.2 How fast does the algorithm learn? vised learning approach to the hand-tuned ap- Hand-tuned approaches have limited flexibility. proach. The results from this experiment are sum- Whenever we move to a significantly different marized in Figure 3. First, supervised training collection of documents we have to reinvest time of the pairwise model (Lin and Bilmes, 2010) to retune it. Learning can make this adaptation resulted in a statistically significant (p ≤ 0.05 to a new collection more automatic and faster – using paired t-test) increase in performance on especially since training data has to be collected both datasets compared to our reimplementation even for manual tuning. of the manually tuned pairwise model. Note that Figure 4 evaluates how effectively the learn- our reimplementation of the approach of Lin and ing algorithm can make use of a given amount of Bilmes (2010) resulted in slightly different per- training data. In particular, the figure shows the 230 extractive summary is about 10 points of ROUGE. Third, we expect some drop in performance, since our model may not be able to fit the optimal extractive summaries due to a lack of expressive- ness. This can be estimated by looking at train- ing set performance, as reported in row model fit of Figure 5. On both datasets, we see a drop of about 5 points of ROUGE performance. Adding more and better features might help the model fit the data better. Finally, a last drop in performance may come Figure 4: Learning curve for the pairwise model on from overfitting. The test set ROUGE scores are DUC ’04 dataset showing ROUGE-1 F scores for given in the row prediction of Figure 5. Note that different numbers of learning examples (logarithmic the drop between training and test performance scale). The dashed line represents the preformance of is rather small, so overfitting is not an issue and the hand-tuned model. is well controlled in our algorithm. We therefore conclude that increasing model fidelity seems like learning curve for our approach. Even with very a promising direction for further improvements. few training examples, the learning approach al- ready outperforms the baseline. Furthermore, at bound dataset ROUGE-1 F the maximum number of training examples avail- human DUC ’03 0.56235 able to us the curve still increases. We therefore extractive 0.45497 conjecture that more data would further improve model fit 0.40873 performance. prediction 0.39294 human DUC ’04 0.55221 5.3 Where is room for improvement? extractive 0.45199 To get a rough estimate of what is actually achiev- model fit 0.40963 able in terms of the final ROUGE-1 F score, we prediction 0.40662 looked at different “upper bounds” under vari- Figure 5: Upper bounds on ROUGE-1 F scores: agree- ous scenarios (Figure 5). First, ROUGE score ment between manual summaries, greedily computed is computed by using four manual summaries best extractive summaries, best model fit on the train from different assessors, so that we can estimate set (using the best C value) and the test scores of the inter-subject disagreement. If one computes the pairwise model. ROUGE score of a held-out summary against the remaining three summaries, the resulting perfor- mance is given in the row labeled human of Fig- 5.4 Which features are most useful? ure 5. It provides a reasonable estimate of human To understand which features affected the final performance. performance of our approach, we assessed the Second, in extractive summarization we re- strength of each set of our features. In particu- strict summaries to sentences from the documents lar, we looked at how the final test score changes themselves, which is likely to lead to a reduc- when we removed certain features groups (de- tion in ROUGE. To estimate this drop, we use the scribed in the beginning of Section 5) as shown greedy algorithm to select the extractive summary in Figure 6. that maximizes ROUGE on the test documents. The most important group of features are the The resulting performance is given in the row ex- basic features (pure cosine similarity between tractive of Figure 5. On both dataset, the drop sentences) since removing them results in the in performance for this (approximately3 ) optimal largest drop in performance. However, other fea- 3 tures play a significant role too (i.e. only the ba- We compared the greedy algorithm with exhaustive sic ones are not enough to achieve good perfor- search for up to three selected sentences (more than that would take too long). In about half the cases we got the same below optimal confirming that greedy selection works quite solution, in other cases the soultion was on average about 1% well. 231 mance). This confirms that performance can be was 0.4010, which is slightly but not significantly improved by adding richer fatures instead of us- lower than the 0.4066 obtained with four sum- ing only a single similarity score as in Lin and maries (as shown on Figure 3). Similarly, on DUC Bilmes (2010). Using learning for these complex ’03 the performance drop from 0.3929 to 0.3838 model is essential, since hand-tuning is likely to was not significant as well. be intractable. Based on those results, we conjecture that hav- The second most important group of features ing more documents sets with only a single man- considering the drop in performance (i.e. loca- ual summary is more useful for training than tion) looks at positions of sentences in the arti- fewer training examples with better labels (i.e. cles. This makes intuitive sense because the first multiple summaries). In both cases, we spend sentences in news articles are usually packed with approximately the same amount of effort (as the information. The other three groups do not have a summaries are the most expensive component of significant impact on their own. the training data), however having more training examples helps (according to the learning curve removed ROUGE-1 F presented before) while spending effort on multi- group ple summaries appears to have only minor benefit none 0.40662 for training. basic 0.38681 all except basic 0.39723 6 Conclusions location 0.39782 This paper presented a supervised learning ap- sent+doc 0.39901 proach to extractive document summarization cap+stop+len 0.40273 based on structual SVMs. The learning method minmax 0.40721 applies to all submodular scoring functions, rang- ing from pairwise-similarity models to coverage- Figure 6: Effects of removing different feature groups on the DUC ’04 dataset. Bold font marks significant based approaches. The learning problem is for- difference (p ≤ 0.05) when compared to the full pari- mulated into a convex quadratic program and was wise model. The most important are basic similar- then solved approximately using a cutting-plane ity features including all words (similar to (Lin and method. In an empirical evaluation, the structural Bilmes, 2010)). The last feature group actually low- SVM approach significantly outperforms conven- ered the score but is included in the model because we tional hand-tuned models on the DUC ’03 and only found this out later on DUC ’04 dataset. ’04 datasets. A key advantage of the learn- ing approach is its ability to handle large num- 5.5 How important is it to train with bers of features, providing substantial flexibility multiple summaries? for building high-fidelity summarization models. Furthermore, it shows good control of overfitting, While having four manual summaries may be im- making it possible to train models even with only portant for computing a reliable ROUGE score a few training examples. for evaluation, it is not clear whether such an ap- proach is the most efficient use of annotator re- Acknowledgments sources for training. In our final experiment, we trained our method using only a single manual We thank Claire Cardie and the members of the summary for each set of documents. When us- Cornell NLP Seminar for their valuable feedback. ing only a single manual summary, we arbitrarily This research was funded in part through NSF took the first one out of the provided four refer- Awards IIS-0812091 and IIS-0905467. ence summaries and used only it to compute the target label for training (instead of using average loss towards all four of them). Otherwise, the ex- References perimental setup was the same as in the previous T. Berg-Kirkpatrick, D. Gillick and D. Klein. Jointly subsections, using the pairwise model. Learning to Extract and Compress. In Proceedings For DUC ’04, the ROUGE-1 F score obtained of ACL, 2011. using only a single summary per document set S. Brin and L. Page. The Anatomy of a Large-Scale 232 Hypertextual Web Search Engine. In Proceedings of Linear Programming for Natural Language Process- WWW, 1998. ing, 2009. J. Carbonell and J. Goldstein. The use of MMR, R. McDonald. 2007. A Study of Global Inference Al- diversity-based reranking for reordering documents gorithms in Multi-document Summarization. In Ad- and producing summaries. In Proceedings of SI- vances in Information Retrieval, Lecture Notes in GIR, 1998. Computer Science, 2007, pp. 557–564. J. M. Conroy and D. P. O’leary. Text summarization via D. Metzler and T. Kanungo. Machine learned sen- hidden markov models. In Proceedings of SIGIR, tence selection strategies for query-biased summa- 2001. rization. In Proceedings of SIGIR, 2008. H. Daum´e III. Practical Structured Learning Tech- R. Mihalcea. 2004. Graph-based ranking algorithms niques for Natural Language Processing. Ph.D. for sentence extraction, applied to text summa- Thesis, 2006. rization. In Proceedings of the ACL on Interactive G. Erkan and D. R. Radev. LexRank: Graph-based poster and demonstration sessions, 2004. Lexical Centrality as Salience in Text Summariza- R. Mihalcea and P. Tarau. Textrank: Bringing order tion. In Journal of Artificial Intelligence Research, into texts. In Proceedings of EMNLP, 2004. Vol. 22, 2004, pp. 457–479. T. Nomoto and Y. Matsumoto. A new approach to un- E. Filatova and V. Hatzivassiloglou. Event-Based Ex- supervised text summarization. In Proceedings of tractive Summarization. In Proceedings of ACL SIGIR, 2001. Workshop on Summarization, 2004. V. Qazvinian, D. R. Radev, and A. Ozg¨¨ ur. 2010. Cita- T. Finley and T. Joachims. Training structural SVMs tion Summarization Through Keyphrase Extraction. when exact inference is intractable. In Proceedings In Proceedings of COLING, 2010. of ICML, 2008. K. Raman, T. Joachims and P. Shivaswamy. Structured D. Gillick and Y. Liu. A scalable global model for sum- Learning of Two-Level Dynamic Rankings. In Pro- marization. In Proceedings of ACL Workshop on ceedings of CIKM, 2011. Integer Linear Programming for Natural Language G. Salton and C. Buckley. Term-weighting approaches Processing, 2009. in automatic text retrieval. In Information process- J. Goldstein, V. Mittal, J. Carbonell, and M. ing and management, 1988, pp. 513–523. Kantrowitz. Multi-document summarization by sen- D. Shen, J. T. Sun, H. Li, Q. Yang, and Z. Chen. tence extraction. In Proceedings of NAACL-ANLP, Document summarization using conditional ran- 2000. dom fields. In Proceedings of IJCAI, 2007. S. Khuller, A. Moss and J. Naor. The budgeted maxi- A. Swaminathan, C. V. Mathew and D. Kirovski. mum coverage problem. In Information Processing Essential Pages. In Proceedings of WI-IAT, IEEE Letters, Vol. 70, Issue 1, 1999, pp. 39–45. Computer Society, 2009. I. Tsochantaridis, T. Hofmann, T. Joachims and Y. Al- J. M. Kleinberg. Authoritative sources in a hyperlinked tun. Large margin methods for structured and inter- environment. In Journal of the ACM, Vol. 46, Issue dependent output variables. In Journal of Machine 5, 1999, pp. 604-632. Learning Research, Vol. 6, 2005, pp. 1453-1484. A. Kulesza and B. Taskar. Learning Determinantal X. Wan, J. Yang, and J. Xiao. Collabsum: Exploit- Point Processes. In Proceedings of UAI, 2011. ing multiple document clustering for collaborative J. Kupiec, J. Pedersen, and F. Chen. A trainable docu- single document summarizations. In Proceedings of ment summarizer. In Proceedings of SIGIR, 1995. SIGIR, 2007. L. Li, Ke Zhou, G. Xue, H. Zha, and Y. Yu. Enhanc- Y. Yue and T. Joachims. Predicting diverse subsets us- ing Diversity, Coverage and Balance for Summa- ing structural svms. In Proceedings of ICML, 2008. rization through Structure Learning. In Proceedings of WWW, 2009. H. Lin and J. Bilmes. 2010. Multi-document summa- rization via budgeted maximization of submodular functions. In Proceedings of NAACL-HLT, 2010. H. Lin and J. Bilmes. 2011. A Class of Submodu- lar Functions for Document Summarization. In Pro- ceedings of ACL-HLT, 2011. C. Y. Lin and E. Hovy. Automatic evaluation of sum- maries using N-gram co-occurrence statistics. In Proceedings of NAACL, 2003. F. T. Martins and N. A. Smith. Summarization with a joint model for sentence extraction and compres- sion. In Proceedings of ACL Workshop on Integer 233 A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utterances and their Meanings Tom Kwiatkowski* † Sharon Goldwater∗ Luke Zettlemoyer† Mark Steedman∗

[email protected] [email protected] [email protected] [email protected]

∗ † ILCC, School of Informatics Computer Science & Engineering University of Edinburgh University of Washington Edinburgh, EH8 9AB, UK Seattle, WA, 98195, USA Abstract of propositional uncertainty1 , from a set of con- textually afforded meaning candidates, as here: This paper presents an incremental prob- abilistic learner that models the acquis- tion of syntax and semantics from a cor- Utterance : you have another cookie  pus of child-directed utterances paired with have(you, another(x, cookie(x))) possible representations of their meanings. Candidate eat(you, your(x, cake(x))) These meaning representations approxi- Meanings  want(i, another(x, cookie(x))) mate the contextual input available to the child; they do not specify the meanings of The task is then to learn, from a sequence of such individual words or syntactic derivations. (utterance, meaning-candidates) pairs, the correct The learner then has to infer the meanings and syntactic properties of the words in the lexicon and parsing model. Here we present a input along with a parsing model. We use probabilistic account of this task with an empha- the CCG grammatical framework and train sis on cognitive plausibility. a non-parametric Bayesian model of parse Our criteria for plausibility are that the learner structure with online variational Bayesian must not require any language-specific informa- expectation maximization. When tested on tion prior to learning and that the learning algo- utterances from the CHILDES corpus, our rithm must be strictly incremental: it sees each learner outperforms a state-of-the-art se- mantic parser. In addition, it models such training instance sequentially and exactly once. aspects of child acquisition as “fast map- We define a Bayesian model of parse structure ping,” while also countering previous crit- with Dirichlet process priors and train this on a icisms of statistical syntactic learners. set of (utterance, meaning-candidates) pairs de- rived from the CHILDES corpus (MacWhinney, 1 Introduction 2000) using online variational Bayesian EM. We evaluate the learnt grammar in three ways. Children learn language by mapping the utter- First, we test the accuracy of the trained model ances they hear onto what they believe those ut- in parsing unseen utterances onto gold standard terances mean. The precise nature of the child’s annotations of their meaning. We show that prelinguistic representation of meaning is not it outperforms a state-of-the-art semantic parser known. We assume for present purposes that (Kwiatkowski et al., 2010) when run with similar it can be approximated by compositional logical training conditions (i.e., neither system is given representations such as (1), where the meaning is the corpus based initialization originally used by a logical expression that describes a relationship Kwiatkowski et al.). We then examine the learn- have between the person you refers to and the ing curves of some individual words, showing that object another(x, cookie(x)): the model can learn word meanings on the ba- sis of a single exposure, similar to the fast map- Utterance : you have another cookie (1) ping phenomenon observed in children (Carey Meaning : have(you, another(x, cookie(x))) and Bartlett, 1978). Finally, we show that our Most situations will support a number of plausi- 1 Similar to referential uncertainty but relating to propo- ble meanings, so the child has to learn in the face sitions rather than referents. 234 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 234–244, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics learner captures the step-like learning curves for pers are not designed to be cognitively plausible, word order regularities that Thornton and Tesan using batch training algorithms, multiple passes (2007) claim children show. This result coun- over the data, and language specific initialisations ters Thornton and Tesan’s criticism of statistical (lists of noun phrases and additional corpus statis- grammar learners—that they tend to exhibit grad- tics), all of which we dispense with here. In ual learning curves rather than the abrupt changes particular, our approach is closely related that of in linguistic competence observed in children. Kwiatkowski et al. (2010) but, whereas that work required careful initialisation and multiple passes 1.1 Related Work over the training data to learn a discriminative Models of syntactic acquisition, whether they parsing model, here we learn a generative parsing have addressed the task of learning both syn- model without either. tax and semantics (Siskind, 1992; Villavicencio, 1.2 Overview of the approach 2002; Buttery, 2006) or syntax alone (Gibson and Wexler, 1994; Sakas and Fodor, 2001; Yang, Our approach takes, as input, a corpus of (ut- 2002) have aimed to learn a single, correct, deter- terance, meaning-candidates) pairs {(si , {m}i ) : ministic grammar. With the exception of Buttery i = 1, . . . , N }, and learns a CCG lexicon Λ and (2006) they also adopt the Principles and Param- the probability of each production a → b that eters grammatical framework, which assumes de- could be used in a parse. Together, these define tailed knowledge of linguistic regularities2 . Our a probabilistic parser that can be used to find the approach contrasts with all previous models in as- most probable meaning for any new sentence. suming a very general kind of linguistic knowl- We learn both the lexicon and production prob- edge and a probabilistic grammar. Specifically, abilities from allowable parses of the training we use the probabilistic Combinatory Categorial pairs. The set of allowable parses {t} for a sin- Grammar (CCG) framework, and assume only gle (utterance, meaning-candidates) pair consists that the learner has access to a small set of general of those parses that map the utterance onto one of combinatory schemata and a functional mapping the meanings. This set is generated with the func- from semantic type to syntactic category. Further- tional mapping T : more, this paper is the first to evaluate a model {t} = T (s, m), (2) of child syntactic-semantic acquisition by parsing unseen data. which is defined, following Kwiatkowski et al. (2010), using only the CCG combinators and a Models of child word learning have focused mapping from semantic type to syntactic category on semantics only, learning word meanings from (presented in in Section 4). utterances paired with either sets of concept sym- The CCG lexicon Λ is learnt by reading off bols (Yu and Ballard, 2007; Frank et al., 2008; Fa- the lexical items used in all parses of all training zly et al., 2010) or a compositional meaning rep- pairs. Production probabilities are learnt in con- resentation of the type used here (Siskind, 1996). junction with Λ through the use of an incremen- The models of Alishahi and Stevenson (2008) tal parameter estimation algorithm, online Varia- and Maurits et al. (2009) learn, as well as word- tional Bayesian EM, as described in Section 5. meanings, orderings for verb-argument structures Before presenting the probabilistic model, the but not the full parsing model that we learn here. mapping T , and the parameter training algorithm, we first provide some background on the meaning Semantic parser induction as addressed by representations we use and on CCG. Zettlemoyer and Collins (2005, 2007, 2009), Kate and Mooney (2007), Wong and Mooney (2006, 2 Background 2007), Lu et al. (2008), Chen et al. (2010), Kwiatkowski et al. (2010, 2011) and B¨orschinger 2.1 Meaning Representations et al. (2011) has the same task definition as the We represent the meanings of utterances in first- one addressed by this paper. However, the learn- order predicate logic using the lambda-calculus. ing approaches presented in those previous pa- An example logical expression (henceforth also 2 referred to as a lambda expression) is: This linguistic use of the term ”parameter” is distinct from the statistical use found elsewhere in this paper. like(eve, mummy) (3) 235 which expresses a logical relationship like be- tween the entity eve and the entity mummy. In 3 Modelling Derivations Section 6.1 we will see how logical expressions like this are created for a set of child-directed ut- The objective of our learning algorithm is to terances (to use in training our model). learn the correct parameterisation of a probabilis- The lambda-calculus uses λ operators to define tic model P (s, m, t) over (utterance, meaning, functions. These may be used to represent func- derivation) triples. This model assigns a proba- tional meanings of utterances but they may also be bility to each of the grammar productions a → b used as a ‘glue language’, to compose elements of used to build the derivation tree t. The probabil- first order logical expressions. For example, the ity of any given CCG derivation t with sentence function λxλy.like(y, x) can be combined with s and semantics m is calculated as the product of the object mummy to give the phrasal mean- all of its production probabilities. Y ing λy.like(y, mummy) through the lambda- P (s, m, t) = P (b|a) (4) calculus operation of function application. a→b∈t 2.2 CCG For example, the derivation in Figure 1 contains Combinatory Categorial Grammar (CCG; Steed- 13 productions, and its probability is the product man 2000) is a strongly lexicalised linguistic for- of the 13 production probabilities. Grammar pro- malism that tightly couples syntax and seman- ductions may be either syntactic—used to build a tics. Each CCG lexical item in the lexicon Λ is syntactic derivation tree, or lexical—used to gen- a triple, written as word ` syntactic category : erate logical expressions and words at the leaves logical expression. Examples are: of this tree. You ` NP : you A syntactic production Ch → R expands a read ` S\NP/NP : λxλy.read(y, x) head node Ch into a result R that is either an ordered pair of syntactic parse nodes hCl , Cr i the ` NP/N : λf.the(x, f (x)) (for a binary production) or a single parse node book ` N : λx.book(x) (for a unary production). Only two unary syn- A full CCG category X : h has syntactic cate- tactic productions are allowed in the grammar: gory X and logical expression h. Syntactic cat- START → A to generate A as the top syntactic egories may be atomic (e.g., S or NP) or com- node of a parse tree and A → [A]lex to indicate plex (e.g., (S\NP)/NP). Slash operators in com- that A is a leaf node in the syntactic derivation plex categories define functions from the range on and should be used to generate a logical expres- the right of the slash to the result on the left in sion and word. Syntactic derivations are built by much the same way as lambda operators do in the recursively applying syntactic productions to non- lambda-calculus. The direction of the slash de- leaf nodes in the derivation tree. Each syntactic fines the linear order of function and argument. production Ch → R has conditional probability CCG uses a small set of combinatory rules to P (R|Ch ). There are 3 binary and 5 unary syntac- concurrently build syntactic parses and semantic tic productions in Figure 1. representations. Two example combinatory rules Lexical productions have two forms. Logical are forward (>) and backward (<) application: expressions are produced from leaf nodes in the syntactic derivation tree Alex → m with condi- X/Y : f Y : g ⇒ X : f (g) (>) tional probability P (m|Alex ). Words are then pro- Y : g X\Y : f ⇒ X : f (g) (<) duced from these logical expressions with condi- tional probability P (w|m). An example logical Given the lexicon above, the phrase “You read the production from Figure 1 is [NP]lex → you. An book” can be parsed using these rules, as illus- example word production is you → You. trated in Figure 1 (with additional notation dis- Every production a → b used in a parse tree t cussed in the following section).. is chosen from the set of productions that could CCG also includes combinatory rules of be used to expand a head node a. If there are a forward (> B) and backward (< B) composition: finite K productions that could expand a then a X/Y : f Y /Z : g ⇒ X/Z : λx.f (g(x)) (> B) K-dimensional Multinomial distribution parame- Y \Z : g X\Y : f ⇒ X\Z : λx.f (g(x)) (< B) terised by θa can be used to model the categorical 236 START X = {(si , {m}i ) : i = 1, . . . , N }, the latent vari- Sdcl ables S (containing the productions used in each parse t) and the parsing parameters θ. NP Sdcl \NP [NP]lex 4 Generating Parses you (Sdcl \NP)/NP NP The previous section defined a parameterisation You [(Sdcl \NP)/NP]lex over parses assuming that the CCG lexicon Λ was NP/N N λxλy.read(y, x) known. In practice Λ is empty prior to training [NP/N]lex [N]lex read and must be populated with the lexical items from λf λx.the(x, f (x)) λx.book(x) parses t consistent with training pairs (s, {m}). the book The set of allowed parses {t} is defined by the Figure 1: Derivation of sentence You read the function T from Equation 2. Here we review the book with meaning read(you, the(x, book(x))). splitting procedure of Kwiatkowski et al. (2010) that is used to generate CCG lexical items and de- choice of production: scribe how it is used by T to create a packed chart representation of all parses {t} that are consistent b ∼ Multinomial(θa ) (5) with s and at least one of the meaning represen- tations in {m}. In this section we assume that s However, before training a model of language ac- is paired at each point with only a single meaning quisition the dimensionality and contents of both m. Later we will show how T is used multiple the syntactic grammar and lexicon are unknown. times to create the set of parses consistent with s In order to maintain a probability model with and a set of candidate meanings {m}. cover over the countably infinite number of pos- The splitting procedure takes as input a CCG sible productions, we define a Dirichlet Process category X : h, such as NP : a(x, cookie(x)), and (DP) prior for each possible production head a. returns a set of category splits. Each category split For the production head a, DP (αa , Ha ) assigns is a pair of CCG categories (Cl : ml , Cr : mr ) that some probability mass to all possible production can be recombined to give X : h using one of the targets {b} covered by the base distribution Ha . CCG combinators in Section 2.2. The CCG cat- It is possible to use the DP as an infinite prior egory splitting procedure has two parts: logical from which the parameter set of a finite dimen- splitting of the category semantics h; and syntac- sional Multinomial may be drawn provided that tic splitting of the syntactic category X. Each logi- we can choose a suitable partition of {b}. When cal split of h is a pair of lambda expressions (f, g) calculating the probability of an (s, m, t) triple, in the following set: the choice of this partition is easy. For any given production head a there is a finite set of usable {(f, g) | h = f (g) ∨ h = λx.f (g(x))}, (8) production targets {b1 , . . . , bk−1 } in t. We create a partition that includes one entry for each of these which means that f and g can be recombined us- along with a final entry {bk , . . . } that includes all ing either function application or function com- other ways in which a could be expanded in dif- position to give the original lambda expression ferent contexts. Then, by applying the distribution h. An example split of the lambda expression Ga drawn from the DP to this partition, we get a h = a(x, cookie(x)) is the pair parameter vector θa that is equivalent to a draw from a k dimensional Dirichlet distribution: (λy.a(x, y(x)), λx.cookie(x)), (9) Ga ∼ DP (αa , Ha ) (6) where λy.a(x, y(x)) applied to λx.cookie(x) re- θa = (Ga (b1 ), . . . , Ga (bk−1 ), Ga ({bk , . . . }) turns the original expression a(x, cookie(x)). ∼ Dir(αa H(b1 ), . . . , αa Ha (bk−1 ), (7) Syntactic splitting assigns linear order and syn- αa Ha ({bk , . . . })) tactic categories to the two lambda expressions f and g. The initial syntactic category X is split by Together, Equations 4-7 describe the joint distri- a reversal of the CCG application combinators in bution P (X, S, θ) over the observed training data Section 2.2 if f and g can be recombined to give 237 Syntactic Category Semantic Type Example Phrase Sdcl hev, ti I took it ` Sdcl :λe.took(i, it, e) St t I0 m angry ` St :angry(i) Swh he, hev, tii Who took it? ` Swh :λxλe.took(x, it, e) Sq hev, ti Did you take it? ` Sq :λe.Q(take(you, it, e)) N he, ti cookie ` N:λx.cookie(x) NP e John ` NP:john PP hev, ti on John ` PP:λe.on(john, e) Figure 2: Atomic Syntactic Categories. h with function application: T cycles over all cell entries in increasingly small spans and populates the chart with their splits. For {(X/Y : f Y : g), (10) any cell entry X : h spanning more than one word (Y : g : X\Y : f )|h = f (g)} T generates a set of pairs representing the splits of X:h. For each split (Cl :ml , Cr :mr ) and every bi- or by a reversal of the CCG composition combi- nary partition (wi:k , wk:j ) of the word-span T cre- nators if f and g can be recombined to give h with ates two new cell entries in the chart: (Cl : ml )i:k function composition: and (Cr :mr )k:j . {(X/Z : f Z/Y : g, (11) Input : Sentence [w1 , . . . , wn ], top node Cm :m (Z\Y : g : X\Z : f )|h = λx.f (g(x))} Output: Packed parse chart Ch containing {t} Ch = [ [{}1 , . . . , {}n ]1 , . . . , [{}1 , . . . , {}n ]n ] Unknown category names in the result of a Ch[1][n − 1] = Cm :m split (Y in (10) and Z in (11)) are labelled via a for i = n, . . . , 2; j = 1 . . . (n − i) + 1 do for X:h ∈ Ch[j][i] do functional mapping cat from semantic type T to for (Cl :ml , Cr :mr ) ∈ split(X:h) do syntactic category: for k = 1, . . . , i − 1 do   Ch[j][k] ← Cl :ml  Atomic(T ) if T ∈ Figure 2  Ch[j + k][i − k] ← Cr :mr cat(T ) = cat(T1 )/cat(T2 ) if T = hT1 , T2 i cat(T1 )\cat(T2 ) if T = hT1 , T2 i   Algorithm 1: Generating {t} with T . which uses the Atomic function illustrated in Figure 2 to map semantic-type to basic CCG Algorithm 1 shows how the learner uses T to syntactic category. As an example, the logical generate a packed chart representation of {t} in split in (9) supports two CCG category splits, one the chart Ch. The function T massively overgen- for each of the CCG application rules. erates parses for any given natural language. The probabilistic parsing model introduced in Sec- (NP/N:λy.a(x, y(x)), N:λx.cookie(x)) (12) tion 3 is used to choose the best parse from the (N:λx.cookie(x), NP\N:λy.a(x, y(x))) (13) overgenerated set. The parse generation algorithm T uses the func- 5 Training tion split to generate all CCG category pairs that 5.1 Parameter Estimation are an allowed split of an input category X:h: The probabilistic model of the grammar describes {(Cl :ml , Cr :mr )} = split(X:h), a distribution over the observed training data X, latent variables S, and parameters θ. The goal of and then packs a chart representation of {t} in a training is to estimate the posterior distribution: top-down fashion starting with a single cell entry p(S, X|θ)p(θ) Cm : m for the top node shared by all parses {t}. p(S, θ|X) = (14) p(X) For the utterance and meaning in (1) the top parse node, spanning the entire word-string, is which we do with online Variational Bayesian Ex- pectation Maximisation (oVBEM; Sato (2001), S:have(you, another(x, cookie(x))). Hoffman et al. (2010)). oVBEM is an online 238 Bayesian extension of the EM algorithm that Input : Corpus D = {(si , {m}i )|i = 1, . . . , N }, accumulates observation pseudocounts na→b for Function T , Semantics to syntactic cate- each of the productions a → b in the grammar. gory mapping cat, function lex to read These pseudocounts define the posterior over pro- lexical items off derivations. duction probabilities as follows: Output: Lexicon Λ, Pseudocounts {na→b }. (θa→b1 , . . . , θa→b{k,... } )) | X, S ∼ (15) Λ = {}, {t} = {} ∞ for i = 1, . . . , N do X {t}i = {} Dir(αH(b1 ) + na→b1 , . . . , αH(bj ) + na→bj ) for m0 ∈ {m}i do j=k Cm0 = cat(m0 ) These pseudocounts are computed in two steps: {t}0 = T (si , Cm0 :m0 ) {t}i = {t}i ∪ {t}0 , {t} = {t} ∪ {t}0 oVBE-step For the training pair (si , {m}i ) Λ = Λ ∪ lex ({t}0 ) which supports the set of parses {t}, the expec- for a → b ∈ {t} do i−1 tation E{t} [a → b] of each production a → b is nia→b = na→b + ηi (N × E{t}i [a → b] − i−1 calculated by creating a packed chart representa- na→b ) tion of {t} and running the inside-outside algo- Algorithm 2: Learning Λ and {na→b } rithm. This is similar to the E-step in standard EM apart from the fact that each production is the parameter update step cycles over all produc- scored with the current expectation of its parame- tions in {t} it is not neccessary to store {t}, just ter weight θˆa→b i−1 , where: the set of productions that it uses. i−1 eΨ(αa Ha (a→b)+na→b ) 6 Experimental Setup θˆa→b i−1 = P K i−1 (16) Ψ 0 {b0 } αa Ha (a→b )+na→b0 e 6.1 Data and Ψ is the digamma function (Beal, 2003). The Eve corpus, collected by Brown (1973), con- oVBM-step The expectations from the oVBE tains 14, 124 English utterances spoken to a sin- step are used to update the pseudocounts in Equa- gle child between the ages of 18 and 27 months. tion 15 as follows, These have been hand annotated by Sagae et al. (2004) with labelled syntactic dependency graphs. nia→b = ni−1 i−1 An example annotation is shown in Figure 3. a→b + ηi (N × E{t} [a → b] − na→b ) (17) While these annotations are designed to rep- where ηi is the learning rate and N is the size of resent syntactic information, the parent-child re- the dataset. lationships in the parse can also be viewed as a proxy for the predicate-argument structure of the 5.2 The Training Algorithm semantics. We developed a template based de- Now the training algorithm used to learn the lex- terministic procedure for mapping this predicate- icon Λ and pseudocounts {na→b } can be defined. argument structure onto logical expressions of the The algorithm, shown in Algorithm 2, passes over type discussed in Section 2.1. For example, the the training data only once and one training in- dependency graph in Figure 3 is automatically stance at a time. For each (si , {m}i ) it uses the transformed into the logical expression function T |{m}i | times to generate a set of con- sistent parses {t}0 . The lexicon is populated by λe.have(you,another(y, cookie(y)), e) (18) using the lex function to read all of the lexical ∧ on(the(z, table(z)), e), items off from the derivations in each {t}0 . In the parameter update step, the training algorithm where e is a Davidsonian event variable used to updates the pseudocounts associated with each of deal with adverbial and prepositional attachments. the productions a → b that have ever been seen The deterministic mapping to logical expressions during training according to Equation (17). uses 19 templates, three of which are used in this Only non-zero pseudocounts are stored in our example: one for the verb and its arguments, one model. The count vector is expanded with a new for the prepositional attachment and one (used entry every time a new production is used. While twice) for the quantifier-noun constructions. 239 SUBJ ROOT DET OBJ JCT DET POBJ pro|you v|have qn|another n|cookie prep|on det|the n|table You have another cookie on the table Figure 3: Syntactic dependency graph from Eve corpus. This mapping from graph to logical expression makes use of a predefined dictionary of allowed, 0.8 typed, logical constants. The mapping is success- 0.7 ful for 31% of the child-directed utterances in the Eve corpus3 . The remaining data is mostly ac- 0.6 counted for by one-word utterances that have no 0.5 Accuracy straightforward interpretation in our typed logi- 0.4 cal language (e.g. what; okay; alright; no; yeah; hmm; yes; uhhuh; mhm; thankyou), missing ver- 0.3 bal arguments that cannot be properly guessed 0.2 from the context (largely in imperative sentences Our Approach UBL1 0.1 such as drink the water), and complex noun con- Our Approach + Guess UBL10 structions that are hard to match with a small set 0.0 0.0 0.2 0.4 0.6 0.8 1.0 of templates (e.g. as top to a jar). We also re- Proportion of Data Seen move the small number of utterances containing Figure 4: Meaning Prediction: Train on files 1, . . . , n more than 10 words for reasons of computational test on file n + 1. efficiency (see discussion in Section 8). Following Alishahi and Stevenson (2010), we 7 Experiments generate a context set {m}i for each utterance si by pairing that utterance with its correct logical 7.1 Parsing Unseen Sentences expression along with the logical expressions of We test the parsing model that is learnt by training the preceding and following (|{m}i | − 1)/2 utter- on the first i files of the longitudinally ordered Eve ances. corpus and testing on file i + 1, for i = 1 . . . 19. For each utterance s0 in the test file we use the 6.2 Base Distributions and Learning Rate parsing model to predict a meaning m∗ and com- Each of the production heads a in the grammar pare this to the target meaning m0 . We report the requires a base distribution Ha and concentration proportion of utterances for which the prediction parameter αa . For word-productions the base dis- m∗ is returned correctly both with and without tribution is a geometric distribution over character word-meaning guessing. When a word has never strings and spaces. For syntactic-productions the been seen at training time our parser has the abil- base distribution is defined in terms of the new ity to ‘guess’ a typed logical meaning with place- category to be named by cat and the probability holders for constant and predicate names. of splitting the rule by reversing either the appli- For comparison we use the UBL semantic cation or composition combinators. parser of Kwiatkowski et al. (2010) trained in Semantic-productions’ base distributions are a similar setting—i.e., with no language specific defined by a probabilistic branching process con- initialisation4 . Figure 4 shows accuracy for our ditioned on the type of the syntactic category. approach with and without guessing, for UBL This distribution prefers less complex logical ex- 4 Kwiatkowski et al. (2010) initialise lexical weights in pressions. All concentration parameters are set to their learning algorithm using corpus-wide alignment statis- 1.0. The learning rate for parameter updates is tics across words and meaning elements. Instead we run ηi = (0.8 + i)−0.5 . UBL with small positive weight for all lexical items. When run with Giza++ parameter initialisations, U BL10 achieves 3 Data available at www.tomkwiat.com/resources.html 48.1% across folds compared to 49.2% for our approach. 240 when run over the training data once (UBL1 ) and Figure 5 shows the posterior probability of the for UBL when run over the training data 10 times correct meanings for the quantifiers ‘a’, ‘another’ (UBL10 ) as in Kwiatkowski et al. (2010). Each and ‘any’ over the course of training with 1, 3, of the points represents accuracy on one of the 5 and 7 candidate meanings for each utterance5 . 19 test files. All of these results are from parsers These three words are all of the same class but trained on utterances paired with a single candi- have very different frequencies in the training date meaning. The lines of best fit show the up- subset shown (168, 10 and 2 respectively). In all ward trend in parser performance over time. training settings, the word ‘a’ is learnt gradually Despite only seeing each training instance from many observations but the rarer words ‘an- once, our approach, due to its broader lexi- other’ and ‘any’ are learnt (when they are learnt) cal search strategy, outperforms both versions of through large updates to the posterior on the ba- UBL which performs a greedy search in the space sis of few observations. These large updates re- of lexicons and requires initialisation with co- sult from a syntactic bootstrapping effect (Gleit- occurence statistics between words and logical man, 1990). When the model has great confidence constants to guide this search. These statistics are about the derivation in which an unseen lexical not justified in a model of language acquisition item occurs, the pseudocounts for that lexical item and so they are not used here. The low perfor- get a large update under Equation 17. This large mance of all systems is due largely to the sparsity update has a greater effect on rare words which of the data with 32.9% of all sentences containing are associated with small amounts of probability a previously unseen word. mass than it does on common ones that have al- ready accumulated large pseudocounts. The fast 7.2 Word Learning learning of rare words later in learning correlates with observations of word learning in children. Due to the sparsity of the data, the training algo- rithm needs to be able to learn word-meanings on the basis of very few exposures. This is also a de- 7.3 Word Order Learning sirable feature from the perspective of modelling Figure 6 shows the posterior probability of the language acquisition as Carey and Bartlett (1978) correct SVO word order learnt from increasing have shown that children have the ability to learn amounts of training data. This is calculated by word meanings on the basis of one, or very few, summing over all lexical items containing transi- exposures through the process of fast mapping. tive verb semantics and sampling in the space of parse trees that could have generated them. With 1 Meaning 3 Meanings no propositional uncertainty in the training data 1.0 the correct word order is learnt very quickly and 0.8 stabilises. As the amount of propositional uncer- P(m|w) 0.6 tainty increases, the rate at which this rule is learnt 0.4 decreases. However, even in the face of ambigu- 0.2 ous training data, the model can learn the cor- 0.0 rect word-order rule. The distribution over word 0 500 1000 1500 2000 0 500 1000 1500 2000 5 Meanings 7 Meanings orders also exhibits initial uncertainty, followed 1.0 by a sharp convergence to the correct analysis. 0.8 This ability to learn syntactic regularities abruptly P(m|w) 0.6 means that our system is not subject to the crit- 0.4 icisms that Thornton and Tesan (2007) levelled 0.2 at statistical models of language acquisition—that 0.0 their learning rates are too gradual. 0 500 1000 1500 2000 0 500 1000 1500 2000 Number of Utterances Number of Utterances f = 168 a → λf.a(x, f (x)) 5 f = 10 another → λf.another(x, f (x)) The term ‘fast mapping’ is generally used to refer to f = 2 any → λf.any(x, f (x)) noun learning. We chose to examine quantifier learning here as there is a greater variation in quantifier frequencies. Fast Figure 5: Learning quantifiers with frequency f. mapping of nouns is also achieved. 241 1 Meaning 3 Meanings ble. In particular, it generates all parses consistent 1.0 P(word order) with each training instance, which can be both 0.8 0.6 memory- and processor-intensive. It is unlikely 0.4 that children do this once they have learnt at least 0.2 some of the target language. In future, we plan 0.0 0 500 1000 1500 2000 0 500 1000 1500 2000 to investigate more efficient parameter estimation 5 Meanings 7 Meanings methods. One possibility would be an approxi- 1.0 mate oVBEM algorithm in which the expectations P(word order) 0.8 0.6 in Equation 17 are calculated according to a high 0.4 probability subset of the parses {t}. Another op- 0.2 tion would be particle filtering, which has been 0.0 0 500 1000 1500 2000 0 500 1000 1500 2000 investigated as a cognitively plausible method for Number of Utterances Number of Utterances approximate Bayesian inference (Shi et al., 2010; vso ovs vos Levy et al., 2009; Sanborn et al., 2010). svo sov osv As a crude approximation to the context in which an utterance is heard, the logical represen- Figure 6: Learning SVO word order. tations of meaning that we present to the learner are also open to criticism. However, Steedman 8 Discussion (2002) argues that children do have access to structured meaning representations from a much We have presented an incremental model of lan- older apparatus used for planning actions and we guage acquisition that learns a probabilistic CCG wish to eventually ground these in sensory input. grammar from utterances paired with one or Despite the limitations listed above, our ap- more potential meanings. The model assumes proach makes several important contributions to no language-specific knowledge, but does assume the computational study of language acquisition. that the learner has access to language-universal It is the first model to learn syntax and seman- correspondences between syntactic and semantic tics concurrently; previous systems (Villavicen- types, as well as a Bayesian prior encouraging cio, 2002; Buttery, 2006) learnt categorial gram- grammars with heavy reuse of existing rules and mars from sentences where all word meanings lexical items. We have shown that this model were known. Our model is also the first to be not only outperforms a state-of-the-art semantic evaluated by parsing sentences onto their mean- parser, but also exhibits learning curves similar ings, in contrast to the work mentioned above and to children’s: lexical items can be acquired on a that of Gibson and Wexler (1994), Siskind (1992) single exposure and word order is learnt suddenly Sakas and Fodor (2001), and Yang (2002). These rather than gradually. all evaluate their learners on the basis of a small Although we use a Bayesian model, our ap- number of predefined syntactic parameters. proach is different from many of the Bayesian Finally, our work addresses a misunderstand- models proposed in cognitive science and lan- ing about statistical learners—that their learn- guage acquisition (Xu and Tenenbaum, 2007; ing curves must be gradual (Thornton and Tesan, Goldwater et al., 2009; Frank et al., 2009; Grif- 2007). By demonstrating sudden learning of word fiths and Tenenbaum, 2006; Griffiths, 2005; Per- order and fast mapping, our model shows that sta- fors et al., 2011). These models are intended tistical learners can account for sudden changes in as ideal observer analyses, demonstrating what children’s grammars. In future, we hope to extend would be learned by a probabilistically optimal these results by examining other learning behav- learner. Our learner uses a more cognitively plau- iors and testing the model on other languages. sible but approximate online learning algorithm. In this way, it is similar to other cognitively plau- 9 Acknowledgements sible approximate Bayesian learners (Pearl et al., 2010; Sanborn et al., 2010; Shi et al., 2010). We thank Mark Johnson for suggesting an analy- Of course, despite the incremental nature of our sis of learning rates. This work was funded by the learning algorithm, there are still many aspects ERC Advanced Fellowship 24952 GramPlus and that could be criticized as cognitively implausi- EU IP grant EC-FP7-270273 Xperience. 242 References Goldwater, S., Griffiths, T. L., and Johnson, M. (2009). A Bayesian framework for word seg- Alishahi and Stevenson, S. (2008). A computa- mentation: Exploring the effects of context. tional model for early argument structure ac- Cognition, 112(1):21–54. quisition. Cognitive Science, 32:5:789–834. Griffiths, T. L., . T. J. B. (2005). Structure and Alishahi, A. and Stevenson, S. (2010). Learning strength in causal induction. Cognitive Psy- general properties of semantic roles from usage chology, 51:354–384. data: a computational model. Language and Cognitive Processes, 25:1. Griffiths, T. L. and Tenenbaum, J. B. (2006). Op- timal predictions in everyday cognition. Psy- Beal, M. J. (2003). Variational algorithms for ap- chological Science. proximate Bayesian inference. Technical re- port, Gatsby Institute, UCL. Hoffman, M., Blei, D. M., and Bach, F. (2010). B¨orschinger, B., Jones, B. K., and Johnson, M. Online learning for latent dirichlet allocation. (2011). Reducing grounded learning tasks In NIPS. to grammatical inference. In Proceedings of Kate, R. J. and Mooney, R. J. (2007). Learning the 2011 Conference on Empirical Methods language semantics from ambiguous supervi- in Natural Language Processing, pages 1416– sion. In Proceedings of the 22nd Conference 1425, Edinburgh, Scotland, UK. Association on Artificial Intelligence (AAAI-07). for Computational Linguistics. Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., Brown, R. (1973). A First Language: the Early and Steedman, M. (2010). Inducing proba- Stages. Harvard University Press, Cambridge bilistic CCG grammars from logical form with MA. higher-order unification. In Proceedings of the Buttery, P. J. (2006). Computational models for Conference on Emperical Methods in Natural first language acquisition. Technical Report Language Processing. UCAM-CL-TR-675, University of Cambridge, Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., Computer Laboratory. and Steedman, M. (2011). Lexical general- Carey, S. and Bartlett, E. (1978). Acquring a sin- ization in ccg grammar induction for semantic gle new word. Papers and Reports on Child parsing. In Proceedings of the Conference on Language Development, 15. Emperical Methods in Natural Language Pro- Chen, D. L., Kim, J., and Mooney, R. J. (2010). cessing. Training a multilingual sportscaster: Using per- Levy, R., Reali, F., and Griffiths, T. (2009). Mod- ceptual context to learn language. J. Artif. In- eling the effects of memory on human online tell. Res. (JAIR), 37:397–435. sentence processing with particle filters. In Ad- Fazly, A., Alishahi, A., and Stevenson, S. (2010). vances in Neural Information Processing Sys- A probabilistic computational model of cross- tems 21. situational word learning. Cognitive Science, Lu, W., Ng, H. T., Lee, W. S., and Zettlemoyer, 34(6):1017–1063. L. S. (2008). A generative model for parsing Frank, M., Goodman, S., and Tenenbaum, J. natural language to meaning representations. In (2009). Using speakers referential intentions Proceedings of The Conference on Empirical to model early cross-situational word learning. Methods in Natural Language Processing. Psychological Science, 20(5):578–585. MacWhinney, B. (2000). The CHILDES project: Frank, M. C., Goodman, N. D., and Tenenbaum, tools for analyzing talk. Lawrence Erlbaum, J. B. (2008). A bayesian framework for cross- Mahwah, NJ u.a. EN. situational word-learning. Advances in Neural Maurits, L., Perfors, A., and Navarro, D. (2009). Information Processing Systems 20. Joint acquisition of word order and word refer- Gibson, E. and Wexler, K. (1994). Triggers. Lin- ence. In Proceedings of the 31th Annual Con- guistic Inquiry, 25:355–407. ference of the Cognitive Science Society. Gleitman, L. (1990). The structural sources of Pearl, L., Goldwater, S., and Steyvers, M. (2010). verb meanings. Language Acquisition, 1:1–55. How ideal are we? Incorporating human limi- 243 tations into Bayesian models of word segmen- University of Cambridge, Computer Labora- tation. pages 315–326, Somerville, MA. Cas- tory. cadilla Press. Wong, Y. W. and Mooney, R. (2006). Learning for Perfors, A., Tenenbaum, J. B., and Regier, T. semantic parsing with statistical machine trans- (2011). The learnability of abstract syntactic lation. In Proceedings of the Human Language principles. Cognition, 118(3):306 – 338. Technology Conference of the NAACL. Sagae, K., MacWhinney, B., and Lavie, A. Wong, Y. W. and Mooney, R. (2007). Learn- (2004). Adding syntactic annotations to tran- ing synchronous grammars for semantic pars- scripts of parent-child dialogs. In Proceed- ing with lambda calculus. In Proceedings of ings of the 4th International Conference on the Association for Computational Linguistics. Language Resources and Evaluation. Lisbon, Xu, F. and Tenenbaum, J. B. (2007). Word learn- LREC. ing as Bayesian inference. Psychological Re- Sakas, W. and Fodor, J. D. (2001). The struc- view, 114:245–272. tural triggers learner. In Bertolo, S., editor, Yang, C. (2002). Knowledge and Learning in Nat- Language Acquisition and Learnability, pages ural Language. Oxford University Press, Ox- 172–233. Cambridge University Press, Cam- ford. bridge. Yu, C. and Ballard, D. H. (2007). A unified model Sanborn, A. N., Griffiths, T. L., and Navarro, of early word learning: Integrating statisti- D. J. (2010). Rational approximations to ratio- cal and social cues. Neurocomputing, 70(13- nal models: Alternative algorithms for category 15):2149 – 2165. learning. Psychological Review. Zettlemoyer, L. S. and Collins, M. (2005). Learn- Sato, M. (2001). Online model selection based ing to map sentences to logical form: Struc- on the variational bayes. Neural Computation, tured classification with probabilistic categorial 13(7):1649–1681. grammars. In Proceedings of the Conference on Shi, L., Griffiths, T. L., Feldman, N. H., and San- Uncertainty in Artificial Intelligence. born, A. N. (2010). Exemplar models as a Zettlemoyer, L. S. and Collins, M. (2007). Online mechanism for performing bayesian inference. learning of relaxed CCG grammars for pars- Psychonomic Bulletin & Review, 17(4):443– ing to logical form. In Proc. of the Joint Con- 464. ference on Empirical Methods in Natural Lan- Siskind, J. M. (1992). Naive Physics, Event Per- guage Processing and Computational Natural ception, Lexical Semantics, and Language Ac- Language Learning. quisition. PhD thesis, Massachusetts Institute Zettlemoyer, L. S. and Collins, M. (2009). Learn- of Technology. ing context-dependent mappings from sen- Siskind, J. M. (1996). A computational study of tences to logical form. In Proceedings of The cross-situational techniques for learning word- Joint Conference of the Association for Com- to-meaning mappings. Cognition, 61(1-2):1– putational Linguistics and International Joint 38. Conference on Natural Language Processing. Steedman, M. (2000). The Syntactic Process. MIT Press, Cambridge, MA. Steedman, M. (2002). Plans, affordances, and combinatory grammar. Linguistics and Philos- ophy, 25. Thornton, R. and Tesan, G. (2007). Categori- cal acquisition: Parameter setting in universal grammar. Biolinguistics, 1. Villavicencio, A. (2002). The acquisition of a unification-based generalised categorial gram- mar. Technical Report UCAM-CL-TR-533, 244 Active learning for interactive machine translation ´ Gonz´alez-Rubio and Daniel Ortiz-Mart´ınez and Francisco Casacuberta Jesus D. de Sistemas Inform´aticos y Computaci´on U. Polit`ecnica de Val`encia C. de Vera s/n, 46022 Valencia, Spain {jegonzalez,dortiz,fcn}@dsic.upv.es Abstract therefore various automatic machine translation methods have been proposed. Translation needs have greatly increased However, automatic statistical machine trans- during the last years. In many situa- tions, text to be translated constitutes an lation (SMT) systems are far from generating unbounded stream of data that grows con- error-free translations and their outputs usually tinually with time. An effective approach require human post-editing in order to achieve to translate text documents is to follow high-quality translations. One way of taking ad- an interactive-predictive paradigm in which vantage of SMT systems is to combine them both the system is guided by the user with the knowledge of a human translator in the and the user is assisted by the system to interactive-predictive machine translation (IMT) generate error-free translations. Unfortu- framework (Foster et al., 1998; Langlais and La- nately, when processing such unbounded data streams even this approach requires an palme, 2002; Barrachina et al., 2009), which is overwhelming amount of manpower. Is in a particular case of the computer-assisted trans- this scenario where the use of active learn- lation paradigm (Isabelle and Church, 1997). In ing techniques is compelling. In this work, the IMT framework, a state-of-the-art SMT model we propose different active learning tech- and a human translator collaborate to obtain high- niques for interactive machine translation. quality translations while minimizing required Results show that for a given translation human effort. quality the use of active learning allows us to greatly reduce the human effort required Unfortunately, the application of either post- to translate the sentences in the stream. editing or IMT to data streams with massive data volumes is still too expensive, simply because manual supervision of all instances requires huge 1 Introduction amounts of manpower. For such massive data Translation needs have greatly increased during streams the need of employing active learning the last years due to phenomena such as global- (AL) is compelling. AL techniques for IMT se- ization and technologic development. For exam- lectively ask an oracle (e.g. a human transla- ple, the European Parliament1 translates its pro- tor) to supervise a small portion of the incoming ceedings to 22 languages in a regular basis or sentences. Sentences are selected so that SMT Project Syndicate2 that translates editorials into models estimated from them translate new sen- different languages. In these and many other ex- tences as accurately as possible. There are three amples, data can be viewed as an incoming un- challenges when applying AL to unbounded data bounded stream since it grows continually with streams (Zhu et al., 2010). These challenges can time (Levenberg et al., 2010). Manual translation be instantiated to IMT as follows: of such streams of data is extremely expensive given the huge volume of translation required, 1. The pool of candidate sentences is dynam- 1 http://www.europarl.europa.eu ically changing, whereas existing AL algo- 2 http://project-syndicate.org rithms are dealing with static datasets only. 245 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 245–254, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics 2. Concepts such as optimum translation and halves the required human effort to obtain a cer- translation probability distribution are con- tain translation quality. tinually evolving whereas existing AL algo- In this work, the AL framework presented rithms only deal with constant concepts. in (Gonz´alez-Rubio et al., 2011) is extended in an effort to address all the above described chal- 3. Data volume is unbounded which makes lenges. In short, we propose an AL framework for impractical to batch-learn one single sys- IMT that splits the data stream into blocks. This tem from all previously translated sentences. approach allows us to have more context to model Therefore, model training must be done in an the changing probability distribution of the stream incremental fashion. (challenge 2) and results in a more accurate sam- pling of the changing pool of sentences (chal- In this work, we present a proposal of AL for lenge 1). In contrast to the proposal described IMT specifically designed to work with stream in (Gonz´alez-Rubio et al., 2011), we define sen- data. In short, our proposal divides the data tence sampling strategies whose underlying mod- stream into blocks where AL techniques for static els can be updated with the newly available data. datasets are applied. Additionally, we implement This way, the sentences to be supervised by the an incremental learning technique to efficiently user are chosen taking into account previously su- train the base SMT models as new data is avail- pervised sentences. To efficiently retrain the un- able. derlying SMT models of the IMT system (chal- lenge 3), we follow the online learning technique 2 Related work described in (Ortiz-Mart´ınez et al., 2010). Finally, A body of work has recently been proposed to ap- we integrate all these elements to define an AL ply AL techniques to SMT (Haffari et al., 2009; framework for IMT with an objective of obtaining Ambati et al., 2010; Bloodgood and Callison- an optimum balance between translation quality Burch, 2010). The aim of these works is to and human user effort. build one single optimal SMT model from manu- ally translated data extracted from static datasets. 3 Interactive machine translation None of them fit in the setting of data streams. IMT can be seen as an evolution of the SMT Some of the above described challenges of AL framework. Given a sentence f from a source from unbounded streams have been previously ad- language to be translated into a sentence e of dressed in the MT literature. In order to deal with a target language, the fundamental equation of the evolutionary nature of the problem, Nepveu et SMT (Brown et al., 1993) is defined as follows: al. (2004) propose an IMT system with dynamic adaptation via cache-based model extensions for ˆ = arg max P r(e | f ) e (1) e language and translation models. Pursuing the same goal for SMT, Levenberg et al., (2010) where P r(e | f ) is usually approximated by a log study how to bound the space when processing linear translation model (Koehn et al., 2003). In (potentially) unbounded streams of parallel data this case, the decision rule is given by the expres- and propose a method to incrementally retrain sion: SMT models. Another method to efficiently re- ( M ) train a SMT model with new data was presented X ˆ = arg max e λm hm (e, f ) (2) in (Ortiz-Mart´ınez et al., 2010). In this work, e m=1 the authors describe an application of the online learning paradigm to the IMT framework. where each hm (e, f ) is a feature function repre- To the best of our knowledge, the only previ- senting a statistical model and λm its weight. ous work on AL for IMT is (Gonz´alez-Rubio et In the IMT framework, a human translator is in- al., 2011). There, the authors present a na¨ıve ap- troduced in the translation process to collaborate plication of the AL paradigm for IMT that do not with an SMT model. For a given source sentence, take into account the dynamic change in proba- the SMT model fully automatically generates an bility distribution of the stream. Nevertheless, re- initial translation. The human user checks this sults show that even that simple AL framework translation, from left to right, correcting the first 246 source (f ): Para ver la lista de recursos number of sentences whose translations are worth e): To view a listing of resources desired translation (ˆ to be supervised by the human expert. ep This approach implies a modification of the inter.-0 es To view the resources list user-machine interaction protocol. For a given ep To view inter.-1 k a source sentence, the SMT model generates an ini- es list of resources tial translation. Then, if this initial translation is ep To view a list classified as incorrect or “worth of supervision”, inter.-2 k list i we perform a conventional IMT procedure as in es list i ng resources Figure 1. If not, we directly return the initial au- ep To view a listing tomatic translation and no effort is required from inter.-3 k o the user. At the end of the process, we use the new es of resources sentence pair (f , e) available to refine the SMT accept ep To view a listing of resources models used by the IMT system. In this scenario, the user only checks a small Figure 1: IMT session to translate a Spanish sentence number of sentences, thus, final translations are into English. The desired translation is the translation not error-free as in conventional IMT. However, the human user have in mind. At interaction-0, the sys- results in previous works (Gonz´alez-Rubio et al., tem suggests a translation (es ). At interaction-1, the user moves the mouse to accept the first eight charac- 2011) show that this approach yields important ters ”To view ” and presses the a key (k), then the reduction in human effort. Moreover, depending system suggests completing the sentence with ”list of on the definition of the sampling strategy, we can resources” (a new es ). Interactions 2 and 3 are simi- modify the ratio of sentences that are interactively lar. In the final interaction, the user accepts the current translated to adapt our system to the requirements translation. of a specific translation task. For example, if the main priority is to minimize human effort, our error. Then, the SMT model proposes a new ex- system can be configured to translate all the sen- tension taking the correct prefix, ep , into account. tences without user intervention. These steps are repeated until the user accepts the Algorithm 1 describes the basic algorithm to translation. Figure 1 illustrates a typical IMT ses- implement AL for IMT. The algorithm receives as sion. In the resulting decision rule, we have to input an initial SMT model, M , a sampling strat- find an extension es for a given prefix ep . To do egy, S, a stream of source sentences, F, and the this we reformulate equation (1) as follows, where block size, B. First, a block of B sentences, X, the term P r(ep | f ) has been dropped since it does is extracted from the data stream (line 3). From not depend on es : this block, we sample those sentences, Y , that ˆs = arg max P r(ep , es | f ) e (3) are worth to be supervised by the human expert es (line 4). For each of the sentences in X, the cur- ≈ arg max p(es | f , ep ) (4) rent SMT model generates an initial translation, es ˆ, (line 6). If the sentence has been sampled as e The search is restricted to those sentences e worthy of supervision, f ∈ Y , the user is required which contain ep as prefix. Since e ≡ ep es , we to interactively translate it (lines 8–13) as exem- can use the same log-linear SMT model, equa- plified in Figure 1. The source sentence f and its tion (2), whenever the search procedures are ad- human-supervised translation, e, are then used to equately modified (Barrachina et al., 2009). retrain the SMT model (line 14). Otherwise, we directly output the automatic translation e ˆ as our 4 Active learning for IMT final translation (line 17). The aim of the IMT framework is to obtain high- Most of the functions in the algorithm denote quality translations while minimizing the required different steps in the interaction between the hu- human effort. Despite the fact that IMT may man user and the machine: reduce the required effort with respect to post- • translate(M, f ): returns the most proba- editing, it still requires the user to supervise all ble automatic translation of f given by M . the translations. To address this problem, we pro- pose to use AL techniques to select only a small • validPrefix(e): returns the prefix of e 247 input : M (initial SMT model) 5 Sentence sampling strategies S (sampling strategy) F (stream of source sentences) A good sentence sampling strategy must be able B (block size) to select those sentences that along with their cor- auxiliar : X (block of sentences) rect translations improve most the performance of Y (sentences worth of supervision) the SMT model. To do that, the sampling strat- 1 begin egy have to correctly discriminate “informative” 2 repeat sentences from those that are not. We can make 3 X = getSentsFromStream (B, F); different approximations to measure the informa- 4 Y = S(X, M ); 5 foreach f ∈ X do tiveness of a given sentence. In the following 6 ˆ = translate(M, f ); e sections, we describe the three different sampling 7 if f ∈ Y then strategies tested in our experimentation. 8 e=e ˆ; 9 repeat 5.1 Random sampling 10 ep = validPrefix(e); 11 ˆs = genSuffix(M, f , ep ); e Arguably, the simplest sampling approach is ran- 12 ˆs ; e = ep e dom sampling, where the sentences are randomly 13 until validTranslation(e) ; selected to be interactively translated. Although 14 M = retrain(M, (f , e)); simple, it turns out that random sampling per- 15 output(e); form surprisingly well in practice. The success 16 else of random sampling stem from the fact that in 17 output(ˆ e); data stream environments the translation proba- 18 until True ; bility distributions may vary significantly through 19 end time. While general AL algorithms ask the user to translate informative sentences, they may signifi- Algorithm 1: Pseudo-code of the proposed cantly change probability distributions by favor- algorithm to implement AL for IMT from ing certain translations, consequently, the previ- unbounded data streams. ously human-translated sentences may no longer reveal the genuine translation distribution in the validated by the user as correct. This prefix current point of the data stream (Zhu et al., 2007). includes the correction k. This problem is less severe for static data where the candidate pool is fixed and AL algorithms are • genSuffix(M, f , ep ): returns the suffix of able to survey all instances. Random sampling maximum probability that extends prefix ep . avoids this problem by randomly selecting sen- tences for human supervision. As a result, it al- • validTranslation(e): returns True if ways selects those sentences with the most similar the user considers the current translation to distribution to the current sentence distribution in be correct and False otherwise. the data stream. 5.2 n-gram coverage sampling Apart from these, the two elements that define the performance of our algorithm are the sampling One technique to measure the informativeness strategy S(X, M ) and the retrain(M, (f , e)) of a sentence is to directly measure the amount function. On the one hand, the sampling strat- of new information that it will add to the SMT egy decides which sentences should be supervised model. This sampling strategy considers that by the user, which defines the human effort re- sentences with rare n-grams are more informa- quired by the algorithm. Section 5 describes our tive. The intuition for this approach is that rare implementation of the sentence sampling to deal n-grams need to be seen several times in order to with the dynamic nature of data streams. On the accurately estimate their probability. other hand, the retrain(·) function incremen- To do that, we store the counts for each n-gram tally trains the SMT model with each new training present in the sentences used to train the SMT pair (f , e). Section 6 describes the implementa- model. We assume that an n-gram is accurately tion of this function. represented when it appears A or more times in 248 the training samples. Therefore, the score for a Finally, this sampling strategy works by select- given sentence f is computed as: ing a given percentage of the highest scoring sen- PN tences. |Nn<A (f )| C(f ) = Pn=1 N (5) We dynamically update the confidence sampler n=1 |Nn (f )| each time a new sentence pair is added to the SMT where Nn (f ) is the set of n-grams of size n model. The incremental version of the EM algo- in f , Nn<A (f ) is the set of n-grams of size n in rithm (Neal and Hinton, 1999) is used to incre- f that are inaccurately represented in the training mentally train the IBM model 1. data and N is the maximum n-gram order. In the experimentation, we assume N = 4 as the 6 Retraining of the SMT model maximum n-gram order and a value of 10 for the To retrain the SMT model, we implement the threshold A. This sampling strategy works by se- online learning techniques proposed in (Ortiz- lecting a given percentage of the highest scoring Mart´ınez et al., 2010). In that work, a state- sentences. of-the-art log-linear model (Och and Ney, 2002) We update the counts of the n-grams seen by and a set of techniques to incrementally train this the SMT model with each new sentence pair. model were defined. The log-linear model is com- Hence, the sampling strategy is always up-to-date posed of a set of feature functions governing dif- with the last training data. ferent aspects of the translation process, includ- 5.3 Dynamic confidence sampling ing a language model, a source sentence–length Another technique is to consider that the most in- model, inverse and direct translation models, a formative sentence is the one the current SMT target phrase–length model, a source phrase– model translates worst. The intuition behind this length model and a distortion model. approach is that an SMT model can not generate The incremental learning algorithm allows us good translations unless it has enough informa- to process each new training sample in constant tion to translate the sentence. time (i.e. the computational complexity of train- The usual approach to compute the quality of a ing a new sample does not depend on the num- translation hypothesis is to compare it to a refer- ber of previously seen training samples). To do ence translation, but, in this case, it is not a valid that, a set of sufficient statistics is maintained for option since reference translations are not avail- each feature function. If the estimation of the able. Hence, we use confidence estimation (Gan- feature function does not require the use of the drabur and Foster, 2003; Blatz et al., 2004; Ueff- well-known expectation–maximization (EM) al- ing and Ney, 2007) to estimate the probability of gorithm (Dempster et al., 1977) (e.g. n-gram lan- correctness of the translations. Specifically, we guage models), then it is generally easy to incre- estimate the quality of a translation from the con- mentally extend the model given a new training fidence scores of their individual words. sample. By contrast, if the EM algorithm is re- The confidence score of a word ei of the trans- quired (e.g. word alignment models), the estima- lation e = e1 . . . ei . . . eI generated from the tion procedure has to be modified, since the con- source sentence f = f1 . . . fj . . . fJ is computed ventional EM algorithm is designed for its use in as described in (Ueffing and Ney, 2005): batch learning scenarios. For such models, the in- cremental version of the EM algorithm (Neal and Cw (ei , f ) = max p(ei |fj ) (6) Hinton, 1999) is applied. A detailed description 0≤j≤| f | of the update algorithm for each of the models in where p(ei |fj ) is an IBM model 1 (Brown et al., the log-linear combination is presented in (Ortiz- 1993) bilingual lexicon probability and f0 is the Mart´ınez et al., 2010). empty source word. The confidence score for the full translation e is computed as the ratio of its 7 Experiments words classified as correct by the word confidence measure. Therefore, we define the confidence- We carried out experiments to assess the perfor- based informativeness score as: mance of the proposed AL implementation for |{ei | Cw (ei , f ) > τw }| IMT. In each experiments, we started with an C(e, f ) = 1 − (7) initial SMT model that is incrementally updated |e| 249 words 7.2 Assessment criteria corpus use sentences (Spa/Eng) train 731K 15M/15M We want to measure both the quality of the gener- Europarl ated translations and the human effort required to devel. 2K 60K/58K News obtain them. test 51K 1.5M/1.2M Commentary We measure translation quality with the well- known BLEU (Papineni et al., 2002) score. Table 1: Size of the Spanish–English corpora used in To estimate human user effort, we simulate the the experiments. K and M stand for thousands and millions of elements respectively. actions taken by a human user in its interaction with the IMT system. The first translation hypoth- esis for each given source sentence is compared with a single reference translation and the longest with the sentences selected by the current sam- common character prefix (LCP) is obtained. The pling strategy. Due to the unavailability of public first non-matching character is replaced by the benchmark data streams, we selected a relatively corresponding reference character and then a new large corpus and treated it as a data stream for AL. translation hypothesis is produced (see Figure 1). To simulate the interaction with the user, we used This process is iterated until a full match with the the reference translations in the data stream cor- reference is obtained. Each computation of the pus as the translation the human user would like LCP would correspond to the user looking for the to obtain. Since each experiment is carried out next error and moving the pointer to the corre- under the same conditions, if one sampling strat- sponding position of the translation hypothesis. egy outperforms its peers, then we can safely con- Each character replacement, on the other hand, clude that this is because the sentences selected to would correspond to a keystroke of the user. be translated are more informative. Bearing this in mind, we measure the user ef- fort by means of the keystroke and mouse-action ratio (KSMR) (Barrachina et al., 2009). This mea- 7.1 Training corpus and data stream sure has been extensively used to report results in the IMT literature. KSMR is calculated as the The training data comes from the Europarl corpus number of keystrokes plus the number of mouse as distributed for the shared task in the NAACL movements divided by the total number of refer- 2006 workshop on statistical machine transla- ence characters. From a user point of view the tion (Koehn and Monz, 2006). We used this data two types of actions are different and require dif- to estimate the initial log-linear model used by our ferent types of effort (Macklovitch, 2006). In any IMT system (see Section 6). The weights of the case, as an approximation, KSMR assumes that different feature functions were tuned by means both actions require a similar effort. of minimum error–rate training (Och, 2003) exe- cuted on the Europarl development corpus. Once 7.3 Experimental results the SMT model was trained, we use the News In this section, we report results for three different Commentary corpus (Callison-Burch et al., 2007) experiments. First, we studied the performance to simulate the data stream. The size of these cor- of the sampling strategies when dealing with the pora is shown in Table 1. The reasons to choose sampling bias problem. In the second experiment, the News Commentary corpus to carry out our we carried out a typical AL experiment measur- experiments are threefold: first, its size is large ing the performance of the sampling strategies as enough to simulate a data stream and test our a function of the percentage of the corpus used AL techniques in the long term; second, it is to retrain the SMT model. Finally, we tested our out-of-domain data which allows us to simulate AL implementation for IMT in order to study the a real-world situation that may occur in a trans- tradeoff between required human effort and final lation company, and, finally, it consists in edito- translation quality. rials from eclectic domain: general politics, eco- nomics and science, which effectively represents 7.3.1 Dealing with the sampling bias the variations in the sentence distributions of the In this experiment, we want to study the perfor- simulated data stream. mance of the different sampling strategies when 250 DCS NS RS DCS NS SCS RS 22 23 22 21 21 20 20 BLEU BLEU 19 19 20 18 19 18 17 18 17 16 17 2 4 6 8 16 15 0 10 20 30 40 50 0 5 10 15 20 Block number Percentage (%) of the corpus in words Figure 2: Performance of the AL methods across dif- Figure 3: BLEU of the initial automatic translations ferent data blocks. Block size 500. Human supervision as a function of the percentage of the corpus used to 10% of the corpus. retrain the model. fact that NS is independent of the target language dealing with the sampling bias problem. Fig- and just looks into the source language, while ure 2 shows the evolution of the translation qual- DCS takes into account both the source sentence ity, in terms of BLEU, across different data blocks and its automatic translation. Similar phenomena for the three sampling strategies described in sec- has been reported in a previous work on AL for tion 5, namely, dynamic confidence sampling SMT (Haffari et al., 2009). (DCS), n-gram coverage sampling (NS) and ran- dom sampling (RS). On the one hand, the x-axis 7.3.2 AL performance represents the data blocks number in their tempo- We carried out experiments to study the perfor- ral order. On the other hand, the y-axis represents mance of the different sampling strategies. To this the BLEU score when automatically translating a end, we compare the quality of the initial auto- block. Such translation is obtained by the SMT matic translations generated in our AL implemen- model trained with translations supervised by the tation for IMT (line 6 in Algorithm 1). Figure 3 user up to that point of the data stream. To fairly shows the BLEU score of these initial translations compare the different methods, we fixed the per- represented as a function of the percentage of the centage of words supervised by the human user corpus used to retrain the SMT model. The per- (10%). In addition to this, we used a block size of centage of the corpus is measured in number of 500 sentences. Similar results were obtained for running words. other block sizes. In Figure 3, we present results for the three Results in Figure 2 indicate that the perfor- sampling strategies described in section 5. Ad- mances for the data blocks fluctuate and fluctu- ditionally, we also compare our techniques with ations are quite significant. This phenomenon is the AL technique for IMT proposed in (Gonz´alez- due to the eclectic domain of the sentences in the Rubio et al., 2011). Such technique is similar to data stream. Additionally, the steady increase in DCS but it does not update the IBM model 1 used performance is caused by the increasing amount by the confidence sampler with the newly avail- of data used to retrain the SMT model. able human-translated sentences. This technique Regarding the results for the different sam- is referred to as static confidence sampler (SCS). pling strategies, DCS consistently outperformed Results in Figure 3 indicate that the perfor- RS and NS. This observation asserts that for con- mance of the retrained SMT models increased as cept drifting data streams with constant changing more data was incorporated. Regarding the sam- translation distributions, DCS can adaptively ask pling strategies, DCS improved the results ob- the user to translate sentences to build a superior tained by the other sampling strategies. NS ob- SMT model. On the other hand, NS obtains worse tained by far the worst results, which confirms the results that RS. This result can be explained by the results shown in the previous experiment. Finally, 251 DCS SCS w/o AL different AL sampling strategies, DCS obtains the NS RS better results but differences with other methods 100 are slight. 90 Varying the sentence classifier, we can achieve 80 a balance between final translation quality and re- 70 quired human effort. This feature allows us to BLEU 60 50 75 adapt the system to suit the requirements of the 70 40 65 particular translation task or to the available eco- 60 30 55 nomic or human resources. For example, if a 50 20 translation quality of 60 BLEU points is satisfac- 16 18 20 22 24 10 0 10 20 30 40 50 60 70 tory, then the human translators would need to KSMR modify only a 20% of the characters of the au- tomatically generated translations. Figure 4: Quality of the data stream translation Finally, it should be noted that our IMT sys- (BLEU) as a function of the required human effort tems with AL are able to generate new suffixes (KSMR). w/o AL denotes a system with no retraining. and retrain with new sentence pairs in tenths of a second. Thus, it can be applied in real time sce- as it can be seen, SCS obtained slightly worst re- narios. sults than DCS showing the importance of dy- 8 Conclusions and future work namically adapting the underlying model used by the sampling strategy. In this work, we have presented an AL frame- work for IMT specially designed to process data 7.3.3 Balancing human effort and streams with massive volumes of data. Our pro- translation quality posal splits the data stream in blocks of sentences Finally, we studied the balance between re- of a certain size and applies AL techniques indi- quired human effort and final translation error. vidually for each block. For this purpose, we im- This can be useful in a real-world scenario where plemented different sampling strategies that mea- a translation company is hired to translate a sure the informativeness of a sentence according stream of sentences. Under these circumstances, to different criteria. it would be important to be able to predict the ef- To evaluate the performance of our proposed fort required from the human translators to obtain sampling strategies, we carried out experiments a certain translation quality. comparing them with random sampling and the The experiment simulate this situation using only previously proposed AL technique for IMT our proposed IMT system with AL to translate described in (Gonz´alez-Rubio et al., 2011). Ac- the stream of sentences. To have a broad view cording to the results, one of the proposed sam- of the behavior of our system, we repeated this pling strategies, specifically the dynamic con- translation process multiple times requiring an in- fidence sampling strategy, consistently outper- creasing human effort each time. Experiments formed all the other strategies. range from a fully-automatic translation system The results in the experimentation show that the with no need of human intervention to a system use of AL techniques allows us to make a tradeoff where the human is required to supervise all the between required human effort and final transla- sentences. Figure 4 presents results for SCS (see tion quality. In other words, we can adapt our sys- section 7.3.2) and the sentence selection strate- tem to meet the translation quality requirements gies presented in section 5. In addition, we also of the translation task or the available human re- present results for a static system without AL (w/o sources. AL). This system is equal to SCS but it do not per- As future work, we plan to investigate on form any SMT retraining. more sophisticated sampling strategies such as Results in Figure 4 show a consistent reduction those based in information density or query-by- in required user effort when using AL. For a given committee. Additionally, we will conduct exper- human effort the use of AL methods allowed to iments with real users to confirm the results ob- obtain twice the translation quality. Regarding the tained by our user simulation. 252 Acknowledgements Conference on Computational Natural Language Learning, pages 315–321. The research leading to these results has re- Jes´us Gonz´alez-Rubio, Daniel Ortiz-Mart´ınez, and ceived funding from the European Union Seventh Francisco casacuberta. 2011. An active learn- Framework Programme (FP7/2007-2013) under ing scenario for interactive machine translation. In grant agreement no 287576. Work also supported Proc. of the 13thInternational Conference on Mul- by the EC (FEDER/FSE) and the Spanish MEC timodal Interaction. ACM. under the MIPRCV Consolider Ingenio 2010 pro- Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. gram (CSD2007-00018) and iTrans2 (TIN2009- 2009. Active learning for statistical phrase-based machine translation. In Proc. of the North Ameri- 14511) project and by the Generalitat Valenciana can Chapter of the Association for Computational under grant ALMPR (Prometeo/2009/01). Linguistics, pages 415–423. Pierre Isabelle and Kenneth Ward Church. 1997. Spe- cial issue on new tools for human translators. Ma- References chine Translation, 12(1-2):1–2. Vamshi Ambati, Stephan Vogel, and Jaime Carbonell. Philipp Koehn and Christof Monz. 2006. Man- 2010. Active learning and crowd-sourcing for ma- ual and automatic evaluation of machine transla- chine translation. In Proc. of the conference on tion between european languages. In Proc. of the International Language Resources and Evaluation, Workshop on Statistical Machine Translation, pages pages 2169–2174. 102–121. Sergio Barrachina, Oliver Bender, Francisco Casacu- Philipp Koehn, Franz Josef Och, and Daniel Marcu. berta, Jorge Civera, Elsa Cubel, Shahram Khadivi, 2003. Statistical phrase-based translation. In Pro- Antonio Lagarda, Hermann Ney, Jes´us Tom´as, En- ceedings of the 2003 Conference of the North Amer- rique Vidal, and Juan-Miguel Vilar. 2009. Sta- ican Chapter of the Association for Computational tistical approaches to computer-assisted translation. Linguistics on Human Language Technology - Vol- Computational Linguistics, 35:3–28. ume 1, pages 48–54. John Blatz, Erin Fitzgerald, George Foster, Simona Philippe Langlais and Guy Lapalme. 2002. Trans Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Type: development-evaluation cycles to boost trans- Sanchis, and Nicola Ueffing. 2004. Confidence es- lator’s productivity. Machine Translation, 17:77– timation for machine translation. In Proc. of the in- 98. ternational conference on Computational Linguis- Abby Levenberg, Chris Callison-Burch, and Miles Os- tics, pages 315–321. borne. 2010. Stream-based translation models for Michael Bloodgood and Chris Callison-Burch. 2010. statistical machine translation. In Proc. of the North Bucking the trend: large-scale cost-focused active American Chapter of the Association for Compu- learning for statistical machine translation. In Proc. tational Linguistics, pages 394–402, Los Angeles, of the Association for Computational Linguistics, California, June. pages 854–864. Elliott Macklovitch. 2006. TransType2: the last word. Peter F. Brown, Vincent J. Della Pietra, Stephen In Proc. of the conference on International Lan- A. Della Pietra, and Robert L. Mercer. 1993. guage Resources and Evaluation, pages 167–17. The mathematics of statistical machine translation: Radford Neal and Geoffrey Hinton. 1999. A view of parameter estimation. Computational Linguistics, the EM algorithm that justifies incremental, sparse, 19:263–311. and other variants. Learning in graphical models, Chris Callison-Burch, Cameron Fordyce, Philipp pages 355–368. Koehn, Christof Monz, and Josh Schroeder. 2007. Laurent Nepveu, Guy Lapalme, Philippe Langlais, and (Meta-) evaluation of machine translation. In Proc. George Foster. 2004. Adaptive language and trans- of the Workshop on Statistical Machine Translation, lation models for interactive machine translation. In pages 136–158. Proc, of EMNLP, pages 190–197, Barcelona, Spain, Arthur Dempster, Nan Laird, and Donald Rubin. July. 1977. Maximum likelihood from incomplete data Franz Och and Hermann Ney. 2002. Discriminative via the EM algorithm. Journal of the Royal Statis- training and maximum entropy models for statisti- tical Society., 39(1):1–38. cal machine translation. In Proc. of the Association George Foster, Pierre Isabelle, and Pierre Plamon- for Computational Linguistics, pages 295–302. don. 1998. Target-text mediated interactive ma- chine translation. Machine Translation, 12:175– Franz Och. 2003. Minimum error rate training in sta- 194. tistical machine translation. In Proc. of the Associa- tion for Computational Linguistics, pages 160–167. Simona Gandrabur and George Foster. 2003. Confi- dence estimation for text prediction. In Proc. of the 253 Daniel Ortiz-Mart´ınez, Ismael Garc´ıa-Varea, and ference, pages 262–270. Francisco Casacuberta. 2010. Online learning for Nicola Ueffing and Hermann Ney. 2007. Word- interactive statistical machine translation. In Proc. level confidence estimation for machine translation. of the North American Chapter of the Association Computational Linguistics, 33:9–40. for Computational Linguistics, pages 546–554. Xingquan Zhu, Peng Zhang, Xiaodong Lin, and Yong Kishore Papineni, Salim Roukos, Todd Ward, and Shi. 2007. Active learning from data streams. In Wei-Jing Zhu. 2002. BLEU: a method for auto- Proc. of the 7th IEEE International Conference on matic evaluation of machine translation. In Proc. Data Mining, pages 757–762. IEEE Computer So- of the Association for Computational Linguistics, ciety. pages 311–318. Xingquan Zhu, Peng Zhang, Xiaodong Lin, and Yong Nicola Ueffing and Hermann Ney. 2005. Applica- Shi. 2010. Active learning from stream data using tion of word-level confidence measures in interac- optimal weight classifier ensemble. Transactions tive statistical machine translation. In Proc. of the on Systems, Man and Cybernetics Part B, 40:1607– European Association for Machine Translation con- 1621, December. 254 Adapting Translation Models to Translationese Improves SMT Gennadi Lembersky Noam Ordan Shuly Wintner Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Haifa University of Haifa University of Haifa 31905 Haifa, Israel 31905 Haifa, Israel 31905 Haifa, Israel

[email protected] [email protected] [email protected]

Abstract target language, which reflects both artifacts of the translation process and traces of the origi- Translation models used for statistical ma- nal language from which the texts were trans- chine translation are compiled from par- lated. Among the better-known properties of allel corpora; such corpora are manually translated, but the direction of translation is translationese are simplification and explicitation usually unknown, and is consequently ig- (Baker, 1993, 1995, 1996): translated texts tend nored. However, much research in Trans- to be shorter, to have lower type/token ratio, and lation Studies indicates that the direction of to use certain discourse markers more frequently translation matters, as translated language than original texts. Incidentally, translated texts (translationese) has many unique proper- are so markedly different from original ones that ties. Specifically, phrase tables constructed automatic classification can identify them with from parallel corpora translated in the same direction as the translation task perform very high accuracy (van Halteren, 2008; Baroni better than ones constructed from corpora and Bernardini, 2006; Ilisei et al., 2010; Koppel translated in the opposite direction. and Ordan, 2011). We reconfirm that this is indeed the case, Contemporary Statistical Machine Translation but emphasize the importance of using also (SMT) systems use parallel corpora to train trans- texts translated in the ‘wrong’ direction. lation models that reflect source- and target- We take advantage of information pertain- language phrase correspondences. Typically, ing to the direction of translation in con- SMT systems ignore the direction of translation structing phrase tables, by adapting the used to produce those corpora. Given the unique translation model to the special proper- properties of translationese, however, it is reason- ties of translationese. We define entropy- based measures that estimate the correspon- able to assume that this direction may affect the dence of target-language phrases to transla- quality of the translation. Recently, Kurokawa tionese, thereby eliminating the need to an- et al. (2009) showed that this is indeed the case. notate the parallel corpus with information They train a system to translate between French pertaining to the direction of translation. and English (and vice versa) using a French- We show that incorporating these measures translated-to-English parallel corpus, and then an as features in the phrase tables of statisti- English-translated-to-French one. They find that cal machine translation systems results in consistent, statistically significant improve- in translating into French the latter parallel cor- ment in the quality of the translation. pus yields better results, whereas for translating into English it is better to use the former. Usually, of course, the translation direction of a 1 Introduction parallel corpus is unknown. Therefore, Kurokawa Much research in Translation Studies indicates et al. (2009) train an SVM-based classifier to pre- that translated texts have unique characteristics dict which side of a bi-text is the origin and which that set them apart from original texts (Toury, one is the translation, and only use the subset 1980; Gellerstam, 1986; Toury, 1995). Known of the corpus that corresponds to the translation as translationese, translated texts (in any lan- direction of the task in training their translation guage) constitute a genre, or a dialect, of the model. 255 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 255–265, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics We use these results as our departure point, from this table. The benefit of this method is that but improve them in two major ways. First, not only does it yield the best results, but it also we demonstrate that the other subset of the cor- eliminates the need to directly predict the direc- pus, reflecting translation in the ‘wrong’ direc- tion of translation of the parallel corpus. The main tion, is also important for the translation task, and contribution of this work, therefore, is a method- must not be ignored; second, we show that ex- ology that improves the quality of SMT by build- plicit information on the direction of translation of ing translation models that are adapted to the na- the parallel corpus, whether manually-annotated ture of translationese. or machine-learned, is not mandatory. This is achieved by casting the problem in the framework 2 Related Work of domain adaptation: we use domain-adaptation Kurokawa et al. (2009) are the first to address techniques to direct the SMT system toward pro- the direction of translation in the context of SMT. ducing output that better reflects the properties Their main finding is that using the S → T por- of translationese. We show that SMT systems tion of the parallel corpus results in mucqqh better adapted to translationese produce better transla- translation quality than when the T → S portion tions than vanilla systems trained on exactly the is used for training the translation model. We in- same resources. We confirm these findings using deed replicate these results here (Section 3), and an automatic evaluation metric, BLEU (Papineni view them as a baseline. Additionally, we show et al., 2002), as well as through a qualitative anal- that the T → S portion is also important for ma- ysis of the results. chine translation and thus should not be discarded. Our departure point is the results of Kurokawa Using information-theory measures, and in par- et al. (2009), which we successfully replicate in ticular cross-entropy, we gain statistically signif- Section 3. First (Section 4), we explain why trans- icant improvements in translation quality beyond lation quality improves when the parallel corpus the results of Kurokawa et al. (2009). Further- is translated in the ‘right’ direction. We do so more, we eliminate the need to (manually or au- by showing that the subset of the corpus that was tomatically) detect the direction of translation of translated in the direction of the translation task the parallel corpus. (the ‘right’ direction, henceforth source-to-target, Lembersky et al. (2011) also investigate the re- or S → T ) yields phrase tables that are better lations between translationese and machine trans- suited for translation of the original language than lation. Focusing on the language model (LM), the subset translated in the reverse direction (the they show that LMs trained on translated texts ‘wrong’ direction, henceforth target-to-source, or yield better translation quality than LMs compiled T → S). We use several statistical measures that from original texts. They also show that perplex- indicate the better quality of the phrase tables in ity is a good discriminator between original and the former case. translated texts. Then (Section 5), we explore ways to build a Our current work is closely related to research translation model that is adapted to the unique in domain-adaptation. In a typical domain adap- properties of translationese. We first show that tation scenario, a system is trained on a large cor- using the entire parallel corpus, including texts pus of “general” (out-of-domain) training mate- that are translated both in the ‘right’ and in the rial, with a small portion of in-domain training ‘wrong’ direction, improves the quality of the re- texts. In our case, the translation model is trained sults. Furthermore, we show that the direction of on a large parallel corpus, of which some (gener- translation used for producing the parallel corpus ally unknown) subset is “in-domain” (S → T ), can be approximated by defining several entropy- and some other subset is “out-of-domain” (T → based measures that correlate well with transla- S). Most existing adaptation methods focus on tionese, and, consequently, with the quality of the selecting in-domain data from a general domain translation. corpus. In particular, perplexity is used to score Specifically, we use the entire corpus, create a the sentences in the general-domain corpus ac- single, unified phrase table and then use the statis- cording to an in-domain language model. Gao tical measures mentioned above, and in particular et al. (2002) and Moore and Lewis (2010) apply cross-entropy, as a clue for selecting phrase pairs this method to language modeling, while Foster 256 et al. (2010) and Axelrod et al. (2011) use it on French-to-English and twelve English-to-French the translation model. Moore and Lewis (2010) phrase-based (PB-) SMT systems using the suggest a slightly different approach, using cross- MOSES toolkit (Koehn et al., 2007), each trained entropy difference as a ranking function. on a different subset of the corpus. We use Domain adaptation methods are usually applied GIZA++ (Och and Ney, 2000) with grow-diag- at the corpus level, while we focus on an adap- final alignment, and extract phrases of length up tation of the phrase table used for SMT. In this to 10 words. We prune the resulting phrase tables sense, our work follows Foster et al. (2010), who as in Johnson et al. (2007), using at most 30 trans- weigh out-of-domain phrase pairs according to lations per source phrase and discarding singleton their relevance to the target domain. They use phrase pairs. multiple features that help distinguish between We construct English and French 5-gram lan- phrase pairs in the general domain and those in guage models from the English and French the specific domain. We rely on features that are subsections of the Europarl-V6 corpus (Koehn, motivated by the findings of Translation Studies, 2005), using interpolated modified Kneser-Ney having established their relevance through a com- discounting (Chen, 1998) and no cut-off on all parative analysis of the phrase tables. In particu- n-grams. Europarl consists of a large number lar, we use measures such as translation model en- of subsets translated from various languages, and tropy, inspired by Koehn et al. (2009). Addition- is therefore unlikely to be biased towards a spe- ally, we apply the method suggested by Moore cific source language. The reordering model used and Lewis (2010) using perplexity ratio instead in all MT systems is trained on the union of of cross-entropy difference. the 1.5M French-original and the 1.5M English- original subsets, using msd-bidirectional-fe re- 3 Experimental Setup ordering. We use the MERT algorithm (Och, The tasks we focus on are translation between 2003) for tuning and BLEU (Papineni et al., 2002) French and English, in both directions. We as our evaluation metric. We test the statistical use the Hansard corpus, containing transcripts of significance of the differences between the results the Canadian parliament from 1996–2007, as the using the bootstrap resampling method (Koehn, source of all parallel data. The Hansard is a 2004). bilingual French–English corpus comprising ap- A word on notation: We use ‘English-original’ proximately 80% English-original texts and 20% (EO) and ‘French-original’ (FO) to refer to the French-original texts. Crucially, each sentence subsets of the corpus that are translated from En- pair in the corpus is annotated with the direction glish to French and from French to English, re- of translation. Both English and French are lower- spectively. The translation tasks are English-to- cased and tokenized using MOSES (Koehn et al., French (E2F) and French-to-English (F2E). We 2007). Sentences longer than 80 words are dis- thus use ‘S → T ’ when the FO corpus is used for carded. the F2E task or when the EO corpus is used for To address the effect of the corpus size, we the E2F task; and ‘T → S’ when the FO corpus compile six subsets of different sizes (250K, is used for the E2F task or when the EO corpus is 500K, 750K, 1M, 1.25M and 1.5M parallel used for the F2E task. sentences) from each portion (English-original Table 1 depicts the BLEU scores of the baseline and French-original) of the corpus. Addition- systems. The data are consistent with the findings ally, we use the devtest section of the Hansard of Kurokawa et al. (2009): systems trained on corpus to randomly select French-original and S → T parallel texts outperform systems trained English-original sentences that are used for tun- on T → S texts, even when the latter are much ing (1,000 sentences each) and evaluation (5,000 larger. The difference in BLEU score can be as sentences each). French-to-English MT sys- high as 3 points. tems are tuned and tested on French-original sen- 4 Analysis of the Phrase Tables tences and English-to-French systems on English- original ones. The baseline results suggest that S → T and To replicate the results of Kurokawa et al. T → S phrase tables differ substantially, presum- (2009) and set up a baseline, we train twelve ably due to the different characteristics of original 257 Task: French-to-English the average entropy over all translation options Corpus subset S → T T → S for each source phrase (henceforth, phrase table 250K 34.35 31.33 entropy or PtEnt), whereas Koehn et al. (2009) 500K 35.21 32.38 search through all possible segmentations of the 750K 36.12 32.90 source sentence to find the optimal covering set of 1M 35.73 33.07 test sentences that minimizes the average entropy 1.25M 36.24 33.23 of the source phrases in the covering set (hence- 1.5M 36.43 33.73 forth, covering set entropy or CovEnt). Task: English-to-French We also propose a metric that assesses the qual- Corpus subset S → T T → S ity of the source side of a phrase table. The met- ric finds the minimal covering set of a given text 250K 27.74 26.58 in the source language using source phrases from 500K 29.15 27.19 a particular phrase table, and outputs the average 750K 29.43 27.63 length of a phrase in the covering set (henceforth, 1M 29.94 27.88 covering set average length or CovLen). 1.25M 30.63 27.84 1.5M 29.89 27.83 Lembersky et al. (2011) show that perplexity distinguishes well between translated and origi- Table 1: BLEU scores of baseline systems nal texts. Moreover, perplexity reflects the de- gree of ‘relatedness’ of a given phrase to original language or to translationese. Motivated by this and translated texts. In this section we explain observation, we design two cross-entropy-based the better translation quality in terms of the bet- measures to assess how well each phrase table fits ter quality of the respective phrase tables, as de- the genre of translationese. Since MT systems are fined by a number of statistical measures. We first evaluated against human translations, we believe relate these measures to the unique properties of that this factor may have a significant impact on translationese. translation performance. The cross-entropy of a Translated texts tend to be simpler than original text T = w1 , w2 , · · · wN according to a language ones along a number of criteria. Generally, trans- model L is: lated texts are not as rich and variable as origi- nal ones, and in particular, their type/token ratio N is lower. Consequently, we expect S → T phrase 1 X H(T, L) = − log2 L(wi ) (2) tables (which are based on a parallel corpus whose N i=1 source is original texts, and whose target is trans- We build language models of translated texts lationese) to have more unique source phrases and as follows. For English translationese, we a lower number of translations per source phrase. extract 170,000 French-original sentences from A large number of unique source phrases suggests the English portion of Europarl, and 3,000 better coverage of the source text, while a small English-translated-from-French sentences from number of translations per source phrase means a the Hansard corpus (disjoint from the training, lower phrase table entropy. Entropy-based mea- development and test sets, of course). We use sures are well-established tools to assess the qual- each corpus to train a trigram language model ity of a phrase table. Phrase table entropy captures with interpolated modified Kneser-Ney discount- the amount of uncertainty involved in choosing ing and no cut-off. All out-of-vocabulary words candidate translation phrases (Koehn et al., 2009). are mapped to a special token, hunki. Then, Given a source phrase s and a phrase table T we interpolate the Hansard and Europarl language with translations t of s whose probabilities are models to minimize the perplexity of the target p(t|s), the entropy H of s is: side of the development set (λ = 0.58). For X French translationese, we use 270,000 sentences H(s) = − p(t|s) × log2 p(t|s) (1) t∈T from Europarl and 3,000 sentences from Hansard, λ = 0.81. Finally, we compute the cross-entropy There are two major flavors of the phrase table of each target phrase in the phrase tables accord- entropy metric: Lambert et al. (2011) calculate ing to these language models. 258 As with the entropy-based measures, we define Measure R2 (FR–EN) R2 (EN-FR) two cross-entropy metrics: phrase table cross- AvgTran 0.06 0.22 entropy or PtCrEnt calculates the average cross- PtEnt 0.03 0.19 entropy over weighted cross-entropies of all trans- CovEnt 0.94 0.46 lation options for each source phrase, and cover- PtCrEnt 0.33 0.44 ing set cross-entropy or CovCrEnt finds the opti- CovCrEnt 0.56 0.54 mal covering set of test sentences that minimizes CovLen 0.75 0.56 the weighted cross-entropy of the source phrase Table 3: Correlation of BLEU scores with phrase table in the covering set. Given a phrase table T and a statistical measures language model L, the weighted cross-entropy W for a source phrase s is: these measures are computed directly on the X phrase table, and do not require reference trans- W (s, L) = − H(t, L) × p(t|s) (3) lations or meta-information pertaining to the di- t∈T rection of translation of the parallel phrase. where H(t, L) is the cross-entropy of t according to a language model L. 5 Translation Model Adaptation Table 2 depicts various statistical measures We have thus established the fact that S → T computed on the phrase tables corresponding to phrase tables have an advantage over T → S ones our 24 SMT systems.1 The data meet our pre- that stems directly from the different characteris- liminary expectations: S → T phrase tables have tics of original and translated texts. We have also more unique source phrases, but fewer translation identified three statistical measures that explain options per source phrase. They have lower en- most of the variability in translation quality. We tropy and cross-entropy, but higher covering set now explore ways for taking advantage of the en- length. tire parallel corpus, including translations in both In order to asses the correspondence of each directions, in light of the above findings. Our goal measure to translation quality, we compute the is to establish the best method to address the is- correlation of BLEU scores from Table 1 with sue of different translation direction components each of the measures specified in Table 2; we in the parallel corpus. compute the correlation coefficient R2 (the square First, we simply take the union of the two sub- of Pearson’s product-moment correlation coeffi- sets of the parallel corpus. We create three dif- cient) by fitting a simple linear regression model. ferent mixtures of FO and EO: 500K sentences Table 3 lists the results. Only the covering set each of FO and EO (‘MIX1’), 500K sentences cross-entropy measure shows stability over the of FO and 1M sentences of EO (‘MIX2’), and French-to-English and English-to-French transla- 1M sentences of FO and 500K sentences of EO tion tasks, with R2 equals to 0.56 and 0.54, re- (‘MIX3’). We use these corpora to train French- spectively. Other measures are sensitive to the to-English and English-to-French MT systems, translation task: covering set entropy has the evaluating their quality on the evaluation sets de- highest correlation with BLEU (R2 = 0.94) when scribed in Section 3. We use the same Moses con- translating French-to-English, but it drops to 0.46 figuration as well as the same language and re- for the reverse task. The covering set average ordering models as in Section 3. length measure shows similar behavior: R2 drops Table 4 reports the results, comparing them from 0.75 in French-to-English to 0.56 in English- to the results obtained for the baseline MT sys- to-French. Still, the correlation of these measures tems trained on individual French-original and with BLEU is high. English-original bi-texts (see Section 3).2 Note Consequently, we use the three best measures, that the mixed corpus includes many more sen- namely covering set entropy, cross-entropy and tences than each of the baseline models; this is a average length, as indicators of better transla- 2 tions, more similar to translationese. Crucially, Recall that when translating from French to English, S → T means that the bi-text is French-original; when trans- 1 The phrase tables were pruned, retaining only phrases lating from English to French, S → T means it is English- that are included in the evaluation set. original. 259 Task: French-to-English Set Total Source AvgTran PtEnt CovEnt PtCrEnt CovCrEnt CovLen S→T 250K 231K 69K 3.35 0.86 0.36 3.94 1.64 2.44 500K 360K 86K 4.21 0.98 0.35 3.52 1.30 2.64 750K 461K 96K 4.81 1.05 0.35 3.24 1.10 2.77 1M 544K 103K 5.27 1.10 0.34 3.09 0.99 2.85 1.25M 619K 109K 5.66 1.14 0.34 2.98 0.91 2.92 1.5M 684K 114K 6.01 1.18 0.33 2.90 0.85 2.97 T →S 250K 199K 55K 3.65 0.92 0.45 4.00 1.87 2.25 500K 317K 69K 4.56 1.05 0.43 3.57 1.52 2.42 750K 405K 78K 5.19 1.12 0.43 3.39 1.35 2.53 1M 479K 85K 5.66 1.16 0.42 3.21 1.21 2.61 1.25M 545K 90K 6.07 1.20 0.41 3.11 1.12 2.67 1.5M 602K 94K 6.43 1.24 0.41 3.04 1.07 2.71 Task: English-to-French Set Total Source AvgTran PtEnt CovEnt PtCrEnt CovCrEnt CovLen S→T 250K 224K 49K 4.52 1.07 0.63 3.48 1.88 2.08 500K 346K 61K 5.64 1.21 0.59 3.08 1.49 2.25 750K 437K 68K 6.39 1.29 0.57 2.91 1.33 2.33 1M 513K 74K 6.95 1.34 0.55 2.75 1.18 2.41 1.25M 579K 78K 7.42 1.38 0.54 2.63 1.09 2.46 1.5M 635K 81K 7.83 1.41 0.53 2.58 1.03 2.50 T →S 250K 220K 46K 4.75 1.12 0.63 3.62 2.09 2.02 500K 334K 57K 5.82 1.24 0.60 3.24 1.70 2.16 750K 421K 64K 6.54 1.31 0.58 2.97 1.48 2.25 1M 489K 69K 7.10 1.36 0.57 2.84 1.35 2.32 1.25M 550K 73K 7.56 1.40 0.55 2.74 1.25 2.37 1.5M 603K 76K 7.92 1.43 0.55 2.66 1.17 2.41 Table 2: Statistic measures computed on the phrase tables: total size, in tokens (‘Total’); the number of unique source phrases (‘Source’); the average number of translations per source phrase (‘AvgTran’); phrase table entropy (‘PtEnt’) and covering set entropy (‘CovEnt’); phrase table cross-entropy (‘PtCrEnt’) and covering set cross- entropy (‘CovCrEnt’); and the covering set average length (‘CovLen’) realistic scenario, in which one can opt either to previous section on phrase tables trained on the use the entire parallel corpus, or only its S → T MIX corpora, and compare them with the same subset. Even with a corpus several times as large, measures computed for phrase tables trained on however, the ‘mixed’ MT systems perform only the relevant S → T corpus for both translation slightly better than the S → T ones. On one tasks. Table 5 displays the figures for the MIX1 hand, this means that one can train MT systems corpus: Phrase tables trained on mixed corpora on S → T data only, at the expense of only a mi- have higher covering set average length, similar nor loss in quality. On the other hand, it is obvi- covering set entropy, but significantly worse cov- ous that the T → S component also contributes to ering set cross-entropy. Consequently, improving translation quality. We now look at ways to better covering set cross-entropy has the greatest poten- utilize this portion. tial for improving translation quality. We there- fore use this feature to ‘encourage’ the decoder to We compute the measures established in the 260 Task: French-to-English tion of Europarl, and 2,700 English-original sen- System MIX1 MIX2 MIX3 tences from the Hansard corpus. We train a tri- Union 35.27 35.36 35.94 gram language model with interpolated modified S→T 35.21 35.21 35.73 Kneser-Ney discounting on each corpus and we T →S 32.38 33.07 32.38 interpolate both models to minimize the perplex- Task: English-to-French ity of the source side of the development set for System MIX1 MIX2 MIX3 the English-to-French translation task (λ = 0.49). Union 29.27 30.01 29.44 For original French, we use 110,000 sentences S→T 29.15 29.94 29.15 from Europarl and 2,900 sentences from Hansard, T →S 27.19 27.19 27.88 λ = 0.61. Finally, for each target phrase t in the phrase table we compute the ratio of the perplex- Table 4: Evaluation of the MIX systems ity of t according to the original language model Lo and the perplexity of t with respect to the trans- select translation options that are more related to lated model Lt (see Section 4). In other words, the the genre of translated texts. factor F is computed as follows: H(t, Lo ) French-to-English F (t) = (4) Measure MIX1 S → T H(t, Lt ) CovLen 2.78 2.64 We apply these techniques to the French-to- CovEnt 0.37 0.35 English and English-to-French phrase tables built CovCrEnt 1.58 1.10 from the mixed corpora and use each phrase ta- English-to-French ble to train an SMT system. Table 6 summa- Measure MIX1 S → T rizes the performance of these systems. All sys- CovLen 2.40 2.25 tems outperform the corresponding Union sys- CovEnt 0.55 0.58 tems. ‘CrEnt’ systems show significant improve- CovCrEnt 2.09 1.48 ments (p < 0.05) on balanced scenarios (‘MIX1’) and on scenarios biased towards the S → T com- Table 5: Statistical measures computed for mixed vs. ponent (‘MIX2’ in the French-to-English task, source-to-target phrase tables ‘MIX3’ in English-to-French). ‘PplRatio’ sys- tems exhibit more consistent behavior, showing We do so by adding to each phrase pair in the small, but statistically significant improvement phrase tables an additional factor, as a measure of (p < 0.05) in all scenarios. its fitness to the genre of translationese. We ex- periment with two such factors. First, we use the Task: French-to-English language models described in Section 4 to com- System MIX1 MIX2 MIX3 pute the cross-entropy of each translation option Union 35.27 35.36 35.94 according to this model. We add cross-entropy CrEnt 35.54 35.45 36.75 as an additional score of a translation pair that PplRatio 35.59 35.78 36.22 can be tuned by MERT (we refer to this system Task: English-to-French as CrEnt). Since cross-entropy is ‘the lower the System MIX1 MIX2 MIX3 better’ metric, we adjust the range of values used Union 29.27 30.01 29.44 by MERT for this score to be negative. Sec- CrEnt 29.47 30.44 29.45 ond, following Moore and Lewis (2010), we de- PplRatio 29.65 30.34 29.62 fine an adapting feature that not only measures how close phrases are to translated language, but Table 6: Evaluation of MT Systems also how far they are from original language, and use it as a factor in a phrase table (this system Note again that all systems in the same column is referred to as PplRatio). We build two addi- are trained on exactly the same corpus and have tional language models of original texts as fol- exactly the same phrase tables. The only differ- lows. For original English, we extract 135,000 ence is an additional factor in the phrase table that English-original sentences from the English por- “encourages” the decoder to select translation op- 261 tions that are closer to translated texts than to orig- Source Cependant, je pense qu’il est pr´ematur´e inal ones. de le faire actuellement, e´ tant donn´e que le ministre a lanc´e cette tourn´ee. 6 Analysis Baseline However, I think it is premature to the In order to study the effect of the adaptation qual- right now, since the minister launched this itatively, rather than quantitatively, we focus on tour. several concrete examples. We compare transla- Adapted However, I think it is premature to do tions produced by the ‘Union’ (henceforth base- so now, given that the minister has launched line) and by the ‘PplRatio’ (henceforth adapted) this tour. French-English SMT systems. We manually in- spect 200 sentences of length between 15 and 25 Finally, there are often cultural differences be- from the French-English evaluation set. tween languages, specifically the use of a 24-hour In many cases, the adapted system produces clock (common in French) vs. a 12-hour clock more fluent and accurate translations. In the fol- (common in English). The adapted system is lowing examples, the baseline system generates more consistent in translating the former to the common translations of French words that are ad- latter: equate for a wider context, whereas the adapted system chooses less common, but more suitable Source On avait d´ecid´e de poursuivre la s´eance translations: jusqu’ a` 18 heures, mais on n’aura pas le Source J’ai eu cette perception et j’´etais assez temps de faire un autre tour de table. certain que c¸a allait se faire. Baseline We had decided to continue the meeting Baseline I had that perception and I was enough until 18 hours, but we will not have the time certain it was going do. to do another round. Adapted I had that perception and I was quite Adapted We had decided to continue the meeting certain it was going do. until 6 p.m., but we won’t have the time to do Source J’attends donc que vous en demandiez la another round. permission, monsieur le Pr´esident. Source Vu qu’il est 17h 20, je suis d’accord Baseline I look so that you seek permission, mr. pour qu’on ne discute pas de ma motion chairman. imm´ediatement. Adapted I await, then, that you seek permission, Baseline Seen that it is 17h 20, I agree that we are mr. chairman. not talking about my motion immediately. In quite a few cases, the baseline system leaves Adapted Given that it is 5:20, I agree that we are out important words from the source sentence, not talking about my motion immediately. producing ungrammatical, even illegible transla- tions, whereas the adapted system generates good In (human) translation circles, translating out of translations. Careful traceback reveals that the one’s mother tongue is considered unprofessional, baseline system ‘splits’ the source sentence into even unethical (Beeby, 2009). Many professional phrases differently (and less optimally) than the associations in Europe urge translators to work adapted system. Apparently, when the decoder is exclusively into their mother tongue (Pavlovi´c, coerced to select translation options that are more 2007). The two kinds of automatic systems built adapted to translationese, it tends to select source in this paper reflect only partly the human sit- phrases that are more related to original texts, re- uation, but they do so in a crucial way. The sulting in more successful coverage of the source S → T systems learn examples from many hu- sentence: man translators who follow the decree according Source Pourtant, lorsqu’ on les avait pr´esent´es, to which translation should be made into one’s na- c’´etait pour corriger les probl`emes li´es au tive tongue. The T → S systems are flipped di- PCSRA. rections of humans’ input and output. The S → T Baseline Yet when they had presented, it was to direction proved to be more fluent, accurate and correct the problems the CAIS program. even more culturally sensitive. This has to do with Adapted Yet when they had presented, it was to fact that the translators ‘cover’ the source texts correct the problems associated with CAIS. more fully, having a better ‘translation model’. 262 7 Conclusion References Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Phrase tables trained on parallel corpora that were Domain adaptation via pseudo in-domain data se- translated in the same direction as the translation lection. In Proceedings of the 2011 Conference task perform better than ones trained on corpora on Empirical Methods in Natural Language Pro- translated in the opposite direction. Nonethe- cessing, pages 355–362. Association for Computa- less, even ‘wrong’ phrase tables contribute to the tional Linguistics, July 2011. URL http://www. translation quality. We analyze both ‘correct’ and aclweb.org/anthology/D11-1033. ‘wrong’ phrase tables, uncovering a great deal of Michiel Bacchiani, Michael Riley, Brian Roark, and difference between them. We use insights from Richard Sproat. MAP adaptation of stochastic Translation Studies to explain these differences; grammars. Computer Speech and Language, 20:41– we then adapt the translation model to the nature 68, January 2006. ISSN 0885-2308. doi: 10.1016/ of translationese. j.csl.2004.12.001. URL http://dl.acm.org/ citation.cfm?id=1648820.1648854. We incorporate information-theoretic measures that correlate well with translationese into phrase Mona Baker. Corpus linguistics and translation stud- tables as an additional score that can be tuned ies: Implications and applications. In Gill Fran- cis Mona Baker and Elena Tognini-Bonelli, editors, by MERT, and show a statistically significant im- Text and technology: in honour of John Sinclair, provement in the translation quality over all base- pages 233–252. John Benjamins, Amsterdam, 1993. line systems. We also analyze the results qual- Mona Baker. Corpora in translation studies: An itatively, showing that SMT systems adapted to overview and some suggestions for future research. translationese tend to produce more coherent and Target, 7(2):223–243, September 1995. fluent outputs than the baseline systems. An addi- tional advantage of our approach is that it does not Mona Baker. Corpus-based translation studies: The challenges that lie ahead. In Gill Francis require an annotation of the translation direction Mona Baker and Elena Tognini-Bonelli, editors, of the parallel corpus. It is completely generic Terminology, LSP and Translation. Studies in lan- and can be applied to any language pair, domain guage engineering in honour of Juan C. Sager, or corpus. pages 175–186. John Benjamins, Amsterdam, 1996. This work can be extended in various direc- Marco Baroni and Silvia Bernardini. A new tions. We plan to further explore the use of two approach to the study of Translationese: Machine- phrase tables, one for each direction-determined learning the difference between original and subset of the parallel corpus. Specifically, we will translated text. Literary and Linguistic Com- interpolate the translation models as in Foster and puting, 21(3):259–274, September 2006. URL Kuhn (2007), including a maximum a posteriori http://llc.oxfordjournals.org/cgi/ content/short/21/3/259?rss=1. combination (Bacchiani et al., 2006). We also plan to upweight the S → T subset of the parallel Alison Beeby. Direction of translation (directional- corpus and train a single phrase table on the con- ity). In Mona Baker and Gabriela Saldanha, edi- tors, Routledge Encyclopedia of Translation Stud- catenated corpus. Finally, we intend to extend this ies, pages 84–88. Routledge (Taylor and Francis), work by combining the translation-model adap- New York, 2nd edition, 2009. tation we present here with the language-model adaptation suggested by Lembersky et al. (2011) Stanley F. Chen. An empirical study of smoothing techniques for language modeling. Technical report in a unified system that is more tuned to generat- 10-98, Computer Science Group, Harvard Univer- ing translationese. sity, November 1998. George Foster and Roland Kuhn. Mixture-model adap- Acknowledgments tation for SMT. In Proceedings of the Second Workshop on Statistical Machine Translation, pages We are grateful to Cyril Goutte, George Foster 128–135. Association for Computational Linguis- and Pierre Isabelle for providing us with an anno- tics, June 2007. URL http://www.aclweb. tated version of the Hansard corpus. This research org/anthology/W/W07/W07-0717. was supported by the Israel Science Foundation George Foster, Cyril Goutte, and Roland Kuhn. Dis- (grant No. 137/06) and by a grant from the Israeli criminative instance weighting for domain adap- Ministry of Science and Technology. tation in statistical machine translation. In 263 Proceedings of the 2010 Conference on Em- Companion Volume Proceedings of the Demo and pirical Methods in Natural Language Process- Poster Sessions, pages 177–180, Prague, Czech Re- ing, pages 451–459, Stroudsburg, PA, USA, public, June 2007. Association for Computational 2010. Association for Computational Linguis- Linguistics. URL http://www.aclweb.org/ tics. URL http://dl.acm.org/citation. anthology/P07-2045. cfm?id=1870658.1870702. Philipp Koehn, Alexandra Birch, and Ralf Steinberger. Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai- 462 machine translation systems for Europe. In Ma- Fu Lee. Toward a unified approach to statistical lan- chine Translation Summit XII, 2009. guage modeling for Chinese. ACM Transactions Moshe Koppel and Noam Ordan. Translationese on Asian Language Information Processing, 1:3– and its dialects. In Proceedings of the 49th An- 33, March 2002. ISSN 1530-0226. doi: http://doi. nual Meeting of the Association for Computa- acm.org/10.1145/595576.595578. URL http:// tional Linguistics: Human Language Technolo- doi.acm.org/10.1145/595576.595578. gies, pages 1318–1326, Portland, Oregon, USA, Martin Gellerstam. Translationese in Swedish novels June 2011. Association for Computational Lin- translated from English. In Lars Wollin and Hans guistics. URL http://www.aclweb.org/ Lindquist, editors, Translation Studies in Scandi- anthology/P11-1132. navia, pages 88–95. CWK Gleerup, Lund, 1986. David Kurokawa, Cyril Goutte, and Pierre Isabelle. Iustina Ilisei, Diana Inkpen, Gloria Corpas Pastor, Automatic detection of translated text and its im- and Ruslan Mitkov. Identification of translationese: pact on machine translation. In Proceedings of MT- A machine learning approach. In Alexander F. Summit XII, 2009. Gelbukh, editor, Proceedings of CICLing-2010: 11th International Conference on Computational Patrik Lambert, Holger Schwenk, Christophe Ser- Linguistics and Intelligent Text Processing, vol- van, and Sadaf Abdul-Rauf. Investigations on ume 6008 of Lecture Notes in Computer Science, translation model adaptation using monolingual pages 503–511. Springer, 2010. ISBN 978-3- data. In Proceedings of the Sixth Workshop 642-12115-9. URL http://dx.doi.org/10. on Statistical Machine Translation, pages 284– 1007/978-3-642-12116-6. 293. Association for Computational Linguistics, July 2011. URL http://www.aclweb.org/ Howard Johnson, Joel Martin, George Foster, and anthology/W11-2132. Roland Kuhn. Improving translation quality by dis- carding most of the phrasetable. In Proceedings of Gennadi Lembersky, Noam Ordan, and Shuly Wint- the Joint Conference on Empirical Methods in Nat- ner. Language models for machine translation: ural Language Processing and Computational Nat- Original vs. translated texts. In Proceedings of the ural Language Learning (EMNLP-CoNLL), pages 2011 Conference on Empirical Methods in Natural 967–975. Association for Computational Linguis- Language Processing, pages 363–374, Edinburgh, tics, June 2007. URL http://www.aclweb. Scotland, UK, July 2011. Association for Computa- org/anthology/D/D07/D07-1103. tional Linguistics. URL http://www.aclweb. org/anthology/D11-1034. Philipp Koehn. Statistical significance tests for ma- chine translation evaluation. In Proceedings of Robert C. Moore and William Lewis. Intelligent EMNLP 2004, pages 388–395, Barcelona, Spain, selection of language model training data. In July 2004. Association for Computational Linguis- Proceedings of the ACL 2010 Conference, Short tics. Papers, pages 220–224, Stroudsburg, PA, USA, 2010. Association for Computational Linguis- Philipp Koehn. Europarl: A Parallel Corpus tics. URL http://dl.acm.org/citation. for Statistical Machine Translation. In Confer- cfm?id=1858842.1858883. ence Proceedings: the tenth Machine Translation Summit, pages 79–86, Phuket, Thailand, 2005. Franz Josef Och. Minimum error rate training in sta- AAMT. URL http://mt-archive.info/ tistical machine translation. In ACL ’03: Proceed- MTS-2005-Koehn.pdf. ings of the 41st Annual Meeting on Association for Computational Linguistics, pages 160–167, Morris- Philipp Koehn, Hieu Hoang, Alexandra Birch, town, NJ, USA, 2003. Association for Computa- Chris Callison-Burch, Marcello Federico, Nicola tional Linguistics. doi: http://dx.doi.org/10.3115/ Bertoldi, Brooke Cowan, Wade Shen, Christine 1075096.1075117. Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Franz Josef Och and Hermann Ney. Improved statisti- Open source toolkit for statistical machine transla- cal alignment models. In ACL ’00: Proceedings of tion. In Proceedings of the 45th Annual Meeting the 38th Annual Meeting on Association for Com- of the Association for Computational Linguistics putational Linguistics, pages 440–447, Morristown, 264 NJ, USA, 2000. Association for Computational Lin- guistics. doi: http://dx.doi.org/10.3115/1075218. 1075274. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic eval- uation of machine translation. In ACL ’02: Proceed- ings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318, Morris- town, NJ, USA, 2002. Association for Computa- tional Linguistics. doi: http://dx.doi.org/10.3115/ 1073083.1073135. Nataˇsa Pavlovi´c. Directionality in translation and in- terpreting practice. Report on a questionnaire sur- vey in Croatia. Forum, 5(2):79–99, 2007. Gideon Toury. In Search of a Theory of Translation. The Porter Institute for Poetics and Semiotics, Tel Aviv University, Tel Aviv, 1980. Gideon Toury. Descriptive Translation Studies and be- yond. John Benjamins, Amsterdam / Philadelphia, 1995. Hans van Halteren. Source language markers in EU- ROPARL translations. In COLING ’08: Proceed- ings of the 22nd International Conference on Com- putational Linguistics, pages 937–944, Morristown, NJ, USA, 2008. Association for Computational Lin- guistics. ISBN 978-1-905593-44-6. 265 Aspectual Type and Temporal Relation Classification Francisco Costa Ant´onio Branco Universidade de Lisboa Universidade de Lisboa

[email protected] [email protected]

Abstract text. These data are annotated according to the TimeML (Pustejovsky et al., 2003) scheme. In this paper we investigate the relevance of Figure 1 shows a small and slightly simpli- aspectual type for the problem of temporal fied fragment of the data from TempEval, with information processing, i.e. the problems of the recent TempEval challenges. TimeML annotations. There, event terms, such as the term referring to the event of releasing the For a large list of verbs, we obtain sev- tapes, are annotated using EVENT tags. States eral indicators about their lexical aspect by querying the web for expressions where (such as the situations denoted by verbs like want these verbs occur in contexts associated or love) are also considered events. Temporal ex- with specific aspectual types. pressions, such as today, are enclosed in TIMEX3 We then proceed to extend existing solu- tags. The attribute value of time expressions tions for the problem of temporal informa- holds a normalized representation of the date or tion processing with the information ex- time they refer to (e.g. the word today denotes the tracted this way. The improved perfor- date 1998-01-14 in this example). The TLINK mance of the resulting models shows that elements at the end describe temporal relations (i) aspectual type can be data-mined with between events and temporal expressions. For in- unsupervised methods with a level of noise stance, the event of the plane going down is anno- that does not prevent this information from tated as temporally preceding the date denoted by being useful and that (ii) temporal informa- tion processing can profit from information the temporal expression today. about aspectual type. The major tasks of these two TempEval evalu- ation challenges were about guessing the type of temporal relations, i.e. the value of the relType 1 Introduction attribute of the TLINK elements in Figure 1, all Extracting the temporal information present in a other annotations being given. Temporal relation text is relevant to many natural language process- classification is also the most interesting problem ing applications, including question-answering, in temporal information processing. The other information extraction, and even document sum- relevant tasks (identifying and normalizing tem- marization, as summaries may be more readable poral expressions and events) have a longer re- if they follow a chronological order. search history and show better evaluation results. Recent evaluation campaigns have focused on TempEval was organized in three tasks the extraction of temporal information from writ- (TempEval-2 has four additional ones, that are not ten text. TempEval (Verhagen et al., 2007), in relevant to this work): task A was concerned with 2007, and more recently TempEval-2 (Verhagen classifying temporal relations holding between an et al., 2010), in 2010, were concerned with this event and a time mentioned in the same sentence problem. Additionally, they provided data that (although they could be syntactically unrelated, as can be used to develop and evaluate systems that the temporal relation represented by the TLINK can automatically temporally tag natural language with the lid with the value l1 in Figure 1); task 266 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 266–275, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics <s>In Washington <TIMEX3 tid="t53" type="DATE" al., 2007) also combined rule-based and machine value="1998-01-14">today</TIMEX3>, the Federal learning approaches. It employed sophisticated Aviation Administration <EVENT eid="e1" class="OCCURRENCE" stem="release" NLP to compute some of the features used; more aspect="NONE" tense="PAST" polarity="POS" specifically it used syntactic features. pos="VERB">released</EVENT> air traffic control tapes from Our goal with this work is to evaluate the im- <TIMEX3 tid="t54" type="TIME" value="1998-XX-XXTNI">the night</TIMEX3> the TWA pact of information about aspectual type on these Flight eight hundred <EVENT eid="e2" tasks. The TimeML annotations include an at- class="OCCURRENCE" stem="go" aspect="NONE" tense="PAST" polarity="POS" tribute class for EVENTs that encodes some as- pos="VERB">went</EVENT> down.</s> pectual information, distinguishing between sta- <TLINK lid="l1" relType="BEFORE" eventID="e2" relatedToTime="t53"/> tive (annotated with the value STATE) and non- <TLINK lid="l2" relType="OVERLAP" stative events (value OCCURRENCE). This at- eventID="e2" relatedToTime="t54"/> tribute is relevant to the classification problem at hand, i.e. it is a useful feature for machine learned Figure 1: Sample of the data annotated for TempEval, classifiers for the TempEval tasks (although this corresponding to the fragment: In Washington today, class attribute encodes other kinds of informa- the Federal Aviation Administration released air traf- tion as well). However, aspectual distinctions can fic control tapes from the night the TWA Flight eight be more fine-grained than a mere binary distinc- hundred went down. tion, and so far no system has explored this sort of information to help improve the solutions to tem- Task poral relation classification. A B C In this paper we work with Portuguese, but in principle there is no reason to believe that our Best system 0.62 0.80 0.55 findings would not apply to other languages that Average of all participants 0.56 0.74 0.51 display similar aspectual phenomena, such as En- Majority class baseline 0.57 0.56 0.47 glish. Some of the details, such as the material in Section 4.2, are however language specific and Table 1: Results for English in TempEval (F-measure), would need adaptation. from Verhagen et al. (2009) 2 Aspectual Type B focused on the temporal relation between events Distinctions of aspectual type (also referred to as and the document’s creation time, which is also situation type, lexical aspect or Aktionsart) of the annotated in TimeML (not shown in that Figure); sort of Vendler (1967) and Dowty (1979) are ex- and task C was about classifying the temporal re- pected to improve the existing solutions to the lation between the main events of two consecu- problem of temporal relation classification. The tive sentences. The possible values for the type major aspectual distinctions are between (i) states of temporal relation are BEFORE, AFTER and (e.g. to hate beer, to know the answer, to own a OVERLAP.1 car, to stink), (ii) processes, also called activities Table 1 shows the results of the first TempEval (to work, to eat ice cream, to grow, to play the evaluation. The results of TempEval-2 are fairly piano), (iii) culminated processes, also called ac- similar (Verhagen et al., 2010), but the data used complishments (to paint a picture, to burn down, are similar but not identical. to deliver a sermon) and (iv) culminations, also The best system in TempEval for tasks A and B called achievements (to explode, to win the game, (Pus¸cas¸u, 2007) combined statistical and knowl- to find the key). States and processes are atelic edge based methods to propagate temporal con- situations in that they do not make salient a spe- straints along parse trees coming from a syntac- cific instant in time. Culminated processes and tic parser. The best system for task C (Min et culminations are telic situations: they have an in- trinsic, instantaneous endpoint, called the culmi- 1 There are the additional disjunctive values nation (e.g. in the case of to paint a picture, it is BEFORE-OR-OVERLAP, OVERLAP-OR-AFTER and VAGUE, employed when the annotators could not make a the moment when the picture is ready; in the case more specific decision, but these affect a small number of of to explode, it is the moment of the explosion). instances. There are several reasons to think aspectual 267 type is relevant to temporal information pro- in he will read the book in three days but not with cessing. First, these distinctions are related to other aspectual types, as in he will be living there how long events last: culminations are punctual, in three days. whereas states can be very prolonged in time. A factor related to aspectual class, that is not States are thus more likely to temporally overlap trivial to account for, is the phenomenon of as- other temporal entities than culminations, for in- pectual shift, or aspectual coercion (Moens and stance. Steedman, 1988; de Swart, 1998; de Swart, 2000). Second, there are grammatical consequences Many linguistic contexts pose constraints on as- on how events are anchored in time. Consider pectual type. This does not mean, however, that the following examples, from Ritchie (1979) and clashes of aspectual type cause ungrammatical- Moens and Steedman (1988): ity. What often happens is that phrases associated with an incompatible aspectual type get their type (1) When they built the 59th Street bridge, changed in order to be of the required type, caus- they used the best materials. ing a change in meaning. (2) When they built that bridge, I was still a For instance, the progressive construction com- young lad. bines with processes. When it combines with e.g. a culminated process, the culmination is stripped The situation of building the bridge is a cul- off from this culminated process, which is thus minated processed, composed by the process of converted into a process. The result is that a sen- actively building a bridge followed by the culmi- tence like (5) does not say that the bridge was fin- nation of the bridge being finished. In sentence ished (the event has no culmination), whereas one (1), the event described in the main clause (that of such as (6) does say this (the event has a culmina- using the best materials) is a process, but in sen- tion). tence (2) it is a state (the state of being a young lad). Even though the two clauses in each sen- (5) They were building that bridge. tence are connected by when, the temporal rela- (6) They built that bridge. tions holding between the events of each clause are different. On the one hand, in sentence (1) Aspectual type is not a property of just words, the event of using the best materials (a process) but phrases as well. For example, while the overlaps with the process of actively building the progressive construction just mentioned combines bridge and precedes the culmination of finishing with processes, the resulting phrase behaves as a the bridge. On the other hand, in sentence (2) state (cf. the sentence When they built the 59th the event of being a young lad (which is a state) Street bridge, they were using the best materi- overlaps with both the process of actively build- als and what was mentioned above about when ing the bridge and the culmination of the bridge clauses). being built. This difference is arguably caused by the different aspectual types of the main events of 3 Strategy each sentence. Aspectual type is hard to annotate. This is partly As another example, states overlap with tem- because of what was just mentioned: it is not a poral location adverbials, as in (3), while culmi- property of just words, but rather phrases, and nations are included in them, as in (4). different phrases with the same head word can (3) He was happy last Monday. have different aspectual types; however anno- (4) He reached the top of Mount Everest last tation schemes like TimeML annotate the head Monday. word as denoting events, not full phrases or clauses. In other cases, differences in aspectual type can For this reason, our strategy is to obtain aspec- disambiguate ambiguous linguistic material. For tual type information from unannotated data. Be- instance, the preposition in is ambiguous as it can cause these data are gradient—an event-denoting be used to locate events in the future but also to word can be associated with different aspectual measure the duration of culminated processes; it types, depending on word sense—we do not aim is thus ambiguous with culminated processes, as to extract categorical information, but rather nu- 268 meric values for each event term that reflect as- data. Relevant to our work is that of Siegel and sociations to aspectual types. These may be seen McKeown (2000). The authors guess the aspec- as values that are indicative of the frequencies in tual type of verbs by searching for specific pat- which an event term denotes a state, or a process, terns in a one million word corpus that has been etc. syntactically parsed. They extract several linguis- In order to extract these indicators, we resort to tic indicators and combine them with machine a methodology sometimes referred to as Google learning algorithms. The indicators that they ex- Hits: large amounts of queries are sent to a web tract are naturally different from ours, since they search engine (not necessarily Google), and the have access to syntactic structure and we do not, number of search results (the number of web but our data are based on a much larger corpus. pages that match the query) is recorded and taken as a measure of the frequency of the queried ex- 3.2 Textual Patterns as Indicators of pression. Aspectual Type This methodology is not perfect, since multiple Because of aspectual shift phenomena (see Sec- occurrences of the queried expression in the same tion 2), full syntactic parsing is necessary in order web page are not reflected in the hit count, and to determine the aspectual type of a natural lan- in many cases the hit counts reported by search guage expression. However, this can be approxi- engines are just estimates and might not be very mated by frequencies: it is natural to expect that accurate. Additionally, uncarefully formulated e.g. stative verbs occur more frequently in stative queries can match expressions that are syntacti- contexts than non-stative verbs, even if there may cally and semantically very different from what be errors in determining these contexts if syntactic was intended. In any case, it has the advantages parsing is not a possibility. of being based on a very large amount of data and If one uses Google Hits, syntactic information not requiring any manual annotation, which can is not accessible. In return for its impreciseness, introduce errors. Google Hits have the advantage of being based on very large amounts of data. 3.1 The Web as a Very Large Corpus Hearst (1992) is one of the earliest studies where 4 Scope and Approach specific textual patterns are used to extract lexico- semantic information from very large corpora. In this study we focus exclusively on verbs, but The author’s goal was to extract hyponymy rela- events can be denoted by words belonging to tions. With the same goal, Kozareva et al. (2008) other parts-of-speech. This limitation is linked to apply similar textual patterns to the web. the fact that the textual patterns that are used to The web has been used as a corpus by many search for specific aspectual contexts are sensitive other authors with the purpose of extracting syn- to part-of-speech (i.e. what may work for a verb tactic or semantic properties of words or re- may not work equally well for a noun). lations between them, e.g. Ravichandran and In order to assess whether aspectual type in- Hovy (2002), Etzioni et al. (2004), etc. Some formation is relevant to the problem of temporal of this work is specially relevant to the problem relation classification, our approach is to check of temporal information processing. VerbOcean whether incorporating that kind of information (Chklovski and Pantel, 2004) is a database of into existing solutions for this problem can im- web mined relations between verbs. Among other prove their performance. TimeML annotated kinds of relations, it includes typical precedence data, such as those used for TempEval, can be relations, e.g. sleeping happens before waking up. used to train machine learned classifiers. These This type of information has in fact been used by can then be augmented with attributes encoding some of the participating systems of TempEval-2 aspectual type information and their performance (Ha et al., 2010), with good results. compared to the original classifiers. More generally, there is a large body of work Additionally, we work with Portuguese data. focusing on lexical acquisition from corpora. Just This is because our work is part of an effort to as an example, Mayol et al. (2005) learn subcate- implement a temporal processing system for Por- gorization frames of verbs from large amounts of tuguese. We briefly describe the data next. 269 <s>Em Washington, <TIMEX3 tid="t53" type="DATE" tool (Branco et al., 2009) to generate the specific value="1998-01-14">hoje</TIMEX3>, a Federal Aviation verb forms that are used in the queries. They are Administration <EVENT eid="e1" class="OCCURRENCE" stem="publicar" aspect="NONE" tense="PPI" mostly third person singular forms of several dif- polarity="POS" pos="VERB">publicou</EVENT> ferent tenses. gravac¸o˜ es do controlo de tr´afego a´ereo da <TIMEX3 tid="t54" type="TIME" The indicators that we used are ratios of Google value="1998-XX-XXTNI">noite</TIMEX3> em que o voo Hits. They compare two queries. TWA800 <EVENT eid="e2" class="OCCURRENCE" Several indicators were tested. We provide ex- stem="cair" aspect="NONE" tense="PPI" polarity="POS" pos="VERB">caiu</EVENT>.</s> amples with the verb fazer “do” for the queries <TLINK lid="l1" relType="BEFORE" eventID="e2" being compared by each indicator. The name of relatedToTime="t53"/> each indicator reflects the aspectual type being <TLINK lid="l2" relType="OVERLAP" tested, i.e. states should present high values for eventID="e2" relatedToTime="t54"/> State Indicators 1 and 2, processes should show high values for Process Indicators 1–4, etc. Figure 2: Sample of the Portuguese data adapted from the TempEval data, corresponding to the fragment: Em • State Indicator 1 (Indicator S1) is about im- Washington, hoje, a Federal Aviation Administration perfective and perfective past forms of verbs. publicou gravac¸o˜ es do controlo de tr´afego a´ereo da It compares the number of hits a for an im- noite em que o voo TWA800 caiu. perfective form fazia “did” to the number of a hits b for a perfective form fez “did”: a+b . 4.1 Data Assuming the imperfective past constrains the entire clause to be a state, and the perfec- Our experiments used TimeBankPT (Costa and tive past constrains it to be telic, the higher Branco, 2010; Costa and Branco, 2012; Costa, to this value the more frequently the verb ap- appear). This corpus is an adaptation of the orig- pears in stative clauses in a past tense.2 inal TempEval data to Portuguese, obtained by translating it and then adapting the annotations. • State Indicator 2 (Indicator S2) is about the Figure 2 shows the Portuguese equivalent to the co-occurrence with acaba de “has just fin- sample presented above in Figure 1. The two cor- ished”. It compares the number of hits a pora are quite similar, but there is of course the for acaba de fazer “has just finished doing” language difference. TimeBankPT contains a few to the number of hits b for fazer “to do”: corrections to the data (mostly the temporal rela- b a+b . In Portuguese, this construction does tions), but these corrections only changed around not seem to be felicitous with states. 1.2% of the total number of annotated temporal relations (Costa and Branco, 2012). Although we • Process Indicator 1 (Indicator P1) is about did not test our results on English data, we specu- past progressive forms and simple past forms late that our results carry over to other languages. (both imperfective). It compares the num- Just like the original English corpus for ber of hits a for fazia “did” to the number of TempEval, it is divided in a training part and a hits b for estava a fazer “was doing”: a+b b . testing part. The numbers (sentences, words, an- Assuming the progressive construction is a notated events, time expressions and temporal re- function from processes to states (see Sec- lations) are fairly similar for the two corpora (the tion 2), the higher this value, the more likely English one and the Portuguese one). the verb can occur with the interpretation of a process. 4.2 Extracting the Aspectual Indicators 2 We expect this frequency to be indicative of states be- We extracted the 4,000 most common verbs from cause states can appear in the imperfective past tense with a 180 million word corpus of Portuguese news- their interpretation unchanged, whereas non-stative events paper text, CETEMP´ublico. Because this corpus have their interpretation shifted to a stative one in that con- is not annotated, we used a part-of-speech tag- text (e.g. they get a habitual reading). In order to refer to an event occurring in the past with an on-going interpretation, ger and morphological analyzer (Barreto et al., non-stative verbs require the progressive construction to be 2006; Silva, 2007) to detect verbs and to obtain used in Portuguese, whereas states do not. Therefore, states their dictionary form. We then used an inflection should occur more freely in the simple imperfective past. 270 • Process Indicator 2 (Indicator P2) is about • Culmination Indicator1 (Indicator C1) is past progressive forms vs. simple past forms about differentiating culminations and cul- (perfective). It compares the number of hits minated processes. It compares the number a for fez “did” to the number of hits b for of hits a for fez de repente “did suddenly” to b esteve a fazer “was doing”: a+b . Similarly the number of hits b for fez num instante “did a to the previous indicator, this one tests the in an instant”: a+b . frequency of a verb appearing in a context typical of processes. For each of the 4,000 verbs, the necessary queries required by these indicators were gener- • Process Indicator 3 (Indicator P3) is about ated and then sent to a search engine. The queries the occurrence of for Adverbials. It com- were enclosed in quotes, so as to guarantee ex- pares the number of hits a for fez “did” to act matches. The number of hits was recorded for the number of hits b for fez durante muito each query. b tempo “did for a long time”: a+b . This We had some problems with outliers for a few number is also intended to be an indica- rather infrequent verbs. These could show very tion of how frequent a verb can be used extreme values for some indicators. In order with the interpretation of a process. Note to minimize their impact, for each indicator we that Portuguese allows modifiers to occur homogenized the 100 highest values that were freely between a verb and its complements, found. More specifically, for each indicator, each so this test should work for transitive verbs one of the highest 100 values was replaced by the (or any other subcategorization frame involv- 100th highest value. The bottom 100 values were ing complements), not just intransitive ones. similarly changed. This way the top 99 values and • Process Indicator 4 (Indicator P4) is about the bottom 99 values are replaced by the 100th the co-occurrence of a verb with parar de “to highest value and the 100th lowest value respec- stop”. It compares the number of hits a for tively. parou de fazer “stopped doing” to the num- Each indicator ranges between 0 and 1 in the- a ber of hits b for fazer “to do”: a+b . Just like ory. In practice, we seldom find values close to the the English verbs stop and finish are sensitive extremes, as this would imply that some queries to the aspectual type of their complement, so would have close to 0 hits, which does not occur is the Portuguese verb parar, which selects very often (after all, we intentionally used queries for processes. for which we would expect large hit counts, as these are more likely to be representative of true • Atelicity Indicator 1 (Indicator A1) is about language use). For this reason, each indicator is comparing in and for adverbials. It compares scaled so that its minimum (actual) value is 0 and the number of hits a for fez num instante “did its maximum (actual) value is 1. in an instant” to the number of hits b for fez durante muito tempo “did for a long time”: 5 Evaluation b a+b . Processes can be modified by for ad- As mentioned before, in order to assess the use- verbials, whereas culminated processes are fulness of these aspectual indicators for the tasks modified by in adverbials. This indicator of temporal relation classification, we checked tests the occurrence of a verb in contexts that whether they can improve machine learned clas- require these aspectual types. sifiers trained for this problem. We next describe • Atelicity Indicator 2 (Indicator A2) is about the classifiers that were used as the bases for com- comparing for Adverbials with suddenly. It parison. compares the number of hits a for fez de re- pente “did suddenly” to the number of hits 5.1 Experimental Setup b for fez durante muito tempo “did for a In order to obtain bases for comparison, we b long time”: a+b . De repente “suddenly” trained machine learned classifiers on the Por- seems to modify culminations, so this indi- tuguese corpus TimeBankPT, that is adapted from cator compares process readings with culmi- the TempEval data (see Section 4.1). We took nation readings. inspiration in the work of Hepple et al. (2007). 271 This was one of the participating systems of Task TempEval. It used machine learning algorithms Attribute A B C implemented in Weka (Witten and Frank, 1999). For our experiments, we used Weka’s implemen- event-aspect × X X tation of the C4.5 algorithm, trees.J48 (Quin- event-polarity X X X lan, 1993), the RIPPER algorithm as implemented event-POS × × X by Weka’s rules.JRip (Cohen, 1995), a near- event-stem × X × est neighbors classifier, lazy.KStar (Cleary event-string X × × and Trigg, 1995), a Na¨ıve Bayes classifier, namely event-class X × X Weka’s bayes.NaiveBayes (John and Lang- event-tense X X X ley, 1995), and a support vector classifier, Weka’s order-event-first X N/A N/A functions.SMO (Platt, 1998) . We chose these order-event-between X N/A N/A algorithms as they are representative of a wide order-timex3-between × N/A N/A range of machine learning approaches. order-adjacent X N/A N/A Recall that the tasks of TempEval are to guess the type of temporal relations. Each train or test timex3-mod X × N/A instance thus corresponds to a temporal relation, timex3-type × × N/A i.e. a TLINK element in the TimeML annota- tlink-relType X X X tions (see Figures 1 and 2). The classification problem is to determine the value of the attribute Table 2: Feature combinations used in the classifiers relType of TimeML TLINK elements. These used as comparison bases. Features inspired by the temporal relations relate an event (referred by the ones used by Hepple et al. (2007) in TempEval. eventID attribute of TLINK elements) to an- other temporal entity, that can be a time (pointed ement that represents the temporal relation to to by the relatedToTime attribute), in the case be classified. The order features are the at- of tasks A and B, or, in the case of task C, an- tributes computed from the document’s textual other event (given by the relatedToEvent at- content. The feature order-event-first tribute). encodes whether the event terms precedes in As for the features that were employed, we also the text the time expression it is related to by took inspiration in the approach of Hepple et al. the temporal relation to classify. The clas- (2007). These authors used as classifier attributes sifier attribute order-event-between de- two types of features. The first group of features scribes whether any other event is mentioned corresponds to TimeML attributes: for instance in the text between the two expressions for the value of the aspect attribute of EVENT el- the entities that are in the temporal relation, ements, for the events involved in the temporal and similarly order-timex3-between is relation to be classified. The second group of fea- about whether there is an intervening tempo- tures corresponds to simple features that can be ral expression. Finally, order-adjacent is computed with string manipulation and do not re- true iff both order-timex3-between and quire any kind of natural language processing. order-event-between are false (even if Table 2 shows the features that were tried and other linguistic material occurs between the ex- employed. pressions denoting the two entities in the temporal The event features correspond to attributes relation). of EVENT elements, with the exception of In order to arrive at the final set of features the event-string feature, which takes as (marked with a check mark in Table 2), we per- value the character data inside the correspond- formed exhaustive search on all possible combi- ing TimeML EVENT element. In a simi- nations of these features for each task, using the lar spirit, the timex3 features are taken from Na¨ıve Bayes algorithm. They were compared us- the attributes of TIMEX3 elements with the ing 10-fold cross-validation on the training data. same name. The tlink-relType feature The feature combinations shown in Table 2 are is the class attribute and corresponds to the the optimal combinations arrived at in this way. relType attribute of the TimeML TLINK el- These are the classifiers that we used for the 272 comparison with the aspectual type indicators. Task We chose this straightforward approach because it Classifier A B C forms a basis for comparison that is easily repro- ducible: the algorithm implementations that were trees.J48 0.57 0.77 0.53 used are part of freely available software, and the With best indicator 0.55 features that were employed are easily computed rules.JRip 0.60 0.76 0.51 from the annotated data, with no need to run any With best indicator 0.61 0.54 natural language processing tools whatsoever. As mentioned before in Section 4.1, the data lazy.KStar 0.54 0.70 0.52 used are organized in a training set and an evalu- With best indicator 0.73 0.53 ation set. The training part is around 60K words bayes.NaiveBayes 0.50 0.76 0.53 long, the test data containing around 9K words. With best indicator 0.53 0.54 When tested on held-out data, these classifiers functions.SMO 0.55 0.79 0.54 present the scores shown in italics in Table 3. With best indicator 0.56 0.55 These results are fairly similar to the scores that the system of Hepple et al. (2007) obtained in Table 3: Evaluation on held-out test data of classi- TempEval with English data: 0.59 for task A, 0.73 fiers trained on full train data. Values for the classi- for task B, and 0.54 for task C. They are also not fiers used as comparison bases are in italics. Boldface very far from the best results of TempEval. As highlights improvements resulting from incorporating such they represent interesting bases for compar- aspectual indicators as classifier features, and missing ison, as improving their performance is likely to values represent no improvement. be relevant to the best systems that have been de- veloped for temporal information processing. the event that is the first argument of this temporal 5.2 Results and Discussion relation. After adding each of these features, we retrained the classifiers on the training data and After obtaining the bases for comparison de- tested them on the held-out test data. In order to scribed above, we proceeded to check whether the keep the evaluation manageable, we did not test aspectual type indicators described in Section 4.2 combinations of multiple indicators. can improve these results. For each aspectual indicator, we implemented Table 3 shows the overall results. For task a classifier feature that encodes its value for the A, the best indicators were P4 (with JRip), A1 event term in the temporal relation (if it is not a (NaiveBayes) and S1 (SMO). For task B the verb, this value is missing). In the case of task C, best one was P4 (KStar). For task C, the best two features are added for each indicator, one for indicators were P3 (J48), A1 and P3 (JRip), each event term. C1 (KStar), A1 (NaiveBayes) and P2 (SMO). We extended each of these classifiers with one Each of the indicators S2, P1 and A2 either does of these features at a time (two in the case of task not improve the results or does so but not as much C), and checked whether it improved the results as another, better indicator for the same task and on the test data. So for instance, in order to test algorithm. Indicator S1, we extended each of these classifiers It seems clear from Table 3 that some tasks ben- with a feature that encodes the value that this indi- efit from these indicators more than others. In cator presents for the term that denotes the event particular, task C shows consistent improvements present in the temporal relation to be classified. whereas task B is hardly affected. Since task C In the case of task C, two classifier features are is about relations involving two events, the classi- added, one for each event term, and both for the fiers may be picking up the sort of linguistic gen- same Indicator S1. For instance, for the (train- eralizations mentioned in Section 2 about when ing) instance corresponding to the TLINK in Fig- clauses. ure 2 with the lid attribute that has the value l1, J48 and JRip produce human-readable mod- the classifier feature for Indicator S1 has the value els. We checked how these classifiers are taking that was computed for the verb cair “go down”, advantage of the aspectual indicators. For task C, since this is the stem of the word that denotes the induced models are generally associating high 273 values of the indicators A1 and P3 with overlap An interesting question that we hope will be ad- relations and low values of these indicators with dressed by future work is how these results extend other types of relations. This is expected. On the to other languages. We cannot provide an answer one end, high values for these indicators are asso- to this question, as we do not have the data. How- ciated with atelicity (i.e. the endpoint of the cor- ever, this experiment can be replicated for any lan- responding event is not presented). On the other guage that has (i) TimeML annotated data, (ii) a hand, both indicators are based on queries con- reasonable size of documents on the Web and a taining the phrase durante muito tempo “for a long search engine capable of separating them from the time”, which, in addition to picking up events that documents in other languages and (iii) an aspec- can be modified by for adverbials, more specifi- tual system similar enough that the question be- cally pick up events that happen for a long time ing addressed in this paper makes sense (and use- and are thus likely to overlap other events. ful patterns for queries can be constructed, even For task A, JRip also associates high values of if not entirely identical to the ones that we used). the indicator P4—which constitute evidence that The second criterion is met by many, many lan- the corresponding events are processes (which are guages. The third one also seems to affect many atelic)—with overlap relations. This is a specially languages, as the existing literature on aspectual interesting result, considering that the queries on phenomena indicates that these phenomena are which this indicator is based reflect a purely as- quite widespread. The second criterion is, at the pectual constraint. moment, the hardest to fulfill as not many lan- guages have data with rich annotations about time 6 Concluding Remarks (i.e. including events and temporal relations). We In this paper, we evaluated the relevance of infor- speculate that our results can extend to English, mation about aspectual type for temporal process- although a different set of query patterns may ing tasks. have to be used in order to extract the aspectual Temporal information processing has received indicators that are employed. We believe this be- substantial attention recently with the two cause the two languages largely overlap when it TempEval challenges in 2007 and 2010. The most comes to aspectual phenomena. interesting problem of temporal information pro- cessing, that of temporal relation classification, is References still affected by high error rates. Even though a very substantial part of the se- Florbela Barreto, Ant´onio Branco, Eduardo Ferreira, Am´alia Mendes, Maria Fernanda Nascimento, Fil- mantics literature on tense and aspect focuses on ipe Nunes, and Jo˜ao Silva. 2006. Open resources aspectual type, solutions to the problem of auto- and tools for the shallow processing of Portuguese: matic temporal relation classification have not in- the TagShare project. In Proceedings of LREC corporated this sort of semantic information. In 2006. part this is expected, as aspectual type is very in- Ant´onio Branco, Francisco Costa, Eduardo Ferreira, terconnected with syntax (cf. the discussion about Pedro Martins, Filipe Nunes, Jo˜ao Silva, and Sara aspectual coercion in Section 2), and the phe- Silveira. 2009. LX-Center: a center of online lin- guistic services. In Proceedings of the Demo Ses- nomenon of aspect shift can make it hard to com- sion, ACL-IJCNLP2009, Singapore. pute even when syntactic information is available. Timothy Chklovski and Patrick Pantel. 2004. Verb- Our contribution with this paper is to incor- Ocean: Mining the Web for fine-grained semantic porate this sort of information in existing ma- verb relations. In In Proceedings of EMNLP-2004, chine learned classifiers that tackle this problem. Barcelona, Spain. Even though these classifiers do not have access to John G. Cleary and Leonard E. Trigg. 1995. K*: An syntactic information, aspectual type information instance-based learner using an entropic distance seemed to be useful in improving the performance measure. In 12th International Conference on Ma- chine Learning, pages 108–114. of these models. We hypothesize that combin- William W. Cohen. 1995. Fast effective rule induc- ing aspectual type information with information tion. In Proceedings of the Twelfth International about syntactic structure can further improve the Conference on Machine Learning, pages 115–123. problems of temporal information processing, but Francisco Costa and Ant´onio Branco. 2010. Tempo- we leave that research to future work. ral information processing of a new language: Fast 274 porting with minimal resources. In Proceedings of Marc Moens and Mark Steedman. 1988. Temporal ACL 2010. ontology and temporal reference. Computational Francisco Costa and Ant´onio Branco. 2012. Time- Linguistics, 14(2):15–28. BankPT: A TimeML annotated corpus of Por- John Platt. 1998. Fast training of support vec- tuguese. In Proceedings of LREC2012. tor machines using sequential minimal optimiza- Francisco Costa. to appear. Processing Temporal In- tion. In Bernhard Sch¨olkopf, Chris Burges, and formation in Unstructured Documents. Ph.D. the- Alexander J. Smola, editors, Advances in Kernel sis, Universidade de Lisboa, Lisbon. Methods—Support Vector Learning. Henri¨ette de Swart. 1998. Aspect shift and coercion. Georgiana Pus¸cas¸u. 2007. WVALI: Temporal rela- Natural Language and Linguistic Theory, 16:347– tion identification by syntactico-semantic analysis. 385. In Proceedings of SemEval-2007, pages 484–487, Prague, Czech Republic. Association for Computa- Henri¨ette de Swart. 2000. Tense, aspect and coer- tional Linguistics. cion in a cross-linguistic perspective. In Proceed- ings of the Berkeley Formal Grammar conference, James Pustejovsky, Jos´e Casta˜no, Robert Ingria, Roser Stanford. CSLI Publications. Saur´ı, Robert Gaizauskas, Andrea Setzer, and Gra- ham Katz. 2003. TimeML: Robust specification of David R. Dowty. 1979. Word Meaning and Montague event and temporal expressions in text. In IWCS- Grammar: the Semantics of Verbs and Times in 5, Fifth International Workshop on Computational Generative Semantics and Montague’s PTQ. Rei- Semantics. del, Dordrecht. John Ross Quinlan. 1993. C4.5: Programs for Ma- Oren Etzioni, Michael Cafarella, Doug Downey, Stan- chine Learning. Morgan Kaufmann, San Mateo, ley Kok, Ana-Maria Popescu, Tal Shaked, , Stephen CA. Soderland, Daniel S. Weld, and Alexander Yates. Deepak Ravichandran and Eduard Hovy. 2002. 2004. Web-scale information extraction in Know- Learning surface text patterns for a question an- ItAll. In Proceedings of the 13th International Con- swering system. In Proceedings of ACL 2002. ference on World Wide Web. Graeme D. Ritchie. 1979. Temporal clauses in En- Eun Young Ha, Alok Baikadi, Carlyle Licata, and glish. Theoretical Linguistics, 6:87–115. James C. Lester. 2010. NCSU: Modeling temporal Eric V. Siegel and Kathleen McKeown. 2000. relations with Markov logic and lexical ontology. In Learning methods to combine linguistic indica- Proceedings of SemEval 2010. tors: Improving aspectual classification and reveal- Marti A. Hearst. 1992. Automatic acquisition of hy- ing linguistic insights. Computational Linguistics, ponyms from large text corpora. In Proceedings of 24(4):595–627. the 14th Conference on Computational Linguistics, Jo˜ao Ricardo Silva. 2007. Shallow processing volume 2, pages 539–545, Nantes, France. of Portuguese: From sentence chunking to nomi- Mark Hepple, Andrea Setzer, and Rob Gaizauskas. nal lemmatization. Master’s thesis, Faculdade de 2007. USFD: Preliminary exploration of fea- Ciˆencias da Universidade de Lisboa, Lisbon, Portu- tures and classifiers for the TempEval-2007 tasks. gal. In Proceedings of SemEval-2007, pages 484–487, Zeno Vendler. 1967. Verbs and times. Linguistics in Prague, Czech Republic. Association for Computa- Philosophy, pages 97–121. tional Linguistics. Marc Verhagen, Robert Gaizauskas, Frank Schilder, George H. John and Pat Langley. 1995. Estimating Mark Hepple, and James Pustejovsky. 2007. continuous distributions in Bayesian classifiers. In SemEval-2007 Task 15: TempEval temporal re- Eleventh Conference on Uncertainty in Artificial In- lation identification. In Proceedings of SemEval- telligence, pages 338–345, San Mateo. 2007. Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. Marc Verhagen, Robert Gaizauskas, Frank Schilder, 2008. Semantic class learning from the web with Mark Hepple, Jessica Moszkowicz, and James hyponym pattern linkage graphs. In Proceedings of Pustejovsky. 2009. The TempEval challenge: iden- ACL-08: HLT, pages 1048–1056, Columbus, Ohio. tifying temporal relations in text. Language Re- Association for Computational Linguistics. sources and Evaluation. Laia Mayol, Gemma Boleda, and Toni Badia. 2005. Marc Verhagen, Roser Saur´ı, Tommaso Caselli, and Automatic acquisition of syntactic verb classes with James Pustejovsky. 2010. SemEval-2010 task 13: basic resources. Language Resources and Evalua- TempEval-2. In Proceedings of SemEval-2010. tion, 39(4):295–312. Ian H. Witten and Eibe Frank. 1999. Data Mining: Congmin Min, Munirathnam Srikanth, and Abraham Practical Machine Learning Tools and Techniques Fowler. 2007. LCC-TE: A hybrid approach to with Java Implementations. Morgan Kaufmann, temporal relation identification in news text. pages San Francisco. 219–222. 275 Automatic generation of short informative sentiment summaries Andrea Glaser and Hinrich Schutze ¨ Institute for Natural Language Processing University of Stuttgart, Germany

[email protected]

Abstract needs”: it must convey the sentiment of the re- view, but it must also provide a specific reason In this paper, we define a new type of for that sentiment, so that the user can make an summary for sentiment analysis: a single- sentence summary that consists of a sup- informed decision as to whether reading the en- porting sentence that conveys the overall tire review is likely to be worth the user’s time – sentiment of a review as well as a convinc- again similar to the purpose of the summary of a ing reason for this sentiment. We present a web page in search engine results. system for extracting supporting sentences We call a sentence that satisfies these two crite- from online product reviews, based on a ria a supporting sentence. A supporting sentence simple and unsupervised method. We de- sign a novel comparative evaluation method contains information on the sentiment as well as for summarization, using a crowdsourcing a specific reason for why the author arrived at this service. The evaluation shows that our sentiment. Examples for supporting sentences are sentence extraction method performs better “The picture quality is very good” or “The bat- than a baseline of taking the sentence with tery life is 2 hours”. Non-supporting sentences the strongest sentiment. contain opinions without such reasons such as “I like the camera” or “This camera is not worth the 1 Introduction money”. Given the success of work on sentiment analy- To address use cases of sentiment analysis that sis in NLP, increasing attention is being focused involve quick scanning and selective reading of on how to present the results of sentiment analy- large numbers of reviews, we present a simple un- sis to the user. In this paper, we address an im- supervised system in this paper that extracts one portant use case that has so far been neglected: supporting sentence per document and show that quick scanning of short summaries of a body of it is superior to a baseline of selecting the sentence reviews with the purpose of finding a subset of with the strongest sentiment. reviews that can be studied in more detail. This One problem we faced in our experiments was use case occurs in companies that want to quickly that standard evaluations of summarization would assess, perhaps on a daily basis, what consumers have been expensive to conduct for this study. We think about a particular product. One-sentence therefore used crowdsourcing to perform a new summaries can be quickly scanned – similar to type of comparative evaluation method that is dif- the summaries that search engines give for search ferent from training set and gold standard cre- results – and the reviews that contain interesting ation, the dominant way crowdsourcing has been and new information can then be easily identified. used in NLP so far. Consumers who want to quickly scan review sum- In summary, our contributions in this paper are maries to pick out a few reviews that are helpful as follows. We define supporting sentences, a new for a purchasing decision are a similar use case. type of sentiment summary that is appropriate in For a one-sentence summary to be useful in this situations where both the sentiment of a review context, it must satisfy two different “information and a good reason for that sentiment need to be 276 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 276–285, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics conveyed succinctly. We present a simple un- aspects can result in bad summaries. supervised method for extracting supporting sen- Our approach enables us to find strong support- tences and show that it is superior to a baseline in ing sentences even if the reason given in that sen- a novel crowdsourcing-based evaluation. tence does not fit well into the fixed inventory. No In the next section, we describe related work manual work like the creation of an aspect inven- that is relevant to our new approach. In Section 3 tory is necessary and there are no requirements on we present the approach we use to identify sup- the format of the reviews such as author-provided porting sentences. Section 4 describes the fea- pros and cons. ture representation of sentences and the classifi- Aspect-oriented summarization also differs in cation method. In Section 5 we give an overview that it does not differentiate along the dimension of the crowdsourcing evaluation. Section 6 dis- of quality of the reason given for a sentiment. For cusses our experimental results. In Sections 7 and example, “I don’t like the zoom” and “The zoom 8, we present our conclusions and plans for future range is too limited” both give reasons for why a work. camera gets a negative evaluation, but only the lat- ter reason is informative. In our work, we evaluate 2 Related Work the quality of the reason given for a sentiment. Both sentiment analysis (Pang and Lee, 2008; The use case we address in this paper requires Liu, 2010) and summarization (Nenkova and a short, easy-to-read summary. A well-formed McKeown, 2011) are important subfields of NLP. sentence is usually easier to understand than a The work most relevant to this paper is work on pro/con table. It also has the advantage that the summarization methods that addresses the spe- information conveyed is accurately representing cific requirements of summarization in sentiment what the user wanted to say – this is not the case analysis. There are two lines of work in this vein for a presentation that involves several complex with goals similar to ours: (i) aspect-based and processing steps and takes linguistic material out pro/con-summarization and (ii) approaches that of the context that may be needed to understand it extract summary sentences from reviews. correctly. An aspect is a component or attribute of a Berend (2011) performs a form of pro/con product such as “battery”, “lens cap”, “battery summarization that does not rely on aspects. life”, and “picture quality” for cameras. Aspect- However, most of the problems of aspect-based oriented summarization (Hu and Liu, 2004; pro/con summarization also apply to this paper: Zhuang et al., 2006; Kim and Hovy, 2006) col- no differentiation between good and bad reasons, lects sentiment assessments for a given set of as- the need for human labels to train a classifier, and pects and returns a list of pros and cons about ev- inferior readability compared to a well-formed ery aspect for a review or, in some cases, on a sentence. per-sentence basis. Two previous approaches that have attempted Aspect-oriented summarization and pro/con- to extract sentences from reviews in the context summarization differ in a number of ways from of summarization are (Beineke et al., 2004) and supporting sentence summarization. First, as- (Arora et al., 2009). Beineke et al. (2004) train pects and pros&cons are taken from a fixed in- a classifier on rottentomatoes.com summary sen- ventory. The inventory is typically small and does tences provided by review authors. These sen- not cover the full spectrum of relevant informa- tences sometimes contain a specific reason for the tion. Second, in its most useful form, aspect- overall sentiment of the review, but sometimes oriented summarization requires classification of they are just catchy lines whose purpose is to phrases and sentences according to the aspect they draw moviegoers in to read the entire review; e.g., belong to; e.g., “The camera is very light” has “El Bulli barely registers a pulse stronger than a to be recognized as being relevant to the aspect book’s” (which does not give a specific reason for “weight”. Developing a component that assigns why the movie does not register a strong pulse). phrases and sentences to their corresponding cat- Arora et al. (2009) define two classes of sen- egories is time-consuming and has to be redone tences: qualified claims and bald claims. A qual- for each domain. Any such component will make ified claim gives the reader more details (e.g., mistakes and undetected or incorrectly classified “This camera is small enough to fit easily in a 277 coat pocket”) while a bald claim is open to inter- in the supporting sentence are nominal; the pretation (e.g., “This camera is small”). Quali- verb will be needed in many cases to accu- fied/bald is a dimension of classification of senti- rately convey the reason for the sentiment ment statements that is to some extent orthogonal expressed. However, it is a fairly safe as- to quality of reason. Qualified claims do not have sumption that part of the information is con- to contain a reason and bald claims can contain veyed using noun phrases since it is dif- an informative reason. For example, “I didn’t like ficult to convey specific information with- the camera, but I suspect it will be a great camera out using specific noun phrases. Adjectives for first timers” is classified as a qualified claim, are often important when expressing a rea- but the sentence does not give a good reason for son, but frequently a noun is also mentioned the sentiment of the document. Both dimensions or one would need to resolve a pronoun to (qualified/bald, high-quality/low-quality reason) make the sentence a self-contained support- are important and can be valuable components of ing sentence. In a sentence like “It’s easy a complete sentiment analysis system. to use” it is not clear what the adjective is Apart from the definition of the concept of sup- referring to. porting sentence, which we believe to be more ap- propriate for the application we have in mind than (iii) Noun phrases that express supporting facts rottentomatoes.com summary sentences and qual- tend to be domain-specific; they can be ified claims, there are two other important differ- automatically identified by selecting noun ences of our approach to these two papers. First, phrases that are frequent in the domain – ei- we directly evaluate the quality of the reasons in a ther in relative terms (compared to a generic crowdsourcing experiment. Second, our approach corpus) or in absolute terms. By making is unsupervised and does not require manual an- this assumption we may fail to detect sup- notation of a training set of supporting sentences. porting sentences that are worded in an orig- As we will discuss in Section 5, we propose inal way using ordinary words. However, a novel evaluation measure for summarization in a specific domain there is usually a lot based on crowdsourcing in this paper. The most of redundancy and most good reasons oc- common use of crowdsourcing in NLP is to have cur many times and are expressed by similar workers label a training set and then train a super- words. vised classifier on this training set. In contrast, we use crowdsourcing to directly evaluate the relative Based on these assumptions, we select the sup- quality of the automatic summaries generated by porting sentence in two steps. In the first step, we the unsupervised method we propose. determine the n sentences with the strongest sen- timent within every review by classifying the po- 3 Approach larity of the sentences (where n is a parameter). In the second step, we select one of the n sen- Our approach is based on the following three tences as the best supporting sentence by means premises. of a weighting function. (i) A good supporting sentence conveys both Step 1: Sentiment Classification the review’s sentiment and a supporting fact. We make this assumption because we want In this step, we apply a sentiment classifier to all the sentence to be self-contained. If it only sentences of the review to classify sentences as describes a fact about a product without positive or negative. We then select the n sen- evaluation, then it does not on its own ex- tences with the highest probability of conforming plain which sentiment is conveyed by the ar- with the overall sentiment of the document. For ticle and why. example, if the document’s polarity is negative, we select the n sentences that are most likely to be (ii) Supporting facts are most often expressed by negative according to the sentiment classifier. We noun phrases. We call a noun phrase that ex- restrict the set of n sentences to sentences with the presses a supporting fact a keyphrase. We “right” sentiment because even an excellent sup- are not assuming that all important words porting sentence is not a good characterization of 278 the content of the review if it contradicts the over- quency), I1 (the set of infrequent nouns), F2 (the all assessment given by the review. Only in cases set of compounds with high frequency), and I2 where there are fewer than n sentences with the (the set of infrequent compounds). An infrequent correct sentiment, we also select sentences with noun (resp. compound) is simply defined as a the “wrong” sentence to obtain a minimum of n noun (resp. compound) that does not meet the fre- sentences for each review. quency criterion. We define the score s of a sentence with n to- Step 2: Weighting Function kens t1 . . . tn (where the last token tn is a punctu- Based on premises (ii) and (iii) above, we score ation mark) as follows: a sentence based on the number of noun phrases n−1 that occur with high absolute and relative fre- X s= wf2 · [[(ti , ti+1 ) ∈ F2 ]] quency in the domain. We only consider sim- i=1 ple nouns and compound nouns consisting of + wi2 · [[(ti , ti+1 ) ∈ I2 ]] (2) two nouns in this paper. In general, compound + wf1 · [[ti ∈ F1 ]] nouns are more informative and specific. A com- + wi1 · [[ti ∈ I1 ]] pound noun may refer to a specific reason even if the head noun does not (e.g., “life” vs. “battery where [[φ]] = 1 if φ is true and [[φ]] = 0 otherwise. life”). This means that we need to compute scores Note that a noun in a compound will contribute to in a way that allows us to give higher weight to the overall score in two different summands. compound nouns than to simple nouns. The weights wf2 , wi2 , wf1 , and wi1 are deter- In addition, we also include counts of nouns mined using logistic regression. The training set and compounds in the scoring that do not have for the regression is created in an unsupervised high absolute/relative frequency because fre- fashion as follows. From each set of n sentences quency heuristics identify keyphrases with only (one per review), we select the two highest scor- moderate accuracy. However, theses nouns and ing, i.e., the two sentences that were classified compounds are given a lower weight. with the highest confidence. The two classes in This motivates a scoring function that is a the regression problem are then the top ranked weighted sum of four variables: number of simple sentences vs. the sentences at rank 2. Since tak- nouns with high frequency, number of infrequent ing all sentences turned out to be too noisy, we simple nouns, number of compound nouns with eliminate sentence pairs where the top sentence is high frequency, and number of infrequent com- better than the second sentence on almost all of pound nouns. High frequency is defined as fol- the set counts (i.e., count of members of F1 , I1 , lows. Let fdom (p) be the domain-specific abso- F2 , and I2 ). Our hypothesis in setting up this re- lute frequency of phrase p, i.e., the frequency in gression was that the sentence with the strongest the review corpus, and fwiki (p) the frequency of sentiment often does not give a good reason. Our p in the English Wikipedia. We view the distribu- experiments confirm that this hypothesis is true. tion of terms in Wikipedia as domain-independent The weights wf2 , wi2 , wf1 , and wi1 estimated and define the relative frequency as in Equation 1. by the regression are then used to score sentences according to Equation 2. fdom (p) frel (p) = (1) We give the same weight to all keyphrase com- fwiki (p) pounds (and the same weight to all keyphrase We do not consider nouns and compound nouns nouns) – in future work one could attempt to give that do not occur in Wikipedia for computing higher weights to keyphrases with higher absolute the relative frequency. A noun (resp. compound or relative frequency. In this paper, our goal is to noun) is deemed to be of high frequency if it is establish a simple baseline for the task of extrac- one of the k% nouns (resp. compound nouns) with tion of supporting sentences. the highest fdom (p) and at the same time is one of After computing the overall weight for each the k% nouns (resp. compound nouns) with the sentence in a review, the sentence with the highest highest frel (p) where k is a parameter. weight is chosen as the supporting sentence – the Based on these definitions, we define four dif- sentence that is most informative for explaining ferent sets: F1 (the set of nouns with high fre- the overall sentiment of the review. 279 4 Experiments reasons. The cleaned corpus consists of 11,624 documents. Finally, we split the corpus into train- 4.1 Data ing set (85%) and test set (15%) as shown in Table We use part of the Amazon dataset from Jindal 1. The average number of sentences of a review is and Liu (2008). The dataset consists of more than 13.36 sentences, the median number of sentences 5.8 million consumer-written reviews of several is 10. products, taken from the Amazon website. For our experiment we used the digital camera do- 4.3 Sentiment Classification main and extracted 15,340 reviews covering a to- We first build a sentence sentiment classifier by tal of 740 products. See table 1 for key statistics training the Stanford maximum entropy classifier of the data set. (Manning and Klein, 2003) on the sentences in the training set. Sentences occurring in positive (resp. Type Number negative) reviews are labeled positive (resp. neg- Brands 17 ative). We use a simple bag-of-words representa- Products 740 tion (without punctuation characters and frequent Documents (all) 15,340 stop words). Propagating labels from documents Documents (cleaned) 11,624 to sentences creates a noisy training set because Documents (train) 9,880 some sentences have sentiment different from the Documents (test) 1,744 sentiment in their documents; however, there is Short test documents 147 no alternative because we need per-sentence clas- Long test documents 1,597 sification decisions, but do not have per-sentence Average number of sents 13.36 human labels. Median number of sents 10 The accuracy of the classifier is 88.4% on “propagated” sentence labels. Table 1: Key statistics of our dataset We use the sentence classifier in two ways. First, it defines our baseline BL for extracting In addition to the review text, authors can give supporting sentences: the baseline simply pro- an overall rating (a number of stars) to the prod- poses the sentence with the highest sentiment uct. Possible ratings are 5 (very positive), 4 (pos- score that is compatible with the sentiment of the itive), 3 (neutral), 2 (negative), and 1 (very nega- document as the supporting sentence. tive). We unify ratings of 4 and 5 to “positive” and Second, the sentence classifier selects a subset ratings of 1 and 2 to “negative” to obtain polarity of candidate sentences that is then further pro- labels for binary classification. Reviews with a cessed using the scoring function in Equation 2. rating of 3 are discarded. This subset consists of the n = 5 sentences with the highest sentiment scores of the “right” polarity 4.2 Preprocessing – that is, if the document is positive (resp. nega- We tokenized and part-of-speech (POS) tagged tive), then the n = 5 sentences with the highest the corpus using TreeTagger (Schmid, 1994). We positive (resp. negative) scores are selected. split each review into individual sentences by us- ing the sentence boundaries given by TreeTag- 4.4 Determining Frequencies and Weights ger. One problem with user-written reviews is The absolute frequency of nouns and compound that they are often not written in coherent En- nouns simply is computed as their token fre- glish, which results in wrong POS tags. To ad- quency in the training set. For computing the rel- dress some of these problems, we cleaned the ative frequency (as described in Section 3, Equa- corpus after the tokenization step. We separated tion 1), we use the 20110405 dump of the English word-punctuation clusters (e.g., word...word) and Wikipedia. removed emoticons, html tags, and all sentences In the product review corpora we studied, with three or fewer tokens, many of which were the percentage of high-frequency keyphrase com- a result of wrong tokenization. We excluded all pound nouns was higher than that of simple reviews with fewer than five sentences. Short re- nouns. We therefore use two different thresh- views are often low-quality and do not give good olds for absolute and relative frequency. We de- 280 fine F1 as the set of nouns that are in the top supporting sentences. kn = 2.5% for both absolute and relative fre- quencies; and F2 as the set of compounds that are 5 Comparative Evaluation with Amazon in the top kp = 5% for both absolute and rela- Mechanical Turk tive frequencies. These thresholds are set to ob- One standard way to evaluate summarization sys- tain a high density of good keyphrases with few tems is to create hand-edited summaries and to false positives. Below the threshold there are still compute some measure of similarity (e.g., word other good keyphrases, but they cannot be sepa- or n-gram overlap) between automatic and human rated easily from non-keyphrases. summaries. An alternative for extractive sum- Sentences are scored according to Equation 2. maries is to classify all sentences in the document Recall that the parameters wf2 , wi2 , wf1 , and wi1 with respect to their appropriateness as summary are determined using logistic regression. The ob- sentences. An automatic summary can then be tained parameter values (see table 2) indicate the scored based on its ability to correctly identify relative importance of the four different types of good summary sentences. Both of these meth- terms. Compounds are the most important term ods require a large annotation effort and are most and even those with a frequency below the thresh- likely too complex to be outsourced to a crowd- old kp still provide more detailed information than sourcing service because the creation of manual simple nouns above the threshold kn ; the value of summaries requires skilled writers. For the sec- wi2 is approximately twice the value wf1 for this ond type of evaluation, ranking sentences accord- reason. Non-keyphrase nouns are least important ing to a criterion is a lot more time consuming and are weighted with only a very small value of than making a binary decision – so ranking the wi1 = 0.01. 13 or 14 sentences that a review contains on av- erage for the entire test set would be a signifi- Phrase Par Value cant annotation effort. It would also be difficult keyphrase compounds w f2 1.07 to obtain consistent and repeatable annotation in non-keyphrase compounds wi2 0.89 crowdsourcing on this task due to its subtlety. keyphrase nouns w f1 0.46 We therefore designed a novel evaluation non-keyphrase nouns wi1 0.01 methodology in this paper that has a much smaller startup cost. It is well known that relative judg- Table 2: Weight settings ments are easier to make on difficult tasks than ab- solute judgments. For example, much recent work The scoring function with these parameter val- on relevance ranking in information retrieval re- ues is applied to the n = 5 selected sentences of lies on relative relevance judgments (one docu- the review. The highest scoring sentence is then ment is more relevant than another) rather than ab- selected as the supporting sentence proposed by solute relevance judgments. We adopt this gen- our system. eral idea and only request such relative judgments For 1380 of the 1744 reviews, the sentence se- on supporting sentences from annotators. Unlike lected by our system is different from the baseline a complete ranking of the sentences (which would sentence; however, there are 364 cases (20.9%) require m(m − 1)/2 judgments where m is the where the two are the same. Only the 1380 cases length of the review), we choose a setup where where the two methods differ are included in the we need to only elicit a single relative judgment crowdsourcing evaluation to be described in the per review, one relative judgment on a sentence next section. As we will show below, our sys- pair (consisting of the baseline sentence and the tem selects better supporting sentences than the system sentence) for each of the 1380 reviews se- baseline in most cases. So if baseline and our sys- lected in the previous section. This is a manage- tem agree, then it is even more likely that the sen- able annotation task that can be run on a crowd- tence selected by both is a good supporting sen- sourcing service in a short time and at little cost. tence. However, there could also be cases where We use Amazon Mechanical Turk (AMT) for the n = 5 sentences selected by the sentiment this annotation task. The main advantage of AMT classifier are all bad supporting sentences or cases is that cost per annotation task is very low, so that where the document does not contain any good we can obtain large annotated datasets for an af- 281 file:///Users/hs0711/example2.html Task: Sentence 1: This 5 meg camera meets all my requirements. Sentence 2: Very good pictures, small bulk, long battery life. Which sentence gives the more convincing reason? Fill out exactly one field, please. Please type the blue word of the chosen sentence into the corresponding answer field. s1 s2 If both sentences do not give a convincing reason, type NOTCONV into this answer field. X Submit Figure 1: AMT interface for annotators fordable price. The disadvantage is the level of is simply the number of times it was rated bet- quality of the annotation which will be discussed ter than its competitor. The score can be 0, 1, 2 at the end of this section. or 3. HITs for which the worker chooses the op- tion “Neither sentence has a convincing reason” 5.1 Task Design are ignored when computing sentence scores. We created a HIT (Human Intelligence Task) The sentence with the higher score is then se- template including detailed annotation guidelines. lected as the best supporting sentence for the cor- Every HIT consists of a pair of sentences. One responding review. sentence is the baseline sentence; the other sen- In cases of ties, we posted the sentence pair one tence is the system sentence, i.e., the sentence se- more time for one worker. If one of the two sen- lected by the1 ofscoring 1 function. The two sentences tences has a higher score after3/9/12this 12:06 reposting, PM we are presented in random order to avoid bias. choose it as the winner. Otherwise we label this The workers are then asked to evaluate the rel- sentence pair “no decision” or “N-D”. ative quality of the sentences by selecting one of the following three options: 5.2 Quality of AMT Annotations Since our crowdsourcing based evaluation is 1. Sentence 1 has the more convincing reason novel, it is important to investigate if human an- notators perform the annotation consistently and 2. Sentence 2 has the more convincing reason reproducibly. 3. Neither sentence has a convincing reason The Fleiss’ κ agreement score for the final experiment is 0.17. AMT workers only have If both sentences contain reasons, the worker the instructions given by the requesters. If they has to compare the two reasons and choose the are not clear enough or too complicated, work- sentence with the more convincing reason. ers can misunderstand the task, which decreases Each HIT was posted to three different workers the quality of the answers. There are also AMT to make it possible to assess annotator agreement. workers who spam and give random answers to Every worker can process each HIT only once tasks. Moreover, ranking sentences according to so that the three assignments are always done by the quality of the given reason is a subjective task. three different people. Even if the sentence contains a reason, it might Based on the worker annotations, we compute a not be convincing for the worker. gold standard score for each sentence. This score To ensure a high level of quality for our dataset, 282 Experiment # Docs BL SY N-D B=S 1 AMT, first pass 1380 27.4 57.9 14.7 - 2 AMT, second pass 203 46.8 45.8 7.4 - 3 AMT final 1380 34.3 64.6 1.1 - 4 AMT+[B=S] 1744 27.1 51.1 0.9 20.9 Table 3: AMT evaluation results. Numbers are percentages or counts. BL = baseline, SY = system, N-D = no decision, B=S = same sentence selected by baseline and system we took some precautions. To force workers to baseline system, 46% the system sentence; 7.4% actually read the sentences and not just click a of the responses were undecided (line 2). Line 3 few boxes, we randomly marked one word of each presents the consolidated results where the 14.7% sentence blue. The worker had to type the word ties on line 1 are replaced by the ratings obtained of their preferred sentence into the corresponding on line 2 in the second pass. answer field or NOTCONV into the special field if The consolidated results (line 3) show that our neither sentence was convincing. Figure 1 shows system is clearly superior to the baseline of se- our AMT interface design. lecting the sentence with the strongest sentiment. For each answer field we have a gold stan- Our system selected a better supporting sentence dard (the words we marked blue and the word for 64.6% of the reviews; the baseline selected a NOTCONV) which enables us to look for spam. better sentence for 34.3% of the reviews. These The analysis showed that some workers mistyped results exclude the reviews where baseline and some words, which however only indicates that system selected the same sentence. If we as- the worker actually typed the word instead of sume that these sentences are also acceptable sen- copying it from the task. Some workers submit- tences (since they score well on the traditional ted inconsistent answers, for instance, they typed sentiment metrics as well as on our new con- a random word or filled out all three answer fields. tent keyword metric), then our system finds a In such cases we reposted this HIT again to re- good supporting sentence for 72.0% of reviews ceive a correct answer. (51.1+20.9) whereas the baseline does so for only After the task, we counted how often a worker 48.0% (27.1+20.9). said that neither sentence is convincing since a 6.1 Error Analysis high number indicates that the worker might have only copied the word for several sentence pairs Our error analysis revealed that a significant pro- without checking the content of the sentences. We portion of system sentences that were worse than also analyzed the time a worker needed for every baseline sentences did contain a reason. How- HIT. Since no task was done in less than 10 sec- ever, the baseline sentence also contained a reason onds, the possibility of just copying the word was and was rated better by AMT annotators. Exam- rather low. ples (1) and (2) show two such cases. The first sentence is the baseline sentence (BL) which was 6 Results and discussion rated better. The system sentence (SY) contains a similar or different reason. Since rating reasons The results of the AMT experiment are shown in is a very subjective task, it is impossible to de- table 3. As described above, each of the 1380 fine which of these two sentences contains the bet- sentence pairs was evaluated by three workers. ter reason and depends on how the workers think Workers rated the system sentence as better for about it. 57.9% of the reviews, and the baselines sentence as better for 27.4% of the reviews; for 14.7% of (1) BL:The best thing is that everything is just so reviews, the scores of the two sentences were tied easily displayed and one doesn’t need a (line 1 of Table 3). The 203 reviews in this cate- manual to start getting the work done. gory were reposted one more time (as described in SY: The zoom is incredible, the video was so Section 5). The responses were almost perfectly clear that I actually thought of making a evenly split: about 47% of workers preferred the 15 min movie. 283 (2) BL:The colors are horrible, indoor shots are Finally, there are a number of cases where our horrible, and too much noise. assumption that good supporting sentences con- SY: Who cares about 8 mega pixels and 1600 tain keyphrases is incorrect. For example, sen- iso when it takes such bad quality pic- tence (6) does not contain any keyphrases indica- tures. tive of good reasons. The information that makes it a good supporting sentence is mainly expressed In example (3) the system sentence is an in- using verbs and particles. complete sentence consisting of only two noun (6) I have had an occasional problem with phrases. These cut-off sentences are mainly the camera not booting up and telling me caused by incorrect usage of grammar and punc- to turn it off and then on again. tuation by the reviewers which results in wrongly determined sentence boundaries in the prepro- 7 Conclusion cessing step. In this work, we presented a system that ex- (3) BL:Gives peace of mind to have it fit per- tracts supporting sentences, single-sentence sum- fectly. maries of a document that contain a convincing SY: battery and SD card. reason for the author’s opinion about a product. We used an unsupervised approach that extracts In some cases, the two sentences that were pre- keyphrases of the given domain and then weights sented to the worker in the evaluation had a dif- these keyphrases to identify supporting sentences. ferent polarity. This can have two reasons: (i) due We used a novel comparative evaluation method- to noisy training input, the classifier misclassified ology with the crowdsourcing framework Ama- some of the sentences, and (ii) for short reviews zon Mechanical Turk to evaluate this novel task we also used sentences with the non-conforming since no gold standard is available. We showed polarity. Sentences with different polarity often that our keyphrase-based system performs better confused the workers and they tended to prefer than a baseline of extracting the sentence with the the positive sentence even if the negative one con- highest sentiment score. tained a more convincing reason as can be seen in example (4). 8 Future work (4) BL:It shares same basic commands and Our method failed for some of the about 35% of setup, so the learning curve was minimal. reviews where it did not find a convincing reason because of the noisiness of reviews. Reviews are SY: I was not blown away by the image qual- user-generated content and contain grammatically ity, and as others have mentioned, the incorrect sentences and are full of typographical flash really is weak. errors. This problem makes it hard to perform pre- A general problem with our approach is that the processing steps like part-of-speech tagging and weighting function favors sentences with many sentence boundary detection correctly and reli- noun phrases. The system sentence in example ably. We plan to address these problems in fu- (5) contains many noun phrases, including some ture work by developing a more robust processing highly frequent nouns (e.g., ”lens”, ”battery”), pipeline. but there is no convincing reason and the baseline sentence has been selected by the workers. Acknowledgments This work was supported by Deutsche (5) BL:I have owned my cd300 for about 3 weeks Forschungsgemeinschaft (Sonderforschungs- and have already taken 700 plus pictures. bereich 732, Project D7) and in part by the SY: It has something to do with the lens be- IST Programme of the European Community, cause the manual says it only happens to under the PASCAL2 Network of Excellence, the 300 and when I called Sony tech sup- IST-2007-216886. This publication only reflects port the guy tried to tell me the battery the authors’ views. was faulty and it wasn’t. 284 References International Conference on New Methods in Lan- guage Processing, Manchester, UK. Shilpa Arora, Mahesh Joshi, and Carolyn P. Ros´e. Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006. 2009. Identifying types of claims in online cus- Movie review mining and summarization. In Pro- tomer reviews. In Proceedings of Human Lan- ceedings of the 15th ACM international conference guage Technologies: The 2009 Annual Conference on Information and knowledge management, CIKM of the North American Chapter of the Association ’06, pages 43–50, New York, NY, USA. ACM. for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09, pages 37–40, Stroudsburg, PA, USA. Association for Computa- tional Linguistics. Philip Beineke, Trevor Hastie, Christopher Manning, and Shivakumar Vaithyanathan. 2004. Exploring sentiment summarization. In Proceedings of the AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications. AAAI Press. AAAI technical report SS-04-07. G´abor Berend. 2011. Opinion expression mining by exploiting keyphrase extraction. In Proceedings of 5th International Joint Conference on Natural Lan- guage Processing, pages 1162–1170, Chiang Mai, Thailand, November. Asian Federation of Natural Language Processing. Minqing Hu and Bing Liu. 2004. Mining and sum- marizing customer reviews. In Proceedings of the Tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pages 168–177, New York, NY, USA. ACM. Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In WSDM ’08: Proceedings of the international conference on Web search and web data mining, pages 219–230, New York, NY, USA. ACM. Soo-Min Kim and Eduard Hovy. 2006. Automatic identification of pro and con reasons in online re- views. In Proceedings of the COLING/ACL on Main conference poster sessions, COLING-ACL ’06, pages 483–490, Stroudsburg, PA, USA. Asso- ciation for Computational Linguistics. Bing Liu. 2010. Sentiment analysis and subjectivity. Handbook of Natural Language Processing, 2nd ed. Christopher Manning and Dan Klein. 2003. Opti- mization, maxent models, and conditional estima- tion without magic. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Hu- man Language Technology: Tutorials - Volume 5, NAACL-Tutorials ’03, pages 8–8, Stroudsburg, PA, USA. Association for Computational Linguistics. Ani Nenkova and Kathleen McKeown. 2011. Auto- matic summarization. Foundations and Trends in Information Retrieval, 5(2-3):103–233. Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in In- formation Retrieval, 2(1-2):1–135. Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the 285 Bootstrapped Training of Event Extraction Classifiers Ruihong Huang and Ellen Riloff School of Computing University of Utah Salt Lake City, UT 84112 {huangrh,riloff}@cs.utah.edu Abstract 2002; Maslennikov and Chua, 2007)). How- ever, manually generating answer keys for event Most event extraction systems are trained extraction is time-consuming and tedious. And with supervised learning and rely on a col- more importantly, event extraction annotations lection of annotated documents. Due to are highly domain-specific, so new annotations the domain-specificity of this task, event extraction systems must be retrained with must be obtained for each domain. new annotated data for each domain. In The goal of our research is to use bootstrap- this paper, we propose a bootstrapping so- ping techniques to automatically train a state-of- lution for event role filler extraction that re- the-art event extraction system without human- quires minimal human supervision. We aim generated answer key templates. The focus of our to rapidly train a state-of-the-art event ex- traction system using a small set of “seed work is the TIER event extraction model, which nouns” for each event role, a collection is a multi-layered architecture for event extrac- of relevant (in-domain) and irrelevant (out- tion (Huang and Riloff, 2011). TIER’s innova- of-domain) texts, and a semantic dictio- tion over previous techniques is the use of four nary. The experimental results show that different classifiers that analyze a document at in- the bootstrapped system outperforms previ- creasing levels of granularity. TIER progressively ous weakly supervised event extraction sys- zooms in on event information using a pipeline tems on the MUC-4 data set, and achieves of classifiers that perform document-level classi- performance levels comparable to super- vised training with 700 manually annotated fication, sentence classification, and noun phrase documents. classification. TIER outperformed previous event extraction systems on the MUC-4 data set, but re- lied heavily on a large collection of 1,300 docu- 1 Introduction ments coupled with answer key templates to train Event extraction systems process stories about its four classifiers. domain-relevant events and identify the role fillers In this paper, we present a bootstrapping solu- of each event. A key challenge for event extrac- tion that exploits a large unannotated corpus for tion is that recognizing role fillers is inherently training by using role-identifying nouns (Phillips contextual. For example, a PERSON can be a and Riloff, 2007) as seed terms. Phillips and perpetrator or a victim in different contexts (e.g., Riloff observed that some nouns, by definition, “John Smith assassinated the mayor” vs. “John refer to entities or objects that play a specific role Smith was assassinated”). Similarly, any COM - in an event. For example, “assassin”, “sniper”, PANY can be an acquirer or an acquiree depending and “hitman” refer to people who play the role on the context. of PERPETRATOR in a criminal event. Similarly, Many supervised learning techniques have “victim”, “casualty”, and “fatality” refer to peo- been used to create event extraction systems us- ple who play the role of VICTIM, by virtue of ing gold standard “answer key” event templates their lexical semantics. Phillips and Riloff called for training (e.g., (Freitag, 1998a; Chieu and Ng, these words role-identifying nouns and used them 286 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 286–295, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics to learn extraction patterns. Our research also and Riloff, 2009)). Other systems take a more uses role-identifying nouns to learn extraction pat- global view and consider discourse properties of terns, but the role-identifying nouns and patterns the document as a whole to improve performance are then used to create training data for event ex- (e.g., (Maslennikov and Chua, 2007; Ji and Gr- traction classifiers. Each classifier is then self- ishman, 2008; Liao and Grishman, 2010; Huang trained in a bootstrapping loop. and Riloff, 2011)). Currently, the learning-based Our weakly supervised training procedure re- event extraction systems that perform best all use quires a small set of “seed nouns” for each event supervised learning techniques that require a large role, and a collection of relevant (in-domain) and number of texts coupled with manually-generated irrelevant (out-of-domain) texts. No answer key annotations or answer key templates. templates or annotated texts are needed. The seed A variety of techniques have been explored nouns are used to automatically generate a set for weakly supervised training of event extrac- of role-identifying patterns, and then the nouns, tion systems, primarily in the realm of pattern or patterns, and a semantic dictionary are used to rule-based approaches (e.g., (Riloff, 1996; Riloff label training instances. We also propagate the and Jones, 1999; Yangarber et al., 2000; Sudo et event role labels across coreferent noun phrases al., 2003; Stevenson and Greenwood, 2005)). In within a document to produce additional train- some of these approaches, a human must man- ing instances. The automatically labeled texts are ually review and “clean” the learned patterns to used to train three components of TIER: its two obtain good performance. Research has also been types of sentence classifiers and its noun phrase done to learn extraction patterns in an unsuper- classifiers. To create TIER’s fourth component, vised way (e.g., (Shinyama and Sekine, 2006; its document genre classifier, we apply heuristics Sekine, 2006)). But these efforts target open do- to the output of the sentence classifiers. main information extraction. To extract domain- We present experimental results on the MUC- specific event information, domain experts are 4 data set, which is a standard benchmark for needed to select the pattern subsets to use. event extraction research. Our results show that There have also been weakly supervised ap- the bootstrapped system, TIERlite , outperforms proaches that use more than just local context. previous weakly supervised event extraction sys- (Patwardhan and Riloff, 2007) uses a semantic tems and achieves performance levels comparable affinity measure to learn primary and secondary to supervised training with 700 manually anno- patterns, and the secondary patterns are applied tated documents. only to event sentences. The event sentence clas- sifier is self-trained using seed patterns. Most 2 Related Work recently, (Chambers and Jurafsky, 2011) acquire Event extraction techniques have largely focused event words from an external resource, group the on detecting event “triggers” with their arguments event words to form event scenarios, and group for extracting role fillers. Classical methods are extraction patterns for different event roles. How- either pattern-based (Kim and Moldovan, 1993; ever, these weakly supervised systems produce Riloff, 1993; Soderland et al., 1995; Huffman, substantially lower performance than the best su- 1996; Freitag, 1998b; Ciravegna, 2001; Califf and pervised systems. Mooney, 2003; Riloff, 1996; Riloff and Jones, 3 Overview of TIER 1999; Yangarber et al., 2000; Sudo et al., 2003; Stevenson and Greenwood, 2005) or classifier- The goal of our research is to develop a weakly based (e.g., (Freitag, 1998a; Chieu and Ng, 2002; supervised training process that can successfully Finn and Kushmerick, 2004; Li et al., 2005; Yu et train a state-of-the-art event extraction system for al., 2005)). a new domain with minimal human input. We de- Recently, several approaches have been pro- cided to focus our efforts on the TIER event ex- posed to address the insufficiency of using only traction model because it recently produced bet- local context to identify role fillers. Some ap- ter performance on the MUC-4 data set than prior proaches look at the broader sentential context learning-based event extraction systems (Huang around a potential role filler when making a de- and Riloff, 2011). In this section, we briefly give cision (e.g., (Gu and Cercone, 2006; Patwardhan an overview of TIER’s architecture and its com- 287 and time-consuming. Furthermore, answer key templates for one domain are virtually never reusable for different domains, so a new set of answer keys must be produced from scratch for each domain. In the next section, we present our weakly supervised approach for training TIER’s Figure 1: TIER Overview event extraction classifiers. ponents. 4 Bootstrapped Training of Event TIER is a multi-layered architecture for event Extraction Classifiers extraction, as shown in Figure 1. Documents pass We adopt a two-phase approach to train TIER’s through a pipeline where they are analyzed at dif- event extraction modules using minimal human- ferent levels of granularity, which enables the sys- generated resources. The goal of the first phase tem to gradually “zoom in” on relevant facts. The is to automatically generate positive training ex- pipeline consists of a document genre classifier, amples using role-identifying seed nouns as input. two types of sentence classifiers, and a set of noun phrase (role filler) classifiers. The seed nouns are used to automatically gener- ate a set of role-identifying patterns for each event The lower pathway in Figure 1 shows that all role. Each set of patterns is then assigned a set documents pass through an event sentence clas- of semantic constraints (selectional restrictions) sifier. Sentences labeled as event descriptions that are appropriate for that event role. The se- then proceed to the noun phrase classifiers, which mantic constraints consist of the role-identifying are responsible for identifying the role fillers in seed nouns as well as general semantic classes each sentence. The upper pathway in Figure 1 in- that constrain the event role (e.g., a victim must volves a document genre classifier to determine be a HUMAN). A noun phrase will satisfy the se- whether a document is an “event narrative” story mantic constraints if its head noun is in the seed (i.e., an article that primarily discusses the details noun list or if it has the appropriate semantic type of a domain-relevant event). Documents that are (based on dictionary lookup). Each pattern is then classified as event narratives warrant additional matched against the unannotated texts, and if the scrutiny because they most likely contain a lot of extracted noun phrase satisfies its semantic con- event information. Event narrative stories are pro- straints, then the noun phrase is automatically la- cessed by an additional set of role-specific sen- beled as a role filler. tence classifiers that look for role-specific con- texts that will not necessarily mention the event. The second phase involves bootstrapped train- For example, a victim may be mentioned in a sen- ing of TIER’s classifiers. Using the labeled in- tence that describes the aftermath of a crime, such stances generated in the first phase, we iteratively as transportation to a hospital or the identifica- train three of TIER’s components: the two types tion of a body. Sentences that are determined to of sentential classifiers and the noun phrase clas- have “role-specific” contexts are passed along to sifiers. For the fourth component, the document the noun phrase classifiers for role filler extrac- classifier, we apply heuristics to the output of the tion. Consequently, event narrative documents sentence classifiers to assess the density of rel- pass through both the lower pathway and the up- evant sentences in a document and label high- per pathway. This approach creates an event ex- density stories as event narratives. In the fol- traction system that can discover role fillers in a lowing sections, we present the details of each of variety of different contexts by considering the these steps. type of document being processed. 4.1 Automatically Labeling Training Data TIER was originally trained with supervised learning using 1,300 texts and their corresponding Finding seeding instances of high precision and answer key templates from the MUC-4 data set reasonable coverage is important in bootstrap- (MUC-4 Proceedings, 1992). Human-generated ping. However, this is especially challenging answer key templates are expensive to produce for event extraction task because identifying role because the annotation process is both difficult fillers is inherently contextual. Furthermore, role 288 patterns automatically generated from unanno- tated texts to assess the similarity of nouns. First, Basilisk assigns a score to each pattern based on the number of seed words that co-occur with it. Basilisk then collects the noun phrases extracted by the highest-scoring patterns. Next, the head noun of each noun phrase is assigned a score Figure 2: Using Basilisk to Induce Role-Identifying based on the set of patterns that it co-occurred Patterns with. Finally, Basilisk selects the highest-scoring nouns, automatically labels them with the seman- fillers occur sparsely in text and in diverse con- tic class of the seeds, adds these nouns to the lex- texts. icon, and restarts the learning process in a boot- In this section, we explain how we gener- strapping fashion. ate role-identifying patterns automatically using For our work, we give Basilisk role-identifying seed nouns, and we discuss why we add seman- seed nouns for each event role. We run the boot- tic constraints to the patterns when producing la- strapping process for 20 iterations and then har- beled instances for training. Then, we discuss the vest the 40 best patterns that Basilisk identifies coreference-based label propagation that we used for each event role. We also tried using the addi- to obtain additional training instances. Finally, we tional role-identifying nouns learned by Basilisk, give examples to illustrate how we create training but found that these nouns were too noisy. instances. 4.1.2 Using the Patterns to Label NPs 4.1.1 Inducing Role-Identifying Patterns The induced role-identifying patterns can be The input to our system is a small set of matched against the unannotated texts to produce manually-defined seed nouns for each event role. labeled instances. However, relying solely on the Specifically, the user is required to provide pattern contexts can be misleading. For example, 10 role-identifying nouns for each event role. the pattern context <subject> caused damage (Phillips and Riloff, 2007) defined a noun as be- will extract some noun phrases that are weapons ing “role-identifying” if its lexical semantics re- (e.g., the bomb) but some noun phrases that are veal the role of the entity/object in an event. For not (e.g., the tsunami). example, the words “assassin” and “sniper” are Based on this observation, we add selectional people who participate in a violent event as a PER - restrictions to each pattern that requires a noun PETRATOR . Therefore, the entities referred to by phrase to satisfy certain semantic constraints in role-identifying nouns are probable role fillers. order to be extracted and labeled as a positive However, treating every context surrounding a instances for an event role. The selectional re- role-identifying noun as a role-identifying pattern strictions are satisfied if the head noun is among is risky. The reason is that many instances of role- the role-identifying seed nouns or if the semantic identifying nouns appear in contexts that do not class of the head noun is compatible with the cor- describe the event. But, if one pattern has been responding event role. In the previous example, seen to extract many role-identifying nouns and tsunami will not be extracted as a weapon because seldomly seen to extract other nouns, then the pat- it has an incompatible semantic class (EVENT), tern likely represents an event context. but bomb will be extracted because it has a com- As (Phillips and Riloff, 2007) did, we use patible semantic class (WEAPON). Basilisk to learn patterns for each event role. We use the semantic class labels assigned by Basilisk was originally designed for semantic the Sundance parser (Riloff and Phillips, 2004) in class learning (e.g., to learn nouns belonging to our experiments. Sundance looks up each noun semantic categories, such as building or human). in a semantic dictionary to assign the semantic As shown in Figure 2, beginning with a small set class labels. As an alternative, general resources of seed nouns for each semantic class, Basilisk (e.g., WordNet (Miller, 1990)) or a semantic tag- learns additional nouns belonging to the same se- ger (e.g., (Huang and Riloff, 2010)) could be mantic class. Internally, Basilisk uses extraction used. 289 propagate the perpetrator label from noun phrase men = Human terrorists was killed by <np> #1 to noun phrase #3. assassins <subject> attacked building = Object snipers <subject> fired shots ... ... ... 4.2 Creating TIERlite with Bootstrapping Semantic Role−Identifying Role−Identifying In this section, we explain how the labeled in- Dictionary Noun Patterns stances are used to train TIER’s classifiers with Constraints Constraints bootstrapping. In addition to the automatically labeled instances, the training process depends John Smith was killed by two armed men on a text corpus that consists of both relevant 1 in broad daylight this morning. (in-domain) and irrelevant (out-of-domain) doc- The assassins 2 attacked the mayor as he uments. Positive instances are generated from left his house to go to work about 8:00 am. Police arrested the unidentified men the relevant documents and negative instances are 3 an hour later. generated by randomly sampling from the irrele- vant documents. Figure 3: Automatic Training Data Creation The classifiers are all support vector machines (SVMs), implemented using the SVMlin software (Keerthi and DeCoste, 2005). When applying the 4.1.3 Propagating Labels with Coreference classifiers during bootstrapping, we use a sliding To enrich the automatically labeled training in- confidence threshold to determine which labels stances, we also propagate the event role labels are reliable based on the values produced by the across coreferent noun phrases within a docu- SVM. Initially, we set the threshold to be 2.0 to ment. The observation is that once a noun phrase identify highly confident predictions. But if fewer has been identified as a role filler, its corefer- than k instances pass the threshold, then we slide ent mentions in the same document likely fill the the threshold down in decrements of 0.1 until we same event role since they are referring to the obtain at least k labeled instances or the thresh- same real world entity. old drops below 0, in which case bootstrapping To leverage these coreferential contexts, we ends. We used k=10 for both sentence classifiers employ a simple head noun matching heuristic to and k=30 for the noun phrase classifiers. identify coreferent noun phrases. This heuristic The following sections present the details of the assumes that two noun phrases that have the same bootstrapped training process for each of TIER’s head noun are coreferential. We considered us- components. ing an off-the-shelf coreference resolver, but de- cided that the head noun matching heuristic would likely produce higher precision results, which is important to produce high-quality labeled data. 4.1.4 Examples of Training Instance Creation Figure 3 illustrates how we label training in- stances automatically. The text example shows three noun phrases that are automatically labeled Figure 4: The Bootstrapping Process as perpetrators. Noun phrases #1 and #2 oc- cur in role-identifying pattern contexts (was killed by <np> and <subject> attacked) and satisfy 4.2.1 Noun Phrase Classifiers the semantic constraints for perpetrators because The mission of the noun phrase classifiers is to “men” has a compatible semantic type and “assas- determine whether a noun phrase is a plausible sins” is a role-identifying noun for perpetrators. event role filler based on the local features sur- Noun phrase #3 (“the unidentified men”) does rounding the noun phrase (NP). A set of classifiers not occur in a pattern context, but it is deemed is needed, one for each event role. to be coreferent with “two armed men” because As shown in Figure 4, to seed the classifier they have the same head noun. Consequently, we training, the positive noun phrase instances are 290 generated from the relevant documents follow- to maintain the negative:positive ratio of 10:1. ing Section 4.1. The negative noun phrase in- The bootstrapping process and feature set are the stances are drawn randomly from the irrelevant same as for the event sentence classifier. documents. Considering the sparsity of role fillers The difference between the two types of sen- in texts, we set the negative:positive ratio to be tence classifiers is that the event sentence classi- 10:1. Once the classifier is trained, it is applied to fier uses positive instances from all event roles, the unlabeled noun phrases in the relevant docu- while each role-specific sentence classifiers only ments. Noun phrases that are assigned role filler uses the positive instances for one particular event labels by the classifier with high confidence (us- role. The rationale is similar as in the super- ing the sliding threshold) are added to the set of vised setting (Huang and Riloff, 2011); the event positive instances. New negative instances are sentence classifier is expected to generalize over drawn randomly from the irrelevant documents to all event roles to identify event mention contexts, maintain the 10:1 (negative:positive) ratio. while the role-specific sentence classifiers are ex- We extract features from each noun phrase pected to learn to identify contexts specific to in- (NP) and its surrounding context. The features dividual roles. include the NP head noun and its premodifiers. 4.2.4 Event Narrative Document Classifier We also use the Stanford NER tagger (Finkel et al., 2005) to identify Named Entities within the TIER also uses an event narrative document NP. The context features include four words to the classifier and only extracts information from role- left of the NP, four words to the right of the NP, specific sentences within event narrative docu- and the lexico-syntactic patterns generated by Au- ments. In the supervised setting, TIER uses toSlog to capture expressions around the NP (see heuristic rules derived from answer key templates (Riloff, 1993) for details). to identify the event narrative documents in the training set, which are used to train an event nar- 4.2.2 Event Sentence Classifier rative document classifier. The heuristic rules re- The event sentence classifier is responsible quire that an event narrative should have a high for identifying sentences that describe a relevant density of relevant information and tend to men- event. Similar to the noun phrase classifier train- tion the relevant information within the first sev- ing, positive training instances are selected from eral sentences. the relevant documents and negative instances are In our weakly supervised setting, we use the drawn from the irrelevant documents. All sen- information density heuristic directly instead of tences in the relevant documents that contain one training an event narrative classifier. We approxi- or more labeled noun phrases (belonging to any mate the relevant information density heuristic by event role) are labeled as positive training in- computing the ratio of relevant sentences (both stances. We randomly sample sentences from the event sentences and role-specific sentences) out of irrelevant documents to obtain a negative:positive all the sentences in a document. Thus, the event training instance ratio of 10:1. The bootstrapping narrative labeller only relies on the output of the process is then identical to that of the noun phrase two sentence classifiers. Specifically, we label a classifiers. The feature set for this classifier con- document as an event narrative if ≥ 50% of the sists of unigrams, bigrams and AutoSlog’s lexico- sentences in the document are relevant (i.e., la- syntactic patterns surrounding all noun phrases in beled positively by either sentence classifier). the sentence. 5 Evaluation 4.2.3 Role-Specific Sentence Classifiers In this section, we evaluate our bootstrapped sys- The role-specific sentence classifiers are tem, TIERlite , on the MUC-4 event extraction trained to identify the contexts specific to each data set. First, we describe the IE task, the data event role. All sentences in the relevant doc- set, and the weakly supervised baseline systems uments that contain at least one labeled noun that we use for comparison. Then we present the phrase for the appropriate event role are used results of our fully bootstrapped system TIERlite , as positive instances. Negative instances are the weakly supervised baseline systems, and two randomly sampled from the irrelevant documents fully supervised event extraction systems, TIER 291 and GLACIER. In addition, we analyze the per- manually selects the best patterns for each event formance of TIERlite using different configura- role. During testing, the patterns are matched tions to assess the impact of its components. against unseen texts to extract event role fillers. PIPER (Patwardhan and Riloff, 2007; Patward- 5.1 IE Task and Data han, 2010) learns extraction patterns using a se- We evaluated the performance of our systems on mantic affinity measure, and it distinguishes be- the MUC-4 terrorism IE task (MUC-4 Proceed- tween primary and secondary patterns and ap- ings, 1992) about Latin American terrorist events. plies them selectively. (Chambers and Jurafsky, We used 1,300 texts (DEV) as our training set and 2011) (C+J) created an event extraction system 200 texts (TST3+TST4) as the test set. All the by acquiring event words from WordNet (Miller, documents have answer key templates. For the 1990), clustering the event words into different training set, we used the answer keys to separate event scenarios, and grouping extraction patterns the documents into relevant and irrelevant sub- for different event roles. sets. Any document containing at least one rel- evant event was considered to be relevant. 5.3 Performance of TIERlite Table 2 shows the seed nouns that we used in our PerpInd PerpOrg Target Victim Weapon 129 74 126 201 58 experiments, which were generated by sorting the nouns in the corpus by frequency and manually Table 1: # of Role Fillers in the MUC-4 Test Set identifying the first 10 role-identifying nouns for each event role.3 Table 3 shows the number of Following previous studies, we evaluate our training instances (noun phrases) that were auto- system on five MUC-4 string event roles: perpe- matically labeled for each event role using our trator individuals (PerpInd), perpetrator organi- training data creation approach (Section 4.1). zations (PerpOrg), physical targets, victims, and weapons. Table 1 shows the distribution of role Event Role Seed Nouns fillers in the MUC-4 test set. The complete IE task Perpetrator terrorists assassins criminals rebels Individual murderers death squads guerrillas involves the creation of answer key templates, one member members individuals template per event1 . Our work focuses on extract- Perpetrator FMLN ELN FARC MRTA M-19 Front ing individual role fillers and not template genera- Organization Shining Path Medellin Cartel tion, so we evaluate the accuracy of the role fillers The Extraditables Army of National Liberation irrespective of which template they occur in. Target houses residence building home homes We used the same head noun scoring scheme offices pipeline hotel car vehicles as previous systems, where an extraction is cor- Victim victims civilians children jesuits Galan rect if its head noun matches the head noun in the priests students women peasants Romero answer key2 . Pronouns were discarded from both Weapon weapons bomb bombs explosives rifles dynamite grenades device car bomb the system responses and the answer keys since no coreference resolution is done. Duplicate ex- Table 2: Role-Identifying Seed Nouns tractions were conflated before being scored, so they count as just one hit or one miss. PerpInd PerpOrg Target Victim Weapon 5.2 Weakly Supervised Baselines 296 157 522 798 248 We compared the performance of our system with Table 3: # of Automatically Labeled NPs three previous weakly supervised event extraction systems. Table 4 shows how our bootstrapped system AutoSlog-TS (Riloff, 1996) generates lexico- TIERlite compares with previous weakly super- syntactic patterns exhaustively from unannotated vised systems and two supervised systems, its su- texts and ranks them based on their frequency and pervised counterpart TIER (Huang and Riloff, probability of occurring in relevant documents. 2011) and a model that jointly considers local A human expert then examines the patterns and and sentential contexts, G LACIER (Patwardhan 1 3 Documents may contain multiple events per article. We only found 9 weapon terms among the high- 2 For example, “armed men” will match “5 armed men”. frequency terms. 292 Weakly Supervised Baselines PerpInd PerpOrg Target Victim Weapon Average AUTO S LOG -TS (1996) 33/49/40 52/33/41 54/59/56 49/54/51 38/44/41 45/48/46 P IPERBest (2007) 39/48/43 55/31/40 37/60/46 44/46/45 47/47/47 44/46/45 C+J (2011) - - - - - 44/36/40 Supervised Models G LACIER (2009) 51/58/54 34/45/38 43/72/53 55/58/56 57/53/55 48/57/52 TIER (2011) 48/57/52 46/53/50 51/73/60 56/60/58 53/64/58 51/62/56 Weakly Supervised Models TIERlite 47/51/49 60/39/47 37/65/47 39/53/45 53/55/54 47/53/50 Table 4: Performance of the Bootstrapped Event Extraction System (Precision/Recall/F-score) 60 5.4 Analysis 55 Table 6 shows the effect of the coreference prop- agation step described in Section 4.1.3 as part of IE performance(F1) 50 training data creation. Without this step, the per- 45 formance of the bootstrapped system yields an F score of 41. With the benefit of the additional 40 training instances produced by coreference prop- 35 agation, the system yields an F score of 53. The new instances produced by coreference propaga- 30 0 200 400 600 800 1000 # of training documents 1200 1400 tion seem to substantially enrich the diversity of the set of labeled instances. Figure 5: The Learning Curve of Supervised TIER Seeding P/R/F wo/Coref 45/38/41 w/Coref 47/53/50 Table 6: Effects of Coreference Propagation and Riloff, 2009). We see that TIERlite outper- forms all three weakly supervised systems, with In the evaluation section, we saw that the su- slightly higher precision and substantially more pervised event extraction systems achieve higher recall. When compared to the supervised sys- recall than the weakly supervised systems. Al- tems, the performance of TIERlite is similar to though our bootstrapped event extraction sys- G LACIER, with comparable precision but slightly tem TIERlite produces higher recall than previ- lower recall. But the supervised TIER system, ous weakly supervised systems, a substantial re- which was trained with 1,300 annotated docu- call gap still exists. ments, is still superior, especially in recall. Considering the pipeline structure of the event extraction system, as shown in Figure 1, the noun Figure 5 shows the learning curve for TIER phrase extractors are responsible for identifying when it is trained with fewer documents, rang- all candidate role fillers. The sentential classifiers ing from 100 to 1,300 in increments of 100. Each and the document classifier effectively serve as data point represents five experiments where we filters to rule out candidates from irrelevant con- randomly selected k documents from the train- texts. Consequently, there is no way to recover ing set and averaged the results. The bars show missing recall (role fillers) if the noun phrase ex- the range of results across the five runs. Figure 5 tractors fail to identify them. shows that TIER’s performance increases from an Since the noun phrase classifiers are so central F score of 34 when trained on just 100 documents to the performance of the system, we compared up to an F score of 56 when training on 1,300 doc- the performance of the bootstrapped noun phrase uments. The circle shows the performance of our classifiers directly with their supervised conter- bootstrapped system, TIERlite , which achieves an parts. The results are shown in Table 5. Both sets F score comparable to supervised training with of classifiers produce low precision when used in about 700 manually annotated documents. isolation, but their precision levels are compara- 293 PerpInd PerpOrg Target Victim Weapon Average Supervised Classifier 25/67/36 26/78/39 34/83/49 32/72/45 30/75/43 30/75/42 Bootstrapped Classifier 30/54/39 37/53/44 30/71/42 28/63/39 36/57/44 32/60/42 Table 5: Evaluation of Bootstrapped Noun Phrase Classifiers (Precision/Recall/F-score) ble. The TIER pipeline architecture is successful References at eliminating many of the false hits. However, M.E. Califf and R. Mooney. 2003. Bottom-up Re- the recall of the bootstrapped classifiers is consis- lational Learning of Pattern Matching rules for In- tently lower than the recall of the supervised clas- formation Extraction. Journal of Machine Learning sifiers. Specifically, the recall is about 10 points Research, 4:177–210. lower for three event roles (PerpInd, Target and Nathanael Chambers and Dan Jurafsky. 2011. Victim) and 20 points lower for the other two event Template-Based Information Extraction without the roles (PerpOrg and Weapon). These results sug- Templates. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- gest that our bootstrapping approach to training guistics: Human Language Technologies (ACL-11). instance creation does not fully capture the diver- H.L. Chieu and H.T. Ng. 2002. A Maximum Entropy sity of role filler contexts that are available in the Approach to Information Extraction from Semi- supervised training set of 1,300 documents. This Structured and Free Text. In Proceedings of the issue is an interesting direction for future work. 18th National Conference on Artificial Intelligence. F. Ciravegna. 2001. Adaptive Information Extraction 6 Conclusions from Text by Rule Induction and Generalisation. In Proceedings of the 17th International Joint Confer- We have presented a bootstrapping approach for ence on Artificial Intelligence. training a multi-layered event extraction model J. Finkel, T. Grenager, and C. Manning. 2005. In- using a small set of “seed nouns” for each event corporating Non-local Information into Information role, a collection of relevant (in-domain) and ir- Extraction Systems by Gibbs Sampling. In Pro- relevant (out-of-domain) texts and a semantic dic- ceedings of the 43rd Annual Meeting of the Associa- tion for Computational Linguistics, pages 363–370, tionary. The experimental results show that the Ann Arbor, MI, June. bootstrapped system, TIERlite , outperforms pre- A. Finn and N. Kushmerick. 2004. Multi-level vious weakly supervised event extraction sys- Boundary Classification for Information Extraction. tems on a standard event extraction data set, and In In Proceedings of the 15th European Conference achieves performance levels comparable to super- on Machine Learning, pages 111–122, Pisa, Italy, vised training with 700 manually annotated docu- September. ments. The minimal supervision required to train Dayne Freitag. 1998a. Multistrategy Learning for such a model increases the portability of event ex- Information Extraction. In Proceedings of the Fif- teenth International Conference on Machine Learn- traction systems. ing. Morgan Kaufmann Publishers. Dayne Freitag. 1998b. Toward General-Purpose 7 Acknowledgments Learning for Information Extraction. In Proceed- We gratefully acknowledge the support of the ings of the 36th Annual Meeting of the Association National Science Foundation under grant IIS- for Computational Linguistics. Z. Gu and N. Cercone. 2006. Segment-Based Hidden 1018314 and the Defense Advanced Research Markov Models for Information Extraction. In Pro- Projects Agency (DARPA) Machine Reading ceedings of the 21st International Conference on Program under Air Force Research Laboratory Computational Linguistics and 44th Annual Meet- (AFRL) prime contract no. FA8750-09-C-0172. ing of the Association for Computational Linguis- Any opinions, findings, and conclusions or rec- tics, pages 481–488, Sydney, Australia, July. ommendations expressed in this material are those Ruihong Huang and Ellen Riloff. 2010. Inducing of the authors and do not necessarily reflect the Domain-specific Semantic Class Taggers from (Al- view of the DARPA, AFRL, or the U.S. govern- most) Nothing. In Proceedings of The 48th Annual Meeting of the Association for Computational Lin- ment. guistics (ACL 2010). Ruihong Huang and Ellen Riloff. 2011. Peeling Back the Layers: Detecting Event Role Fillers in Sec- ondary Contexts. In Proceedings of the 49th Annual 294 Meeting of the Association for Computational Lin- W. Phillips and E. Riloff. 2007. Exploiting Role- guistics: Human Language Technologies (ACL-11). Identifying Nouns and Expressions for Information S. Huffman. 1996. Learning Information Extraction Extraction. In Proceedings of the 2007 Interna- Patterns from Examples. In Stefan Wermter, Ellen tional Conference on Recent Advances in Natural Riloff, and Gabriele Scheler, editors, Connectionist, Language Processing (RANLP-07), pages 468–473. Statistical, and Symbolic Approaches to Learning E. Riloff and R. Jones. 1999. Learning Dictionar- for Natural Language Processing, pages 246–260. ies for Information Extraction by Multi-Level Boot- Springer-Verlag, Berlin. strapping. In Proceedings of the Sixteenth National H. Ji and R. Grishman. 2008. Refining Event Extrac- Conference on Artificial Intelligence. tion through Cross-Document Inference. In Pro- E. Riloff and W. Phillips. 2004. An Introduction to the ceedings of ACL-08: HLT, pages 254–262, Colum- Sundance and AutoSlog Systems. Technical Report bus, OH, June. UUCS-04-015, School of Computing, University of Utah. S. Keerthi and D. DeCoste. 2005. A Modified Finite E. Riloff. 1993. Automatically Constructing a Dictio- Newton Method for Fast Solution of Large Scale nary for Information Extraction Tasks. In Proceed- Linear SVMs. Journal of Machine Learning Re- ings of the 11th National Conference on Artificial search. Intelligence. J. Kim and D. Moldovan. 1993. Acquisition of E. Riloff. 1996. Automatically Generating Extraction Semantic Patterns for Information Extraction from Patterns from Untagged Text. In Proceedings of the Corpora. In Proceedings of the Ninth IEEE Con- Thirteenth National Conference on Artificial Intel- ference on Artificial Intelligence for Applications, ligence, pages 1044–1049. pages 171–176, Los Alamitos, CA. IEEE Computer Satoshi Sekine. 2006. On-demand information extrac- Society Press. tion. In Proceedings of Joint Conference of the In- Y. Li, K. Bontcheva, and H. Cunningham. 2005. Us- ternational Committee on Computational Linguis- ing Uneven Margins SVM and Perceptron for Infor- tics and the Association for Computational Linguis- mation Extraction. In Proceedings of Ninth Confer- tics (COLING/ACL-06. ence on Computational Natural Language Learn- Y. Shinyama and S. Sekine. 2006. Preemptive In- ing, pages 72–79, Ann Arbor, MI, June. formation Extraction using Unrestricted Relation Shasha Liao and Ralph Grishman. 2010. Using Docu- Discovery. In Proceedings of the Human Lan- ment Level Cross-Event Inference to Improve Event guage Technology Conference of the North Ameri- Extraction. In Proceedings of the 48st Annual can Chapter of the Association for Computational Meeting on Association for Computational Linguis- Linguistics, pages 304–311, New York City, NY, tics (ACL-10). June. M. Maslennikov and T. Chua. 2007. A Multi- S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert. Resolution Framework for Information Extraction 1995. CRYSTAL: Inducing a conceptual dictio- from Free Text. In Proceedings of the 45th Annual nary. In Proc. of the Fourteenth International Joint Meeting of the Association for Computational Lin- Conference on Artificial Intelligence, pages 1314– guistics. 1319. G. Miller. 1990. Wordnet: An On-line Lexical M. Stevenson and M. Greenwood. 2005. A Seman- Database. International Journal of Lexicography, tic Approach to IE Pattern Induction. In Proceed- 3(4). ings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 379–386, Ann MUC-4 Proceedings. 1992. Proceedings of the Arbor, MI, June. Fourth Message Understanding Conference (MUC- K. Sudo, S. Sekine, and R. Grishman. 2003. An Im- 4). Morgan Kaufmann. proved Extraction Pattern Representation Model for S. Patwardhan and E. Riloff. 2007. Effective Informa- Automatic IE Pattern Acquisition. In Proceedings tion Extraction with Semantic Affinity Patterns and of the 41st Annual Meeting of the Association for Relevant Regions. In Proceedings of 2007 the Con- Computational Linguistics (ACL-03). ference on Empirical Methods in Natural Language R. Yangarber, R. Grishman, P. Tapanainen, and S. Hut- Processing (EMNLP-2007). tunen. 2000. Automatic Acquisition of Domain S. Patwardhan and E. Riloff. 2009. A Unified Model Knowledge for Information Extraction. In Proceed- of Phrasal and Sentential Evidence for Information ings of the Eighteenth International Conference on Extraction. In Proceedings of 2009 the Conference Computational Linguistics (COLING 2000). on Empirical Methods in Natural Language Pro- K. Yu, G. Guan, and M. Zhou. 2005. Resum´e In- cessing (EMNLP-2009). formation Extraction with Cascaded Hybrid Model. S. Patwardhan. 2010. Widening the Field of View In Proceedings of the 43rd Annual Meeting of the of Information Extraction through Sentential Event Association for Computational Linguistics, pages Recognition. Ph.D. thesis, University of Utah. 499–506, Ann Arbor, MI, June. 295 Bootstrapping Events and Relations from Text Ting Liu Tomek Strzalkowski ILS, University at Albany, ILS, University at Albany, USA USA Polish Academy of Sciences

[email protected] [email protected]

(2) self-adapting unsupervised multi-pass boot- Abstract strapping by which the system learns new rules as it reads un-annotated text using the rules learnt In this paper, we describe a new approach to in the first step and in the subsequent learning semi-supervised adaptive learning of event passes. When a sufficient quantity and quality of extraction from text. Given a set of exam- text material is supplied, the system will learn ples and an un-annotated text corpus, the many ways in which a specific class of events BEAR system (Bootstrapping Events And can be described. This includes the capability to Relations) will automatically learn how to recognize and understand descriptions of detect individual event mentions using a system complex semantic relationships in text, such of context-sensitive triggers and to isolate perti- as events involving multiple entities and nent attributes such as agent, object, instrument, their roles. For example, given a series of time, place, etc., as may be specific for each type descriptions of bombing and shooting inci- of event. This method produces an accurate and dents (e.g., in newswire) the system will highly adaptable event extraction that significant- learn to extract, with a high degree of accu- ly outperforms current information extraction racy, other attack-type events mentioned techniques both in terms of accuracy and robust- elsewhere in text, irrespective of the form of ness, as well as in deployment cost. description. A series of evaluations using the ACE data and event set show a signifi- 2 Learning by bootstrapping cant performance improvement over our baseline system. As a semi-supervised machine learning method, bootstrapping can start either with a set of prede- fined rules or patterns, or with a collection of 1 Introduction training examples (seeds) annotated by a domain expert on a (small) data set. These are normally We constructed a semi-supervised machine related to a target application domain and may be learning process that effectively exploits statisti- regarded as initial “teacher instructions” to the cal and structural properties of natural language learning system. The training set enables the sys- discourse in order to rapidly acquire rules to de- tem to derive initial extraction rules, which are tect mentions of events and other complex rela- applied to un-annotated text data in order to pro- tionships in text, extract their key attributes, and duce a much larger set of examples. The exam- construct template-like representations. The ples found by the initial rules will occur in a learning process exploits descriptive and struc- variety of linguistic contexts, and some of these tural redundancy, which is common in language; contexts may provide support for creating alter- it is often critical for achieving successful com- native extraction rules. When the new rules are munication despite distractions, different con- subsequently applied to the text corpus, addition- texts, or incompatible semantic models between al instances of the target concepts will be identi- a speaker/writer and a hearer/reader. We also fied, some of which will be positive and some take advantage of the high degree of referential not. As this process continues to iterate over, the consistency in discourse (e.g., as observed in system acquires more extraction rules, fanning word sense distribution by (Gale, et al. 1992), out from the seed set until no new rules can be and arguably applicable to larger linguistic learned. units), which enables the reader to efficiently Thus defined, bootstrapping has been used in correlate different forms of description across natural language processing research, notably in coherent spans of text. word sense disambiguation (Yarowsky, 1995). The method we describe here consists of two Strzalkowski and Wang (1996) were first to steps: (1) supervised acquisition of initial extrac- demonstrate that the technique could be applied tion rules from an annotated training corpus, and to adaptive learning of named entity extraction 296 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 296–305, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics rules. For example, given a “naïve” rule for iden- be found in essays and other narrative forms. The tifying company names in text, e.g., “capitalized system needs to recognize any of these forms and NP followed by Co.”, their system would first to do so we need to distill each description to a find a large number of (mostly) positive instanc- basic event pattern. This pattern will capture the es of company names, such as “Henry Kauffman heads of key phrases and their dependency struc- Co.” From the context surrounding each of these ture while suppressing modifiers and certain oth- instances it would isolate alternative indicators, er non-essential elements. Such skeletal such as “the president of”, which is noted to oc- representations cannot be obtained with keyword cur in front of many company names, as in “The analysis or linear processing of sentences at word president of American Electric Automobile Co. level (e.g., Agichtein and Gravano, 2000), be- …”. Such alternative indicators give rise to new cause such methods cannot distinguish a phrase extraction rules, e.g., “president of + CNAME”. head from its modifier. A shallow dependency The new rules find more entities, including com- parser, such as Minipar (Lin, 1998), that recog- pany names that do not end with Co., and the nizes dependency relations between words is process iterates until no further rules are found. quite sufficient for deriving head-modifier rela- The technique achieved a very high performance tions and thus for construction of event tem- (95% precision and 90% recall), which encour- plates. Event templates are obtained by stripping aged more research in IE area by using boot- the parse tree of modifiers while preserving the strapping techniques. Using a similar approach, basic dependency structure as shown in Figure 1, (Thelen and Riloff, 2002) generated new syntac- which is a stripped down parse tree of, “Also tic patterns by exploiting the context of known Monday, Israeli soldiers fired on four diplomatic seeds for learning semantic categories. vehicles in the northern Gaza town of Beit In Snowball (Agichtein and Gravano, 2000 ) Hanoun, said diplomats” and Yangarber’s IE system (2000), bootstrapping The model proposed here represents a signifi- technique was applied for extraction of binary cant advance over the current methods for rela- relations, such as Organization-Location, e.g., tion extraction, such as the SVO model between Microsoft and Redmond, WA. Then, Xu (Yangarber, et al. 2000) and its extension, e.g., (2007) extended the method for more complex the chain model (Sudo, et al. 2001) and other relations extraction by using sentence syntactic related variants (Riloff, 1996) all of which lack structure and a data driven pattern generation. In the expressive power to accurately recognize and this paper, we describe a different approach on represent complex event descriptions and to sup- building event patterns and adapting to the dif- port successful machine learning. While Sudo’s ferent structures of unseen events. subtree model (2003) overcomes some of the limitations of the chain models and is thus con- 3 Bootstrapping applied to event learn- ceptually closer to our method, it nonetheless ing lacks efficiency required for practical applica- tions. Our objective in this project was to expand the We represent complex relations as tree-like bootstrapping technique to learn extraction of structures anchored at an event trigger (which is events from text, irrespective of their form of usually but not necessarily the main verb) with description, a property essential for successful branches extending to the event attributes (which adaptability to new domains and text genres. The are usually named entities). Unlike the singular major challenge in advancing from entities and concepts (i.e., named entities such as ‘person’ or binary relations to event learning is the complex- ity of structures involved that not only consist of multiple elements but their linguistic context may now extend well beyond a few surrounding words, even past sentence boundaries. These considerations guided the design of the BEAR system (Bootstrapping Events And Relations), which is described in this paper. 3.1 Event representation An event description can vary from very concise, newswire-style to very rich and complex as may Figure 1. Skeletal dependency structure representation of an event mention. 297 ‘location’) or linear relations (i.e., tuples such as 3.2 Designating the sense of event triggers ‘Gates – CEO – Microsoft’), an event description An event trigger may have multiple senses but consists of elements that form non-linear de- only one of them is for the event representation. pendencies, which may not be apparent in the If the correct sense can be determined, we would word order and therefore require syntactic and be able to use its synonyms and hyponym as al- semantic analysis to extract. Furthermore, an ar- ternative event triggers, thus enabling extraction rangement of these elements in text can vary of more events. This, in turn, requires sense dis- greatly from one event mention to the next, and ambiguation to be performed on the event trig- there is usually other intervening material in- gers. volved. Consequently, we construe event repre- In MUC evaluations, participating groups ( sentation as a collection of paths linking the Yangarber and Grishman, 1998) used human trigger to the attributes through the nodes of a experts to decide the correct sense of event trig- parse tree1. gers and then manually added correct synonyms To create an event pattern (which will be part to generalize event patterns. Although accurate, of an extraction rule), we generalize the depend- the process is time consuming and not portable to ency paths that connect the event trigger with new domains. each of the event key attributes (the roles). A We developed a new approach for utilizing dependency path consists of lexical and syntactic Wordnet to decide the correct sense of an event relations (POS and phrase dependencies), as well trigger. The method is based on the hypothesis as semantic relations, such as entity tags (e.g., that event triggers will share same sense when Person, Company, etc.) of event roles and word represent same type of event. For example, when sense designations (based on Wordnet senses) of the verbs, attack, assail, strike, gas, bomb, are event triggers. In addition to the trigger-role trigger words of Conflict-Attack event, they paths (which we shall call the sub-patterns), an share same sense. This process is described in the event pattern also contains the following: following steps: • Event Type and Subtype – which is inher- 1) From training corpus, collect all triggers, ited from seed examples; which specify the lemma, POS tag, the type • Trigger class – an instance of the trigger of event and get all possible senses of them must be found in text before any patterns from Wordnet. are applied; 2) Order the triggers by the trigger frequency • Confidence score – expected accuracy of TrF(t, w_pos),2 which is calculated by divid- the pattern established during training ing number of times each word (w_pos) is process; used as a trigger for the event of type (t) by • Context profile – additional features col- the total number of times this word occurs in lected from the context surrounding the the training corpus. Clearly, the greater trig- event description, including references of ger frequency of a word, the more discrimi- other types of events near this event, in native it is as a trigger for the given type of the same sentence, same paragraph, or ad- event. When the senses of the triggers with jacent paragraphs. high accuracy are defined, they can be the We note that the trigger-attribute sub-patterns reference for the triggers in low accuracy. are defined over phrase structures rather than 3) From the top of the trigger list, select the over linear text, as shown in Figure 2. In order to first none-sense defined trigger (Tr1) compose a complete event pattern, sub-patterns 4) Again, beginning from the top of the trigger are collected across multiple mentions of the list, for every trigger Tr2 (other than Tr1), same-type event. we look for a pair of compatible senses be- tween Tr1 and Tr2. To do so, traverse Syno- Attacker: <N(subj, PER): Attacker> <V(fire): trigger> Place: <V(fire): trigger> <Prep> <N> <Prep(in)> <N(GPE): Place> nym, Hypernym, and Hyponym links starting Target: <V(fire): trigger> <Prep(on)> <N(VEH): Target> from the sense(s) of Tr2 (use either the sense Time-Within:<N(timex2): Time-Within><SentHead><V(fire): already assigned to Tr2 if has or all its possi- trigger> Figure 2. Trigger-attribute sub-patterns for key roles in a Conflict- ble senses) and see whether there are paths Attack event pattern. which can reach the senses of Tr1. If such 1 Details of how to derive the skeletal tree representation are converging paths exist, the compatible senses described in (Liu, 2009). 2 2 t – the type of the event, w_pos – the lemma of a word and t – the type of the event, w_pos – the lemma of a word and its POS. its POS. 3 In this figure we omit the parse tree trimming step which was explained in the previous section. 298 relaxation, is particularly useful for rapid adapta- tion of extraction capability to slightly altered, partly ungrammatical, or otherwise variant data. The basic idea is as follows: the patterns ac- quired in prior learning iterations (starting with those obtained from the seed examples) are matched against incoming text to extract new events. Along the way there will be a number of partial matches, i.e., when no existing pattern fully matches a span of text. This may simply mean that no event is present; however, depend- ing upon the degree of the partial match we may Figure 3. A Conflict-Attack event pattern derived from a also consider that a novel structural variant was positive example in the training corpus are identified and assigned to Tr1 and Tr2 (if found. BEAR would automatically test this hy- Tr2’s sense wasn’t assigned before). Then go pothesis by attempting to construe a new pattern, back to step 3. However, if no such path ex- out of the elements of existing patterns, in order ist between Tr1 senses with other triggers to achieve a full match. If a match is achieved, senses, the first sense listed in Wordnet will the new “mutated” pattern will be added to be assigned to Tr1 BEAR learned collection, subject to a validation This algorithm tries to assign the most proper step. The validation step (discussed later in this sense to every trigger for one type of event. For paper) is to assure that the added pattern would example, the sense of fire as trigger of Conflict- not introduce an unacceptable drop in overall Attack event is “start firing a weapon”; while it is system precision. Specific pattern mutation tech- used in Personal-End_Position, its sense is “ter- niques include the following: minate the employment of”. After the trigger • Adding a role subpattern: When a pattern sense is defined, we can expand event triggers by matches an event mention while there is a adding their synonyms and hyponyms during the sufficient linguistic evidence (e.g., pres- event extraction. ence of certain types of named entities) that additional roles may be present in 3.3 Deriving initial rules from seed exam- text, then appropriate role subpatterns can ples be "imported" from other, non-matching patterns (Figure 4). Extraction rules are construed as transformations from the event patterns derived from text onto a • Replacing a role subpattern: When a pat- formal representation of an event. The initial tern matches but for one role, the system rules are derived from a manually annotated can replace this role subpattern by another training text corpus (seed data), supplied as part subpattern for the same role taken from a of an application task. Each rule contains the different pattern for the same event type. type of events it extracts, trigger, a list of role • Adding or replacing a trigger: When a sub-patterns, and the confidence score obtained pattern matches but for the trigger, a new through a validation process (see section 3.6). trigger can be added if it either is already Figure 3 shows an extraction pattern for the Con- present in another pattern for the same flict-Attack event derived from the training cor- event type or the syno- pus (but not validated yet)3. nym/hyponym/hypernym of the trigger (found in section 3.2). 3.4 Learning through pattern mutation We should point out that some of the same ef- fects can be obtained by making patterns more Given an initial set of extraction rules, a variety general, i.e., adding "optional" attributes (i.e., of pattern mutation techniques are applied to de- optional sub-patterns), etc. Nonetheless, the pat- rive new patterns and new rules. This is done by tern mutation is more efficient because it will selecting elements of previously learnt patterns, automatically learn such generalization on an as- based on the history of partial matches and com- needed basis in an entirely data-driven fashion, bining them into new patterns. This form of while also maintaining high precision of the re- learning, which also includes conditional rule sulting pattern set. It is thus a more general 3 In this figure we omit the parse tree trimming step which method. Figure 4 illustrated the use of the ele- was explained in the previous section. ments combination technique. In this example, 299 Figure 4. Deriving a new pattern by importing a role from another pattern neither of the two existing patterns can fully (shown in Figure 5B) is of course subject to con- match the new event description; however, by fidence validation, after which it will be immedi- combining the first pattern with the Place role ately applied to extract more events. sub-pattern from the second pattern we obtain a Another way of getting at this kind of struc- new pattern that fully matches the text. While tural duality is to exploit co-referential con- this adjustment is quite simple, it is nonetheless sistency within coherent spans of discourse, e.g., performed automatically and without any human a single news article or a similar document. Such assistance. The new pattern is then “learned” by documents may contain references to multiple BEAR, subject to a verification step explained in events, but when the same type of event is men- a later section. tioned along with the same attributes, it is more likely than not in reference to the same event. 3.5 Learning by exploiting structural duali- This hypothesis is a variant of an argument ad- ty vanced in (Gale, et al. 2000) that a polysemous As the system reads through new text extracting word used multiple times within a single docu- more events using already learnt rules, each ex- ment, is consistently used in the same sense. So tracted event mention is analyzed for presence of if we extract an event mention (of type T) with alternative trigger elements that can consistently trigger t in one part of a document, and then find predict the presence of a subset of events that that t occurs in another part of the same docu- includes the current one. Subsequently, an alter- ment, then we may assume that this second oc- native sub-pattern structure will be built with currence of t has the same sense as the first. branches extending from the new trigger to the Since t is a trigger for an event of type T, we can already identified attributes, as shown schemati- hypothesize its subsequent occurrences indicate cally in Figure 5. additional mentions of type T events that were In this example, a Conflict-Attack-type event not extracted by any of the existing patterns. Our is extracted using a pattern (shown in Figure 5A) objective is to exploit these unextracted mentions anchored at the “bombing” trigger. Nonetheless, and then automatically generate additional event an alternative trigger structure is discovered, patterns. which is anchored at “an attack” NP, as shown Indeed, Ji (2008) showed that trigger co- on the right side of Figure 5. This “discovery” is occurrence helps finding new mentions of the based upon seeing the new trigger repeatedly – it Pattern ID: 1207 needs to “explain” a subset of previously seen Type: Conflict Subtype: Attack events to be adopted. The new trigger will Trigger: bombing_N Target: <N(bombing): trigger> <Prep(of)> <N(FAC): Target> prompt BEAR to derive additional event pat- Attacker: <N(PER): Attacker> <V> <N(bombing): trigger> terns, by computing alternative trigger-attribute Time-Within: <N(bombing): trigger> <Prep> <N> <Prep> <N> paths in the dependency tree. The new pattern <E0> <V> <N(timex2): Time-within> Figure 5A. A pattern with the bombing trigger matches the event mention in Fig. 5. Pattern ID: 1286 Type: Conflict Subtype: Attack Trigger: attack_N Target: <N(FAC): Target> <Prep(in)> <N(attack): trigger> Attacker: <N(PER): Attacker> <V> <N> <Prep> <N> <Prep(in)> <N(attack): trigger> Time-Within: <N(attack): trigger> <E0> <V> <N(timex2): Time- within> Figure 5B. A new pattern is derived for event in Fig 5, with an attack as the Figure 5. A new extraction pattern is derived by iden- trigger. tifying an alternative trigger for an event. 300 entities, “Howard G. Capek” and “UBS”. The projected accuracy of resign_V as an End- Position trigger is 0.88. With 100% argument overlap rate, we estimate the probability that sen- tence R contains an event mention of the same type as sentence L (and in fact co-referential mention) at 97% (We set 80% as the threshold). Thus a new event mention is found and a new pattern for End-Position is automatically derived Figure 6. The probability of a sentence containing a mention of the from R, as shown in Figure 7A. same type of event within a single document same event; however, we found that if using enti- 3.6 Pattern validation ty co-reference as another factor, more new men- Extraction patterns are validated after each learn- tions could be identified when the trigger has low ing cycle against the already annotated data. In projected accuracy (Liu, 2009; Yu Hong, et al. the first supervised learning step, patterns accu- 2011). Our experiments (Figure 64), which com- racy is tested against the training corpus based on pared the triggers and the roles across all event the similarity between the extracted events and mentions within each document on ACE training human annotated events: corpus, showed that when the trigger accuracy is • A Full match is achieved when the event 0.5 or higher, each of its occurrences within the type is correctly identified and all its roles document indicates an event mention of the same are correctly matched. A full credit is type with a very high probability (mostly > 0.9). added to the pattern score. For triggers with lower accuracy, this high prob- • A Partial match is achieved when the ability is only achieved when the two mentions event type is correctly identified but only share at least 60% of their roles, in addition to a subset of roles is correctly extracted. A having a common trigger. Thus our approach partial score, which is the ratio of the uses co-occurrence of both trigger and event ar- matched roles to the whole roles, is add- gument for detecting new event mentions. ed. In Figure 7, an End-Position event is extracted • A False Alarm occurs when a wrong type from left sentence (L), with “resign” as the trig- of event is extracted (including when no ger and “Capek” and “UBS” assigned Person and event is present in text). No credit is add- Entity roles, respectively 5 . The right sentence ed to the pattern score. (R), taken from the same document, contains the In the subsequent steps, the validation is ex- same trigger word, “resigned” and also the same tended over parts of the unannotated corpus. In Riloff (1996) and Sudo et al. (2001), the pattern accuracy is mainly dependent on its occurrences in the relevant documents6 vs. the whole corpus. However, one document may contain multiple types of events, thus we set a more restricted val- idation measure on new rules: • Good Match If a new rule “rediscovers” already extracted events of the same type, Figure 7. Two event mentions have different triggers and then it will be counted as either a Full sub-patterns structures Match or Partial Match based on previ- Pattern ID: -1 ous rules Type: Personnel Subtype: End-Position • Possible Match If an already extracted Trigger: resign_V event of same type of a rule contains Person: <N(PER, subj): Person> <V(resign): trigger> Entity: <V(resign):trigger> <E0> <N(ORG): Entity> <N> <V> same entities and trigger as the candidate Figure 7A. A new pattern for End-Position learned by exploiting extracted by the rule. This candidate is a event co-reference. possible match, so it will get a partial 4 The X-axis is the percentage of entities coreferred between the EVMs (Event mentions) and the SEs (Sentences); while 6 the Y-axis shows the probability that the SE contains a men- If a document contains same type of events extracted from tion that is the same type as the EVM. previous steps, the document is a relevant document to the 5 Entity is the employer in the event pattern. 301 accuracy is expected to increase, in some cases Event id: 27 from: sample above the threshold. Projected Accuracy: 0.1765 For example, the pattern in Figure 8 has an in- Adjusted Projected Accuracy: 0.91 itially low projected accuracy score; however, we Type: Justice Subtype: Arrest-Jail Trigger: capture find that positive matches of this pattern show a Person sub-pattern: <N(obj, PER): Person> <V(capture): trigger> very high (100% in fact) degree of correlation Co-occurrence ratio: {para_Conflict_Demonstrate=100%, …} with mentions of Demonstrate events. Therefore, Mutually exclusive ratio: {sent_Conflict_Attack=100%, pa- ra_Conflict_Attack=96.3%, …} limiting the application of this pattern to situa- Figure 8. An Arrest-Jail pattern with context profile information tions where a Justice-Arrest-Jail event is men- tioned in a nearby text improves its projected score based on the statistics result from accuracy to 91%, which is well above the re- Figure 6. quired threshold. • False Alarm If a new rule picks up an al- In addition to the confidence rate of each new ready extracted event in different type pattern, we also calculate projected accuracy of Thus, event patterns are validated for overall each of the role sub-patterns, because they may expected precision by calculating the ratio of be used in the process of detecting new patterns, positive matches to all matches against known and it will be necessary to score partial matches, events. This produces pattern confidence scores, as a function confidence weights for pattern which are used to decide if a pattern is to be components. To validate a sub-pattern we apply learned or not. Learning only the patterns with it to the training corpus and calculate its project- sufficiently high confidence scores helps to ed accuracy score by dividing the number of cor- guard the bootstrapping process from spinning rectly matched roles by the total number of off track; nonetheless, the overall objective is to matches returned. The projected accuracy score maximize the performance of the resulting set of will tell us how well a sub-pattern can distin- extraction rules, particularly by expanding its guish a specific event role from other infor- recall rate. mation, when used independently from other For the patterns where the projected accuracy elements of the complete pattern. score falls under the cutoff threshold, we may Figure 9 shows three sub-pattern examples. still be able to make some “repairs” by taking The first sub-pattern extracts the Victim role in a into account their context profile. To do so, we Life-Die event with very high projected accuracy. applied a similar approach as (Liao, 2010), which This sub-pattern is also a good candidate for showed that some types of events can appeared generations of additional patterns for this type of frequently with each other. We collected all the event, a process which we describe in section D. matches produced by such a failed pattern and The second sub-pattern was built to extract the created a list of all other events that occur in their Attacker role in Conflict-Attack events, but it has immediate vicinity: in the same sentence, as well very low projected accuracy. The third one as the sentences before and after it7. These other shows another Attacker sub-pattern whose pro- events, of different types and detected by differ- jected accuracy score is 0.417 after the first step ent patterns, may be seen as co-occurring near Victim pattern: <N(obj, PER): Victim> <V(kill): trigger> (Life-Die) the target event: these that co-occur near positive Projected Accuracy: 0.9390243902439024 matches of our pattern will be added to the posi- Number of negative matches: 5 Number of Positive matches: 77 tive context support of this pattern; conversely, events co-occurring near false alarms will be Attacker pattern: <N(subj, PE/PER/ORG): Attacker> <V> <V(use): added to the negative context support for this trigger> (Conflict-Attack) pattern. By collecting such contextual infor- Projected Accuracy: 0.025210084033613446 Number of negative matches: 116 mation, we can find contextually-based indica- Number of positive matches: 3 tors and non-indicators for occurrence of event mentions. When these extra constraints are in- Attacker pattern: <N(subj, GPE/PER): Attacker> <V(attack): trig- cluded in a previously failed pattern, its projected ger> (Conflict-Attack) Projected Accuracy: 0.4166666666666667 Number of negative matches: 7 Number of positive matches: 5 categories of posi- GPE: 4 GPE_Nation: 4 PER: 1 7 If a known event is detected in the same sentence tive matches: PER_Individual: 1 (sent_…), the same paragraph (para_…), or an adjacent categories of nega- GPE: 1 GPE_Nation: 1 PER: 6 paragraph (adj_para_...) as the candidate event, it be- tive matches: PER_Group: 1 comes an element of the pattern context support. PER_Individual: 5 Figure 9. sub-patterns with projected accuracy scores 302 Table 1. Sub-patterns whose projected accuracy is significantly increased after noisy samples are removed Projected Additional con- Revised Accu- Sub-patterns Accuracy straints racy Movement-Transport: <N(obj, PER/VEH): Artifact> <V(send): trigger> 0.475 removing PER 0.667 <V(bring): trigger> <N(obj)> <Prep = to> <N(FAC/GPE): Destina- 0.375 removing GPE 1.0 tion> … Conflict Attack: <N(PER/ORG/GPE):Attacker><N(attack):trigger> 0.682 removing PER 0.8 <N(subj,GPE/PER):Attacker><V(attack): trigger> 0.417 removing GPE 0.8 removing <N(obj,VEH/PER/FAC):Target><V(target):trigger> 0.364 0.667 PER_Individual … in validation process. This is quite low; however, reached the best cross-validated score, 66.72%, it can be repaired by constraining its entity type when pattern accuracy threshold is set at 0.5. The to GPE. This is because we note that with a GPE highest score of single run is 67.62%. In the fol- entity, the subpattern is 80% on target, while lowing of this section, we will use results of one with PER entity it is 85% a false alarm. After single run to display the learning behavior of this sub-pattern is restricted to GPE its projected BEAR. accuracy becomes 0.8. In Figure 10, X-axis shows values of the Table 1 lists example sub-patterns for which learning threshold (in descending order), while the projected accuracy increases significantly Y-axis is the average F-score achieved by the after adding more constrains. When the projected automatically learned patterns for all types of accuracy of a sub-pattern is improved, all pat- events against the test corpus. The red (lower) terns containing this sub-pattern will also im- line represents BEAR’s base run immediately prove their projected accuracy. If the adjusted after the first iteration (supervised learning step); projected accuracy rises above the predefined the blue (upper) line represents BEAR’s perfor- threshold, the repaired pattern will be saved. mance after an additional 10 unsupervised learn- In the following section, we will discuss the ing cycles9 are completed. We note that the final experiments conducted to evaluate the perfor- performance of the bootstrapped system steadily mance of the techniques underlying BEAR: how increases as the learning threshold is lowered, effectively it can learn and how accurately it can peaking at about 0.5 threshold value, and then perform its extraction task. declines as the threshold value is further de- creased, although it remains solidly above the 4 Evaluation base run. Analyzing more closely a few selected points on this chart we note, for example, that the We test the system learning effectiveness by base run at threshold of 0 has F-score of 34.5%, comparing its performance immediately follow- which represents 30.42% recall, 40% precision. ing the first iteration (i.e., using rules derived On the other end of the curve, at the threshold of from the training data) with its performance after 0.9, the base run precision is 91.8% but recall at N cycles of unsupervised learning. We split ACE only 21.5%, which produces F-score of 34.8%. It training corpus 8 randomly into 5 folders and is interesting to observe that at neither of these trained BEAR on the four folders and evaluated two extremes the system learning effectiveness is it on the left one. Then, we did 5 fold cross vali- particularly good, and is significantly less than at dation. Our experiments showed that BEAR Figure 11. BEAR’s unsupervised learning curve. Figure 10. BEAR cross-validated scores 9 The learning process for one type of event will stop when 8 ACE training data contains 599 documents from news, no new patterns can be generated, so the number of learning weblog, usenet, and conversational telephone speech. Total cycles for each event type is different. The highest number 33 types of events are defined in ACE corpus. of learning cycles is 10 and lowest one is 2. 303 Table 2. BEAR performance following different selections of the median threshold of 0.5 (based on the exper- learning steps Precision Recall F-score iments conducted thus far), where the system Base1 0.89 0.22 0.35 performance improves from 42% to 66.86% F- Base2 0.87 0.28 0.42 score, which represents 83.9% precision and All 0.84 0.56 0.67 55.57% recall. PMM 0.84 0.48 0.61 Figure 11 explains BEAR’s learning effec- CBM 0.86 0.37 0.52 tiveness at what we determined empirically to be see how they contribute to the end performance. the optimal confidence threshold (0.5) for pattern Base1 and Base2 showed the result without and acquisition. We note that the performance of the with adding trigger synonyms in event extrac- system steadily increases until it reaches a plat- tion. By introducing trigger synonyms, 27% eau after about 10 learning cycles. more good events were extracted at the first it- Figure 12 and Figure 13 show a detailed eration and thus, BEAR had more resources to breakdown of BEAR extraction performance use in the unsupervised learning steps. after 10 learning cycles for different types of The ALL is the combination of PMM and events. We note that while precision holds steady CBM, which demonstrate both methods have the across the event types, recall levels vary signifi- contribution to the final results. Furthermore, as cantly. The main reason for low recall in some explained before, new extraction rules are types of events is the failure to find a sufficient learned in each iteration cycle based on what was number of high-confidence patterns. This may learned in prior cycles and that new rules are point to limitations of the current pattern discov- adopted only after they are tested for their pro- ery methods and may require new ways of reach- jected accuracy (confidence score), so that the ing outside of the current feature set. overall precision of the resulting rule set is main- In the previous section we described several tained at a high level relative to the base run. learning methods that BEAR uses to discover, validate and adapt new event extraction rules. 5 Conclusion and future work Some of them work by manipulating already In this paper, we presented a semi-supervised learnt patterns and adapting them to new data in method for learning new event extraction pat- order to create new patterns, and we shall call terns from un-annotated text. The techniques de- these pattern-mutation methods (PMM). Other scribed here add significant new tools that described methods work by exploiting a broader increase capabilities of information extraction linguistic context in which the events occur, or technology in general, and more specifically, of context-based methods (CBM). CB methods look systems that are built by purely supervised meth- for structural duality in text surrounding the ods or from manually designed rules. Our eval- events and thus discover alternative extraction uation using ACE dataset demonstrated that that patterns. bootstrapping can be effectively applied to learn- In Table 2, we report the results of running ing event extraction rules for 33 different types BEAR with each of these two groups of learning of events and that the resulting system can out- methods separately and then in combination to perform supervised system (base run) significant- ly. Some follow-up research issues include: • New techniques are needed to recognize event descriptions that still evade the cur- rent pattern derivation techniques, espe- cially for the events defined in Personnel, Business, and Transactions classes. Figure 12. Event mention extraction after learning: preci- • Adapting the bootstrapping method to ex- sion for each type of event tract events in a different language, e.g. Chinese or Arabic. • Expanding this method to extraction of larger “scenarios”, i.e., groups of correlat- ed events that form coherent “stories” of- ten described in larger sections of text, e.g., an event and its immediate conse- quences. Figure 13. Event mention extraction after learning: recall for each type of event 304 References Thelen, M., Riloff, E. 2002. A bootstrapping method for learning semantic lexicons using Agichtein, E. and Gravano, L. 2000. Snowball: extraction pattern contexts. In Proceedings of Extracting Relations from Large Plain-Text the ACL-02 conference on Empirical methods Collections. In Proceedings of the Fifth ACM in natural language processing - Volume 10. International Conference on Digital Libraries 214-222. Morristown, NJ: Association for Gale, W. A., Church, K. W., and Yarowsky, D. Computational Linguistics 1992. One sense per discourse. In Proceedings Xu, F., Uszkoreit, H., &amp; Li, H. (2007). A of the workshop on Speech and Natural Lan- seed-driven bottom-up machine learning guage, 233-237. Harriman, New York: Asso- framework for extracting relations of various ciation for Computational Linguistics. complexity. In Proc. of the 45th Annual Meet- Hong, Y., Zhang, J., Ma, B., Yao, J., Zhou, G., ing of the Association of Comp. Linguistics, and Zhu, Q,. 2011. Using Cross-Entity Infer- pp. 584–591, Prague, Czech Republic. ence to Improve Event Extraction. In Proceed- Yangarber, R., and Grishman, R. 1998. NYU: ings of the Annual Meeting of the Association Description of the Proteus/PET System as of Computational Linguistics (ACL 2011). Used for MUC-7 ST. In Proceedings of the Portland, Oregon, USA. 7th conference on Message understanding. Ji, H. and Grishman, R. 2008. Refining Event Yangarber, R., Grishman, R., Tapanainen, P., Extraction Through Unsupervised Cross- and Huttunen, S. 2000. Unsupervised discov- document Inference. In Proceedings of the ery of scenario-level patterns for information Annual Meeting of the Association of Compu- extraction. In Proceedings of the Sixth Confer- tational Linguistics (ACL 2008).Ohio, USA. ence on Applied Natural Language Pro- Liao, S. and Grishman R. 2010. Using Document cessing, (ANLP-NAACL 2000), 282-289 Level Cross-Event Inference to Improve Event Yarowsky, D. 1995. Unsupervised word sense Extraction. In Proc. ACL-2010, pages 789- disambiguation rivaling supervised methods. 797, Uppsala, Sweden, July. In Proceedings of the 33rd annual meeting on Lin, D. 1998. Dependency-based evaluation of Association for Computational Linguistics, MINIPAR. In Workshop on the Evaluation of 189-196, Cambridge, Massachusetts: Associa- Parsing System, Granada, Spain. tion for Computational Linguistics Liu Ting. 2009. BEAR: Bootstrap Event and Re- lations from Text. Ph.D. Thesis Riloff, E. 1996. Automatically Generating Ex- traction Patterns from Untagged Text. In Pro- ceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044–1049. The AAAI Press/MIT Press. Sudo, K., Sekine, S., Grishman, R. 2001. Auto- matic Pattern Acquisition for Japanese Infor- mation Extraction. In Proceedings of Human Language Technology Conference (HLT2001). Sudo, K., Sekine, S., Grishman, R. 2003. An im- proved extraction pattern representation model for automatic IE pattern acquisition. Proceed- ings of ACL 2003 , 224 – 231. Tokyo. Strzalkowski, T., and Wang, J. 1996. A self- learning universal concept spotter. In Proceed- ings of the 16th conference on Computational linguistics - Volume 2, 931-936, Copenhagen, Denmark: Association for Computational Lin- guistics 305 CLex: A Lexicon for Exploring Color, Concept and Emotion Associations in Language Svitlana Volkova William B. Dolan Theresa Wilson Johns Hopkins University Microsoft Research HLTCOE 3400 North Charles One Microsoft Way 810 Wyman Park Drive Baltimore, MD 21218, USA Redmond, WA 98052, USA Baltimore, MD 21211, USA

[email protected] [email protected] [email protected]

Abstract many tasks in language understanding and gener- ation. A detailed set of color-concept-emotion as- Existing concept-color-emotion lexicons sociations (e.g., brown - darkness - boredom; red - limit themselves to small sets of basic emo- blood - anger) could be quite useful for sentiment tions and colors, which cannot capture the analysis, for example, in helping to understand rich pallet of color terms that humans use what emotion a newspaper article, a fairy tale, or in communication. In this paper we begin a tweet is trying to evoke (Alm et al., 2005; Mo- to address this problem by building a novel, color-emotion-concept association lexicon hammad, 2011b; Kouloumpis et al., 2011). Color- via crowdsourcing. This lexicon, which we concept-emotion associations may also be useful call C LEX, has over 2,300 color terms, over for textual entailment, and for machine translation 3,000 affect terms and almost 2,000 con- as a source of paraphrasing. cepts. We investigate the relation between Color-concept-emotion associations also have color and concept, and color and emotion, the potential to enhance human-computer inter- reinforcing results from previous studies, as actions in many real- and virtual-world domains, well as discovering new associations. We also investigate cross-cultural differences in e.g., online shopping, and avatar construction in color-emotion associations between US and gaming environments. Such knowledge may al- India-based annotators. low for clearer and hopefully more natural de- scriptions by users, for example searching for a sky-blue shirt rather than blue or light blue 1 Introduction shirt. Our long term goal is to use color-emotion- concept associations to enrich dialog systems People typically use color terms to describe the with information that will help them generate visual characteristics of objects, and certain col- more appropriate responses to users’ different ors often have strong associations with particu- emotional states. lar objects, e.g., blue - sky, white - snow. How- This work introduces a new lexicon of color- ever, people also take advantage of color terms to concept-emotion associations, created through strengthen their messages and convey emotions in crowdsourcing. We call this lexicon CL EX1 . It natural interactions (Jacobson and Bender, 1996; is comparable in size to only two known lexi- Hardin and Maffi, 1997). Colors are both indica- cons: W ORD N ET-A FFECT (Strapparava and Val- tive of and have an effect on our feelings and emo- itutti, 2004) and E MO L EX (Mohammad and Tur- tions. Some colors are associated with positive ney, 2010). In contrast to the development of emotions, e.g., joy, trust and admiration and some these lexicons, we do not restrict our annotators with negative emotions, e.g., aggressiveness, fear, to a particular set of emotions. This allows us to boredom and sadness (Ortony et al., 1988). 1 Given the importance of color and visual de- Available for download at: http://research.microsoft.com/en-us/ scriptions in conveying emotion, obtaining a downloads/ deeper understanding of the associations between Questions about the data and the access process may be colors, concepts and emotions may be helpful for sent to

[email protected]

306 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 306–314, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics collect more linguistically rich color-concept an- collecting color and emotion annotations for notations associated with mood, cognitive state, 10,170 word-sense pairs from Macquarie The- behavior and attitude. We also do not have any saurus2 . They analyzed their annotations, looking restrictions on color naming, which helps us to for associations with the 11 basic color terms from discover a rich lexicon of color terms and collo- Berlin and Key (1988). The set of emotion labels cations that represent various hues, darkness, sat- used in their annotations was restricted to the set uration and other natural language collocations. of 8 basic emotions proposed by Plutchik (1980). We also perform a comprehensive analysis of Their annotators were restricted to the US, and the data by investigating several questions includ- produced 4.45 annotations per word-sense pair on ing: What affect terms are evoked by a certain average. color, e.g., positive vs. negative? What con- There is also a commercial project from Cym- cepts are frequently associated with a particular bolism3 to collect concept-color associations. It color? What is the distribution of part-of-speech has 561,261 annotations for a restricted set of 256 tags over concepts and affect terms in the data col- concepts, mainly nouns, adjectives and adverbs. lected without any preselected set of affect terms Other work on collecting emotional aspect and concepts? What affect terms are strongly as- of concepts includes WordNet-Affect (WNA) sociated with a certain concept or a category of (Strapparava and Valitutti, 2004), the General En- concepts and is there any correlation with a se- quirer (GI) (Stone et al., 1966), Affective Forms mantic orientation of a concept? of English Words (Bradley and Lang, 1999) and Finally, we share our experience collecting the Elliott’s Affective Reasoner (Elliott, 1992). data using crowdsourcing, describe advantages The WNA lexicon is a set of affect terms from and disadvantages as well as the strategies we WordNet (Miller, 1995). It contains emotions, used to ensure high quality annotations. cognitive states, personality traits, behavior, at- titude and feelings, e.g., joy, doubt, competitive, 2 Related Work cry, indifference, pain. Total of 289 affect terms Interestingly, some color-concept associations were manually extracted, but later the lexicon was vary by culture and are influenced by the tra- extended using WordNet semantic relationships. ditions and beliefs of a society. As shown in WNA covers 1903 affect terms - 539 nouns, 517 (Sable and Akcay, 2010) green represents danger adjectives, 238 verbs and 15 adverbs. in Malaysia, envy in Belgium, love and happiness The General Enquirer covers 11,788 concepts in Japan; red is associated with luck in China and labeled with 182 category labels including cer- Denmark, but with bad luck in Nigeria and Ger- tain affect categories (e.g., pleasure, arousal, feel- many and reflects ambition and desire in India. ing, pain) in addition to positive/negative seman- Some expressions involving colors share the tic orientation for concepts4 . same meaning across many languages. For in- Affective Forms of English Words is a work stance, white heat or red heat (the state of high which describes a manually collected set of nor- physical and mental tension), blue-blood (an aris- mative emotional ratings for 1K English words tocrat, royalty), white-collar or blue collar (of- that are rated in terms of emotional arousal (rang- fice clerks). However, there are some expres- ing from calm to excited), affective valence (rang- sions where color associations differ across lan- ing from pleasant to unpleasant) and dominance guages, e.g., British or Italian black eye becomes (ranging from in control to dominated). blue in Germany, purple in Spain and black-butter Elliott’s Affective Reasoner is a collection of in France; your French, Italian and English neigh- programs that is able to reason about human emo- bors are green with envy while Germans are yel- tions. The system covers a set of 26 emotion cat- low with envy (Bortoli and Maroto, 2001). egories from Ortony et al (1988). There has been little academic work on con- Kaya (2004) and Strapparava and Ozbal (2010) structing color-concept and color-emotion lexi- both have worked on inferring emotions associ- cons. The work most closely related to ours ated with colors using semantic similarity. Their collects concept-color (Mohammad, 2011c) and 2 http://www.macquarieonline.com.au concept-emotion (E MO L EX) associations, both 3 http://www.cymbolism.com/ 4 relying on crowdsourcing. His project involved http://www.wjh.harvard.edu/˜inquirer/ 307 research found that Americans perceive red as ex- a set of trusted workers who had been consistently citement, yellow as cheer, purple as dignity and working on similar tasks for us. associate blue with comfort and security. Other research includes that geared toward discovering 3.2 Task Design culture-specific color-concept associations (Gage, Our task was designed to collect a linguistically 1993) and color preference, for example, in chil- rich set of color terms, emotions, and concepts dren vs. adults (Ou et al., 2011). that were associated with a large set of colors, specifically the 152 RGB values corresponding to 3 Data Collection facial features of cartoon human avatars. In to- tal we had 36 colors for hair/eyebrows, 18 for In order to collect color-concept and color- eyes, 27 for lips, 26 for eye shadows, 27 for fa- emotion associations, we use Amazon Mechani- cial mask and 18 for skin. These data is necessary cal Turk5 . It is a fast and relatively inexpensive to achieve our long-term goal which is to model way to get a large amount of data from many cul- natural human-computer interactions in a virtual tures all over the world. world domain such as the avatar editor. We designed two MTurk tasks. For Task 1, we 3.1 MTurk and Data Quality showed a swatch for one RGB value and asked Amazon Mechanical Turk is a crowdsourcing 50 workers to name the color, describe emotions platform that has been extensively used for ob- this color evokes and define a set of concepts as- taining low-cost human annotations for various sociated with that color. For Task 2, we showed a linguistic tasks over the last few years (Callison- particular facial feature and a swatch in a particu- Burch, 2009). The quality of the data obtained lar color, and asked 50 workers to name the color from non-expert annotators, also referred to as and describe the concepts and emotions associ- workers or turkers, was investigated by Snow et ated with that color. Figure 1 shows what would al (2008). Their empirical results show that the be presented to worker for Task 2. quality of non-expert annotations is comparable Q1. How would you name this color? to the quality of expert annotations on a variety of Q2. What emotion does this color evoke? natural language tasks, but the cost of the annota- Q3. What concepts do you associate with it? tion is much lower. There are various quality control strategies that can be used to ensure annotation quality. For in- stance, one can restrict a “crowd” by creating a pilot task that allows only workers who passed the task to proceed with annotations (Chen and Figure 1: Example of MTurk Task 2. Task 1 is the Dolan, 2011). In addition, new quality control same except that only a swatch is given. mechanisms have been recently introduced e.g., Masters. They are groups of workers who are The design that we suggested has a minor lim- trusted for their consistent high quality annota- itation in that a color swatch may display differ- tions, but to employ them costs more. ently on different monitors. However, we hope to Our task required direct natural language in- overcome this issue by collecting 50 annotations e c put from workers and did not include any mul- per RGB value. The example color → emotion → tiple choice questions (which tend to attract more concept associations produced by different anno- cheating). Thus, we limited our quality control ef- tators ai are shown below: forts to (1) checking for empty input fields and (2) blocking copy/paste functionality on a form. We • [R=222, G=207, B=186] (a1 ) light golden e c did not ask workers to complete any qualification yellow → purity, happiness → butter cookie, e c tasks because it is impossible to have gold stan- vanilla; (a2 ) gold→ cheerful, happy → sun, e c dard answers for color-emotion and color-concept corn; (a3 ) golden → sexy → beach, jewelery. associations. In addition, we limited our crowd to • [R=218, G=97, B=212] (a4 ) pinkish pur- e c 5 http://www.mturk.com ple → peace, tranquility, stressless → justin 308 bieber’s headphones, someday perfume; (a5 ) orange), darkness (dark, light, medium), satura- e c pink → happiness → rose, bougainvillea. tion (grayish, vivid), and brightness (deep, pale) (Mojsilovic, 2002). Interestingly, we observe In addition, we collected data about workers’ these dimensions in CL EX by looking for B&K gender, age, native language, number of years of color terms and their frequent collocations. We experience with English, and color preferences. present the top 10 color collocations for the B&K This data is useful for investigating variance in an- colors in Table 2. As can be seen, color terms notations for color-emotion-concept associations truly are distinguished by darkness, saturation and among workers from different cultural and lin- brightness terms e.g., light, dark, greenish, deep. guistic backgrounds. In addition, we find that color terms are also as- 4 Data Analysis sociated with color-specific collocations, e.g., sky blue, chocolate brown, pea green, salmon pink, We collected 15,200 annotations evenly divided carrot orange. These collocations were produced between the two tasks over 12 days. In total, 915 by annotators to describe the color of particular workers (41% male, 51% female and 8% who did RGB values. We investigate these color-concept not specify), mainly from India and United States, associations in more details in Section 4.3. completed our tasks as shown in Table 1. 18% In total, the CL EX has 2,315 unique color workers produced 20 or more annotations. They spent 78 seconds on average per annotation with P an average salary rate $2.3 per hour ($0.05 per Color Co-occurrences completed task). white off, antique, half, dark, black, bone, 0.62 milky, pale, pure, silver Country Annotations black light, blackish brown, brownish, 0.43 brown, jet, dark, green, off, ash, India 7844 blackish grey United States 5824 red dark, light, dish brown, brick, or- 0.59 Canada 187 ange, brown, indian, dish, crimson, United Kingdom 172 bright Colombia 100 green dark, light, olive, yellow, lime, for- 0.54 est, sea, dark olive, pea, dirty Table 1: Demographic information about annota- yellow light, dark, green, pale, golden, 0.63 tors: top 5 countries represented in our dataset. brown, mustard, orange, deep, bright In total, we collected 2,315 unique color terms, blue light, sky, dark, royal, navy, baby, 0.55 grey, purple, cornflower, violet 3,397 unique affect terms, and 1,957 unique con- brown dark, light, chocolate, saddle, red- 0.67 cepts for the given 152 RGB values. In the dish, coffee, pale, deep, red, sections below we discuss our findings on color medium naming, color-emotion and color-concept associ- pink dark, light, hot, pale, salmon, baby, 0.55 ations. We also give a comparison of annotated deep, rose, coral, bright affect terms and concepts from C LEX and other purple light, dark, deep, blue, bright, 0.69 existing lexicons. medium, pink, pinkish, bluish, pretty 4.1 Color Terms orange light, burnt, red, dark, yellow, 0.68 brown, brownish, pale, bright, car- Berlin and Kay (1988) state that as languages rot evolve they acquire new color terms in a strict gray dark, light, blue, brown, charcoal, 0.62 chronological order. When a language has only leaden, greenish, grayish blue, pale, two colors they are white (light, warm) and black grayish brown (dark, cold). English is considered to have 11 ba- sic colors: white, black, red, green, yellow, blue, Table 2: Top 10 color term collocations for the brown, pink, purple, orange and gray, which is 11 B&K colors; co-occurrences are sorted by fre- known as the B&K order. quency P10 from left to right in a decreasing order; In addition, colors can be distinguished along at 1 p(• | color) is a total estimated probability most three independent dimensions of hue (olive, of the top 10 co-occurrences. 309 Agreement Color Term Valitutti, 2004). Of this set, 41% appeared at % of overall Exact match 0.492 least once in CL EX. We also looked specifically agreement Substring match 0.461 at the set of terms labeled as emotions in the Free-marginal Exact match 0.458 W ORD N ET-A FFECT hierarchy. Of these, 12 are Kappa Substring match 0.424 positive emotions and 10 are negative emotions. We found that 9 out of 12 positive emotion Table 3: Inter-annotator agreement on assigning terms (except self-pride, levity and fearlessness) names to RGB values: 100 annotators, 152 RGB and 9 out of 10 negative emotion terms (except in- values and 16 color categories including 11 B&K gratitude) also appear in CL EX as shown in Table colors, 4 additional colors and none of the above. 5. Thus, we can conclude that annotators do not names for the set of 152 RGB values. The associate any colors with self-pride, levity, fear- inter-annotator agreement rate on color naming is lessness and ingratitude. In addition, some emo- shown in Table 3. We report free-marginal Kappa tions were associated more frequently with colors (Randolph, 2005) because we did not force an- than others. For instance, positive emotions like notators to assign certain number of RGB values calmness, joy, love are more frequent in CL EX to a certain number of color terms. Additionally, than expectation and ingratitude; negative emo- we report inter-annotator agreement for an exact tions like sadness, fear are more frequent than string match e.g., purple, green and a substring shame, humility and daze. match e.g., pale yellow = yellow = golden yellow. Positive Freq. Negative Freq. 4.2 Color-Emotion Associations calmness 1045 sadness 356 In total, the CL EX lexicon has 3,397 unique af- joy 527 fear 250 fect terms representing feelings (calm, pleasure), love 482 anxiety 55 emotions (joy, love, anxiety), attitudes (indiffer- hope 147 despair 19 ence, caution), and mood (anger, amusement). affection 86 compassion 10 The affect terms in C LEX include the 8 basic emo- enthusiasm 33 dislike 8 tions from (Plutchik, 1980): joy, sadness, anger, liking 5 shame 5 fear, disgust, surprise, trust and anticipation6 expectation 3 humility 3 CL EX is a very rich lexicon because we did not gratitude 3 daze 1 restrict our annotators to any specific set of affect terms. A wide range of parts-of-speech are rep- Table 5: W ORD N ET-A FFECT positive and neg- resented, as shown in the first column in Table 4. ative emotion terms from CL EX. Emotions are For instance, the term love is represented by other sorted by frequency in decreasing order from the semantically related terms such as: lovely, loved, total 27,802 annotations. loveliness, loveless, love-able and the term joy is Next, we analyze the color-emotion associ- represented as enjoy, enjoyable, enjoyment, joy- ations in CL EX in more detail and compare ful, joyfulness, overjoyed. them with the only other publicly-available color- emotion lexicon, E MO L EX. Recall that E MO L EX POS Affect Terms, % Concepts, % (Mohammad, 2011a) has 11 B&K colors associ- Nouns 79 52 ated with 8 basic positive and negative emotions Adjectives 12 29 from (Plutchik, 1980). Affect terms in CL EX are Adverbs 3 5 not labeled as conveying positive or negative emo- Verbs 6 12 tions. Instead, we use the overlapping 289 affect terms between W ORD N ET-A FFECT and CL EX Table 4: Main syntactic categories for affect terms and propagate labels from W ORD N ET-A FFECT to and concepts in CL EX. the corresponding affect terms in CL EX. As a re- The manually constructed portion of sult we discover positive and negative affect term W ORD N ET-A FFECT includes 101 positive associations with the 11 B&K colors. Table 6 and 188 negative affect terms (Strapparava and shows the percentage of positive and negative af- 6 The set of 8 Plutchik’s emotions is a superset of emotions fect term associations with colors for both CL EX from (Ekman, 1992). and E MO L EX. 310 Positive Negative a disagreement in color-emotion associations be- CL EX EL CL EX EL tween CL EX and E MO L EX. For instance antic- white 2.5 20.1 0.3 2.9 ipation is associated with orange in CL EX com- black 0.6 3.9 9.3 28.3 pared to white, red or yellow in E MO L EX. We also red 1.7 8.0 8.2 21.6 found quite a few inconsistent associations with green 3.3 15.5 2.7 4.7 the disgust emotion. This inconsistency may be yellow 3.0 10.8 0.7 6.9 explained by several reasons: (a) E MO L EX asso- blue 5.9 12.0 1.6 4.1 ciates emotions with colors through concepts, but brown 6.5 4.8 7.6 9.4 CL EX has color-emotion associations obtained pink 5.6 7.8 1.1 1.2 directly from annotators; (b) CL EX has 3,397 purple 3.1 5.7 1.8 2.5 affect terms compared to 8 basic emotions in orange 1.6 5.4 1.7 3.8 E MO L EX. Therefore, it may be introducing some gray 1.0 5.7 3.6 14.1 ambiguous color-emotion associations. Finally, we investigate cross-cultural differ- Table 6: The percentage of affect terms associated ences in color-emotion associations between the with B&K colors in CL EX and E MO L EX (similar two most representative groups of our annotators: color-emotion associations are shown in bold). US-based and India-based. We consider the 8 The percentage of color-emotion associations Plutchik’s emotions and allow associations with in CL EX and E MO L EX differs because the set of all possible color terms (rather than only 11 B&K affect terms in CL EX consists of 289 positive and colors). We show top 5 colors associated with negative affect terms compared to 8 affect terms emotions for two groups of annotators in Figure 2. in E MO L EX. Nevertheless, we observe the same For example, we found that US-based annotators pattern as (Mohammad, 2011a) for negative emo- associate pink with joy, dark brown with trust vs. tions. They are associated with black, red and India-based annotators who associate yellow with gray colors, except yellow becomes a color of joy and blue with trust. positive emotions in CL EX. Moreover, we found 4.3 Color-Concept Associations the associations with the color brown to be am- biguous as it was associated with both positive In total, workers annotated the 152 RGB values and negative emotions. In addition, we did not ob- with 37,693 concepts which is on average 2.47 serve strong associations between white and pos- concepts compared to 1.82 affect term per anno- itive emotions. This may be because white is the tation. CL EX contains 1,957 unique concepts in- color of grief in India. The rest of the positive cluding 1,667 nouns, 23 verbs, 28 adjectives, and emotions follow the E MO L EX pattern and are as- 12 adverbs. We investigate an overlap of con- sociated with green, pink, blue and purple colors. cepts by part-of-speech tag between CL EX and Next, we perform a detailed comparison be- other lexicons including E MO L EX (EL), Affec- tween CL EX and E MO L EX color-emotion asso- tive Norms of English Words (AN), General In- ciations for the 11 B&K colors and the 8 basic quirer (GI). The results are shown in Table 8. emotions from (Plutchik, 1980) in Table 7. Recall Finally, we generate concept clusters associ- that annotations in E MO L EX are done by workers ated with yellow, white and brown colors in Fig- from the USA only. Thus, we report two num- ure 3. From the clusters, we observe the most bers for CL EX - annotations from workers from frequent k concepts associated with these colors the USA (CA ) and all annotations (C). We take have a correlation with either positive or negative E MO L EX results from (Mohammad, 2011c). We emotion. For example, white is frequently associ- observe a strong correlation between CL EX and ated with snow, milk, cloud and all of these con- E MO L EX affect lexicons for some color-emotion cepts evolve positive emotions. This observation associations. For instance, anger has a strong as- helps resolve the ambiguity in color-emotion as- sociation with red and brown, anticipation with sociations we found in Table 7. green, fear with black, joy with pink, sadness 5 Conclusions with black, brown and gray, surprise with yel- low and orange, and finally, trust is associated We have described a large-scale crowdsourcing with blue and brown. Nonetheless, we also found effort aimed at constructing a rich color-emotion- 311 white black red green yellow blue brown pink purple orange grey C - 3.6 43.4 0.3 0.3 0.3 3.3 0.6 0.3 1.5 2.1 anger CA - 3.8 40.6 0.8 - - 4.5 - 0.8 2.3 0.8 EA 2.1 30.7 32.4 5.0 5.0 2.4 6.6 0.5 2.3 2.5 9.9 C 0.3 24.0 0.3 0.6 0.3 4.2 11.4 0.3 2.2 0.3 10.3 sadness CA - 22.2 - 0.6 - 5.3 9.4 - 4.1 - 12.3 EA 3.0 36.0 18.6 3.4 5.4 5.8 7.1 0.5 1.4 2.1 16.1 C 0.8 43.0 8.9 2.0 1.2 0.4 6.1 0.4 0.8 0.4 2.0 fear CA - 29.5 10.5 3.2 1.1 - 3.2 - 1.1 1.1 4.2 EA 4.5 31.8 25.0 3.5 6.9 3.0 6.1 1.3 2.3 3.3 11.8 C - 2.3 1.1 11.2 1.1 1.1 24.7 1.1 3.4 1.1 - disgust CA - - - 14.8 1.8 - 33.3 - 1.8 - - EA 2.0 33.7 24.9 4.8 5.5 1.9 9.7 1.1 1.8 3.5 10.5 C 1.0 0.2 0.2 3.4 5.7 4.2 4.2 9.1 4.4 4.0 0.6 joy CA 0.9 - 0.3 3.3 4.5 4.8 2.7 10.6 4.2 3.9 0.6 EA 21.8 2.2 7.4 14.1 13.4 11.3 3.1 11.1 6.3 5.8 2.8 C - - 1.2 3.5 1.2 17.4 8.1 1.2 1.2 5.8 1.2 trust CA - - 3.0 6.1 3.0 3.0 9.1 - - 3.0 3.0 EA 22.0 6.3 8.4 14.2 8.3 14.4 5.9 5.5 4.9 3.8 5.8 C - - - 3.3 6.7 6.7 3.3 3.3 6.7 13.3 3.3 surprise CA - - - - 5.6 5.6 - 5.6 11.1 11.1 - EA 11.0 13.4 21.0 8.3 13.5 5.2 3.4 5.2 4.1 5.6 8.8 C - - - 5.3 5.3 - 5.3 5.3 - 15.8 5.3 anticipation CA - - - - - - - 10.0 - 10.0 10.0 EA 16.2 7.5 11.5 16.2 10.7 9.5 5.7 5.9 3.1 4.9 8.4 Table 7: The percentage of the 8 basic emotions associated with 11 B&K colors in CL EX vs. E MO L EX, e.g., sadness is associated with black by 36% of annotators in E MOLEX (EA ), 22.1% in CL EX (CA ) by US-based annotators only and 24% in CL EX (C) by all annotators; we report zero associations by “-”. (a) Joy - US: 331, I: 154 (b) Trust - US: 33, I: 47 (c) Surprise - US: 18, I: 12 (d) Anticipation - US: 10, I: 9 (e) Anger - US: 133, I: 160 (f) Sadness - US: 171, I: 142 (g) Fear - US: 95, I: 105 (h) Disgust - US: 54, I: 16 Figure 2: Apparent cross-cultural differences in color-emotion associations between US- and India- based annotators. 10.6% of US workers associated joy with pink, while 7.1% India-based workers associated joy with yellow (based on 331 joy associations from the US and from 154 India). 312 (a) Yellow (b) Brown (c) White Figure 3: Concept clusters of color-concept associations for ambiguous colors: yellow, white, brown. concept association lexicon, CL EX. This lexicon the way that colors are associated with concepts links concepts, color terms and emotions to spe- and emotions in languages other than English. cific RGB values. This lexicon may help to dis- ambiguate objects when modeling conversational Acknowledgments interactions in many domains. We have examined We are grateful to everyone in the NLP group the association between color terms and positive at Microsoft Research for helpful discussion and or negative emotions. feedback especially Chris Brocket, Piali Choud- Our work also investigated cross-cultural dif- hury, and Hassan Sajjad. We thank Natalia Rud ferences in color-emotion associations between from Tyumen State University, Center of Linguis- India- and US-based annotators. We identified tic Education for helpful comments and sugges- frequent color-concept associations, which sug- tions. gests that concepts associated with a particular color may express the same sentiment as the color. Our future work includes applying statistical References inference for discovering a hidden structure of Cecilia Ovesdotter Alm, Dan Roth, and Richard concept-emotion associations. Moreover, auto- Sproat. 2005. Emotions from text: machine matically identifying the strength of association learning for text-based emotion prediction. In between a particular concept and emotions is an- Proceedings of the conference on Human Lan- other task which is more difficult than just iden- guage Technology and Empirical Methods in Natu- tifying the polarity of the word. We are also in- ral Language Processing, HLT ’05, pages 579–586, terested in using a similar approach to investigate Stroudsburg, PA, USA. Association for Computa- tional Linguistics. CL EX∩AN CL EX∩EL CL EX∩GI Brent Berlin and Paul Kay. 1988. Basic Color Terms: their Universality and Evolution. Berkley: Univer- Noun 287 Noun 574 Noun 708 sity of California Press. Verb 4 Verb 13 Verb 17 M. Bortoli and J. Maroto. 2001. Translating colors in Adj 28 Adj 53 Adj 66 web site localisation. In In The Proceedings of Eu- Adv 1 Adv 2 Adv 3 ropean Languages and the Implementation of Com- 320 642 794 munication and Information Technologies (Elicit). AN\CL EX EL\CL EX GI\CL EX M. Bradley and P. Lang. 1999. Affective forms for 712 7,445 11,101 english words (anew): Instruction manual and af- CL EX\AN CL EX\EL CL EX\GI fective ranking. 1,637 1,315 1,163 Chris Callison-Burch. 2009. Fast, cheap, and creative: evaluating translation quality using amazon’s me- Table 8: An overlap of concepts by part-of- chanical turk. In EMNLP ’09: Proceedings of the speech tag between CL EX and existing lexicons. 2009 Conference on Empirical Methods in Natural Language Processing, pages 286–295, Stroudsburg, CL EX∩GI stands for the intersection of sets, PA, USA. Association for Computational Linguis- CL EX\GI denotes the difference of sets. tics. 313 David L. Chen and William B. Dolan. 2011. Building Aleksandra Mojsilovic. 2002. A method for color a persistent workforce on mechanical turk for mul- naming and description of color composition in im- tilingual data collection. In Proceedings of The 3rd ages. In Proc. IEEE Int. Conf. Image Processing, Human Computation Workshop (HCOMP 2011), pages 789–792. August. Andrew Ortony, Gerald L. Clore, and Allan Collins. Paul Ekman. 1992. An argument for basic emotions. 1988. The Cognitive Structure of Emotions. Cam- Cognition & Emotion, 6(3):169–200. bridge University Press, July. Clark Davidson Elliott. 1992. The affective reasoner: Li-Chen Ou, M. Ronnier Luo, Pei-Li Sun, Neng- a process model of emotions in a multi-agent sys- Chung Hu, and Hung-Shing Chen. 2011. Age ef- tem. Ph.D. thesis, Evanston, IL, USA. UMI Order fects on colour emotion, preference, and harmony. No. GAX92-29901. Color Research and Application. R. Plutchik, 1980. A general psychoevolutionary the- J. Gage. 1993. Color and culture: Practice and mean- ory of emotion, pages 3–33. Academic press, New ing from antiquity to abstraction, univ. of calif. York. C. Hardin and L. Maffi. 1997. Color Categories in Justus J. Randolph. 2005. Author note: Free-marginal Thought and Language. multirater kappa: An alternative to fleiss fixed- N. Jacobson and W. Bender. 1996. Color as a deter- marginal multirater kappa. mined communication. IBM Syst. J., 35:526–538, P. Sable and O. Akcay. 2010. Color: Cross cultural September. marketing perspectives as to what governs our re- N. Kaya. 2004. Relationship between color and emo- sponse to it. In In The Proceedings of ASSBS, vol- tion: a study of college students. College Student ume 17. Journal. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Efthymios Kouloumpis, Theresa Wilson, and Johanna Andrew Y. Ng. 2008. Cheap and fast—but is it Moore. 2011. Twitter sentiment analysis: The good good?: evaluating non-expert annotations for natu- the bad and the OMG! In Proc. ICWSM. ral language tasks. In Proceedings of the Confer- ence on Empirical Methods in Natural Language George A. Miller. 1995. Wordnet: A lexical database Processing, EMNLP ’08, pages 254–263, Strouds- for english. Communications of the ACM, 38:39– burg, PA, USA. Association for Computational Lin- 41. guistics. Saif M. Mohammad and Peter D. Turney. 2010. Emo- Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith, tions evoked by common words and phrases: using and Daniel M. Ogilvie. 1966. The General In- mechanical turk to create an emotion lexicon. In quirer: A Computer Approach to Content Analysis. Proceedings of the NAACL HLT 2010 Workshop on MIT Press. Computational Approaches to Analysis and Gener- Carlo Strapparava and Gozde Ozbal. 2010. The color ation of Emotion in Text, CAAGET ’10, pages 26– of emotions in text. COLING, pages 28–32. 34, Stroudsburg, PA, USA. Association for Compu- C. Strapparava and A. Valitutti. 2004. Wordnet-affect: tational Linguistics. an affective extension of wordnet. In In: Proceed- Saif Mohammad. 2011a. Colourful language: Mea- ings of the 4th International Conference on Lan- suring word-colour associations. In Proceedings guage Resources and Evaluation (LREC 2004), Lis- of the 2nd Workshop on Cognitive Modeling and bon, pages 1083–1086. Computational Linguistics, pages 97–106, Port- land, Oregon, USA, June. Association for Compu- tational Linguistics. Saif Mohammad. 2011b. From once upon a time to happily ever after: Tracking emotions in novels and fairy tales. In Proceedings of the 5th ACL- HLT Workshop on Language Technology for Cul- tural Heritage, Social Sciences, and Humanities, pages 105–114, Portland, OR, USA, June. Associa- tion for Computational Linguistics. Saif M. Mohammad. 2011c. Even the abstract have colour: consensus in word-colour associations. In Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT ’11, pages 368–373, Stroudsburg, PA, USA. Association for Computational Linguistics. 314 Extending the Entity-based Coherence Model with Multiple Ranks Vanessa Wei Feng Graeme Hirst Department of Computer Science Department of Computer Science University of Toronto University of Toronto Toronto, ON, M5S 3G4, Canada Toronto, ON, M5S 3G4, Canada

[email protected] [email protected]

Abstract 3.1 below). However, coherence is matter of de- gree rather than a binary distinction, so a model We extend the original entity-based coher- based only on such pairwise rankings is insuffi- ence model (Barzilay and Lapata, 2008) ciently fine-grained and cannot capture the sub- by learning from more fine-grained coher- tle differences in coherence between the permuted ence preferences in training data. We asso- documents. ciate multiple ranks with the set of permuta- tions originating from the same source doc- Since the first appearance of B&L’s model, ument, as opposed to the original pairwise several extensions have been proposed (see Sec- rankings. We also study the effect of the tion 2.3 below), primarily focusing on modify- permutations used in training, and the effect ing or enriching the original feature set by incor- of the coreference component used in en- porating other document information. By con- tity extraction. With no additional manual trast, we wish to refine the learning procedure annotations required, our extended model is able to outperform the original model on in a way such that the resulting model will be two tasks: sentence ordering and summary able to evaluate coherence on a more fine-grained coherence rating. level. Specifically, we propose a concise exten- sion to the standard entity-based coherence model by learning not only from the original docu- 1 Introduction ment and its corresponding permutations but also from ranking preferences among the permutations Coherence is important in a well-written docu- themselves. ment; it helps make the text semantically mean- ingful and interpretable. Automatic evaluation We show that this can be done by assigning a of coherence is an essential component of vari- suitable objective score for each permutation indi- ous natural language applications. Therefore, the cating its dissimilarity from the original one. We study of coherence models has recently become call this a multiple-rank model since we train our an active research area. A particularly popular model on a multiple-rank basis, rather than tak- coherence model is the entity-based local coher- ing the original pairwise ranking approach. This ence model of Barzilay and Lapata (B&L) (2005; extension can also be easily combined with other 2008). This model represents local coherence extensions by incorporating their enriched feature by transitions, from one sentence to the next, in sets. We show that our multiple-rank model out- the grammatical role of references to entities. It performs B&L’s basic model on two tasks, sen- learns a pairwise ranking preference between al- tence ordering and summary coherence rating, ternative renderings of a document based on the evaluated on the same datasets as in Barzilay and probability distribution of those transitions. In Lapata (2008). particular, B&L associated a lower rank with au- In sentence ordering, we experiment with tomatically created permutations of a source doc- different approaches to assigning dissimilarity ument, and learned a model to discriminate an scores and ranks (Section 5.1.1). We also exper- original text from its permutations (see Section iment with different entity extraction approaches 315 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 315–324, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Manila Miles Island Quake Baco are simply clustered by string matching. 1 − − X X − 2.2 Evaluation Tasks 2 S − O − − Two evaluation tasks for Barzilay and Lapata 3 X X X X X (2008)’s entity-based model are sentence order- Table 1: A fragment of an entity grid for five entities ing and summary coherence rating. across three sentences. In sentence ordering, a set of random permu- tations is created for each source document, and the learning procedure is conducted on this syn- (Section 5.1.2) and different distributions of per- thetic mixture of coherent and incoherent docu- mutations used in training (Section 5.1.3). We ments. Barzilay and Lapata (2008) experimented show that these two aspects are crucial, depend- on two datasets: news articles on the topic of ing on the characteristics of the dataset. earthquakes (Earthquakes) and narratives on the 2 Entity-based Coherence Model topic of aviation accidents (Accidents). A train- ing data instance is constructed as a pair con- 2.1 Document Representation sisting of a source document and one of its ran- The original entity-based coherence model is dom permutations, and the permuted document based on the assumption that a document makes is always considered to be less coherent than the repeated reference to elements of a set of entities source document. The entity transition features that are central to its topic. For a document d, an are then used to train a support vector machine entity grid is constructed, in which the columns ranker (Joachims, 2002) to rank the source docu- represent the entities referred to in d, and rows ments higher than the permutations. The model is represent the sentences. Each cell corresponds tested on a different set of source documents and to the grammatical role of an entity in the corre- their permutations, and the performance is evalu- sponding sentence: subject (S), object (O), nei- ated as the fraction of correct pairwise rankings in ther (X), or nothing (−). An example fragment the test set. of an entity grid is shown in Table 1; it shows In summary coherence rating, a similar exper- the representation of three sentences from a text imental framework is adopted. However, in this on a Philippine earthquake. B&L define a lo- task, rather than training and evaluating on a set cal transition as a sequence {S , O, X, −}n , repre- of synthetic data, system-generated summaries senting the occurrence and grammatical roles of and human-composed reference summaries from an entity in n adjacent sentences. Such transi- the Document Understanding Conference (DUC tion sequences can be extracted from the entity 2003) were used. Human annotators were asked grid as continuous subsequences in each column. to give a coherence score on a seven-point scale For example, the entity “Manila” in Table 1 has for each item. The pairwise ranking preferences a bigram transition {S , X} from sentence 2 to 3. between summaries generated from the same in- The entity grid is then encoded as a feature vector put document cluster (excluding the pairs consist- Φ(d) = (p1 (d), p2 (d), . . . , pm (d)), where pt (d) is ing of two human-written summaries) are used by the probability of the transition t in the entity grid, a support vector machine ranker to learn a dis- and m is the number of transitions with length no criminant function to rank each pair according to more than a predefined optimal transition length their coherence scores. k. pt (d) is computed as the number of occurrences of t in the entity grid of document d, divided by 2.3 Extended Models the total number of transitions of the same length Filippova and Strube (2007) applied Barzilay and in the entity grid. Lapata’s model on a German corpus of newspa- For entity extraction, Barzilay and Lapata per articles with manual syntactic, morphological, (2008) had two conditions: Coreference+ and and NP coreference annotations provided. They Coreference−. In Coreference+, entity corefer- further clustered entities by semantic relatedness ence relations in the document were resolved by as computed by the WikiRelated! API (Strube and an automatic coreference resolution tool (Ng and Ponzetto, 2006). Though the improvement was Cardie, 2002), whereas in Coreference−, nouns not significant, interestingly, a short subsection in 316 their paper described their approach to extending 3.1 Sentence Ordering pairwise rankings to longer rankings, by supply- In the standard entity-based model, a discrimina- ing the learner with rankings of all renderings as tive system is trained on the pairwise rankings be- computed by Kendall’s τ, which is one of our tween source documents and their permutations extensions considered in this paper. Although (see Section 2.2). However, a model learned from Filippova and Strube simply discarded this idea these pairwise rankings is not sufficiently fine- because it hurt accuracies when tested on their grained, since the subtle differences between the data, we found it a promising direction for further permutations are not learned. Our major contribu- exploration. Cheung and Penn (2010) adapted tion is to further differentiate among the permuta- the standard entity-based coherence model to the tions generated from the same source documents, same German corpus, but replaced the original rather than simply treating them all as being of the linguistic dimension used by Barzilay and Lap- same degree of coherence. ata (2008) — grammatical role — with topologi- cal field information, and showed that for German Our fundamental assumption is that there exists text, such a modification improves accuracy. a canonical ordering for the sentences of a doc- ument; therefore we can approximate the degree For English text, two extensions have been pro- of coherence of a document by the similarity be- posed recently. Elsner and Charniak (2011) aug- tween its actual sentence ordering and that canon- mented the original features used in the standard ical sentence ordering. Practically, we automati- entity-based coherence model with a large num- cally assign an objective score for each permuta- ber of entity-specific features, and their extension tion to estimate its dissimilarity from the source significantly outperformed the standard model document (see Section 4). By learning from all on two tasks: document discrimination (another the pairs across a source document and its per- name for sentence ordering), and sentence inser- mutations, the effective size of the training data tion. Lin et al. (2011) adapted the entity grid rep- is increased while no further manual annotation resentation in the standard model into a discourse is required, which is favorable in real applica- role matrix, where additional discourse informa- tions when available samples with manually an- tion about the document was encoded. Their ex- notated coherence scores are usually limited. For tended model significantly improved ranking ac- r source documents each with m random permuta- curacies on the same two datasets used by Barzi- tions, the number of training instances in the stan- lay and Lapata (2008) as well as on the Wall Street dard entity-based model is therefore r × m, while Journal corpus. in our multiple-rank model learning process, it is r × m+1 ≈ 1 r × m 2 > r × m, when m > 2. However, while enriching or modifying the 2 2 original features used in the standard model is cer- tainly a direction for refinement of the model, it 3.2 Summary Coherence Rating usually requires more training data or a more so- Compared to the standard entity-based coherence phisticated feature representation. In this paper, model, our major contribution in this task is to we instead modify the learning approach and pro- show that by automatically assigning an objective pose a concise and highly adaptive extension that score for each machine-generated summary to es- can be easily combined with other extended fea- timate its dissimilarity from the human-generated tures or applied to different languages. summary from the same input document cluster, we are able to achieve performance competitive with, or even superior to, that of B&L’s model 3 Experimental Design without knowing the true coherence score given by human judges. Following Barzilay and Lapata (2008), we wish Evaluating our multiple-rank model in this task to train a discriminative model to give the cor- is crucial, since in summary coherence rating, rect ranking preference between two documents the coherence violations that the reader might en- in terms of their degree of coherence. We experi- counter in real machine-generated texts can be ment on the same two tasks as in their work: sen- more precisely approximated, while the sentence tence ordering and summary coherence rating. ordering task is only partially capable of doing so. 317 4 Dissimilarity Metrics by Bollegala et al. (2006). This metric esti- mates the quality of a particular sentence order- As mentioned previously, the subtle differences ing by the number of correctly arranged contin- among the permutations of the same source docu- uous sentences, compared to the reference order- ment can be used to refine the model learning pro- ing. For example, if π = (. . . , 3, 4, 5, 7, . . . , oN ), cess. Considering an original document d and one then {3, 4, 5} is considered as continuous while of its permutations, we call σ = (1, 2, . . . , N) the {3, 4, 5, 7} is not. Average continuity is calculated reference ordering, which is the sentence order- as ing in d, and π = (o1 , o2 , . . . , oN ) the test order- n    1 X AC = exp  log (Pi + α) ,  ing, which is the sentence ordering in that permu-  n − 1 i=2 tation, where N is the number of sentences being rendered in both documents. where n = min(4, N) is the maximum number In order to approximate different degrees of co- of continuous sentences to be considered, and herence among the set of permutations which bear α = 0.01. Pi is the proportion of continuous sen- the same content, we need a suitable metric to tences of length i in π that are also continuous in quantify the dissimilarity between the test order- the reference ordering σ. To represent the dis- ing π and the reference ordering σ. Such a metric similarity between the two orderings π and σ, we needs to satisfy the following criteria: (1) It can be use its complement AC 0 = 1 − AC, such that the automatically computed while being highly corre- larger AC 0 is, the more dissimilar two orderings lated with human judgments of coherence, since are2 . additional manual annotation is certainly undesir- Edit distance (ED): Edit distance is a com- able. (2) It depends on the particular sentence monly used metric in information theory to mea- ordering in a permutation while remaining inde- sure the difference between two sequences. Given pendent of the entities within the sentences; oth- a test ordering π, its edit distance is defined as the erwise our multiple-rank model might be trained minimum number of edits (i.e., insertions, dele- to fit particular probability distributions of entity tions, and substitutions) needed to transform it transitions rather than true coherence preferences. into the reference ordering σ. For permutations, In our work we use three different metrics: the edits are essentially movements, which can Kendall’s τ distance, average continuity, and edit be considered as equal numbers of insertions and distance. deletions. Kendall’s τ distance: This metric has been 5 Experiments widely used in evaluation of sentence ordering (Lapata, 2003; Lapata, 2006; Bollegala et al., 5.1 Sentence Ordering 2006; Madnani et al., 2007)1 . It measures the Our first set of experiments is on sentence order- disagreement between two orderings σ and π in ing. Following Barzilay and Lapata (2008), we terms of the number of inversions of adjacent sen- use all transitions of length ≤ 3 for feature extrac- tences necessary to convert one ordering into an- tion. In addition, we explore three specific aspects other. Kendall’s τ distance is defined as in our experiments: rank assignment, entity ex- 2m traction, and permutation generation. τ= , N(N − 1) 5.1.1 Rank Assignment In our multiple-rank model, pairwise rankings where m is the number of sentence inversions nec- between a source document and its permutations essary to convert σ to π. are extended into a longer ranking with multiple Average continuity (AC): Following Zhang ranks. We assign a rank to a particular permuta- (2011), we use average continuity as the sec- tion, based on the result of applying a chosen dis- ond dissimilarity metric. It was first proposed similarity metric from Section 4 (τ, AC, or ED) to 1 Filippova and Strube (2007) found that their perfor- the sentence ordering in that permutation. mance dropped when using this metric for longer rankings; We experiment with two different approaches but they were using data in a different language and with to assigning ranks to permutations, while each manual annotations, so its effect on our datasets is worth try- ing nonetheless. 2 We will refer to AC 0 as AC from now on. 318 source document is always assigned a zero (the 5.1.3 Permutation Generation highest) rank. The quality of the model learned depends on In the raw option, we rank the permutations di- the set of permutations used in training. We are rectly by their dissimilarity scores to form a full not aware of how B&L’s permutations were gen- ranking for the set of permutations generated from erated, but we assume they are generated in a per- the same source document. fectly random fashion. Since a full ranking might be too sensitive to However, in reality, the probabilities of seeing noise in training, we also experiment with the documents with different degrees of coherence are stratified option, in which C ranks are assigned to not equal. For example, in an essay scoring task, the permutations generated from the same source if the target group is (near-) native speakers with document. The permutation with the smallest dis- sufficient education, we should expect their essays similarity score is assigned the same (zero, the to be less incoherent — most of the essays will highest) rank as the source document, and the one be coherent in most parts, with only a few minor with the largest score is assigned the lowest (C−1) problems regarding discourse coherence. In such rank; then ranks of other permutations are uni- a setting, the performance of a model trained from formly distributed in this range according to their permutations generated from a uniform distribu- raw dissimilarity scores. We experiment with 3 tion may suffer some accuracy loss. to 6 ranks (the case where C = 2 reduces to the Therefore, in addition to the set of permutations standard entity-based model). used by Barzilay and Lapata (2008) (PS BL ), we create another set of permutations for each source document (PS M ) by assigning most of the proba- 5.1.2 Entity Extraction bility mass to permutations which are mostly sim- ilar to the original source document. Besides its Barzilay and Lapata (2008)’s best results were capability of better approximating real-life situ- achieved by employing an automatic coreference ations, training our model on permutations gen- resolution tool (Ng and Cardie, 2002) for ex- erated in this way has another benefit: in the tracting entities from a source document, and the standard entity-based model, all permuted doc- permutations were generated only afterwards — uments are treated as incoherent; thus there are entity extraction from a permuted document de- many more incoherent training instances than co- pends on knowing the correct sentence order and herent ones (typically the proportion is 20:1). In the oracular entity information from the source contrast, in our multiple-rank model, permuted document — since resolving coreference relations documents are assigned different ranks to fur- in permuted documents is too unreliable for an au- ther differentiate the different degrees of coher- tomatic tool. ence within them. By doing so, our model will We implement our multiple-rank model with be able to learn the characteristics of a coherent full coreference resolution using Ng and Cardie’s document from those near-coherent documents as coreference resolution system, and entity extrac- well, and therefore the problem of lacking coher- tion approach as described above — the Coref- ent instances can be mitigated. erence+ condition. However, as argued by El- Our permutation generation algorithm is shown sner and Charniak (2011), to better simulate in Algorithm 1, where α = 0.05, β = 5.0, the real situations that human readers might en- MAX NUM = 50, and K and K 0 are two normal- counter in machine-generated documents, such ization factors to make p(swap num) and p(i, j) oracular information should not be taken into ac- proper probability distributions. For each source count. Therefore we also employ two alterna- document, we create the same number of permu- tive approaches for entity extraction: (1) use the tations as PS BL . same automatic coreference resolution tool on permuted documents — we call it the Corefer- 5.2 Summary Coherence Rating ence± condition; (2) use no coreference reso- In the summary coherence rating task, we are lution, i.e., group head noun clusters by simple dealing with a mixture of multi-document sum- string matching — B&L’s Coreference− condi- maries generated by systems and written by hu- tion. mans. Barzilay and Lapata (2008) did not assume 319 Algorithm 1 Permutation Generation. with the optimal transition length set to ≤ 2. Input: S 1 , S 2 , . . . , S N ; σ = (1, 2, . . . , N) Choose a number of sentence swaps 6 Results swap num with probability e−α×swap num /K 6.1 Sentence Ordering for i = 1 → swap num do Swap a pair of sentence (S i , S j ) In this task, we use the same two sets of source with probability p(i, j) = e−β×|i− j| /K 0 documents (Earthquakes and Accidents, see Sec- end for tion 3.1) as Barzilay and Lapata (2008). Each Output: π = (o1 , o2 , . . . , oN ) contains 200 source documents, equally divided between training and test sets, with up to 20 per- mutations per document. We conduct experi- a simple binary distinction among the summaries ments on these two domains separately. For each generated from the same input document clus- domain, we accompany each source document ter; rather, they had human judges give scores for with two different sets of permutations: the one each summary based on its degree of coherence used by B&L (PS BL ), and the one generated from (see Section 3.2). Therefore, it seems that the our model described in Section 5.1.3 (PS M ). We subtle differences among incoherent documents train our multiple-rank model and B&L’s standard (system-generated summaries in this case) have two-rank model on each set of permutations using already been learned by their model. the SVM rank package (Joachims, 2006), and eval- uate both systems on their test sets. Accuracy is But we wish to see if we can replace hu- measured as the fraction of correct pairwise rank- man judgments by our computed dissimilarity ings for the test set. scores so that the original supervised learning is converted into unsupervised learning and yet re- 6.1.1 Full Coreference Resolution with tain competitive performance. However, given Oracular Information a summary, computing its dissimilarity score is In this experiment, we implement B&L’s fully- a bit involved, due to the fact that we do not fledged standard entity-based coherence model, know its correct sentence order. To tackle this and extract entities from permuted documents us- problem, we employ a simple sentence align- ing oracular information from the source docu- ment between a system-generated summary and ments (see Section 5.1.2). a human-written summary originating from the Results are shown in Table 2. For each test sit- same input document cluster. Given a system- uation, we list the best accuracy (in Acc columns) generated summary D s = (S s1 , S s2 , . . . , S sn ) and for each chosen dissimilarity metric, with the cor- its corresponding human-written summary Dh = responding rank assignment approach. C repre- (S h1 , S h2 , . . . , S hN ) (here it is possible that n , sents the number of ranks used in stratifying raw N), we treat the sentence ordering (1, 2, . . . , N) scores (“N” if using raw configuration, see Sec- in Dh as σ (the original sentence ordering), and tion 5.1.1 for details). Baselines are accuracies compute π = (o1 , o2 , . . . , on ) based on D s . To trained using the standard entity-based coherence compute each oi in π, we find the most similar model3 . sentence S h j , j ∈ [1, N] in Dh by computing their Our model outperforms the standard entity- cosine similarity over all tokens in S h j and S si ; based model on both permutation sets for both if all sentences in Dh have zero cosine similarity datasets. The improvement is not significant with S si , we assign −1 to oi . when trained on the permutation set PS BL , and Once π is known, we can compute its “dissimi- is achieved only with one of the three metrics; larity” from σ using a chosen metric. But because 3 There are discrepancies between our reported accuracies now π is not guaranteed to be a permutation of σ and those of Barzilay and Lapata (2008). The differences are (there may be repetition or missing values, i.e., due to the fact that we use a different parser: the Stanford de- −1, in π), Kendall’s τ cannot be used, and we use pendency parser (de Marneffe et al., 2006), and might have only average continuity and edit distance as dis- extracted entities in a slightly different way than theirs, al- though we keep other experimental configurations as close similarity metrics in this experiment. as possible to theirs. But when comparing our model with The remaining experimental configuration is theirs, we always use the exact same set of features, so the the same as that of Barzilay and Lapata (2008), absolute accuracies do not matter. 320 Condition: Coreference+ Condition: Coreference± Earthquakes Accidents Earthquakes Accidents Perms Metric Perms Metric C Acc C Acc C Acc C Acc τ 3 79.5 3 82.0 τ 3 71.0 3 73.3 AC 4 85.2 3 83.3 AC 3 *76.8 3 74.5 PS BL PS BL ED 3 86.8 6 82.2 ED 4 *77.4 6 74.4 Baseline 85.3 83.2 Baseline 71.7 73.8 τ 3 86.8 3 85.2* τ 3 55.9 3 51.5 AC 3 85.6 1 85.4* AC 4 53.9 6 49.0 PS M PS M ED N 87.9* 4 86.3* ED 4 53.9 5 52.3 Baseline 85.3 81.7 Baseline 49.2 53.2 Table 2: Accuracies (%) of extending the stan- Table 3: Accuracies (%) of extending the stan- dard entity-based coherence model with multiple-rank dard entity-based coherence model with multiple-rank learning in sentence ordering using Coreference+ op- learning in sentence ordering using Coreference± op- tion. Accuracies which are significantly better than the tion. Accuracies which are significantly better than the baseline (p < .05) are indicated by *. baseline (p < .05) are indicated by *. but when trained on PS M (the set of permutations generated from our model), running full corefer- generated from our biased model), our model’s ence resolution is not a good option, since it al- performance significantly exceeds B&L’s4 for all most makes the accuracies no better than random three metrics, especially as their model’s perfor- guessing (50%). mance drops for dataset Accidents. Moreover, considering training using PS BL , From these results, we see that in the ideal sit- running full coreference resolution has a different uation where we extract entities and resolve their influence for the two datasets. For Earthquakes, coreference relations based on the oracular infor- our model significantly outperforms B&L’s while mation from the source document, our model is the improvement is insignificant for Accidents. effective in terms of improving ranking accura- This is most probably due to the different way that cies, especially when trained on our more realistic entities are realized in these two datasets. As an- permutation sets PS M . alyzed by Barzilay and Lapata (2008), in dataset Earthquakes, entities tend to be referred to by pro- 6.1.2 Full Coreference Resolution without nouns in subsequent mentions, while in dataset Oracular Information Accidents, literal string repetition is more com- In this experiment, we apply the same auto- mon. matic coreference resolution tool (Ng and Cardie, Given a balanced permutation distribution as 2002) on not only the source documents but also we assumed in PS BL , switching distant sentence their permutations. We want to see how removing pairs in Accidents may result in very similar en- the oracular component in the original model af- tity distribution with the situation of switching fects the performance of our multiple-rank model closer sentence pairs, as recognized by the auto- and the standard model. Results are shown in Ta- matic tool. Therefore, compared to Earthquakes, ble 3. our multiple-rank model may be less powerful in First we can see when trained on PS M , run- indicating the dissimilarity between the sentence ning full coreference resolution significantly hurts orderings in a permutation and its source docu- performance for both models. This suggests that, ment, and therefore can improve on the baseline in real-life applications, where the distribution of only by a small margin. training instances with different degrees of co- herence is skewed (as in the set of permutations 6.1.3 No Coreference Resolution 4 Following Elsner and Charniak (2011), we use the In this experiment, we do not employ any coref- Wilcoxon Sign-rank test for significance. erence resolution tool, and simply cluster head 321 Condition: Coreference− 88.0 Accuracy (%) Earthquakes Accidents 83.0 Earthquake ED Coref+ Perms Metric Earthquake ED Coref± C Acc C Acc 78.0 Accidents ED Coref+ τ 4 82.8 N 82.0 73.0 Accidents ED Coref± Accidents τ Coref- AC 3 78.0 3 **84.2 PS BL 68.0 ED N 78.2 3 *82.7 3 4 5 6 N C Baseline 83.7 80.1 Figure 1: Effect of C on testing accuracies in selected τ 3 **86.4 N **85.7 sentence ordering experimental configurations. AC 4 *84.4 N **86.6 PS M ED 5 **86.7 N **84.6 choices of C’s with the configurations where our Baseline 82.6 77.5 model outperforms the baseline model. In each configuration, we choose the dissimilarity metric Table 4: Accuracies (%) of extending the stan- dard entity-based coherence model with multiple-rank which achieves the best accuracy reported in Ta- learning in sentence ordering using Coreference− op- bles 2 to 4 and the PS BL permutation set. We tion. Accuracies which are significantly better than the can see that the dependency of accuracies on the baseline are indicated by * (p < .05) and ** (p < .01). particular choice of C is not consistent across all experimental configurations, which suggests that this free parameter C needs careful tuning in dif- nouns by string matching. Results are shown in ferent experimental setups. Table 4. Combining our multiple-rank model with sim- Even with such a coarse approximation of ple string matching for entity extraction is a ro- coreference resolution, our model is able to bust option for coherence evaluation, regardless achieve around 85% accuracy in most test cases, of the particular distribution of permutations used except for dataset Earthquakes, training on PS BL in training, and it significantly outperforms the gives poorer performance than the standard model baseline in most conditions. by a small margin. But such inferior perfor- mance should be expected, because as explained 6.2 Summary Coherence Rating above, coreference resolution is crucial to this As explained in Section 3.2, we employ a simple dataset, since entities tend to be realized through sentence alignment between a system-generated pronouns; simple string matching introduces too summary and its corresponding human-written much noise into training, especially when our summary to construct a test ordering π and calcu- model wants to train a more fine-grained discrim- late its dissimilarity between the reference order- inative system than B&L’s. However, we can see ing σ from the human-written summary. In this from the result of training on PS M , if the per- way, we convert B&L’s supervised learning model mutations used in training do not involve swap- into a fully unsupervised model, since human an- ping sentences which are too far away, the result- notations for coherence scores are not required. ing noise is reduced, and our model outperforms We use the same dataset as Barzilay and Lap- theirs. And for dataset Accidents, our model ata (2008), which includes multi-document sum- consistently outperforms the baseline model by a maries from 16 input document clusters generated large margin (with significance test at p < .01). by five systems, along with reference summaries 6.1.4 Conclusions for Sentence Ordering composed by humans. Considering the particular dissimilarity metric In this experiment, we consider only average used in training, we find that edit distance usually continuity (AC) and edit distance (ED) as dissimi- stands out from the other two metrics. Kendall’s τ larity metrics, with raw configuration for rank as- distance proves to be a fairly weak metric, which signment, and compare our multiple-rank model is consistent with the findings of Filippova and with the standard entity-based model using ei- Strube (2007) (see Section 2.3). Figure 1 plots ther full coreference resolution5 or no resolution the testing accuracies as a function of different 5 We run the coreference resolution tool on all documents. 322 Entities Metric Same Full vs. 72.3% on full test. When our model performs poorer than the AC 82.5 *72.6 baseline (using Coreference− configuration), the Coreference+ ED 81.3 **73.0 difference is not significant, which suggests that Baseline 78.8 70.9 our multiple-rank model with unsupervised score AC 76.3 72.0 assignment via simple cosine matching can re- Coreference− ED 78.8 71.7 main competitive with the standard model, which requires human annotations to obtain a more fine- Baseline 80.0 72.3 grained coherence spectrum. This observation is consistent with Banko and Vanderwende (2004)’s Table 5: Accuracies (%) of extending the stan- discovery that human-generated summaries look dard entity-based coherence model with multiple-rank learning in summary rating. Baselines are results of quite extractive. standard entity-based coherence model. Accuracies which are significantly better than the corresponding 7 Conclusions baseline are indicated by * (p < .05) and ** (p < .01). In this paper, we have extended the popular co- herence model of Barzilay and Lapata (2008) by for entity extraction. We train both models on adopting a multiple-rank learning approach. This the ranking preferences (144 in all) among sum- is inherently different from other extensions to maries originating from the same input document this model, in which the focus is on enriching cluster using the SVM rank package (Joachims, the set of features for entity-grid construction, 2006), and test on two different test sets: same- whereas we simply keep their original feature set cluster test and full test. Same-cluster test is the intact, and manipulate only their learning method- one used by Barzilay and Lapata (2008), in which ology. We show that this concise extension is only the pairwise rankings (80 in all) between effective and able to outperform B&L’s standard summaries originating from the same input doc- model in various experimental setups, especially ument cluster are tested; we also experiment with when experimental configurations are most suit- full test, in which pairwise rankings (1520 in all) able considering certain dataset properties (see between all summary pairs excluding two human- discussion in Section 6.1.4). written summaries are tested. We experimented with two tasks: sentence or- dering and summary coherence rating, following Results are shown in Table 5. Coreference+ B&L’s original framework. In sentence ordering, and Coreference− denote the configuration of we also explored the influence of removing the using full coreference resolution or no resolu- oracular component in their original model and tion separately. First, clearly for both models, dealing with permutations generated from differ- performance on full test is inferior to that on ent distributions, showing that our model is robust same-cluster test, but our model is still able to for different experimental situations. In summary achieve performance competitive with the stan- coherence rating, we further extended their model dard model, even if our fundamental assumption such that their original supervised learning is con- about the existence of canonical sentence order- verted into unsupervised learning with competi- ing in documents with same content may break tive or even superior performance. down on those test pairs not originating from the Our multiple-rank learning model can be easily same input document cluster. Secondly, for the adapted into other extended entity-based coher- baseline model, using the Coreference− configu- ence models with their enriched feature sets, and ration yields better accuracy in this task (80.0% further improvement in ranking accuracies should vs. 78.8% on same-cluster test, and 72.3% vs. be expected. 70.9% on full test), which is consistent with the findings of Barzilay and Lapata (2008). But our Acknowledgments multiple-rank model seems to favor the Corefer- ence+ configuration, and our best accuracy even This work was financially supported by the Nat- exceeds B&L’s best when tested on the same set: ural Sciences and Engineering Research Council 82.5% vs. 80.0% on same-cluster test, and 73.0% of Canada and by the University of Toronto. 323 References Mirella Lapata. 2006. Automatic evaluation of in- formation ordering: Kendall’s tau. Computational Michele Banko and Lucy Vanderwende. 2004. Us- Linguistics, 32(4):471–484. ing n-grams to understand the nature of summaries. Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2011. In Proceedings of Human Language Technologies Automatically evaluating text coherence using dis- and North American Association for Computational course relations. In Proceedings of the 49th Annual Linguistics 2004: Short Papers, pages 1–4. Meeting of the Association for Computational Lin- Regina Barzilay and Mirella Lapata. 2005. Modeling guistics (ACL 2011), pages 997–1006. local coherence: An entity-based approach. In Pro- Nitin Madnani, Rebecca Passonneau, Necip Fazil ceedings of the 42rd Annual Meeting of the Asso- Ayan, John M. Conroy, Bonnie J. Dorr, Ju- ciation for Computational Linguistics (ACL 2005), dith L. Klavans, Dianne P. O’Leary, and Judith D. pages 141–148. Schlesinger. 2007. Measuring variability in sen- Regina Barzilay and Mirella Lapata. 2008. Modeling tence ordering for news summarization. In Pro- local coherence: an entity-based approach. Compu- ceedings of the Eleventh European Workshop on tational Linguistics, 34(1):1–34. Natural Language Generation (ENLG 2007), pages Danushka Bollegala, Naoaki Okazaki, and Mitsuru 81–88. Ishizuka. 2006. A bottom-up approach to sen- Vincent Ng and Claire Cardie. 2002. Improving ma- tence ordering for multi-document summarization. chine learning approaches to coreference resolution. In Proceedings of the 21st International Confer- In Proceedings of the 40th Annual Meeting on Asso- ence on Computational Linguistics and 44th Annual ciation for Computational Linguistics (ACL 2002), Meeting of the Association for Computational Lin- pages 104–111. guistics, pages 385–392. Michael Strube and Simone Paolo Ponzetto. 2006. Jackie Chi Kit Cheung and Gerald Penn. 2010. Entity- Wikirelate! Computing semantic relatedness using based local coherence modelling using topological Wikipedia. In Proceedings of the 21st National fields. In Proceedings of the 48th Annual Meet- Conference on Artificial Intelligence, pages 1219– ing of the Association for Computational Linguis- 1224. tics (ACL 2010), pages 186–195. Renxian Zhang. 2011. Sentence ordering driven by Marie-Catherine de Marneffe, Bill MacCartney, and local and global coherence for summary generation. Christopher D. Manning. 2006. Generating typed In Proceedings of the ACL 2011 Student Session, dependency parses from phrase structure parses. In pages 6–11. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006). Micha Elsner and Eugene Charniak. 2011. Extending the entity grid with entity-specific features. In Pro- ceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics (ACL 2011), pages 125–129. Katja Filippova and Michael Strube. 2007. Extend- ing the entity-grid coherence model to semantically related entities. In Proceedings of the Eleventh Eu- ropean Workshop on Natural Language Generation (ENLG 2007), pages 139–142. Thorsten Joachims. 2002. Optimizing search en- gines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), pages 133–142. Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), pages 217–226. Mirella Lapata. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Proceed- ings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003), pages 545–552. 324 Generalization Methods for In-Domain and Cross-Domain Opinion Holder Extraction Michael Wiegand and Dietrich Klakow Spoken Language Systems Saarland University D-66123 Saarbr¨ucken, Germany {Michael.Wiegand|Dietrich.Klakow}@lsv.uni-saarland.de Abstract In order to illustrate this, compare for instance (1) and (2). In this paper, we compare three different generalization methods for in-domain and (1) Malaysia did not agree to such treatment of Al-Qaeda sol- cross-domain opinion holder extraction be- diers as they were prisoners-of-war and should be accorded treatment as provided for under the Geneva Convention. ing simple unsupervised word clustering, (2) Japan wishes to build a $21 billion per year aerospace indus- an induction method inspired by distant try centered on commercial satellite development. supervision and the usage of lexical re- sources. The generalization methods are Though both sentences contain an opinion incorporated into diverse classifiers. We holder, the lexical items vary considerably. How- show that generalization causes significant improvements and that the impact of im- ever, if the two sentences are compared on the ba- provement depends on the type of classifier sis of some higher level patterns, some similari- and on how much training and test data dif- ties become obvious. In both cases the opinion fer from each other. We also address the holder is an entity denoting a person and this en- less common case of opinion holders being tity is an agent1 of some predictive predicate (i.e. realized in patient position and suggest ap- agree in (1) and wishes in (2)), more specifically, proaches including a novel (linguistically- an expression that indicates that the agent utters a informed) extraction method how to detect subjective statement. Generalization methods ide- those opinion holders without labeled train- ing data as standard datasets contain too ally capture these patterns, for instance, they may few instances of this type. provide a domain-independent lexicon for those predicates. In some cases, even higher order fea- tures, such as certain syntactic constructions may 1 Introduction vary throughout the different domains. In (1) and Opinion holder extraction is one of the most im- (2), the opinion holders are agents of a predictive portant subtasks in sentiment analysis. The ex- predicate, whereas the opinion holder her daugh- traction of sources of opinions is an essential com- ters in (3) is a patient2 of embarrasses. ponent for complex real-life applications, such (3) Mrs. Bennet does what she can to get Jane and Bingley to- as opinion question answering systems or opin- gether and embarrasses her daughters by doing so. ion summarization systems (Stoyanov and Cardie, 2011). Common approaches designed to extract If only sentences, such as (1) and (2), occur in opinion holders are based on data-driven methods, the training data, a classifier will not correctly ex- in particular supervised learning. tract the opinion holder in (3), unless it obtains In this paper, we examine the role of general- additional knowledge as to which predicates take ization for opinion holder extraction in both in- opinion holders as patients. domain and cross-domain classification. General- 1 By agent we always mean constituents being labeled as ization may not only help to compensate the avail- A0 in PropBank (Kingsbury and Palmer, 2002). ability of labeled training data but also conciliate 2 By patient we always mean constituents being labeled domain mismatches. as A1 in PropBank. 325 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 325–335, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics In this work, we will consider three differ- Domain # Sentences # Holders in sentence (average) ETHICS 5700 0.79 ent generalization methods being simple unsuper- SPACE 628 0.28 vised word clustering, an induction method and FICTION 614 1.49 the usage of lexical resources. We show that gen- Table 1: Statistics of the different domain corpora. eralization causes significant improvements and that the impact of improvement depends on how much training and test data differ from each other. In addition to these two (sub)domains, we We also address the issue of opinion holders in chose some text type that is not even news text patient position and present methods including a in order to have a very distant domain. There- novel extraction method to detect these opinion fore, we had to use some text not included in the holders without any labeled training data as stan- MPQA corpus. Existing text collections contain- dard datasets contain too few instances of them. ing product reviews (Kessler et al., 2010; Toprak In the context of generalization it is also impor- et al., 2010), which are generally a popular re- tant to consider different classification methods source for sentiment analysis, were not found as the incorporation of generalization may have a suitable as they only contain few distinct opinion varying impact depending on how robust the clas- holders. We finally used a few summaries of fic- sifier is by itself, i.e. how well it generalizes even tional work (two Shakespeare plays and one novel with a standard feature set. We compare two state- by Jane Austen4 ) since their language is notably of-the-art learning methods, conditional random different from that of news texts and they con- fields and convolution kernels, and a rule-based tain a large number of different opinion holders method. (therefore opinion holder extraction is a meaning- ful task on this text type). These texts make up 2 Data our third domain FICTION. We manually labeled As a labeled dataset we mainly use the MPQA it with opinion holder information by applying the 2.0 corpus (Wiebe et al., 2005). We adhere to annotation scheme of the MPQA corpus. the definition of opinion holders from previous Table 1 lists the properties of the different do- work (Wiegand and Klakow, 2010; Wiegand and main corpora. Note that ETHICS is the largest do- Klakow, 2011a; Wiegand and Klakow, 2011b), main. We consider it our primary (source) domain i.e. every source of a private state or a subjective as it serves both as a training and (in-domain) test speech event (Wiebe et al., 2005) is considered an set. Due to their size, the other domains only opinion holder. serve as test sets (target domains). This corpus contains almost exclusively news For some of our generalization methods, we texts. In order to divide it into different domains, also need a large unlabeled corpus. We use the we use the topic labels from (Stoyanov et al., North American News Text Corpus (LDC95T21). 2004). By inspecting those topics, we found that 3 The Different Types of Generalization many of them can grouped to a cluster of news items discussing human rights issues mostly in 3.1 Word Clustering (Clus) the context of combating global terrorism. This The simplest generalization method that is con- means that there is little point in considering every sidered in this paper is word clustering. By that, single topic as a distinct (sub)domain and, there- we understand the automatic grouping of words fore, we consider this cluster as one single domain occurring in similar contexts. Such clusters are ETHICS.3 For our cross-domain evaluation, we usually computed on a large unlabeled corpus. want to have another topic that is fairly different Unlike lexical features, features based on clusters from this set of documents. By visual inspection, are less sparse and have been proven to signif- we found that the topic discussing issues regard- icantly improve data-driven classifiers in related ing the International Space Station would suit our tasks, such as named-entity recognition (Turian et purpose. It is henceforth called SPACE. 4 available at: www.absoluteshakespeare.com/ 3 The cluster is the union of documents with the following guides/{othello|twelfth night}/summary/ MPQA-topic labels: axisofevil, guantanamo, humanrights, {othello|twelfth night} summary.htm mugabe and settlements. www.wikisummaries.org/Pride and Prejudice 326 I. Madrid, Dresden, Bordeaux, Istanbul, Caracas, Manila, ... majority of holders are agents (4). A certain II. Toby, Betsy, Michele, Tim, Jean-Marie, Rory, Andrew, ... III. detest, resent, imply, liken, indicate, suggest, owe, expect, ... number of predicates, however, also have opinion IV. disappointment, unease, nervousness, dismay, optimism, ... holders in patient position, e.g. (5) and (6). V. remark, baby, book, saint, manhole, maxim, coin, batter, ... Wiegand and Klakow (2011b) found that many Table 2: Some automatically induced clusters. of those latter predicates are listed in one of Levin’s verb classes called amuse verbs. While ETHICS SPACE FICTION on the evaluation on the entire MPQA corpus, 1.47 2.70 11.59 opinion holders in patient position are fairly rare (Wiegand and Klakow, 2011b), we may wonder Table 3: Percentage of opinion holders as patients. whether the same applies to the individual do- mains that we consider in this work. Table 3 lists the proportion of those opinion holders (com- al., 2010). Such a generalization is, in particular, puted manually) based on a random sample of 100 attractive as it is cheaply produced. As a state- opinion holder mentions from those corpora. The of-the-art clustering method, we consider Brown table shows indeed that on the domains from the clustering (Brown et al., 1992) as implemented in MPQA corpus, i.e. ETHICS and SPACE, those the SRILM-toolkit (Stolcke, 2002). We induced opinion holders play a minor role but there is a no- 1000 clusters which is also the configuration used tably higher proportion on the FICTION-domain. in (Turian et al., 2010).5 Table 2 illustrates a few of the clusters induced 3.3 Task-Specific Lexicon Induction (Induc) from our unlabeled dataset introduced in Section 3.3.1 Distant Supervision with Prototypical (§) 2. Some of these clusters represent location Opinion Holders or person names (e.g. I. & II.). This exempli- Lexical resources are potentially much more fies why clustering is effective for named-entity expressive than word clustering. This knowledge, recognition. We also find clusters that intuitively however, is usually manually compiled, which seem to be meaningful for our task (e.g. III. & makes this solution much more expensive. Wie- IV.) but, on the other hand, there are clusters that gand and Klakow (2011a) present an intermedi- contain words that with the exception of their part ate solution for opinion holder extraction inspired of speech do not have anything in common (e.g. by distant supervision (Mintz et al., 2009). The V.). output of that method is also a lexicon of predi- 3.2 Manually Compiled Lexicons (Lex) cates but it is automatically extracted from a large The major shortcoming of word clustering is that unlabeled corpus. This is achieved by collecting it lacks any task-specific knowledge. The oppo- predicates that frequently co-occur with prototyp- site type of generalization is the usage of manu- ical opinion holders, i.e. common nouns such as ally compiled lexicons comprising predicates that opponents (7) or critics (8), if they are an agent indicate the presence of opinion holders, such as of that predicate. The rationale behind this is supported, worries or disappointed in (4)-(6). that those nouns act very much like actual opin- ion holders and therefore can be seen as a proxy. (4) I always supported this idea. holder:agent. (5) This worries me. holder:patient (7) Opponents say these arguments miss the point. (6) He disappointed me. holder:patient (8) Critics argued that the proposed limits were unconstitutional. This method reduces the human effort to specify- We follow Wiegand and Klakow (2011b) who ing a small set of such prototypes. found that those predicates can be best obtained Following the best configuration reported by using a subset of Levin’s verb classes (Levin, in (Wiegand and Klakow, 2011a), we extract 250 1993) and the strong subjective expressions of the verbs, 100 nouns and 100 adjectives from our un- Subjectivity Lexicon (Wilson et al., 2005). For labeled corpus (§2). those predicates it is also important to consider in which argument position they usually take an 3.3.2 Extension for Opinion Holders in opinion holder. Bethard et al. (2004) found the Patient Position 5 We also experimented with other sizes but they did not The downside of using prototypical opinion produce a better overall performance. holders as a proxy for opinion holders is that it 327 anguish∗ , astonish, astound, concern, convince, daze, delight, opinion holders to persons. This means that we disenchant∗ , disappoint, displease, disgust, disillusion, dissat- isfy, distress, embitter∗ , enamor∗ , engross, enrage, entangle∗ , allow personal pronouns (i.e. I, you, he, she and excite, fatigue∗ , flatter, fluster, flummox∗ , frazzle∗ , hook∗ , hu- we) to appear in this position. We believe that this miliate, incapacitate∗ , incense, interest, irritate, obsess, outrage, perturb, petrify∗ , sadden, sedate∗ , shock, stun, tether∗ , trouble relaxation can be done in that particular case, as adjectives are much more likely to convey opin- Table 4: Examples of the automatically extracted verbs ions a priori than verbs (Wiebe et al., 2004). taking opinion holders as patients (∗ : not listed as An intrinsic evaluation of the predicates that we amuse verb). thus extracted from our unlabeled corpus is dif- ficult. The 250 most frequent verbs exhibiting this special property of coinciding with adjectives is limited to agentive opinion holders. Opinion (this will be the list that we use in our experi- holders in patient position, such as the ones taken ments) contains 42% entries of the amuse verbs by amuse verbs in (5) and (6), are not covered. (§3.2). However, we also found many other po- Wiegand and Klakow (2011a) show that consid- tentially useful predicates on this list that are not ering less restrictive contexts significantly drops listed as amuse verbs (Table 4). As amuse verbs classification performance. So the natural exten- cannot be considered a complete golden standard sion of looking for predicates having prototypical for all predicates taking opinion holders as pa- opinion holders in patient position is not effective. tients, we will focus on a task-based evaluation Sentences, such as (9), would mar the result. of our automatically extracted list (§6). (9) They criticized their opponents. 4 Data-driven Methods In (9) the prototypical opinion holder opponents (in the patient position) is not a true opinion In the following, we present the two supervised holder. classifiers we use in our experiments. Both clas- Our novel method to extract those predicates sifiers incorporate the same levels of representa- rests on the observation that the past participle of tions, including the same generalization methods. those verbs, such as shocked in (10), is very often 4.1 Conditional Random Fields (CRF) identical to some predicate adjective (11) having a similar if not identical meaning. For the predi- The supervised classifier most frequently used cate adjective, the opinion holder is, however, its for information extraction tasks, in general, are subject/agent and not its patient. conditional random fields (CRF) (Lafferty et al., 2001). Using CRF, the task of opinion holder ex- (10) He had shockedverb me. holder:patient (11) I was shockedadj . holder:agent traction is framed as a tagging problem in which given a sequence of observations x = x1 x2 . . . xn Instead of extracting those verbs directly (10), (words in a sentence) a sequence of output tags we take the detour via their corresponding pred- y = y1 y2 . . . yn indicating the boundaries of opin- icate adjectives (11). This means that we collect ion holders is computed by modeling the condi- all those verbs (from our large unlabeled corpus tional probability P (x|y). (§2)) for which there is a predicate adjective that The features we use (Table 5) are mostly in- coincides with the past participle of the verb. spired by Choi et al. (2005) and by the ones To increase the likelihood that our extracted used for plain support vector machines (SVMs) predicates are meaningful for opinion holder ex- in (Wiegand and Klakow, 2010). They are orga- traction, we also need to check the semantic type nized into groups. The basic group Plain does not in the relevant argument position, i.e. make sure contain any generalization method. Each other that the agent of the predicate adjective (which group is dedicated to one specific generalization would be the patient of the corresponding verb) method that we want to examine (Clus, Induc is an entity likely to be an opinion holder. Our and Lex). Apart from considering generalization initial attempts with prototypical opinion holders features indicating the presence of generalization were too restrictive, i.e. the number of prototyp- types, we also consider those types in conjunction ical opinion holders co-occurring with those ad- with semantic roles. As already indicated above, jectives was too small. Therefore, we widen the semantic roles are especially important for the de- semantic type of this position from prototypical tection of opinion holders. Unfortunately, the cor- 328 Group Features convolution kernels, the structures to be compared Token features: unigrams and bigrams POS/chunk/named-entity features: unigrams, bi- within the kernel function are not vectors com- grams and trigrams prising manually designed features but the under- Plain Constituency tree path to nearest predicate lying discrete structures, such as syntactic parse Nearest predicate Semantic role to predicate+lexical form of predicate trees or part-of-speech sequences. Since they are Cluster features: unigrams, bigrams and trigrams directly provided to the learning algorithm, a clas- Clus Semantic role to predicate+cluster-id of predicate sifier can be built without taking the effort of im- Cluster-id of nearest predicate Is there predicate from induced lexicon within win- plementing an explicit feature extraction. dow of 5 tokens? We take the best configuration from (Wiegand Induc Semantic role to predicate, if predicate is contained in and Klakow, 2010) that comprises a combination induced lexicon Is nearest predicate contained in induced lexicon? of three different tree kernels being two tree ker- Is there predicate from manually compiled lexicons nels based on constituency parse trees (one with within window of 5 tokens? predicate and another with semantic scope) and Semantic role to predicate, if predicate is contained in Lex manually compiled lexicons a tree kernel encoding predicate-argument struc- Is nearest predicate contained in manually compiled tures based on semantic role information. These lexicons? representations are illustrated in Figure 1. The re- Table 5: Feature set for CRF. sulting kernels are combined by plain summation. In order to integrate our generalization meth- ods into the convolution kernels, the input struc- responding feature from the Plain feature group tures, i.e. the linguistic tree structures, have to be that also includes the lexical form of the predicate augmented. For that we just add additional nodes is most likely a sparse feature. For the opinion whose labels correspond to the respective gener- holder me in (10), for example, it would corre- alization types (i.e. Clus: CLUSTER-ID, Induc: spond to A1 shock. Therefore, we introduce for INDUC-PRED and Lex: LEX-PRED). The nodes each generalization method an additional feature are added in such a way that they (directly) domi- replacing the sparse lexical item by a generaliza- nate the leaf node for which they provide a gener- tion label, i.e. Clus: A1 CLUSTER-35265, Induc: alization.10 If several generalization methods are A1 INDUC-PRED and Lex: A1 LEX-PRED.6 used and several of them apply for the same lex- For this learning method, we use CRF++.7 We ical unit, then the (vertical) order of the general- choose a configuration that provides good perfor- ization nodes is LEX-PRED INDUC-PRED mance on our source domain (i.e. ETHICS).8 CLUSTER-ID.11 Figure 2 illustrates the predi- For semantic role labeling we use SWIRL9 , for cate argument structure from Figure 1 augmented chunk parsing CASS (Abney, 1991) and for con- with INDUC-PRED and CLUSTER-IDs. stituency parsing Stanford Parser (Klein and Man- For this learning method, we use the ning, 2003). Named-entity information is pro- SVMLight-TK toolkit.12 Again, we tune the vided by Stanford Tagger (Finkel et al., 2005). parameters to our source domain (ETHICS).13 4.2 Convolution Kernels (CK) 5 Rule-based Classifiers (RB) Convolution kernels (CK) are special kernel func- tions. A kernel function K : X × X → R com- Finally, we also consider rule-based classifiers putes the similarity of two data instances xi and (RB). The main difference towards CRF and CK xj (xi ∧ xj ∈ X). It is mostly used in SVMs that is that it is an unsupervised approach not requiring estimate a hyperplane to separate data instances training data. We re-use the framework by Wie- from different classes H(~x) = w ~ · ~x + b = 0 gand and Klakow (2011b). The candidate set are n where w ∈ R and b ∈ R (Joachims, 1999). In all noun phrases in a test set. A candidate is clas- sified as an opinion holder if all of the following 6 Predicates in patient position are given the same gener- 10 alization label as the predicates in agent position. Specially Note that even for the configuration Plain the trees are marking them did not result in a notable improvement. already augmented with named-entity information. 7 11 http://crfpp.sourceforge.net We chose this order as it roughly corresponds to the 8 The soft margin parameter −c is set to 1.0 and all fea- specificity of those generalization types. 12 tures occurring less than 3 times are removed. disi.unitn.it/moschitti 9 13 http://www.surdeanu.name/mihai/swirl The cost parameter −j (Morik et al., 1999) was set to 5. 329 Figure 1: The different structures (left: constituency trees, right: predicate argument structure) derived from Sentence (1) for the opinion holder candidate Malaysia used as input for convolution kernels (CK). Features Induc Lex Induc+Lex Domains AG AG+PT AG AG+PT AG+PT ETHICS 50.77 50.99 52.22 52.27 53.07 SPACE 45.81 46.55 47.60 48.47 45.20 FICTION 46.59 49.97 54.84 59.35 63.11 Table 6: F-score of the different rule-based classifiers. Figure 2: Predicate argument structure augmented with generalization nodes. ods, since there is no straightforward way of in- corporating this output into a rule-based classifier. conditions hold: • The candidate denotes a person or group of persons. 6 Experiments • There is a predictive predicate in the same sentence. • The candidate has a pre-specified semantic role in the event CK and RB have an instance space that is differ- that the predictive predicate evokes (default: agent-role). ent from the one of CRF. While CRF produces The set of predicates is obtained from a given lex- a prediction for every word token in a sentence, icon. For predicates that take opinion holders as CK and RB only produce a prediction for every patients, the default agent-role is overruled. noun phrase. For evaluation, we project the pre- We consider several classifiers that differ in the dictions from RB and CK to word token level in lexicon they use. RB-Lex uses the combination of order to ensure comparability. We evaluate the se- the manually compiled lexicons presented in §3.2. quential output with precision, recall and F-score RB-Induc uses the predicates that have been au- as defined in (Johansson and Moschitti, 2010; Jo- tomatically extracted from a large unlabeled cor- hansson and Moschitti, 2011). pus using the methods presented in §3.3. RB- Induc+Lex considers the union of those lexicons. 6.1 Rule-based Classifier In order to examine the impact of modeling opin- Table 6 shows the cross-domain performance of ion holders in patient position, we also introduce the different rule-based classifiers. RB-Lex per- two versions of each lexicon. AG just consid- forms better than RB-Induc. In comparison to the ers predicates in agentive position while AG+PT domains ETHICS and SPACE the difference is also considers predicates that take opinion hold- larger on FICTION. Presumably, this is due to the ers as patients. For example, RB-InducAG+P T fact that the predicates in Induc are extracted from is a classifier that uses automatically extracted a news corpus (§2). Thus, Induc may slightly suf- predicates in order to detect opinion holders in fer from a domain mismatch. A combination of both agent and patient argument position, i.e. the two classifiers, i.e. RB-Lex+Induc, results in RB-InducAG+P T also covers our novel extraction a notable improvement in the FICTION-domain. method for patients (§3.3.2). The approaches that also detect opinion holders as The output of clustering will exclusively be patients (AG+PT) including our novel approach evaluated in the context of learning-based meth- (§3.3.2) are effective. A notable improvement can 330 Training Size (%) and recall. RB achieves a high recall, whereas the Features Alg. 5 10 20 50 100 CRF 32.14 35.24 41.03 51.05 55.13 learning-based methods always excel RB in pre- Plain CK 42.15 46.34 51.14 56.39 59.52 cision.14 Applying generalization to the learning- CRF 33.06 37.11 43.47 52.05 56.18 based methods results in an improvement of both +Clus CK 42.02 45.86 51.11 56.59 59.77 CRF 37.28 42.31 46.54 54.27 56.71 recall and precision if few training data are used. +Induc The impact on precision decreases, however, the CK 46.26 49.35 53.26 57.28 60.42 +Lex CRF 40.69 43.91 48.43 55.37 58.46 more training data are added. There is always a CK 46.45 50.59 53.93 58.63 61.50 significant increase in recall but learning-based CRF 37.27 42.19 47.35 54.95 57.14 +Clus+Induc methods may not reach the level of RB even CK 45.14 48.20 52.39 57.37 59.97 +Clus+Lex CRF 40.52 44.29 49.32 55.44 58.80 though they use the same resources. This is a CK 45.89 49.35 53.56 58.74 61.43 side-effect of preserving a much higher precision. CRF 42.23 45.92 49.96 55.61 58.40 +Lex+Induc CK 47.46 51.44 54.80 58.74 61.58 It also explains why learning-based methods with CRF 41.56 45.75 50.39 56.24 59.08 generalization may have a lower F-score than RB. All CK 46.18 50.10 54.04 58.92 61.44 6.3 Out-of-Domain Evaluation of Table 7: F-score of in-domain (ETHICS) learning- Learning-based Methods based classifiers. Table 9 presents the results of out-of-domain clas- sifiers. The complete ETHICS-dataset is used for only be measured on the FICTION-domain since training. Some properties are similar to the pre- this is the only domain with a significant propor- vious experiments: CK always outperforms CRF. tion of those opinion holders (Table 3). RB provides a high recall whereas the learning- based methods maintain a higher precision. Sim- 6.2 In-Domain Evaluation of ilar to the in-domain setting using few labeled Learning-based Methods training data, the incorporation of generalization Table 7 shows the performance of the learning- increases both precision and recall. Moreover, a based methods CRF and CK on an in-domain combination of generalization methods is better evaluation (ETHICS-domain) using different than just using one method on average, although amounts of labeled training data. We carry out Lex is again a fairly robust individual generaliza- a 5-fold cross-validation and use n% of the train- tion method. Generalization is more effective in ing data in the training folds. The table shows that this setting than on the in-domain evaluation us- CK is more robust than CRF. The fewer training ing all training data, in particular for CK, since data are used the more important generalization the training and test data are much more different becomes. CRF benefits much more from gener- from each other and suitable generalization meth- alization than CK. Interestingly, the CRF config- ods partly close that gap. uration with the best generalization is usually as There is a notable difference in precision be- good as plain CK. This proves the effectiveness tween the SPACE- and FICTION-domain (and of CK. In principle, Lex is the strongest general- also the source domain ETHICS (Table 8)). We ization method while Clus is by far the weakest. strongly assume that this is due to the distribu- For Clus, systematic improvements towards no tion of opinion holders in those datasets (Table 1). generalization (even though they are minor) can The FICTION-domain contains much more opin- only be observed with CRF. As far as combina- ion holders, therefore the chance that a predicted tions are concerned, either Lex+Induc or All per- opinion holder is correct is much higher. forms best. This in-domain evaluation proves that With regard to recall, a similar level of per- opinion holder extraction is different from named- formance as in the ETHICS-domain can only be entity recognition. Simple unsupervised general- achieved in the SPACE-domain, i.e. CK achieves ization, such as word clustering, is not effective a recall of 60%. In the FICTION-domain, how- and popular sequential classifiers are less robust ever, the recall is much lower (best recall of CK than margin-based tree-kernels. is below 47%). This is no surprise as the SPACE- Table 8 complements Table 7 in that it com- domain is more similar to the source domain than pares the learning-based methods with the best 14 The reason for RB having a high recall is extensively rule-based classifier and also displays precision discussed in (Wiegand and Klakow, 2011b). 331 the FICTION-domain since ETHICS and SPACE CRF CK Size Feat. Prec Rec F1 Prec Rec F1 are news texts. FICTION contains more out-of- Plain 52.17 26.61 35.24 58.26 38.47 46.34 domain language. Therefore, RB (which exclu- 10 All 62.85 35.96 45.75 63.18 41.50 50.10 sively uses domain-independent knowledge) out- Plain 59.85 44.50 51.05 59.60 53.50 56.39 50 All 62.99 50.80 56.24 61.91 56.20 58.92 performs both learning-based methods including Plain 64.14 48.33 55.13 62.38 56.91 59.52 the ones incorporating generalization. Similar re- 100 All 64.75 54.32 59.08 63.81 59.24 61.44 sults have been observed for rule-based classifiers RB 47.38 60.32 53.07 47.38 60.32 53.07 from other tasks in cross-domain sentiment anal- ysis, such as subjectivity detection and polarity Table 8: Comparison of best RB with learning-based approaches on in-domain classification. classification. High-level information as it is en- coded in a rule-based classifier generalizes better Algorithms Generalization Prec Rec F than learning-based methods (Andreevskaia and CK (Plain) 66.90 41.48 51.21 Bergler, 2008; Lambov et al., 2009). CK Induc 67.06 45.15 53.97 CK+RBAG Induc 60.22 54.52 57.23 We set up another experiment exclusively for CK+RBAG+P T Induc 61.09 58.14 59.58 the FICTION-domain in which we combine the CK Lex 69.45 46.65 55.81 output of our best learning-based method, i.e. CK, CK+RBAG Lex 67.36 59.02 62.91 CK+RBAG+P T Lex 68.25 63.28 65.67 with the prediction of a rule-based classifier. The CK Induc+Lex 69.73 46.17 55.55 combined classifier will predict an opinion holder, CK+RBAG Induc+Lex 61.41 65.56 63.42 if either classifier predicts one. The motivation for CK+RBAG+P T Induc+Lex 62.26 70.56 66.15 this is the following: The FICTION-domain is the Table 10: Combination of out-of-domain CK and rule- only domain to have a significant proportion of based classifiers on FICTION (i.e. distant domain). opinion holders appearing as patients. We want to know how much of them can be recognized with the best out-of-domain classifier using train- further evidence that our novel approach to extract ing data with only very few instances of this type those predicates (§3.3.2) is effective. and what benefit the addition of using various RBs The combined approach in Table 10 not only which have a clearer notion of these constructions outperforms CK (discussed above) but also RB brings about. Moreover, we already observed that (Table 6). We manually inspected the output of the learning-based methods have a bias towards the classifiers to find also cases in which CK de- preserving a high precision and this may have as tect opinion holders that RB misses. CK has the a consequence that the generalization features in- advantage that it is not only bound to the relation- corporated into CK will not receive sufficiently ship between candidate holder and predicate. It large weights. Unlike the SPACE-domain where learns further heuristics, e.g. that sentence-initial a sufficiently high recall is already achieved with mentions of persons are likely opinion holders. In CK (presumably due to its stronger similarity to- (12), for example, this heuristics fires while RB wards the source domain) the FICTION-domain overlooks this instance as to give someone a share may be more severely affected by this bias and of advice is not part of the lexicon. evidence from RB may compensate for this. (12) She later gives Charlotte her share of advice on running a Table 10 shows the performance of those com- household. bined classifiers. For all generalization types considered, there is, indeed, an improvement by 7 Related Work adding information from RB resulting in a large boost in recall. Already the application of our in- The research on opinion holder extraction has duction approach Induc results in an increase of been focusing on applying different data-driven more than 8% points compared to plain CK. The approaches. Choi et al. (2005) and Choi et al. table also shows that there is always some im- (2006) explore conditional random fields, Wie- provement if RB considers opinion holders as pa- gand and Klakow (2010) examine different com- tients (AG+PT). This can be considered as some binations of convolution kernels, while Johans- evidence that (given the available data we use) son and Moschitti (2010) present a re-ranking ap- opinion holders in patient position can only be ef- proach modeling complex relations between mul- fectively extracted with the help of RBs. It is also tiple opinions in a sentence. A comparison of 332 SPACE (similar target domain) FICTION (distant target domain) CRF CK CRF CK Features Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 Plain 47.32 48.62 47.96 45.89 57.07 50.87 68.58 28.96 40.73 66.90 41.48 51.21 +Clus 49.00 48.62 48.81 49.23 57.64 53.10 71.85 32.21 44.48 67.54 41.21 51.19 +Induc 42.92 49.15 45.82 46.66 60.45 52.67 71.59 34.77 46.80 67.06 45.15 53.97 +Lex 49.65 49.07 49.36 49.60 59.88 54.26 71.91 35.83 47.83 69.45 46.65 55.81 +Clus+Induc 46.61 48.78 47.67 48.65 58.20 53.00 71.32 35.88 47.74 67.46 42.17 51.90 +Lex+Induc 48.75 50.87 49.78 49.92 58.76 53.98 74.02 37.37 49.67 69.73 46.17 55.55 +Clus+Lex 49.72 50.87 50.29 53.70 59.32 56.37 73.41 37.15 49.33 70.59 43.98 54.20 All 49.87 51.03 50.44 51.68 58.76 54.99 72.00 37.44 49.26 70.61 44.83 54.84 best RB 41.72 57.80 48.47 41.72 57.80 48.47 63.26 62.96 63.11 63.26 62.96 63.11 Table 9: Comparison of best RB with learning-based approaches on out-of-domain classification. those methods has not yet been attempted. In The only cross-domain evaluation of opinion this work, we compare the popular state-of-the-art holder extraction is reported in (Li et al., 2007) us- learning algorithms conditional random fields and ing the MPQA corpus as a training set and the NT- convolution kernels for the first time. All these CIR collection as a test set. A low cross-domain data-driven methods have been evaluated on the performance is obtained and the authors conclude MPQA corpus. Some generalization methods are that this is due to the very different annotation incorporated but unlike this paper they are neither schemes of those corpora. systematically compared nor combined. The role of resources that provide the knowledge of argu- 8 Conclusion ment positions of opinion holders is not covered We examined different generalization methods for in any of these works. This kind of knowledge opinion holder extraction. We found that for in- should be directly learnt from the labeled train- domain classification, the more labeled training ing data. In this work, we found, however, that data are used, the smaller is the impact of gener- the distribution of argument positions of opinion alization. Robust learning methods, such as con- holders varies throughout the different domains volution kernels, benefit less from generalization and, therefore, cannot be learnt from any arbitrary than weaker classifiers, such as conditional ran- out-of-domain training set. dom fields. For cross-domain classification, gen- Bethard et al. (2004) and Kim and Hovy (2006) eralization is always helpful. Distant domains explore the usefulness of semantic roles provided are problematic for learning-based methods, how- by FrameNet (Fillmore et al., 2003). Bethard ever, rule-based methods provide a reasonable re- et al. (2004) use this resource to acquire labeled call and can be effectively combined with the training data while in (Kim and Hovy, 2006) learning-based methods. The types of generaliza- FrameNet is used within a rule-based classifier tion that help best are manually compiled lexicons mapping frame-elements of frames to opinion followed by an induction method inspired by dis- holders. Bethard et al. (2004) only evaluate on an tant supervision. Finally, we examined the case artificial dataset (i.e. a subset of sentences from of opinion holders as patients and also presented FrameNet and PropBank (Kingsbury and Palmer, a novel automatic extraction method that proved 2002)). The only realistic test set on which Kim effective. Such dedicated extraction methods are and Hovy (2006) evaluate their approach are news important as common labeled datasets (from the texts. Their method is compared against a sim- news domain) do not provide sufficient training ple rule-based baseline and, unlike this work, not data for these constructions. against a robust data-driven algorithm. Acknowledgements (Wiegand and Klakow, 2011b) is similar to (Kim and Hovy, 2006) in that a rule-based ap- This work was funded by the German Federal Ministry proach is used relying on the relationship towards of Education and Research (Software-Cluster) under predictive predicates. Diverse resources are con- grant no. “01IC10S01”. The authors thank Alessandro sidered for obtaining such words, however, they Moschitti, Benjamin Roth and Josef Ruppenhofer for are only evaluated on the entire MPQA corpus. their technical support and interesting discussions. 333 References Proceedings of the Annual Meeting of the Associa- tion for Computational Linguistics (ACL), Portland, Steven Abney. 1991. Parsing By Chunks. In Robert OR, USA. Berwick, Steven Abney, and Carol Tenny, editors, Jason S. Kessler, Miriam Eckert, Lyndsay Clarke, Principle-Based Parsing. Kluwer Academic Pub- and Nicolas Nicolov. 2010. The ICWSM JDPA lishers, Dordrecht. 2010 Sentiment Corpus for the Automotive Do- Alina Andreevskaia and Sabine Bergler. 2008. When main. In Proceedings of the International AAAI Specialists and Generalists Work Together: Over- Conference on Weblogs and Social Media Data coming Domain Dependence in Sentiment Tagging. Challange Workshop (ICWSM-DCW), Washington, In Proceedings of the Annual Meeting of the Associ- DC, USA. ation for Computational Linguistics: Human Lan- Soo-Min Kim and Eduard Hovy. 2006. Extracting guage Technologies (ACL/HLT), Columbus, OH, Opinions, Opinion Holders, and Topics Expressed USA. in Online News Media Text. In Proceedings of Steven Bethard, Hong Yu, Ashley Thornton, Vasileios the ACL Workshop on Sentiment and Subjectivity in Hatzivassiloglou, and Dan Jurafsky. 2004. Extract- Text, Sydney, Australia. ing Opinion Propositions and Opinion Holders us- Paul Kingsbury and Martha Palmer. 2002. From ing Syntactic and Lexical Cues. In Computing At- TreeBank to PropBank. In Proceedings of the titude and Affect in Text: Theory and Applications. Conference on Language Resources and Evaluation Springer-Verlag. (LREC), Las Palmas, Spain. Peter F. Brown, Peter V. deSouza, Robert L. Mer- Dan Klein and Christopher D. Manning. 2003. Accu- cer, Vincent J. Della Pietra, and Jenifer C. Lai. rate Unlexicalized Parsing. In Proceedings of the 1992. Class-based n-gram models of natural lan- Annual Meeting of the Association for Computa- guage. Computational Linguistics, 18:467–479. tional Linguistics (ACL), Sapporo, Japan. Yejin Choi, Claire Cardie, Ellen Riloff, and Sid- John Lafferty, Andrew McCallum, and Fernando dharth Patwardhan. 2005. Identifying Sources Pereira. 2001. Conditional Random Fields: Prob- of Opinions with Conditional Random Fields and abilistic Models for Segmenting and Labeling Se- Extraction Patterns. In Proceedings of the Con- quence Data. In Proceedings of the International ference on Human Language Technology and Em- Conference on Machine Learning (ICML). pirical Methods in Natural Language Processing Dinko Lambov, Ga¨el Dias, and Veska Noncheva. (HLT/EMNLP), Vancouver, BC, Canada. 2009. Sentiment Classification across Domains. In Yejin Choi, Eric Breck, and Claire Cardie. 2006. Joint Proceedings of the Portuguese Conference on Artifi- Extraction of Entities and Relations for Opinion cial Intelligence (EPIA), Aveiro, Portugal. Springer- Recognition. In Proceedings of the Conference on Verlag. Empirical Methods in Natural Language Process- Beth Levin. 1993. English Verb Classes and Alter- ing (EMNLP), Sydney, Australia. nations: A Preliminary Investigation. University of Charles. J. Fillmore, Christopher R. Johnson, and Chicago Press. Miriam R. Petruck. 2003. Background to Yangyong Li, Kalina Bontcheva, and Hamish Cun- FrameNet. International Journal of Lexicography, ningham. 2007. Experiments of Opinion Analy- 16:235 – 250. sis on the Corpora MPQA and NTCIR-6. In Pro- Jenny Rose Finkel, Trond Grenager, and Christopher ceedings of the NTCIR-6 Workshop Meeting, Tokyo, Manning. 2005. Incorporating Non-local Informa- Japan. tion into Information Extraction Systems by Gibbs Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- Sampling. In Proceedings of the Annual Meeting sky. 2009. Distant Supervision for Relation Extrac- of the Association for Computational Linguistics tion without Labeled Data. In Proceedings of the (ACL), Ann Arbor, MI, USA. Joint Conference of the Annual Meeting of the As- Thorsten Joachims. 1999. Making Large-Scale SVM sociation for Computational Linguistics and the In- Learning Practical. In B. Sch¨olkopf, C. Burges, and ternational Joint Conference on Natural Language A. Smola, editors, Advances in Kernel Methods - Processing of the Asian Federation of Natural Lan- Support Vector Learning. MIT Press. guage Processing (ACL/IJCNLP), Singapore. Richard Johansson and Alessandro Moschitti. 2010. Katharina Morik, Peter Brockhausen, and Thorsten Reranking Models in Fine-grained Opinion Anal- Joachims. 1999. Combining Statistical Learn- ysis. In Proceedings of the International Confer- ing with a Knowledge-based Approach - A Case ence on Computational Linguistics (COLING), Be- Study in Intensive Care Monitoring. In Proceedings jing, China. the International Conference on Machine Learning Richard Johansson and Alessandro Moschitti. 2011. (ICML). Extracting Opinion Expressions and Their Polari- Andreas Stolcke. 2002. SRILM - An Extensible Lan- ties – Exploration of Pipelines and Joint Models. In guage Modeling Toolkit. In Proceedings of the In- 334 ternational Conference on Spoken Language Pro- cessing (ICSLP), Denver, CO, USA. Veselin Stoyanov and Claire Cardie. 2011. Auto- matically Creating General-Purpose Opinion Sum- maries from Text. In Proceedings of Recent Ad- vances in Natural Language Processing (RANLP), Hissar, Bulgaria. Veselin Stoyanov, Claire Cardie, Diane Litman, and Janyce Wiebe. 2004. Evaluating an Opinion An- notation Scheme Using a New Multi-Perspective Question and Answer Corpus. In Proceedings of the AAAI Spring Symposium on Exploring Attitude and Affect in Text, Menlo Park, CA, USA. Cigdem Toprak, Niklas Jakob, and Iryna Gurevych. 2010. Sentence and Expression Level Annotation of Opinions in User-Generated Discourse. In Pro- ceedings of the Annual Meeting of the Associa- tion for Computational Linguistics (ACL), Uppsala, Sweden. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word Representations: A Simple and Gen- eral Method for Semi-supervised Learning. In Pro- ceedings of the Annual Meeting of the Associa- tion for Computational Linguistics (ACL), Uppsala, Sweden. Janyce Wiebe, Theresa Wilson, Rebecca Bruce, Matthew Bell, and Melanie Martin. 2004. Learn- ing Subjective Language. Computational Linguis- tics, 30(3). Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation, 39(2/3):164–210. Michael Wiegand and Dietrich Klakow. 2010. Convo- lution Kernels for Opinion Holder Extraction. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL (HLT/NAACL), Los Angeles, CA, USA. Michael Wiegand and Dietrich Klakow. 2011a. Proto- typical Opinion Holders: What We can Learn from Experts and Analysts. In Proceedings of Recent Ad- vances in Natural Language Processing (RANLP), Hissar, Bulgaria. Michael Wiegand and Dietrich Klakow. 2011b. The Role of Predicates in Opinion Holder Extraction. In Proceedings of the RANLP Workshop on Informa- tion Extraction and Knowledge Acquisition (IEKA), Hissar, Bulgaria. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing Contextual Polarity in Phrase- level Sentiment Analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Process- ing (HLT/EMNLP), Vancouver, BC, Canada. 335 Skip N-grams and Ranking Functions for Predicting Script Events Bram Jans Steven Bethard KU Leuven University of Colorado Boulder Leuven, Belgium Boulder, Colorado, USA

[email protected] [email protected]

Ivan Vuli´c Marie Francine Moens KU Leuven KU Leuven Leuven, Belgium Leuven, Belgium

[email protected] [email protected]

Abstract within that script (Chambers and Jurafsky, 2008; Chambers and Jurafsky, 2009) or that generates In this paper, we extend current state-of-the- a story using the selected events (McIntyre and art research on unsupervised acquisition of Lapata, 2009; McIntyre and Lapata, 2010). scripts, that is, stereotypical and frequently In this article, we analyze and compare tech- observed sequences of events. We design, evaluate and compare different methods for niques for constructing models that, given a partial constructing models for script event predic- chain of events, predict other events that belong to tion: given a partial chain of events in a the script. In particular, we consider the following script, predict other events that are likely questions: to belong to the script. Our work aims to answer key questions about how best • How should representative chains of events to (1) identify representative event chains be selected from the source text? from a source text, (2) gather statistics from the event chains, and (3) choose ranking • Given an event chain, how should statistics functions for predicting new script events. be gathered from it? We make several contributions, introducing skip-grams for collecting event statistics, de- • Given event n-gram statistics, which ranking signing improved methods for ranking event function best predicts the events for a script? predictions, defining a more reliable evalu- In the process of answering these questions, this ation metric for measuring predictiveness, and providing a systematic analysis of the article makes several contributions to the field of various event prediction models. script and narrative event chain understanding: • We explore for the first time the use of skip- 1 Introduction grams for collecting narrative event statistics, and show that this approach performs better There has been recent interest in automatically ac- than classic n-gram statistics. quiring world knowledge in the form of scripts (Schank and Abelson, 1977), that is, frequently • We propose a new method for ranking events recurring situations that have a stereotypical se- given a partial script, and show that it per- quence of events, such as a visit to a restaurant. forms substantially better than ranking meth- All of the techniques so far proposed for this task ods from prior work. share a common sub-task: given an event or partial • We propose a new evaluation procedure (us- chain of events, predict other events that belong ing Recall@N) for the cloze test, and advo- to the same script (Chambers and Jurafsky, 2008; cate its usage instead of average rank used Chambers and Jurafsky, 2009; Chambers and Ju- previously in the literature. rafsky, 2011; Manshadi et al., 2008; McIntyre and Lapata, 2009; McIntyre and Lapata, 2010; Regneri • We provide a systematic analysis of the in- et al., 2010). Such a model can then serve as input teractions between the choices made when to a system that identifies the order of the events constructing an event prediction model. 336 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 336–344, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Section 2 gives an overview of the prior work dom sequence?) rather than on prediction tasks related to this task. Section 3 lists and briefly de- (e.g. which event should follow these events?). scribes different approaches that try to provide In the current article, we attempt to shed some answers to the three questions posed in this intro- light on these previous works by comparing differ- duction, while Section 4 presents the results of our ent ways of collecting and using event chains. experiments and reports on our findings. Finally, Section 5 provides a conclusive discussion along 3 Methods with ideas for future work. Models that predict script events typically have three stages. First, a large corpus is processed to 2 Prior Work find event chains in each of the documents. Next, Our work is primarily inspired by the work of statistics over these event chains are gathered and Chambers and Jurafsky, which combined a depen- stored. Finally, the gathered statistics are used to dency parser with coreference resolution to col- create a model that takes as input a partial script lect event script statistics and predict script events and produces as output a ranked list of events for (Chambers and Jurafsky, 2008; Chambers and Ju- that script. The following sections give more de- rafsky, 2009). For each document in their training tails about each of these stages and identify the corpus, they used coreference resolution to iden- decisions that must be made in each step, and an tify all the entities, and a dependency parser to overview of the whole process with an example identify all verbs that had an entity as either a sub- source text is displayed in Figure 1. ject or object. They defined an event as a verb plus 3.1 Identifying Event Chains a dependency type (either subject or object), and collected for each entity, the chain of events that Event chains are typically defined as a sequence it participated in. They then calculated pointwise of actions performed by some actor. Formally, an mutual information (PMI) statistics over all the event chain C for some actor a, is a partially or- pairs of events that occurred in the event chains in dered set of events (v, d) where each v is a verb their corpus. To predict a new script event given that has the actor a as its dependency d. Following a partial chain of events, they selected the event prior work (Chambers and Jurafsky, 2008; Cham- with the highest sum of PMIs with all the events bers and Jurafsky, 2009; McIntyre and Lapata, in the partial chain. 2009; McIntyre and Lapata, 2010), these event The work of McIntyre and Lapata followed in chains are identified by running a coreference sys- this same paradigm, (McIntyre and Lapata, 2009; tem and a dependency parser. Then for each en- McIntyre and Lapata, 2010), collecting chains of tity identified by the coreference system, all verbs events by looking at entities and the sequence of that have a mention of that entity as one of their verbs for which they were a subject or object. They dependencies are collected1 . The event chain is also calculated statistics over the collected event then the sequence of (verb, dependency-type) tu- chains, though they considered both event bigram ples. For example, given the sentence A Crow and event trigram counts. Rather than predicting was sitting on a branch of a tree when a Fox ob- an event for a script however, they used these sim- served her, the event chain for the Crow would be ple counts to predict the next event that should be (sitting, SUBJECT), (observed, OBJECT). generated for a children’s story. Once event chains have been identified, the most Manshadi and colleagues were concerned about appropriate event chains for training the model the scalability of running parsers and coreference must be selected. The goal of this process is to over a large collection of story blogs, and so used select the subset of the event chains identified by a simplified version of event chains – just the main the coreference system and the dependency parser verb of each sentence (Manshadi et al., 2008). that look to be the most reliable. Both the coref- Rather than rely on an ad-hoc summation of PMIs, erence system and the dependency parser make they apply language modeling techniques (specifi- some errors, so not all event chains are necessarily cally, a smoothed 5-gram model) over the sequence useful for training a model. The three strategies of events in the collected chains. However, they we consider for this selection process are: only tested these language models on sequencing 1 Also following prior work, we consider only the depen- tasks (e.g. is the real sequence better than a ran- dencies subject and object. 337 John woke up. He opened his eyes and yawned. Then he crossed the room and walked to the door. There he saw Mary. Mary smiled and kissed him. Then they both blushed. all chains, long chains, all chains 1. Identifying event chains the longest chain JOHN MARY (saw, OBJ) (woke, SUBJ) (smiled, SUBJ) (opened, SUBJ) st) (kissed, SUBJ) (yawned, SUBJ) te e (blushed, SUBJ) (crossed, SUBJ) loz (walked, SUBJ) (c 1-skip bigrams t (saw, SUBJ) rip 2. Gathering event chain statistics l sc (kissed, OBJ) rti a . SUBJ) (blushed, pa 2-skip a ms r bigra . bigram g regula in s . ct t ru ns [(saw, OBJ), (smiled, SUBJ)] [(saw, OBJ), (smiled, SUBJ)] co [(smiled, SUBJ), (kissed, SUBJ)] [(saw, OBJ), (kissed, SUBJ)] [(saw, OBJ), (smiled, SUBJ)] [(saw, OBJ), (kissed, SUBJ)] [(kissed, SUBJ), (blushed, SUBJ)] [(smiled, SUBJ), (kissed, SUBJ)] [(saw, OBJ), (blushed, SUBJ)] [(smiled, SUBJ), (blushed, SUBJ)] ... [(kissed, SUBJ), (blushed, SUBJ)] [(kissed, SUBJ), (blushed, SUBJ)] (saw, OBJ) 1. (looked, OBJ) MI JP 2. (gave, SUBJ) (smiled, SUBJ) C& 3. (saw, SUBJ) (kissed, SUBJ) Orde ... red P _________ (missing event)B MI 3. Predicting script events ig 1. (kissed, OBJ) r am pr 2. (looked, OBJ) ob 3. (waited, SUBJ) . ... 1. (blushed, SUBJ) 2. (kissed, OBJ) 3. (smiled, SUBJ) Figure 1: An overview of the whole linear work flow showing the three key steps – identifying event chains, collecting statistics out of the chains and predicting a missing event in a script. The figure also displays how a partial script for evaluation (Section 4.3) is constructed. We show the whole process for Mary’s event chain only, but the same steps are followed for John’s event chain. • Select all event chains, that is, all sequences 3.2 Gathering Event Chain Statistics of two or more events linked by common Once event chains have been collected from the actors. This strategy will produce the largest corpus, the statistics necessary for constructing number of event chains to train a model from, the event prediction model must be gathered. Fol- but it may produce noisier training data as lowing prior work (Chambers and Jurafsky, 2008; the very short chains included by this strategy Chambers and Jurafsky, 2009; Manshadi et al., may be less likely to represent real scripts. 2008; McIntyre and Lapata, 2009; McIntyre and • Select all long event chains consisting of 5 Lapata, 2010), we focus on gathering statistics or more events. This strategy will produce a about the n-grams of events that occur in the smaller number of event chains, but as they collected event chains. Specifically, we look at are longer, they may be more likely to repre- strategies for collecting bigram statistics, the most sent scripts. common type of statistics gathered in prior work. • Select only the longest event chain. This We consider three strategies for collecting bigram strategy will produce the smallest number of statistics: event chains from a corpus. However, they may be of higher quality, since this strategy • Regular bigrams. We find all pairs of looks for the key actor in each story, and only events that are adjacent in an event chain uses the events that are tied together by that and collect the number of times each event key actor. Since this is the single actor that pair was observed. For example, given the played the largest role in the story, its actions chain of events (saw, SUBJ), (kissed, OBJ), may be the most likely to represent a real (blushed, SUBJ), we would extract the two script. event bigrams: ((saw, SUBJ), (kissed, OBJ)) 338 and ((kissed, OBJ), (blushed, SUBJ)). In addi- collected from their corpus, and score it as tion to the event pair counts, we also collect the sum of the pointwise mutual informations the number of times each event was observed between the event e and each of the events in individually, to allow for various conditional the script: probability calculations. This strategy fol- n lows the classic approach for most language X P (ci , e) f (e, c) = log models. P (ci )P (e) i • 1-skip bigrams. We collect pairs of events Chambers and Jurafsky’s description of this that occur with 0 or 1 events intervening be- score suggests that it is unordered, such that tween them. For example, given the chain P (a, b) = P (b, a). Thus the probabilities (saw, SUBJ), (kissed, OBJ), (blushed, SUBJ), must be defined as: we would extract three bigrams: the two regu- C(e1 , e2 ) + C(e2 , e1 ) lar bigrams ((saw, SUBJ), (kissed, OBJ)) and P (e1 , e2 ) = PP C(ei , ej ) ((kissed, OBJ), (blushed, SUBJ)), plus the 1- ei ej skip-bigram, ((saw, SUBJ), (blushed, SUBJ)). This approach to collecting n-gram statistics C(e) P (e) = P 0 is sometimes called skip-gram modeling, and e0 C(e ) it can reduce data sparsity by extracting more where C(e1 , e2 ) is the number of times that event pairs per chain (Guthrie et al., 2006). the ordered event pair (e1 , e2 ) was counted in It has not previously been applied in the task the training data, and C(e) is the number of of predicting script events, but it may be times that the event e was counted. quite appropriate to this task because in most scripts it is possible to skip some events in • Ordered PMI. A variation on the approach the sequence. of Chambers and Jurafsky is to have a score that takes the order of the events in the chain • 2-skip bigrams. We collect pairs of events into account. In this scenario, we assume that that occur with 0, 1 or 2 intervening events, in addition to the partial script of events, we similar to what was done in the 1-skip bi- are given an insertion point, m, where the grams strategy. This will extract even more new event should be added. The score is then pairs of events from each chain, but it is pos- defined as: sible the statistics over these pairs of events m will be noisier. X P (ck , e) f (e, c) = log + P (ck )P (e) 3.3 Predicting Script Events k=1 n X P (e, ck ) Once statistics over event chains have been col- log lected, it is possible to construct the model for P (e)P (ck ) k=m+1 predicting script events. The input of this model will be a partial script c of n events, where c = where the probabilities are defined as: c1 c2 . . . cn = (v1 , d1 ), (v2 , d2 ), . . . , (vn , dn ), and C(e1 , e2 ) the output of this model will be a ranked list of P (e1 , e2 ) = P P C(ei , ej ) events where the highest ranked events are the ones ei ej most likely to belong to the event sequence in the script. Thus, the key issue for this model is to de- C(e) P (e) = P 0 fine the function f for ranking events. We consider e0 C(e ) three such ranking functions: This approach uses pointwise mutual infor- mation but also models the event chain in the • Chambers & Jurafsky PMI. Chambers and order it was observed. Jurafsky (2008) define their event ranking function based on pointwise mutual infor- • Bigram probabilities. Finally, a natural mation. Given a partial script c as defined ranking function, which has not been applied above, they consider each event e = (v 0 , d0 ) to the script event prediction task (but has 339 been applied to related tasks (Manshadi et economics, sports, etc., strongly varying in al., 2008)) is to use the bigram probabilities length, topics and narrative structure. of language modeling rather than pointwise mutual information scores. Again, given an • Andrew Lang Fairy Tale Corpus 4 – a insertion point m for the event in the script, small collection of 437 children stories with we define the score as: an average length of 125 sentences, and used previously for story generation by McIntyre m X and Lapata (2009). f (e, c) = log P (e|ck ) + k=1 n In general, the Reuters Corpus is much larger and X log P (ck |e) allows us to see how well script events can be k=m+1 predicted when a lot of data is available, while the Andrew Lang Fairy Tale Corpus is much smaller, where the conditional probability is defined but has a more straightforward narrative structure as2 : that may make identifying scripts simpler. C(e1 , e2 ) P (e1 |e2 ) = C(e2 ) 4.2 Corpus Processing This approach scores an event based on the Constructing a model for predicting script events probability that it was observed following all requires a corpus that has been parsed with a de- the events before it in the chain and preceding pendency parser, and whose entities have been all the events after it in the chain. This ap- identified via a coreference system. We there- proach most directly models the event chain fore processed our corpora by (1) filtering out in the order it was observed. non-narrative articles, (2) applying a dependency parser, (3) applying a coreference resolution sys- 4 Experiments tem and (4) identifying event chains via entities Our experiments aimed to answer three questions: and dependencies. Which event chains are worth keeping? How First, articles that had no narrative content were should event bigram counts be collected? And removed from the corpora. In the Reuters Corpus, which ranking method is best for predicting script we removed all files solely listing stock exchange events? To answer these questions we use two values, interest rates, etc., as well as all articles corpora, the Reuters Corpus and the Andrew Lang that were simply summaries of headlines from dif- Fairy Tale Corpus, to evaluate our three differ- ferent countries or cities. After removing these ent chain selection methods, {all chains, long files, the Reuters corpus was reduced to 788, 245 chains, the longest chain}, our three different bi- files. Removing files from the Fairy Tale corpus gram counting methods, {regular bigrams, 1-skip was not necessary – all 437 stories were retained. bigrams, 2-skip bigrams}, and our three different We then applied the Stanford Parser (Klein and ranking methods, {Chambers & Jurafsky PMI, or- Manning, 2003) to identify the dependency struc- dered PMI, bigram probabilities}. ture of each sentence in each article in the corpus. This parser produces a constitutent-based syntactic 4.1 Corpora parse tree for each sentence, and then converts this We consider two corpora for evaluation: tree to a collapsed dependency structure via a set of tree patterns. • Reuters Corpus, Volume 1 3 (Lewis et Next we applied the OpenNLP coreference en- al., 2004) – a large collection of 806, 791 gine5 to identify the entities in each article, and the news stories written in English concerning noun phrases that were mentions of each entity. a number of different topics such as politics, Finally, to identify the event chains, we took 2 Note that predicted bigram probabilities are calculated each of the entities proposed by the coreference in this way for both classic language modeling and skip-gram system, walked through each of the noun phrases modeling. In skip-gram modeling, skips in the n-grams are associated with that entity, retrieved any subject only used to increase the size of the training data; prediction 4 is performed exactly as in classic language modeling. http://www.mythfolklore.net/andrewlang/ 3 5 http://trec.nist.gov/data/reuters/reuters.html http://incubator.apache.org/opennlp/ 340 or object dependencies that linked a verb to that the rank of e in the system’s guess list for c. noun phrase, and created an event chain from the sequence of (verb, dependency-type) tuples in the • Average rank. The average rank of the miss- order that they appeared in the text. ing event across all of the partial scripts: 1 X 4.3 Evaluation Metrics ranksys (c) |C| c∈C We follow the approach of Chambers and Jurafsky (2008), evaluating our models for predicting script This is the evaluation metric used by Cham- events in a narrative cloze task. The narrative bers and Jurafsky (2008). cloze task is inspired by the classic psychological cloze task in which subjects are given a sentence • Recall@N. The fraction of partial scripts with a word missing and asked to fill in the blank where the missing event is ranked N or less6 (Taylor, 1953). Similarly, in the narrative cloze in the guess list. task, the system is given a sequence of events from 1 a script where one event is missing, and asked |{c : c ∈ C ∧ ranksys (c) ≤ N }| |C| to predict the missing event. The difficulty of a cloze task depends a lot on the context around In our experiments we use N = 50, but re- the missing item – in some cases it may be quite sults are roughly similar for lower and higher predictable, but in many cases there is no single values of N . correct answer, though some answers are more probable than others. Thus, performing well on a Recall@N has not been used before for evaluat- cloze task is more about ranking the missing event ing models that predict script events, however we highly, and not about proposing a single “correct” suggest that it is a more reliable metric than Av- event. erage rank. When calculating the average rank, In this way, narrative cloze is like perplexity the length of the guess lists will have a significant in a language model. However, where perplexity influence on results. For instance, if a small model measures how good the model is at predicting a is trained with only a small vocabulary of events, script event given the previous events in the script, its guess lists will usually be shorter than a larger narrative cloze measures how good the model is model, but if both models predict the missing event at predicting what is missing between events in at the bottom of the list, the larger model will get the script. Thus narrative cloze is somewhat more penalized more. Recall@N does not have this is- appropriate to our task, and at the same time sim- sue – it is not influenced by length of the guess plifies comparisons to prior work. lists. An alternative evaluation metric would have Rather than manually constructing a set of been mean average precision (MAP), a metric scripts on which to run the cloze test, we follow commonly used to evaluate information retrieval. Chambers and Jurafsky in reserving a section of Mean average precision reduces to mean recipro- our parsed corpora for testing, and then using the cal rank (MRR) when there’s only a single answer event chains from that section as the scripts for as in the case of narrative cloze, and would have which the system must predict events. Given an scored the ranked lists as: event chain of length n, we run n cloze tests, with a different one of the n events removed each time 1 X 1 to create a partial script from the remaining n − 1 |C| ranksys (c) c∈C events (see Figure 1). Given a partial script as input, an accurate event prediction model should Note that mean reciprocal rank has the same issues rank the missing event highly in the guess list that with guess list length that average rank does. Thus, it generates as output. since it does not aid us in comparing to prior work, We consider two approaches to evaluating the and it has the same deficiencies as average rank, guess lists produced in response to narrative cloze we do not report MRR in this article. tests. Both are defined in terms of a test collection 6 Rank 1 is the event that the system predicts is most prob- C, consisting of |C| partial scripts, where for each able, so we want the missing event to have the smallest rank partial script c with missing event e, ranksys (c) is possible. 341 2-skip + bigram prob. all chains + bigram prob. Chain selection Av. rank Recall@50 Bigram selection Av. rank Recall@50 all chains 502 0.5179 regular bigrams 789 0.4886 long chains 549 0.4951 1-skip bigrams 630 0.4951 the longest chain 546 0.4984 2-skip bigrams 502 0.5179 Table 1: Chain selection methods for the Reuters corpus Table 3: Event bigram selection methods for the - comparison of average ranks and Recall@50. Reuters corpus - comparison of average ranks and Re- call@50. 2-skip + bigram prob. Chain selection Av. rank Recall@50 all chains + bigram prob. all chains 1650 0.3376 Bigram selection Av. rank Recall@50 long chains 452 0.3461 regular bigrams 2363 0.3227 the longest chain 1534 0.3376 1-skip bigrams 1690 0.3418 2-skip bigrams 1650 0.3376 Table 2: Chain selection methods for the Fairy Tale corpus - comparison of average ranks and Recall@50. Table 4: Event bigram selection methods for the Fairy Tales corpus - comparison of average ranks and Re- call@50. 4.4 Results We considered all 27 combinations of our chain predicting script events. selection methods, bigram counting methods, and For the Fairy Tale collection, long chains gives ranking methods: {all chains, long chains, the the lowest average rank and highest Recall@50. In longest chain}x{regular bigrams, 1-skip bigrams, this collection, there is apparently some benefit to 2-skip bigrams}x{Chambers & Jurafsky PMI, or- filtering the shorter event chains, probably because dered PMI, bigram probabilities}. The best among the collection is small enough that the noise in- these 27 combinations for the Reuters corpus was troduced from dependency and coreference errors {all chains}x{2-skip bigrams}x{bigram probabil- plays a larger role. ities} achieving an average rank of 502 and a Re- call@50 of 0.5179. 4.4.2 Gathering Event Chain Statistics Since viewing all the combinations at once would be confusing, instead the following sec- We next try to answer the question: Given an tions investigate each decision (selection, counting, event chain, how should statistics be gathered from ranking) one at a time. While one decision is var- it? Tables 3 and 4 show performance when we vary ied across its three choices, the other decisions are the strategy for counting event pairs, while fixing held to their values in the best model above. the selecting method to all chains, and fixing the ranking method to bigram probabilities. 4.4.1 Identifying Event Chains For the Reuters corpus, 2-skip bigrams achieves We first try to answer the question: How should the lowest average rank and the highest Recall@50. representative chains of events be selected from For the Fairy Tale corpus, 1-skip bigrams and 2- the source text? Tables 1 and 2 show perfor- skip bigrams perform similarly, and both have mance when we vary the strategy for selecting lower average rank and higher Recall@50 than event chains, while fixing the counting method to regular bigrams. 2-skip bigrams, and fixing the ranking method to Skip-grams probably outperform regular n- bigram probabilities. grams on both of these corpora because the skip- For the Reuters collection, we see that using all grams provide many more event pairs over which chains gives a lower average rank and a higher to calculate statistics: in the Reuters corpus, regu- Recall@50 than either of the strategies that select lar bigrams extracts 737,103 bigrams, while 2-skip a subset of the event chains. The explanation is bigrams extracts 1,201,185 bigrams. Though skip- probably simple: using all chains produces more grams have not been applied to predicting script than 700,000 bigrams from the Reuters corpus, events before, it seems that they are a good fit, while using only the long chains produces only and better capture statistics about narrative event around 300,000. So more data is better data for chains than regular n-grams do. 342 all bigrams + 2-skip ing the intuition that events do not have to appear Ranking method Av. rank Recall@50 strictly one after another to be closely semantically C&J PMI 2052 0.1954 related, skip-grams decrease data sparsity and in- ordered PMI 3584 0.1694 bigram prob. 502 0.5179 crease the size of the training data. Second, our novel bigram probabilities ranking Table 5: Ranking methods for the Reuters corpus - function outperforms the other ranking methods. comparison of average ranks and Recall@50. In particular, it outperforms the state-of-the-art pointwise mutual information method introduced all bigrams + 2-skip by Chambers and Jurafsky (2008), and it does so Ranking method Av. rank Recall@50 by a large margin, more than doubling the Re- C&J PMI 1455 0.1975 call@50 on the Reuters corpus. The key insight ordered PMI 2460 0.0467 bigram prob. 1650 0.3376 here is that, when modeling events in a script, a language-model-like approach better fits the task Table 6: Ranking methods for the Fairy Tale corpus - than a mutual information approach. comparison of average ranks and Recall@50. Third, we have discussed why Recall@N is a better and more consistent evaluation metric than Average rank. However, both evaluation metrics 4.4.3 Predicting Script Events suffer from the strictness of the narrative cloze test, Finally, we try to answer the question: Given which accepts only one event being the correct event n-gram statistics, which ranking function event, while it is sometimes very difficult, even best predicts the events for a script? Tables 5 and for humans, to predict the missing events, and 6 show performance when we vary the strategy for sometimes more solutions are possible and equally ranking event predictions, while fixing the selec- correct. In future research, our goal is to design tion method to all chains, and fixing the counting a better evaluation framework which is more suit- method to 2-skip bigrams. able for this task, where credit can be given for For both Reuters and the Fairy Tale corpus, Re- proposed script events that are appropriate but not call@50 identifies bigram probabilities as the best identical to the ones observed in a text. ranking function by far. On the Reuters corpus Fourth, we have observed some differences in the Chambers & Jurafsky PMI ranking method results between the Reuters and the Fairy Tale achieves Recall@50 of only 0.1954, while bigram corpora. The results for Reuters are consistently probabilities ranking method achieves 0.5179. The better (higher Recall@50, lower average rank), al- gap is also quite large on the Fairy Tales corpus: though fairy tales contain a plainer narrative struc- 0.1975 vs. 0.3376. ture, which should be more appropriate to our task. On the Reuters corpus, average rank also identi- This again leads us to the conclusion that more fies bigram probabilities as the best ranking func- data (even with more noise as in Reuters) leads to tion, yet for the Fairy Tales corpus, Chambers & a greater coverage of events, better overall models Jurafsky PMI and bigram probabilities have simi- and, consequently, to more accurate predictions. lar average ranks. This inconsistency is probably Still, the Reuters corpus seems to be far from a due to the flaws in the average rank evaluation perfect corpus for research in the automatic acqui- measure that were discussed in Section 4.3 – the sition of scripts, since only a small portion of the measure is overly sensitive to the length of the corpus contains true narratives. Future work must guess list, particularly when the missing event is therefore gather a large corpus of true narratives, ranked lower, as it is likely to be when training on like fairy tales and children’s stories, whose sim- a smaller corpus like the Fairy Tales corpus. ple plot structures should provide better learning material, both for models predicting script events, 5 Discussion and for related tasks like automatic storytelling Our experiments have led us to several important (McIntyre and Lapata, 2009). conclusions. First, we have introduced skip-grams One of the limitations of the work presented and proved their utility for acquiring script knowl- here is that it takes a fairly linear, n-gram-based ap- edge – our models that employ skip bigrams score proach to characterizing story structure. We think consistently higher on event prediction. By follow- such an approach is useful because it forms a natu- 343 ral baseline for the task (as it does in many other 47th Annual Meeting of the Association for Compu- tasks such as named entity tagging and language tational Linguistics and the 4th International Joint modeling). However, story structure is seldom Conference on Natural Language Processing of the strictly linear, and future work should consider AFNLP, pages 217–225. Neil McIntyre and Mirella Lapata. 2010. Plot induc- models based on grammatical or discourse links tion and evolutionary search for story generation. that can capture the more complex nature of script In Proceedings of the 48th Annual Meeting of the events and story structure. Association for Computational Linguistics, pages 1562–1572. Acknowledgments Michaela Regneri, Alexander Koller, and Manfred We would like to thank the anonymous reviewers Pinkal. 2010. Learning script knowledge with web experiments. In Proceedings of the 48th Annual for their constructive comments. This research Meeting of the Association for Computational Lin- was carried out as a master thesis in the frame- guistics, pages 979–988. work of the TERENCE European project (EU FP7- Roger C. Schank and Robert P. Abelson. 1977. Scripts, 257410). plans, goals, and understanding: an inquiry into human knowledge structures. Lawrence Erlbaum Associates. References Wilson L. Taylor. 1953. Cloze procedure: a new tool Nathanael Chambers and Dan Jurafsky. 2008. Un- for measuring readibility. Journalism Quarterly, supervised learning of narrative event chains. In 30:415–433. Proceedings of the 46th Annual Meeting of the As- sociation for Computational Linguistics: Human Language Technologies, pages 789–797. Nathanael Chambers and Dan Jurafsky. 2009. Un- supervised learning of narrative schemas and their participants. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 602–610. Nathanael Chambers and Dan Jurafsky. 2011. Template-based information extraction without the templates. In Proceedings of the 49th Annual Meet- ing of the Association for Computational Linguistics: Human Language Technologies, pages 976–986. David Guthrie, Ben Allison, W. Liu, Louise Guthrie, and Yorick Wilks. 2006. A closer look at skip-gram modelling. In Proceedings of the Fifth international Conference on Language Resources and Evaluation (LREC), pages 1222–1225. Dan Klein and Christopher D. Manning. 2003. Ac- curate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Compu- tational Linguistics, pages 423–430. David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397. Mehdi Manshadi, Reid Swanson, and Andrew S. Gor- don. 2008. Learning a probabilistic model of event sequences from internet weblog stories. In Proceed- ings of the Twenty-First International Florida Artifi- cial Intelligence Research Society Conference. Neil McIntyre and Mirella Lapata. 2009. Learning to tell tales: A data-driven approach to story genera- tion. In Proceedings of the Joint Conference of the 344 The Problem with Kappa David M W Powers Centre for Knowledge & Interaction Technology, CSEM Flinders University

[email protected]

Abstract Introduction Research in Computational Linguistics usually It is becoming clear that traditional requires some form of quantitative evaluation. A evaluation measures used in number of traditional measures borrowed from Computational Linguistics (including Information Retrieval (Manning & Schütze, Error Rates, Accuracy, Recall, Precision 1999) are in common use but there has been and F-measure) are of limited value for considerable critical evaluation of these measures unbiased evaluation of systems, and are themselves over the last decade or so (Entwisle not meaningful for comparison of & Powers, 1998, Flach, 2003, Ben-David. 2008). algorithms unless both the dataset and Receiver Operating Analysis (ROC) has been algorithm parameters are strictly advocated as an alternative by many, and in controlled for skew (Prevalence and particular has been used by Fürnkranz and Flach Bias). The use of techniques originally (2005), Ben-David (2008) and Powers (2008) to designed for other purposes, in particular better understand both learning algorithms Receiver Operating Characteristics Area relationship and the between the various Under Curve, plus variants of Kappa, measures, and the inherent biases that make have been proposed to fill the void. many of them suspect. One of the key advantages of ROC is that it provides a clear indication of This paper aims to clear up some of the chance level performance as well as a less well confusion relating to evaluation, by known indication of the relative cost weighting demonstrating that the usefulness of each of positive and negative cases for each possible evaluation method is highly dependent on system or parameterization represented. the assumptions made about the ROC Area Under the Curve (Fig. 1) has been distributions of the dataset and the also used as a performance measure but averages underlying populations. The behaviour of over the false positive rate (Fallout) and is thus a a number of evaluation measures is function of cost that is dependent on the compared under common assumptions. classifier rather than the application. For this reason it has come into considerable criticism Deploying a system in a context which and a number of variants and alternatives have has the opposite skew from its validation been proposed (e.g. AUK, Kaymak et. Al, 2010 set can be expected to approximately and H-measure, Hand, 2009). An AUC curve negate Fleiss Kappa and halve Cohen that is at least as good as a second curve at all Kappa but leave Powers Kappa points, is said to dominate it and indicates that unchanged. For most performance the first classifier is equal or better than the evaluation purposes, the latter is thus second for all plotted values of the parameters, most appropriate, whilst for comparison and all cost ratios. However AUC being greater of behaviour, Matthews Correlation is for one classifier than another does not have such recommended. a property – indeed deconvexities within or 345 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 345–355, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics intersections of ROC curves are both prima facie ROC AUC, which are in fact all equivalent to evidence that fusion of the parameterized DeltaP’ in the dichotomous case, which we deal classifiers will be useful (cf. Provost and Facett, with first, and to the other Kappas when the 2001; Flach and Wu, 2005). marginal prevalences (or biases) match. AUK stands for Area under Kappa, and represents a step in the advocacy of Kappa (Ben- 1.1 Two classes and non-negative Kappa. David, 2008ab) as an alternative to the traditional Kappa was originally proposed (Cohen, 1960) to measures and ROC AUC. Powers (2003,2007) compare human ratings in a binary, or has also proposed a Kappa-like measure dichotomous, classification task. Cohen (1960) (Informedness) and analysed it in terms of ROC, recognized that Rand Accuracy did not take and there are many more, Warrens (2010) analyzing chance into account and therefore proposed to the relationships between some of the others. subtract off the chance level of Accuracy and Systems like RapidMiner (2011) and Weka then renormalize to the form of a probability: (Witten and Frank, 2005) provide almost all of K(Acc) = [Acc – E(Acc)] / [1 – E(Acc)] (1) the measures we have considered, and many This leaves the question of how to estimate the more besides. This encourages the use of expected Accuracy, E(Acc). Cohen (1960) made multiple measures, and indeed it is now the assumption that raters would have different becoming routine to display tables of multiple distributions that could be estimated as results for each system, and this is in particular the products of the corresponding marginal true for the frameworks of some of the coefficients of the contingency table: challenges and competitions brought to the communities (e.g. 2nd i2b2 Challenge in NLP for +ve Class −ve Class Clinical Data, 2011; 2nd Pascal Challenge on +ve Prediction A=TP B=FP PP HTC, 2011)). −ve Prediction C=FN D=TN PN This use of multiple statistics is no doubt in Notation RP RN N response to the criticism levelled at the Table 1. Statistical and IR Contingency Notation evaluation mechanisms used in earlier generations of competitions and the above In order to discuss this further it is important mentioned critiques, but the proliferation of to discuss our notational conventions, and it is alternate measures in some ways merely noted that in statistics, the letters A-D (upper compounds the problem. Researchers have the case or lower case) are conventionally used to temptation of choosing those that favour their label the cells, and their sums may be used to system as they face the dilemma of what to do label the marginal cells. However in the about competing (and often disagreeing) literature on ROC analysis, which we follow evaluation measures that they do not completely here, it is usual to talk about true and false understand. These systems and competitions also positives (that is positive predictions that are exhibit another issue, the tendency to macro- correct or incorrect), and conversely true and averages over multiple classes, even of measures false negatives. Often upper case is used to that are not denominated in class (e.g. that are indicate counts in the contingency table, which proportions of predicted labels rather than real sum to the number of instances, N. In this case classes, as with Precision). lower case letters are used to indicate This paper is directed at better understanding probabilities, which means that the some of these new and old measures as well as corresponding upper case values in the providing recommendations as to which measures contingency table are all divided by N, and n=1. are appropriate in which circumstances. Statistics relative to (the total numbers of items in) the real classes are called Rates and What’s in a Kappa? have the number (or proportion) of Real Positives (RP) or Real Negatives (RN) in the In this paper we focus on the Kappa family of denominator. In this notation, we have Recall = measures, as well as some closely related TPR = TP/RP. statistics named for other letters of the Greek Conversely statistics relative to the (number alphabet, and some measures that we will show of) predictions are called Accuracies, so relative behave as Kappa measures although they were to the predictions that label instances positively, not originally defined as such. These include Predicted Positives (PP), we have Precision = Informedness, Gini Coefficient and single point TPA = TP/PP. 346 ! the weighting is made according to the number of predictions made for the corresponding labels. Rand Accuracy is also the weighted average of Recall and Inverse Recall (probability that negative instances are correctly predicted), where the weighting is made according to the number of instances in the corresponding classes. The marginal probabilities rp and pp are also known as Prevalence (the class prevalence of positive instances) and Bias (the label bias to positive predictions), and the corresponding probabilities of negative classes and labels are the Inverse Prevalence and Inverse Bias respectively. In the ROC literature, the ratios of negative to positive classes is often referred to as the class ratio or skew. We can similarly also refer to a label ratio, prediction ratio or Figure 1. Illustration of ROC Analysis. The prediction skew. Note that optimal performance solid diagonal represents chance performance can only be achieved if class skew = label skew. for different rates of guessing positive or The Expected True Positives and Expected negative labels. The dotted line represent the True Negatives for Cohen Kappa, as well as Chi- convex hull enclosing the results of different squared significance, are estimated as the systems, thresholds or parameters tested. The product of Bias and Prevalence, and the product (0,0) and (1,1) points represent guessing always of Inverse Bias and Inverse Prevalence, resp., negative and always positive and are always where traditional uses of Kappa for agreement of nominal systems in a ROC curve. The points human raters, the contingency table represents along any straight line segment of a convex hull one rater as providing the classification to be are achievable by probabilistic interpolation of predicted by the other rater. Cohen assumes that the systems at each end, the gradient represents their distribution of ratings are independent, as the cost ratio and all points along the segment, including the endpoints have the same effective reflected both by the margins and the cost benefit. AUC is the area under the curve contingencies: ETP = RP*PP; ETN = RN*NN. joining the systems with straight edges and This gives us E(Acc) = (ETP+ETN)/N=etp+etn. AUCH is the area under the convex hull where By contrast the two rater two class form of points within it are ignored. The height above Fleiss (1981) Kappa, also known as Scott Pi, the chance line of any point represents DeltaP’, assumes that both raters are labeling the Gini Coefficient and also the Dichotomous independently using the same distribution, and Informedness of the corresponding system, and that the margins reflect this potential variation. also corresponds to twice the area of the triangle The expected number of positives is thus between it and the chance line, and thus 2AUC-1 effectively estimated as the average of the two where AUC is calculated on this single point raters’ counts, so that EP = (RP+PP)/2, and EN = curve (not shown) joining it to (0,0) and (1,1). (RN+PN)/2, ETP = EP2 and ETN = EN2. The (1,0) point represents perfect performance with 100% True Positive Rate and 0% False 1.2 Inverting Kappa Negative Rate. The definition of Kappa in Eqn (1) can be seen to be applicable to arbitrary definitions of The accuracy of all our predictions, positive or Expected Accuracy, and in order to discover how negative, is given by Rand Accuracy = other measures relate to the family of Kappa (TF+TN)/N = tf+tn, and this is what is meant in measures it is useful to invert Kappa to discover general by the unadorned term Accuracy, or the the implicit definition of Expected Accuracy that abbreviation Acc. allows a measure to be interpreted as a form of Rand Accuracy is the weighted average of Kappa. We simply make E(Acc) the subject by Precision and Inverse Precision (probability that multiplying out Eqn (1) to a common negative predictions are correctly labeled), where denominator and associating factors of E(Acc): 347 K(Acc) = [Acc – E(Acc)] / [1 – E(Acc)] (1) prevalence and bias of each class/label). Our E(Acc) = [Acc – K(Acc)] / [1 – K(Acc)] (2) focus in this paper is the behaviour of the various Note that for a given value of Acc the function Kappa measures as we move from strongly connecting E(Acc) and K(Acc) is its own matched to strongly mismatched biases. inverse: Cohen (1968) also introduced a weighted variant of Kappa. We have also discussed cost E(Acc) = fAcc(K(Acc)) (3) weighting in the context of ROC, and Hand K(Acc) = fAcc(E(Acc)) (4) (2009) seeks to improve on ROC AUC by For the future we will tend to drop the Acc introducing a beta distribution as an estimated argument or subscript when it is clear, and we cost profile, but we will not discuss them further will also subscript E and K with the name or here as we are more interested in the initial of the corresponding definition of effectiveness of the classifer overall rather than Expectation and thus Kappa (viz. Fleiss and matching a particular cost profile, and are Cohen so far). skeptical about any generic cost distribution. In Note that given Acc and E(Acc) are in the particular the beta distribution gives priority to range of 0..1 as probabilities, Kappa is also central tendency rather than boundary conditions, restricted to this range, and takes the form of a but boundary conditions are frequently probability. encountered in optimization. Similarly Kaymak et al.’s (2010) proposal to replace AUC by AUK 1.3 Multiclass multirater Kappa corresponds to a Cohen Kappa reweighting of Fleiss (1981) and others sought to generalize the ROC that eliminates many of its useful Cohen (1960) definition of Kappa to handle both properties, without any expectation that the multiple class (not just positive/negative) and measure, as an integration across a surrogate cost multiple raters (not just two – one of which we distribution, has any validity for system have called real and the other prediction). Fleiss selection. Introducing alternative weights is also in fact generalized Scott’s (1955) Pi in both allowed in the definition of F-Measure, although senses, not Cohen Kappa. The Fleiss Kappa is in practice this is almost invariably employed as not formulated as we have done here for the equally weighted harmonic mean of Recall exposition, but in terms of pairings (agreements) and Precision. Introducing additional weight or amongst the raters, who are each assumed to distribution parameters, just multiplies the have rated the same number of items, N, but not confusion as to which measure to believe. necessarily all. Krippendorf’s (1970, 1978) Powers (2003) derived a further multiclass effectively generalizes further by dealing with Kappa-like measure from first principles, arbitrary numbers of raters assessing different dubbing it Informedness, based on an analogy of numbers of items. Bookmaker associating costs/payoffs based on Light (1971) and Hubert (1977) successfully the odds. This is then proven to measure the generalized Cohen Kappa. Another approach to proportion of time (or probability) a decision is estimating E(Acc) was taken by Bennett et al informed versus random, based on the same (1955) which basically assumed all classes were assumptions re expectation as Cohen Kappa, and equilikely (effectively what use of Accuracy, F- we will thus call it Powers Kappa, and derive an Measure etc. do, although they don’t subtract off formulation of the corresponding expectation. the chance component). Powers (2007) further identifies that the The Bennett Kappa was generalized by dichotomous form of Powers Kappa is equivalent Randolph (2005), but as our starting point is that to the Gini cooefficient as a deskewed version of we need to take the actual margins into account, the weighted Relative Accuracy proposed by we do not pursue these further. However, Flach (2003) based on his analysis and Warrens (2010a) shows that, under certain deskewing of common evaluation measures in conditions, Fleiss Kappa is a lower bound of the ROC paradigm. Powers (2007) also identifies both the Hubert generalization of Cohen Kappa that Dichotomous Informedness is equivalent to and the Randolph generalization of Bennet an empirically derived psychological measure Kappa, which is itself correspondingly an upper called DeltaP’ (Perruchet et al. 2004). DeltaP’ bound of both the Hubert and the Light (and its dual DeltaP) were derived based on generalizations of Cohen Kappa. Unfortunately analysis of human word association data – the the conditions are that there is some agreement combination of this empirical observation with between the class and label skews (viz. the the place of DeltaP’ as the dichotomous case of 348 Powers’ ‘Informedness’ suggests that human the ability to take the geometric mean (of macro- association is in some sense optimal. Powers averaged) Informedness and Markedness means (2007) also introduces a dual of Informedness that a single Correlation can be provided when that he names Markedness, and shows that the appropriate. geometric mean of Informedness and Our aim now is therefore to characterize Markedness is Matthews Correlation, the Informedness (and hence as its dual Markedness) nominal analog of Pearson Correlation. as a Kappa measure in relation to the families of Powers’ Informedness is in fact a variant of Kappa measures represented by Cohen and Fleiss Kappa with some similarities to Cohen Kappa, Kappa in the dichotomous case. Note that but also some advantages over both Cohen and Warrens (2011) shows that a linearly weighted Fleiss Kappa due to its asymmetric relation with versions of Cohen’s (1968) Kappa is in fact a Recall, in the dichotomous form of Powers (2007), weighted average of dichotomous Kappas. Informedness = Recall + InverseRecall – 1 Similarly Powers (2003) shows that his Kappa = (Recall – Bias) / (1 – Prevalence). (Informedness) has this property. Thus it is If we think of Kappa as assessing the appropriate to consider the dichotomous case, relationship between two raters, Powers’ statistic and from this we can generalize as required. is not evenhanded and the Informedness and Markedness duals measure the two directions of 1.5 Kappa vs Determinant prediction, normalizing Recall and Precision. In Warrens (2010c) discusses another commonly fact, the relationship with Correlation allows used measure, the Odds Ratio ad/bc (in these to be interpreted as regression coefficients Epidemiology rather than Computer Science or for the prediction function and its inverse. Computational Linguistics). Closely related to this is the Determinant of the Contingency 1.4 Kappa vs Correlation Matrix dtp = ad-bc = etp-etn (in the Chi-Sqr, It is often asked why we don’t just use Cohen and Powers sense based on independent Correlation to measure. In fact, Castellan (1996) marginal probabilities). Both show whether the uses Tetrachoric Correlation, another odds favour positives over negatives more for the generalization of Pearson Correlation that first rater (real) than the second (predicted) – for assumes that the two class variables are given by the ratio it is if it is greater than one, for the underlying normal distributions. Uebersax difference it is if it is greater than 0. Note that (1987), Hutchison (1993) and Bonnet and Price taking logs of all coefficients would maintain the (2005) each compare Kappa and Correlation and same relationship and that the difference of the conclude that there does not seem to be any logs corresponds to the log of the ratio, mapping situation where Kappa would be preferable to into the information domain. Correlation. However all the Kappa and Warrens (2010c) further shows (in cost- Correlation variants considered were symmetric, weighted form) that Cohen Kappa is given by the and it is thus interesting to consider the separate following (in the notation of this paper, but regression coefficients underlying it that preferring the notations Prevalence and Inverse represent the Powers Kappa duals of Prevalence to rp and rn for clarity): Informedness and Markedness, which have the KC = dtp/[(Prev*IBias+Bias*IPrev)/2]. (5) advantage of separating out the influences of Prevalence and Bias (which then allows macro- Based on the previous characterization of averaging, which is not admissable for any Fleiss Kappa, we can further characterize it by symmetric form of Correlation or Kappa, as we KF = dtp/[(Prev+Bias)*(IBias+IPrev)/4]. (6) will discuss shortly). Powers (2007) regards Powers (2007) also showed corresponding Matthews Correlation as an appropriate measure formulations for Bookmaker Informedness (B, or for symmetric situations (like rater agreement) Powers Kappa = KP), Markedness and Matthews and generalizes the relationships between Correlation: Correlation and Significance to the Markedness B = dtp/[(Prev*IPrev)]. (7) and Informedness Measures. The differences M = dtp/[(Bias*IBias)]. (8) between Informedness and Markedness, which relate to mismatches in Prevalence and Bias, C = dtp/[√(Prev*IPrev*Bias*IBias)]. (9) mean that the pair of numbers provides further These elegant dichotomous forms are information about the nature of the relationship straightforward, with the independence between the two classifications or raters, whilst assumptions on Bias and Prevalence clear in 349 Cohen Kappa, the arithmetic means of Bias and 1.7 Averaging Prevalence clear in Fleiss Kappa, and the We now consider the issue of dealing with geometric means of Bias and Prevalence in the multiple measures and results of multiple Matthews Correlation. Further the independence classifiers by averaging. We first consider of Bias is apparent for Powers Kappa in the averages of some of the individual measures we Informedness form, and independence of have seen. The averages need not be arithmetic Prevalence is clear in the Markedness direction. means, or may represent means over the Note that the names Powers uses suggest that Prevalences and Biases. we are measuring something about the We will be punctuating our theoretical information conveyed by the prediction about the discussions and explanations with empirical class in the case of Informedness, and the demonstrations where we use 1:1 and 4:1 information conveyed to the predictor by the prevalence versus matching and mismatching class state in the case of Markedness. To the bias to generate the chance level contingency extent that Prevalence and Bias can be controlled based on marginal independence. We then mix independently, Informedness and Markedness are in a proportion of informed decisions, with the independent and Correlation represents the joint remaining decisions made by chance. probability of information being passed in both Table 2 compares Accuracy and F-Measure directions! Powers (2007) further proposes using for an informed decision percentage of 0, 100, 15 log formulations of these measures to take them and -15. Note that Powers Kappa or into the information domain, as well as relating ‘Informedness’ purports to recover this them to mutual information, G-squared and chi- proportion or probability. squared significance. F-Measure is one of the most common measures in Computational Linguistics and 1.6 Kappa vs Concordance Information Retrieval, being a Harmonic Mean of Recall and Precision, which in the common The pairwise approach used by Fleiss Kappa and unweighted form also is interpretable with its relatives does not assume raters use a respect to a mean of Prevalence and Bias: common distribution, but does assume they are using the same set, and number of categories. F = tp / [(Prev+Bias)/2] (10) When undertaking comparison of unconstrained Note that like Recall and Precision, F-Measure ratings or unsupervised learning, this constraint ignores totally cell D corresponding to tn. This is removed and we need to use a measure of is an issue when Prevalence and Bias are uneven concordance to compare clusterings against each or mismatched. In Information Retrieval, it is other or against a Gold Standard. Some of the often justified on the basis that the number of concordance measures use operators in irrelevant documents is large and not precisely probability space and relate closely to the known, but in fact this is due to lack of techniques here, whilst others operate in knowledge of the number of relevant documents, information space. See Pfitzner et al. (2009) for which affects Recall. In fact if tn is large with reviews of clustering comparison/concordance. respect to both rp and pp, and thus with respect A complete coverage of evaluation would also to components tp, fp and fn, then both tn/pn and cover significance and the multiple testing tn/rn approach 0 as tn increases without bound. problem, but we will confine our focus in this As discussed earlier, Rand Accuracy is a paper to the issue of choice of Kappa or prevalence (real class) weighted average of Correlation statistic, as well as addressing some Precision and Inverse Precision, as well as a bias issues relating to the use of macro-averaging. In (prediction label) weighted average of Recall and this paper we are regarding the choice of Bias as Inverse Precision. It reflects the D (tn) cell unlike under the control of the experimenter, as we have F, and while it does not remove the effect of a focus on learned or hand crafted computational chance it does not have the positive bias of F. linguistics systems. In fact, when we are using Acc = tp + fp (11) bootstrapping techniques or dealing with We also point out that the differences between multiple real samples or different subjects or the various Kappas shown in Determinant ecosystems, Prevalence may also vary. Thus the normalized form in Eqns (5-9) vary only in the simple marginal assumptions of Cohen or way prevalences and biases are averaged Powers statistics are the appropriate ones. together in the normalizing denominator. 350 Informed 1:1/1:1 4:1/4:1 4:1/1:4 We now turn to macro-averaging across Acc 50% 68% 32% multiple classifiers or raters. The Area Under the 0% Curve measures are all of this form, whether we F 50% 80% 32% Acc 100% 100% 100% are talking about ROC, Kappa, Recall-Precision 100% curves or whatever. The controversy over these F 100% 100% 100% Acc 57.5% 72.8% 42.2% averages, and macro-averaging in general, relates 15% to one of two issues: 1. The averages are not in F 57.5% 83% 46.97% Acc 42.5% 57.8% 27.2% general over the appropriate units or -15% denominators of the individual statistics; or 2. F 42.5% 72% 27.2% Table 2. Accuracy and F-Measure for different The averages are over a classifier determined mixes of prevalence and bias skew (odds ratio cost function rather than an externally or shown) as well as different proportions of correct standardly defined cost function. AUK and H- (informed) answers versus guessing – negative Measure seek to address these issues as discussed proportions imply that the informed decisions are earlier. In fact they both boil down to averaging deliberately made incorrectly (oracle tells me with an inappropriate distribution of weights. what to do and I do the opposite). Commonly macro-averaging averages across classes as average statistics derived for each class weighted by the cardinality of the class (viz. From Table 2 we note that the first set of prevalence). In our review above, we cited four statistics notes the chance level varies from the examples, but we will refer only to WEKA 50% expected for Bias=Prevalence=50%. This is (Witten et al., 2005) here as a commonly used in fact the E(Acc) used in calculating Cohen system and associated text book that employs Kappa. Where Prevalences and Biases are equal and advocates macro-averaging. WEKA and balanced, all common statistics agree – averages over tpr, fpr, Recall (yes redundantly), Recall = Precision = Accuracy = F, and they are Precision, F-Factor and ROC AUC. Only the interpretable with respect to this 50% chance average over tpr=Recall is actually meaningful, level. All the Kappas will also agree, as the because only it has the number of members of different averages of the identical prevalences the class, or its prevalence, as its denominator. and biases all come down to 50% as well. So Precision needs to be macro-averaged over the subtracting 50% from 57.5% and normalizing number of predictions for each class, in which (dividing) by the average effective prevalence of case it is equivalent to micro-averaging. 50%, we return 15% informed decisions in all Other micro-averaged statistics are also cases (as seen in detail in Table 3). shown, including Kappa (with the expectation However, F-measure gives an inflated estimate determined from ZeroR – predicting the majority when it focus on the more prevalent positive class, leading to a Cohen-like Kappa). class, with corresponding bias in the chance AUC will be pointwise for classifiers that component. don’t provide any probabilistic information Worse still is the strength of the Acc and F associated with label prediction, and thus don’t scores under conditions of matched bias and allow varying a threshold for additional points on prevalence when the deviation from chance is - the ROC or other threshold curves. In the case 15% - that is making the wrong decision 15% of where multiple threshold points are available, the time and guessing the rest of the time. In ROC AUC cannot be interpreted as having any academic terms, if we bump these rates up to relevance to any particular classifier, but is an ±25% F-factor gives a High Distinction for average over a range of classifiers. Even then it guessing 75% of the time and putting the right is not so meaningful as AUCH, which should be answer for the other 25%, a Distinction for 100% used as classifiers on the convex hull are usually guessing, and a Credit for guessing 75% of the available. The AUCH measure will then time and putting a wrong answer for the other dominate any individual classifiers, as if the 25%! In fact, the Powers Kappa corresponds to convex hull is not the same as the single the methodology of multiple choice marking, classifier it must include points that are above the where for questions with k+1 choices, a right classifier curve and thus its enclosed area totally answer gets 1 mark, and a wrong answer gets -1/k includes the area that is enclosed by the so that guessing achieves an expected mark of 0. individual classifier. Cohen Kappa achieves a very similar result for Macroaveraging of the curve based on each unbiased guessing strategies. class in turn as the Positive Class, and weighted 351 by the size of the positive class, is not two of the more complex cases that both relate to meaningful as effectively shown by Powers Fleiss Kappa with its mismatch to the marginal (2003) for the special case of the single point independence assumptions we prefer. These will curve given its equivalence to Powers Kappa. provide informedness of probability B plus a In fact Markedness does admit averaging over remaining proportion 1-B of random responses classes, whilst Informedness requires averaging exhibiting extreme bias versus both neutral and over predicted labels, as does Precision. The contrary prevalence. Note that we consider only other Kappa and Correlations are more complex |B|<1 as all Kappas give Acc=1 and thus K=1 for (note the demoninators in Eqns 5-9) and how B=1, and only Powers Kappa is designed to work they might be meaningfully macro-averaged is for B<1, giving K= -1 for B= -1. an open question. However, microaveraging can Recall that the general calculation of Expected always be done quickly and easily by simply Accuracy is summing all the contingency tables (the true E(Acc) = etp+etn (11) contingency tables are tables of counts, not For Fleiss Kappa we must calculate the probabilities, as shown in Table 1). expected values of the correct contingencies as Macroaveraging should never be done except discussed previously with expected probabilities for the special cases of Recall and Markedness ep = (rp+pp)/2 & en = (rn+pn)/2 (12) when it is equivalent to micro-average, which is etp = ep 2 & etn = en 2 (13) only slightly more expensive/complicated to do. We first consider cases where prevalence is Comparison of Kappas extreme and the chance component exhibits inverse bias. We thus consider limits as We now turn to explore the different definitions rp0, rn1, pp1-B, pnB. This gives us of Kappas, using the same approach employed (assuming |B|<1) with Accuracy and F-Factor in Table 1: We will EF(Acc) = (1/4+B2/4+B/2)2+(1/4+B2/4-B/2)2 consider 0%, 100%, 15% and -15% informed = (1+B2)/2 (14) decisions, with random decisions modelled on the basis of independent Bias and Prevalence. KF(Acc) = (1-B)2/[B2-2] (15) This clearly biases against the Fleiss family of We second consider cases where the Kappas, which is entirely appropriate. As prevalence is balanced and chance extreme, with pointed out by Entwisle & Powers (1998) the rp0.5, rn0.5, pp1-B, pnB, giving practice of deliberately skewing bias to achieve EF(Acc) = 1/2 + (B-1/2)2/2 better statistics is to be deprecated – they used = 5/8 + B(B-1)/2 (16) 1 1 2 1 1 2 the real-life example of a CL researcher choosing KF(Acc)=[(B- /2)-(B- /2) /2]/[ /2-(B- /2) /2] (17) to say water was always a noun because it was a =[B-5/8+B(B-1)/2]/[1-(5/8+B(B-1)/2) noun more often than not. With Cohen or Powers’ measures, any actual power of the system to Conclusions determine PoS, however weak, would be The asymmetric Powers Informedness gives reflected in an improvement in the scores versus the clearest measure of the predictive value of a any random choice, whatever the distribution. system, while the Matthews Correlation (as Recall that choosing one answer all the time geometric mean with the Powers Markedness corresponds to the extreme points of the chance dual) is appropriate for comparing equally valid line in the ROC curve. classifications or ratings into an agreed number Studies like Fitzgibbon et al (2007) and of classes. Concordance measures should be used Leibbrandt and Powers (2012) show divergences if number of classes is not agreed or specified. amongst the conventional and debiased measures, For mismatch cases (15) Fleiss is always but it is tricky to prove which is better. negative for |B|<1) and thus fails to adequately reward good performance under these marginal Kappa in the Limit conditions. For the chance case (17), the first It is however straightforward to derive limits for form we provide shows that the deviation from the various Kappas and Expectations under matching Prevalence is a driver in a Kappa-like extreme and central conditions of bias and function. Cohen on the other hand (Table 3) prevalence, including both match and mismatch. tends to apply multiply the weight given to error The 36 theoretical results match the mixture in even mild prevalence-bias mismatch model results in Table 3, however, due to space conditions. None of the symmetric Kappas constraints, formal treatment will be limited to designed for raters are suitable for classifiers. 352 1:1 1:1 4:1 4:1 4:1 1:4 1:1 1:1 4:1 4:1 4:1 1:4 1:1 1:1 4:1 4:1 4:1 1:4 Informedness 0% 0% 0% 0% 0% 0% 0% 0% 0% Prevalence 50% 80% 80% 50% 80% 80% 50% 20% 20% Iprevalence 50% 20% 20% 50% 20% 20% 50% 80% 80% Bias 50% 80% 20% 50% 80% 20% 50% 20% 80% Ibias 50% 20% 80% 50% 20% 80% 50% 80% 20% SkewR 100% 25% 25% 100% 25% 25% 100% 400% 400% SkewP 100% 25% 400% 100% 25% 400% 100% 400% 25% OddsRatio 100% 100% 6% 100% 100% 6% 100% 100% 1600% ePowers 50% 68% 32% 50% 68% 32% 50% 68% 32% eCohen 50% 68% 32% 50% 68% 32% 50% 68% 32% eFleiss 50% 68% 50% 50% 68% 50% 50% 68% 50% kPowers 0% 0% 0% 0% 0% 0% 0% 0% 0% kCohen 0% 0% 0% 0% 0% 0% 0% 0% 0% kFleiss 0% 0% -36% 0% 0% -36% 0% 0% -36% Informedness 100% 100% 100% 100% 100% 100% 100% 100% 100% Prevalence 50% 80% 80% 50% 80% 80% 50% 20% 20% Iprevalence 50% 20% 20% 50% 20% 20% 50% 80% 80% Bias 50% 80% 80% 50% 80% 80% 50% 20% 20% Ibias 50% 20% 20% 50% 20% 20% 50% 80% 80% SkewR 100% 25% 25% 100% 25% 25% 100% 400% 400% SkewP 100% 25% 25% 100% 25% 25% 100% 400% 400% OddsRatio 100% 100% 100% 100% 100% 100% 100% 100% 100% ePowers 50% 68% 68% 50% 68% 68% 50% 68% 68% aCohen 50% 68% 68% 50% 68% 68% 50% 68% 68% aFleiss 50% 68% 68% 50% 68% 68% 50% 68% 68% kPowers 100% 100% 100% 100% 100% 100% 100% 100% 100% kCohen 100% 100% 100% 100% 100% 100% 100% 100% 100% kFleiss 100% 100% 100% 100% 100% 100% 100% 100% 100% Informedness 15% 15% 15% 99% 99% 99% 99% 99% 99% Prevalence 50% 80% 80% 50% 80% 80% 50% 20% 20% Iprevalence 50% 20% 20% 50% 20% 20% 50% 80% 80% Bias 50% 80% 29% 50% 80% 79% 50% 20% 79% Ibias 50% 20% 71% 50% 20% 21% 50% 80% 21% SkewR 100% 25% 25% 100% 25% 25% 100% 400% 400% SkewP 100% 25% 245% 100% 25% 26% 100% 400% 26% OddsRatio 100% 100% 6% 100% 100% 6% 100% 100% 1600% ePowers 50% 68% 32% 50% 68% 32% 50% 68% 32% eCohen 50% 68% 37% 50% 68% 68% 50% 68% 32% eFleiss 50% 68% 50% 50% 68% 68% 50% 68% 50% kPowers 15% 15% 15% 99% 99% 99% 1% 1% 1% kCohen 15% 15% 8% 99% 99% 98% 1% 1% 0% kFleiss 15% 15% -17% 99% 99% 98% 1% 1% -35% Informedness -15% -15% -15% -99% -99% -99% -99% -99% -99% Prevalence 50% 80% 20% 50% 80% 80% 50% 20% 20% Iprevalence 50% 20% 80% 50% 20% 20% 50% 80% 80% Bias 50% 71% 80% 50% 21% 20% 50% 21% 80% Ibias 50% 29% 20% 50% 79% 80% 50% 79% 20% SkewR 100% 25% 400% 100% 25% 25% 100% 400% 400% SkewP 100% 41% 25% 100% 385% 400% 100% 385% 25% OddsRatio 100% 65% 1038% 100% 25% 25% 100% 104% 1542% ePowers 50% 63% 37% 50% 50% 50% 50% 68% 32% eCohen 50% 63% 32% 50% 32% 32% 50% 68% 32% eFleiss 50% 63% 50% 50% 50% 50% 50% 68% 50% kPowers -15% -15% -15% -99% -99% -99% -1% -1% -1% kCohen -15% -13% -7% -99% -47% -47% -1% -1% 0% kFleiss -15% -14% -46% -99% -99% -99% -1% -1% -37% Table 3. Empirical Results for Accuracy and Kappa for Fleiss/Scott, Cohen and Powers. Shaded cells indicate misleading results, which occur for both Cohen and Fleiss Kappas. 353 References P. A. Flach (2003). The Geometry of ROC Space: Understanding Machine Learning Metrics through 2nd i2b2 Workshop on Challenges in Natural ROC Isometrics, Proceedings of the Twentieth Language Processing for Clinical Data (2008). International Conference on Machine Learning http://gnode1.mib.man.ac.uk/awards.html (ICML-2003), Washington DC, 2003, pp. 226-233. (accessed 4 November 2011) J. L. Fleiss (1981). Statistical methods for rates and 2nd Pascal Challenge on Hierarchical Text proportions (2nd ed.). New York: Wiley. Classification http://lshtc.iit.demokritos.gr/node/48 (accessed 4 November 2011) A. Fraser & D. Marcu (2007). Measuring Word Alignment Quality for Statistical Machine N. Ailon. and M. Mohri (2010) Preference-based Translation, Computational Linguistics 33(3):293- learning to rank. Machine Learning 80:189-211. 303. A. Ben-David. (2008a). About the relationship J. Fürnkranz & P. A. Flach (2005). ROC ’n’ Rule between ROC curves and Cohen’s kappa. Learning – Towards a Better Understanding of Engineering Applications of AI, 21:874–882, 2008. Covering Algorithms, Machine Learning 58(1):39- A. Ben-David (2008b). Comparison of classification 77. accuracy using Cohen’s Weighted Kappa, Expert D. J. Hand (2009). Measuring classifier performance: Systems with Applications 34 (2008) 825–832 a coherent alternative to the area under the ROC Y. Benjamini and Y. Hochberg (1995). "Controlling curve. Machine Learning 77:103-123. the false discovery rate: a practical and powerful T. P. Hutchinson (1993). Focus on Psychometrics. approach to multiple testing". Journal of the Royal Kappa muddles together two sources of Statistical Society. Series B (Methodological) 57 disagreement: tetrachoric correlation is preferable. (1), 289–300. Research in Nursing & Health 16(4):313-6, 1993 D. G. Bonett & R.M. Price, (2005). Inferential Aug. Methods for the Tetrachoric Correlation U. Kaymak, A. Ben-David and R. Potharst (2010), Coefficient, Journal of Educational and Behavioral AUK: a sinple alternative to the AUC, Technical Statistics 30:2, 213-225 Report, Erasmus Research Institute of J. Carletta (1996). Assessing agreement on Management, Erasmus School of Economics, classification tasks: the kappa statistic. Rotterdam NL. Computational Linguistics 22(2):249-254 K. Krippendorff (1970). Estimating the reliability, N. J. Castellan, (1966). On the estimation of the systematic error, and random error of interval data. tetrachoric correlation coefficient. Psychometrika, Educational and Psychological Measurement, 30 31(1), 67-73. (1),61-70. J. Cohen (1960). A coefficient of agreement for K. Krippendorff (1978). Reliability of binary attribute nominal scales. Educational and Psychological data. Biometrics, 34 (1), 142-144. Measurement, 1960:37-46. J. Lafferty, A. McCallum. & F. Pereira. (2001). J. Cohen (1968). Weighted kappa: Nominal scale Conditional Random Fields: Probabilistic Models agreement with provision for scaled disagreement for Segmenting and Labeling Sequence Data. or partial credit. Psychological Bulletin 70:213-20. Proceedings of the 18th International Conference on Machine Learning (ICML-2001), San B. Di Eugenio and M. Glass (2004), The Kappa Francisco, CA: Morgan Kaufmann, pp. 282-289. Statistic: A Second Look., Computational Linguistics 30:1 95-101. R. Leibbrandt & D. M. W. Powers, Robust Induction of Parts-of-Speech in Child-Directed Language by J. Entwisle and D. M. W. Powers (1998). "The Co-Clustering of Words and Contexts. (2012). Present Use of Statistics in the Evaluation of NLP EACL Joint Workshop of ROBUS (Robust Parsers", pp215-224, NeMLaP3/CoNLL98 Joint Unsupervised and Semi-supervised Methods in Conference, Sydney, January 1998 NLP) and UNSUP (Unsupervised Learning in NLP). Sean Fitzgibbon, David M. W. Powers, Kenneth P. J. G. Lisboa, A. Vellido & H. Wong (2000). Bias Pope, and C. Richard Clark (2007). Removal of reduction in skewed binary classfication with EEG noise and artefact using blind source Bayesian neural networks. Neural Networks separation. Journal of Clinical Neurophysiology 13:407-410. 24(3):232-243, June 2007 354 R. Lowry (1999). Concepts and Applications of L. H. Reeker, (2000), Theoretic Constructs and Inferential Statistics. (Published on the web as Measurement of Performance and Intelligence in http:// faculty.vassar.edu/lowry/webtext.html.) Intelligent Systems, PerMIS 2000. (See http://www.isd.mel.nist.gov/research_areas/ C. D. Manning, and H. Schütze (1999). Foundations research_engineering/PerMIS_Workshop/ accessed of Statistical Natural Language Processing. MIT 22 December 2007.) Press, Cambridge, MA. W. A. Scott (1955). Reliability of content analysis: J. H McDonald, (2007). The Handbook of Biological The case of nominal scale coding. Public Opinion Statistics. (Course handbook web published as Quarterly, 19, 321-325. http: //udel.edu/~mcdonald/statpermissions.html) D. R. Shanks (1995). Is human learning rational? J.C. Nunnally and Bernstein, I.H. (1994). Quarterly Journal of Experimental Psychology, Psychometric Theory (Third ed.). McGraw-Hill. 48A, 257-279. K. Pearson and D. Heron (1912). On Theories of T. Sellke, Bayarri, M.J. and Berger, J. (2001), Calibration Association. J. Royal Stat. Soc. LXXV:579-652 of P-values for testing precise null hypotheses, P. Perruchet and R. Peereman (2004). The American Statistician 55, 62-71. (See http:// exploitation of distributional information in www.stat.duke.edu/%7Eberger/papers.html#p-value syllable processing, J. Neurolinguistics 17:97−119. accessed 22 December 2007.) D. Pfitzner, R. E. Leibbrandt and D. M. W. Powers P. J. Smith, Rae, DS, Manderscheid, RW and (2009). Characterization and evaluation of Silbergeld, S. (1981). Approximating the moments similarity measures for pairs of clusterings, and distribution of the likelihood ratio statistic for Knowledge and Information Systems, 19:3, 361-394 multinomial goodness of fit. Journal of the American Statistical Association 76:375,737-740. D. M. W. Powers (2003), Recall and Precision versus the Bookmaker, Proceedings of the International R. R. Sokal, Rohlf FJ (1995) Biometry: The principles Conference on Cognitive Science (ICSC-2003), and practice of statistics in biological research, 3rd ed New York: WH Freeman and Company. Sydney Australia, 2003, pp. 529-534. (See http:// david.wardpowers.info/BM/index.htm.) J. Uebersax (1987). Diversity of decision-making models and the measurement of interrater D. M. W. Powers (2008), Evaluation Evaluation, The agreement. Psychological Bulletin 101, 140−146. 18th European Conference on Artificial Intelligence (ECAI’08) J. Uebersax (2009) http://ourworld.compuserve.com/ homepages/jsuebersax/agree.htm accessed 24 D. M W Powers, (2007/2011) Evaluation: From February 2011. Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation, M. J. Warrens (2010a), Inequalities between multi- School of Informatics and Engineering, Flinders rater kappas. Advances in Data Analysis and University, Adelaide, Australia, TR SIE-07-001, Classification 4:271-286. Journal of Machine Learning Technologies 2:1 37-63. M. J. Warrens (2010b). A formal proof of a paradox https://dl-web.dropbox.com/get/Public/201101- associated with Cohen’s kappa. Journal of Evaluation_JMLT_Postprint-Colour.pdf?w=abcda988 Classificaiton 27:322-332. D. M. W. Powers, 2012. The Problem of Area Under M. J. Warrens (2010c). A Kraemer-type rescaling that the Curve. International Conference on Information transforms the Odds Ratio into the Weighted Science and Technology, ICIST2012, in press. Kappa Coefficient. Psychometrika 75:2 328-330. D. M. W. Powers and A. Atyabi, 2012. The Problem M. J. Warrens (2011). Cohen’s linearly wieghted of Cross-Validation: Averaging and Bias, Kappa is a weighted average of 2x2 Kappas. Repetition and Significance, SCET2012, in press. Psychometrika 76:3, 471-486. F. Provost and T. Fawcett. Robust classification for D. A. Williams (1976). Improved Likelihood Ratio imprecise environments. Machine Learning, Tests for Complete Contingency Tables, 44:203–231, 2001. Biometrika 63:33-37. RapidMiner (2011). http://rapid-i.com (accessed 4 I. H. Witten & E. Frank, (2005). Data mining (2nd November 2011). ed.). London: Academic Press. 355 User Edits Classification Using Document Revision Histories Amit Bronner Christof Monz Informatics Institute Informatics Institute University of Amsterdam University of Amsterdam

[email protected] [email protected]

Abstract Exploiting document revision histories has proven useful for a variety of natural language Document revision histories are a useful and abundant source of data for natural processing (NLP) tasks, including sentence com- language processing, but selecting relevant pression (Nelken and Yamangil, 2008; Yamangil data for the task at hand is not trivial. and Nelken, 2008) and simplification (Yatskar et In this paper we introduce a scalable ap- al., 2010; Woodsend and Lapata, 2011), informa- proach for automatically distinguishing be- tion retrieval (Aji et al., 2010; Nunes et al., 2011), tween factual and fluency edits in document textual entailment recognition (Zanzotto and Pen- revision histories. The approach is based nacchiotti, 2010), and paraphrase extraction (Max on supervised machine learning using lan- guage model probabilities, string similar- and Wisniewski, 2010; Dutrey et al., 2011). ity measured over different representations The ability to distinguish between factual of user edits, comparison of part-of-speech changes or edits, which alter the meaning, and flu- tags and named entities, and a set of adap- ency edits, which improve the style or readability, tive features extracted from large amounts is a crucial requirement for approaches exploit- of unlabeled user edits. Applied to con- ing revision histories. The need for an automated tiguous edit segments, our method achieves classification method has been identified (Nelken statistically significant improvements over a simple yet effective edit-distance base- and Yamangil, 2008; Max and Wisniewski, 2010), line. It reaches high classification accuracy but to the best of our knowledge has not been di- (88%) and is shown to generalize to addi- rectly addressed. Previous approaches have either tional sets of unseen data. applied simple heuristics (Yatskar et al., 2010; Woodsend and Lapata, 2011) or manual annota- 1 Introduction tions (Dutrey et al., 2011) to restrict the data to the type of edits relevant to the NLP task at hand. Many online collaborative editing projects such as The work described in this paper shows that it is Wikipedia1 keep track of complete revision histo- possible to automatically distinguish between fac- ries. These contain valuable information about the tual and fluency edits. This is very desirable as evolution of documents in terms of content as well it does not rely on heuristics, which often gener- as language, style and form. Such data is publicly alize poorly, and does not require manual anno- available in large volumes and constantly grow- tation beyond a small collection of training data, ing. According to Wikipedia statistics, in August thereby allowing for much larger data sets of re- 2011 the English Wikipedia contained 3.8 million vision histories to be used for NLP research. articles with an average of 78.3 revisions per ar- In this paper, we make the following novel con- ticle. The average number of revision edits per tributions: month is about 4 million in English and almost 11 We address the problem of automated classi- million in total for all languages.2 fication of user edits as factual or fluency edits 1 http://www.wikipedia.org 2 Average for the 5 years period between August 2006 users, anonymous users, software bots and reverts. Source: and August 2011. The count includes edits by registered http://stats.wikimedia.org. 356 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 356–366, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics by defining the scope of user edits, extracting a phrase identification using a rule based approach large collection of such user edits from the En- and manually annotated examples. glish Wikipedia, constructing a manually labeled Wikipedia vandalism detection is a user ed- dataset, and setting up a classification baseline. its classification problem addressed by a yearly A set of features is designed and integrated into competition (since 2010) in conjunction with the a supervised machine learning framework. It is CLEF conference (Potthast et al., 2010; Potthast composed of language model probabilities and and Holfeld, 2011). State-of-the-art solutions in- string similarity measured over different represen- volve supervised machine learning using various tations, including part-of-speech tags and named content and metadata features. Content features entities. Despite their relative simplicity, the fea- use spelling, grammar, character- and word-level tures achieve high classification accuracy when attributes. Many of them are relevant for our ap- applied to contiguous edit segments. proach. Metadata features allow detection by pat- We go beyond labeled data and exploit large terns of usage, time and place, which are gener- amounts of unlabeled data. First, we demonstrate ally useful for the detection of online malicious that the trained classifier generalizes to thousands activities (West et al., 2010; West and Lee, 2011). of examples identified by user comments as spe- We deliberately refrain from using such features. cific types of fluency edits. Furthermore, we in- A wide range of methods and approaches has troduce a new method for extracting features from been applied to the similar tasks of textual en- an evolving set of unlabeled user edits. This tailment and paraphrase recognition, see Androut- method is successfully evaluated as an alternative sopoulos and Malakasiotis (2010) for a compre- or supplement to the initial supervised approach. hensive review. These are all related because paraphrases and bidirectional entailments repre- 2 Related Work sent types of fluency edits. The need for user edits classification is implicit in A different line of research uses classifiers to studies of Wikipedia edit histories. For example, predict sentence-level fluency (Zwarts and Dras, Viegas et al. (2004) use revision size as a simpli- 2008; Chae and Nenkova, 2009). These could be fied measure for the change of content, and Kittur useful for fluency edits detection. Alternatively, et al. (2007) use metadata features to predict user user edits could be a potential source of human- edit conflicts. produced training data for fluency models. Classification becomes an explicit requirement 3 Definition of User Edits Scope when exploiting edit histories for NLP research. Yamangil and Nelken (2008) use edits as train- Within our approach we distinguish between edit ing data for sentence compression. They make segments, which represent the comparison (diff) the simplifying assumption that all selected edits between two document revisions, and user edits, retain the core meaning. Zanzotto and Pennac- which are the input for classification. chiotti (2010) use edits as training data for textual An edit segment is a contiguous sequence of entailment recognition. In addition to manually deleted, inserted or equal words. The difference labeled edits, they use Wikipedia user comments between two document revisions (vi , vj ) is repre- and a co-training approach to leverage unlabeled sented by a sequence of edit segments E. Each edits. Woodsend and Lapata (2011) and Yatskar edit segment (δ, w1m ) ∈ E is a pair, where δ ∈ et al. (2010) use Wikipedia comments to identify {deleted , inserted , equal } and w1m is a m-word relevant edits for learning sentence simplification. substring of vi , vj or both (respectively). The work by Max and Wisniewski (2010) is A user edit is a minimal set of sentences over- closely related to the approach proposed in this lapping with deleted or inserted segments. Given paper. They extract a corpus of rewritings, dis- the two sets of revision sentences (Svi , Svj ), let tinguish between weak semantic differences and strong semantic differences, and present a typol- φ(δ, w1m ) = {s ∈ Svi ∪ Svj | w1m ∩ s 6= ∅} (1) ogy of multiple subclasses. Spelling corrections be the subset of sentences overlapping with a are heuristically identified but the task of auto- given edit segment, and let matic classification is deferred. Follow-up work by Dutrey et al. (2011) focuses on automatic para- ψ(s) = {(δ, w1m ) ∈ E | w1m ∩ s 6= ∅} (2) 357 be the subset of edit segments overlapping with a (1) Revisions 368209202 & 378822230 given sentence. pre (“By the mid 1700s, Medzhybizh was the seat of A user edit is a pair (pre ⊆ Svi , post ⊆ Svj ) power in Podilia Province.”) where post (“By the mid 18th century, Medzhybizh was the seat of power in Podilia Province.”) ∀s ∈ pre ∪ post ∀δ ∈ {deleted , inserted } ∀w1m diff (equal , “By the mid”) , (deleted, “1700s”) , (inserted , “18th century”) , (equal , “, Medzhy- (δ, w1m ) ∈ ψ(s) → φ(δ, w1m ) ⊆ pre ∪ post (3) bizh was the seat of power in Podilia Province.”) ∃s ∈ pre ∪ post ∃δ ∈ {deleted , inserted } ∃w1m (δ, w1m ) ∈ ψ(s) (4) (2) Revisions 148109085 & 149440273 pre (“Original Society of Teachers of the Alexander Table 1 illustrates different types of edit seg- Technique (est. 1958).”) ments and user edits. The term replaced segment post (“Original and largest professional Society of refers to adjacent deleted and inserted segments. Teachers of the Alexander Technique estab- lished in 1958.”) Example (1) contains a replaced segment because diff (equal , “Original”) , (inserted , “and largest the deleted segment (“1700s”) is adjacent to the professional”) , (equal , “Society of Teachers of inserted segment (“18th century”). Example (2) the Alexander Technique”) , (deleted , “(est.”) , contains an inserted segment (“and largest profes- (inserted , “ established in”) , (equal , “1958”) , sional”), a replaced segment (“(est.” → “estab- (deleted , “)”) , (equal , “.”) lished in”) and a deleted segment (“)”). User edits of both examples consist of a single pre sentence (3) Revisions 61406809 & 61746002 and a single post sentence because deleted and in- pre (“Fredrik Modin is a Swedish ice hockey left serted segments do not cross any sentence bound- winger.” , “He is known for having one of the hardest slap shots in the NHL.”) ary. Example (3) contains a replaced segment (“. post (“Fredrik Modin is a Swedish ice hockey left He” → “who”). In this case the deleted segment winger who is known for having one of the hard- (“. He”) overlaps with two sentences and there- est slap shots in the NHL.”) fore the user edit consists of two pre sentences. diff (equal , “Fredrik Modin is a Swedish ice hockey left winger”) , (deleted , “. He”) , (inserted , 4 Features for Edits Classification “who”) , (equal , “is known for having one of the hardest slap shots in the NHL.”) We design a set of features for supervised classi- fication of user edits. The design is guided by two Table 1: Examples of user edits and the corre- main considerations: simplicity and interoperabil- sponding edit segments (revision numbers corre- ity. Simplicity is important because there are po- spond to the English Wikipedia). tentially hundreds of millions of user edits to be classified. This amount continues to grow at rapid pace and a scalable solution is required. Interop- each user edit. For instance, example (1) in Table erability is important because millions of user ed- 1 has one deleted token, two inserted tokens and its are available in multiple languages. Wikipedia 14 equal tokens. Many features use string similar- is a flagship project, but there are other collabora- ity calculated over alternative representations. tive editing projects. The solution should prefer- Character-level features include counts of ably be language- and project-independent. Con- deleted, inserted and equal characters of different sequently, we refrain from deeper syntactic pars- types, such as word & non-word characters or dig- ing, Wikipedia-specific features, and language re- its & non-digits. Character types may help iden- sources that are limited to English. tify edits types. For example, the change of dig- Our basic intuition is that longer edits are likely its may suggest a factual edit while the change of to be factual and shorter edits are likely to be non-word characters may suggest a fluency edit. fluency edits. The baseline method is therefore Word-level features count deleted, inserted character-level edit distance (Levenshtein, 1966) and equal words using three parallel represen- between pre- and post-edited text. tations: original case, lower case, and lemmas. Six feature categories are added to the baseline. Word-level edit distance is calculated for each Most features take the form of threefold counts re- representation. Table 2 illustrates how edit dis- ferring to deleted, inserted and equal elements of tance may vary across different representations. 358 Rep. User Edit Dist the deleted NE tag and the inserted PoS tag. This Words pre Branch lines were built in Kenya 4 is an inherent weakness of these features when post A branch line was built in Kenya compared to parsing-based alternatives. Lowcase pre branch lines were built in kenya 3 An additional set of counts, NE values, de- post a branch line was built in kenya scribes the number of deleted, inserted and equal Lemmas pre branch line be build in Kenya 1 post a branch line be build in Kenya normalized values of numeric entities such as PoS tags pre NN NNS VBD VBN IN NNP 2 numbers and dates. For instance, if the word post DT NN NN VBD VBN IN NNP “100” is replaced by “200” and the respective nu- NE tags pre LOCATION 0 meric values 100.0 and 200.0 are normalized, the post LOCATION counts of deleted and inserted NE values will be incremented and suggest a factual edit. If on the Table 2: Word- and tag-level edit distance mea- other hand “100” is replaced by “hundred” and the sured over different representations (example latter is normalized as having the numeric value from Wikipedia revisions 2678278 & 2682972). 100.0, then the count of equal NE values will be incremented, rather suggesting a fluency edit. Acronym features count deleted, inserted and Fluency edits may shift words, which sometimes equal acronyms. Potential acronyms are extracted may be slightly modified. Fluency edits may also from word sequences that start with a capital letter add or remove words that already appear in con- and from words that contain multiple capital let- text. Optimal calculation of edit distance with ters. If, for example, “UN” is replaced by “United shifts is computationally expensive (Shapira and Nations”, “MicroSoft” by “MS” or “Jean Pierre” Storer, 2002). Translation error rate (TER) pro- by “J.P”, the count of equal acronyms will be in- vides an approximation but it is designed for the cremented, suggesting a fluency edit. needs of machine translation evaluation (Snover The last category, language model (LM) fea- et al., 2006). To have a more sensitive estima- tures, takes a different approach. These features tion of the degree of edit, we compute the minimal look at n-gram based sentence probabilities be- character-level edit distance between every pair of fore and after the edit, with and without normal- words that belong to different edit segments. For ization with respect to sentence lengths. The ratio each pair of edit segments (δ, w1m ), (δ 0 , w0 k1 ) over- of the two probabilities, Pˆratio (pre, post) is com- lapping with a user edit, if δ 6= δ 0 we compute: puted as follows: ∀w ∈ w1m : min EditDist(w, w0 ) (5) m Y w0 ∈w0 k1 Pˆ (w1m ) ≈ i−1 P (wi |wi−n+1 ) (6) Binned counts of the number of words with a min- i=1 1 imal edit distance of 0, 1, 2, 3 or more charac- Pˆnorm (w1m ) = Pˆ (w1m ) m (7) ters are accumulated per edit segment type (equal, Pˆnorm (post) deleted or inserted). Pˆratio (pre, post) = (8) Pˆnorm (pre) Part-of-speech (PoS) features include counts of deleted, inserted and equal PoS tags (per tag) Pˆnorm (post) log Pˆratio (pre, post) = log (9) and edit distance at the tag level between PoS tags Pˆnorm (pre) before and after the edit. Similarly, named-entity = log Pˆnorm (post) − log Pˆnorm (pre) (NE) features include counts of deleted, inserted 1 1 and equal NE tags (per tag, excluding OTHER) = log Pˆ (post) − log Pˆ (pre) |post| |pre| and edit distance at the tag level between NE tags before and after the edit. Table 2 illustrates the Where Pˆ is the sentence probability estimated as edit distance at different levels of representation. a product of n-gram conditional probabilities and We assume that a deleted NE tag, e.g. PERSON Pˆnorm is the sentence probability normalized by or LOCATION, could indicate a factual edit. It the sentence length. We hypothesize that the rel- could however be a fluency edit where the NE is ative change of normalized sentence probabilities replaced by a co-referent like “she” or “it”. Even is related to the edit type. As an additional feature, if we encounter an inserted PRP PoS tag, the fea- the number of out of vocabulary (OOV) words be- tures do not capture the explicit relation between fore and after the edit is computed. The intuition 359 Dataset Labeled Subset guage model built by SRILM (Stolcke, 2002) with Number of User Edits: modified interpolated Kneser-Ney smoothing us- 923,820 (100%) 2,008 (100%) ing the AFP and Xinhua portions of the English Edit Segments Distribution: Gigaword corpus (LDC2003T05). Replaced 535,402 (57.96%) 1,259 (62.70%) We extract a total of 4.3 million user edits of Inserted 235,968 (25.54%) 471 (23.46%) which 2.52 million (almost 60%) are insertions Deleted 152,450 (16.5%) 278 (13.84%) and deletions of complete sentences. Although Character-level Edit Distance Distribution: these may include fluency edits such as sentence 1 202,882 (21.96%) 466 (23.21%) reordering or rewriting from scratch, we assume 2 81,388 (8.81%) 198 (9.86%) that the large majority is factual. Of the remaining 3-10 296,841 (32.13%) 645 (32.12%) 1.78 million edits, the majority (64.5%) contains 11-100 342,709 (37.10%) 699 (34.81%) single deleted, inserted or replaced segments. We Word-level Edit Distance Distribution: decide to focus on this subset because sentences 1 493,095 (53.38%) 1,008 (54.18%) 2 182,770 (19.78%) 402 (20.02%) with multiple non-contiguous edit segments are 3 77,603 (8.40%) 161 (8.02%) more likely to contain mixed cases of unrelated 4-10 170,352 (18.44%) 357 (17.78%) factual and fluency edits, as illustrated by exam- Labels Distribution: ple (2) in Table 1. Learning to classify contigu- Fluency - 1,008 (50.2%) ous edit segments seems to be a reasonable way Factual - 1,000 (49.8%) of breaking down the problem into smaller parts. We filter out user edits with edit distance longer Table 3: Dataset of nearly 1 million user edits than 100 characters or 10 words that we assume to with single deleted, inserted or replaced segments, be factual. The resulting dataset contains 923,820 of which 2K are labeled. The labels are almost user edits: 58% replaced segments, 25.5% in- equally distributed. The distribution over edit seg- serted segments and 16.5% deleted segments. ment types and edit distance intervals is detailed. Manual labeling of user edits is carried out by a group of annotators with near native or native level of English. All annotators receive the same is that unknown words are more likely to be in- written guidelines. In short, fluency labels are dicative of factual edits. assigned to edits of letter case, spelling, gram- mar, synonyms, paraphrases, co-referents, lan- 5 Experiments guage and style. Factual labels are assigned to 5.1 Experimental Setup edits of dates, numbers and figures, named enti- ties, semantic change or disambiguation, addition First, we extract a large amount of user edits from or removal of content. A random set of 2,676 in- revision histories of the English Wikipedia.3 The stances is labeled: 2,008 instances with a majority extraction process scans pairs of subsequent re- agreement of at least two annotators are selected visions of article pages and ignores any revision as training set, 270 instances are held out as de- that was reverted due to vandalism. It parses the velopment set, 164 trivial fluency corrections of a Wikitext and filters out markup, hyperlinks, tables single letter’s case and 234 instances with no clear and templates. The process analyzes the clean text agreement among annotators are excluded. The of the two revisions4 and computes the difference last group (8.7%) emphasizes that the task is, to between them.5 The process identifies the overlap a limited extent, subjective. It suggests that auto- between edit segments and sentence boundaries mated classification of certain user edits would be and extracts user edits. Features are calculated difficult. Nevertheless, inter-rater agreement be- and user edits are stored and indexed. LM features tween annotators is high to very high. Kappa val- are calculated against a large English 4-gram lan- ues between 0.74 to 0.84 are measured between 3 Dump of all pages with complete edit history as of Jan- six pairs of annotators, each pair annotated a com- uary 15, 2011 (342GB bz2), http://dumps.wikimedia.org. mon subset of at least 100 instances. Table 3 de- 4 Tokenization, sentence split, PoS & NE tags by Stanford scribes the resulting dataset, which we also make CoreNLP, http://nlp.stanford.edu/software/corenlp.shtml. 5 Myers’ O(N D) difference algorithm (Myers, 1986), available to the research community.6 http://code.google.com/p/google-diff-match-patch. 6 Available for download at http://staff. 360 Character-level Edit Distance Feature set SVM RF Logit flu. / fac. flu. / fac. flu. / fac. .≤4 >4& Baseline 0.85 / 0.67 0.74 / 0.79 0.85 / 0.67 + Char-level 0.85 / 0.82 0.83 / 0.86 0.86 / 0.82 Fluency (725) Factual (821) + Word-level 0.88 / 0.69 0.81 / 0.82 0.86 / 0.70 Factual (179) Fluency (283) + PoS 0.85 / 0.68 0.78 / 0.76 0.84 / 0.72 + NE 0.86 / 0.79 0.79 / 0.87 0.87 / 0.78 Figure 1: A decision tree that uses character-level + Acronyms 0.87 / 0.66 0.83 / 0.70 0.86 / 0.68 edit distance as a sole feature. The tree correctly + LM 0.85 / 0.67 0.79 / 0.76 0.84 / 0.69 All Features 0.88 / 0.86 0.86 / 0.88 0.87 / 0.84 classifies 76% of the labeled user edits. Table 5: Fraction of correctly classified edits per Feature set SVM RF Logit type: fluency edits (left) and factual edits (right), Baseline 76.26% 76.26% 76.34% using the baseline, each feature set added to the + Char-level 83.71%† 84.45%† 84.01%† baseline, and all features combined. + Word-level 78.38%†∨ 81.38%†∧ 78.13%†∨ + PoS 76.58%∨ 76.97% 78.35%†∧ + NE 82.71%† 83.12%† 82.38%† + Acronyms 76.55% 76.61% 76.96% line. Then each one of the feature groups is sep- + LM 76.20% 77.42% 76.52% arately added to the baseline. Finally, all features All Features 87.14%†∧ 87.14%† 85.64%†∨ are evaluated together. Table 4 reports the per- centage of correctly classified edits (classifiers’ Table 4: Classification accuracy using the base- accuracy), and Table 5 reports the fraction of cor- line, each feature set added to the baseline, and rectly classified edits per type. All results are for all features combined. Statistical significance at 10-fold cross validation. Statistical significance p < 0.05 is indicated by † w.r.t the baseline (us- against the baseline and between classifiers is cal- ing the same classifier), and by ∧ w.r.t to another culated at p < 0.05 using paired t-test. classifier marked by ∨ (using the same features). The first interesting result is the highly predic- Highest accuracy per classifier is marked in bold. tive power of the single-feature baseline. It con- firms the intuition that longer edits are mainly fac- tual. Figure 1 shows that the edit distance of 72% 5.2 Feature Analysis of the user edits labeled as fluency is between 1 to We experiment with three classifiers: Support 4, while the edit distance of 82% of those labeled Vector Machines (SVM), Random Forests (RF) as factual is greater than 4. The cut-off value is and Logistic Regression (Logit).7 SVMs (Cortes found by a single-node decision tree that uses edit and Vapnik, 1995) and Logistic Regression (or distance as a sole feature. The tree correctly clas- Maximum Entropy classifiers) are two widely sifies 76% of the instances. This result implies used machine learning techniques. SVMs have that the actual challenge is to correctly classify been applied to many text classification problems short factual edits and long fluency edits. (Joachims, 1998). Maximum Entropy classifiers Character-level features and named-entity fea- have been applied to the similar tasks of para- tures lead to significant improvements over the phrase recognition (Malakasiotis, 2009) and tex- baseline for all classifiers. Their strength lies in tual entailment (Hickl et al., 2006). Random their ability to identify short factual edits such Forests (Breiman, 2001) as well as other decision as changes of numeric values or proper names. tree algorithms are successfully used for classi- Word-level features also significantly improve the fying Wikipedia edits for the purpose of vandal- baseline but their contribution is smaller. PoS ism detection (Potthast et al., 2010; Potthast and and acronym features lead to small statistically- Holfeld, 2011). insignificant improvements over the baseline. Experiments begin with the edit-distance base- The poor contribution of LM features is sur- prising. It might be due to the limited context science.uva.nl/˜abronner/uec/data. 7 Using Weka classifiers: SMO (SVM), RandomForest & of n-grams, but it might be that LM probabili- Logistic (Hall et al., 2009). Classifier’s parameters are tuned ties are not a good predictor for the task. Re- using the held-out development set. moving LM features from the set of all features 361 Fluency Edits Misclassified as Factual Correctly Classified Fluency Edits Equivalent or redundant in context 14 “Adventure education makes intentional use of intention- Paraphrases 13 ally uses challenging experiences for learning.” Equivalent numeric patterns 7 “He served as president from October 1 , 1985 and retired Replacing first name with last name 4 through his retirement on June 30 , 2002.” Acronyms 4 “In 1973, he helped organize assisted in organizing his Non specific adjectives or adverbs 3 first ever visit to the West.” Other 5 Correctly Classified Factual Edits Factual Edits Misclassified as Fluency “Over the course of the next two years five months, the Short correction of content 35 unit completed a series of daring raids.” Opposites 3 “Scottish born David Tennant has reportedly said he Similar names 3 would like his Doctor to wear a kilt.” Noise (unfiltered vandalism) 3 Other 6 “This family joined the strip in late 1990 around March 1991.” Table 6: Error types based on manual examina- Table 7: Examples of correctly classified user ed- tion of 50 fluency edit misclassifications and 50 its. Deleted segments are struck out, inserted are factual edit misclassifications. bold (revision numbers are omitted for brevity). leads to a small decrease in classification accu- racy, namely 86.68% instead of 87.14% for SVM. For example: “in 1986” → “that year”, “when This decrease is not statistically significant. she returned” → “when Ruffa returned” and “the The highest accuracy is achieved by both SVM core member of the group are” → “the core mem- and RF and there are few significant differences bers are”. 13 (26%) are paraphrases misclassified among the three classifiers. The fraction of cor- as factual edits. Examples are: “made cartoons” rectly classified edits per type (Table 5) reveals → “produced animated cartoons” and “with the that for SVM and Logit, most fluency edits are implication that they are similar to” → “imply- correctly classified by the baseline and most im- ing a connection to”. 7 modify numeric patterns provements over the baseline are attributed to bet- that do not change the meaning such as the year ter classification of factual edits. This is not the “37” → “1937”. 4 replace a first name of a per- case for RF, where the fraction of correctly classi- son with the last name. 4 contain acronyms, e.g. fied factual edits is higher and the fraction of cor- “Display PostScript” → “Display PostScript (or rectly classified fluency edits is lower. This in- DPS)”. Acronym features are correctly identified sight motivates further experimentation. Repeat- but the classifier fails to recognize a fluency edit. ing the experiment with a meta-classifier that uses 3 modify adjectives or adverbs that do not change a majority voting scheme, achieves an improved the meaning such as “entirely” and “various”. accuracy of 87.58%. This improvement is not sta- Factual edit misclassifications: the big major- tistically significant. ity, 35 instances (70%), could be characterized as short corrections, often replacing a similar word, 5.3 Error Analysis that make the content more accurate or more To have better understanding of errors made by precise. Examples (context is omitted): “city” the classifier, 50 fluency edit misclassifications → “village”, “emigrated” → “immigrated” and and 50 factual edit misclassifications are ran- “electrical” → “electromagnetic”. 3 are opposites domly selected and manually examined. The er- or antonyms such as “previous” → “next” and rors are grouped into categories as summarized in “lived” → “died”. 3 are modifications of similar Table 6. These explain certain limitations of the person or entity names, e.g. “Kelly” → “Kate”. classifier and suggest possible improvements. 3 are instances of unfiltered vandalism, i.e. noisy Fluency edit misclassifications: 14 instances examples. Other misclassifications include verb (28%) are phrases (often co-referents) that are ei- tense modifications such as “is” → “was” and ther equivalent or redundant in the given context. “consists” → “consisted”. These are difficult to 362 Comment Test Set Classified as Replaced by Frequency Edit class Size Fluency Edits “second” 144 Factual “grammar” 1,122 88.9% “First” 38 Fluency “spelling” 2,893 97.6% “last” 31 Factual “typo” 3,382 91.6% “1st” 22 Fluency “copyedit” 3,437 68.4% “third” 22 Factaul Random set 5,000 49.4% Table 9: User edits replacing the word “first” with Table 8: Classifying unlabeled data selected by another single word: most frequent 5 out of 524. user comments that suggest a fluency edit. The SVM classifier is trained using the labeled data. Replaced by Frequency Replaced by Frequency User comments are not used as features. “Adams” 7 “Squidward” 6 “Joseph” 7 “Alexander” 5 “Einstein” 6 “Davids” 5 classify because the modification of verb tense in “Galland” 6 “Haim” 5 a given context is sometimes factual and some- “Lowe” 6 “Hickes” 5 times a fluency edit. These findings agree with the feature analy- Table 10: Fluency edits replacing the word “He” sis. Fluency edit misclassifications are typically with proper noun: most frequent 10 out of 1,381. longer phrases that carry the same meaning while factual edit misclassifications are typically sin- gle words or short phrases that carry different uate against. We resort to Wikipedia user com- meaning. The main conclusion is that the clas- ments. It is a problematic option because it is un- sifier should take into account explicit content reliable. Users may add a comment when submit- and context. Putting aside the consideration of ting an edit, but it is not mandatory. The com- simplicity and interoperability, features based on ment is a free text with no predefined structure. co-reference resolution and paraphrase recogni- It could be meaningful or nonsense. The com- tion are likely to improve fluency edits classi- ment is per revision. It may refer to one, some fication, and features from language resources or all edits submitted for a given revision. Nev- that describe synonymy and antonymy relations ertheless, we identify several keywords that rep- are likely to improve factual edits classification. resent certain types of fluency edits: “grammar”, While this conclusion may come at no surprise, it “spelling”, “typo”, and “copyedit”. The first three is important to highlight the high classification ac- clearly indicate grammar and spelling corrections. curacy that is achieved without such capabilities The last indicates a correction of format and style, and resources. Table 7 presents several examples but also of accuracy of the text. Therefore it only of correct classification produced by our classifier. represents a bias towards fluency edits. We extract unlabeled edits whose comment is 6 Exploiting Unlabeled Data equal to one of the keywords and construct a test We extracted a large set of user edits but our ap- set per keyword. An additional test set consists of proach has been limited to a restricted number of randomly selected unlabeled edits with any com- labeled examples. This section attempts to find ment. The five test sets are classified by the SVM whether the classifier generalizes beyond labeled classifier trained using the labeled data and the set data and whether unlabeled data could be used to of all features. To remove any doubt, user com- improve classification accuracy. ments are not part of any feature of the classifier. The results in Table 8 show that most unlabeled 6.1 Generalizing Beyond Labeled Data edits whose comments are “grammar”, “spelling” The aim of the next experiment is to test how well or “typo” are indeed classified as fluency ed- the supervised classifier generalizes beyond the its. The classification of edits whose comment is labeled test set. The problem is the availability “copyedit” is biased towards fluency edits, but as of test data. There is no shared task for user ed- expected the result is less distinct. The classifica- its classification and no common test set to eval- tion of the random set is balanced, as expected. 363 Feature set SVM RF Logit tences of other unlabeled edits. The first step is to Baseline 76.26% 76.26% 76.34% select candidates using a bag of words approach. All Features 87.14%†∧ 87.14%† 85.64%†∨ The second step is a comparison of the user edit Unlabeled only 78.11%∨ 83.49%†∧ 78.78%†∨ with each one of the candidates while increment- Base + unlabeled 80.86%†∨ 85.45%†∧ 81.83%†∨ ing counts of similarity measures. These account All + unlabeled 87.23% 88.35%‡†∧ 85.92%∨ for exact matches between different representa- tions (original and low case, lemmas, PoS and NE Table 11: Classification accuracy using features tags) as well as for approximate matches using from unlabeled data. The first two rows are identi- character- and word-level edit distance between cal to Table 4. Statistical significance at p < 0.05 those representations. An additional feature is the is indicated by: † w.r.t the baseline; ‡ w.r.t all fea- number of distinct documents in the candidate set. tures excluding features from unlabeled data; and We compute the set of features for the labeled ∧ w.r.t to another classifier marked by ∨ (using the dataset based on the unlabeled data. The number same features). The best result is marked in bold. of candidates is set to 1,000 per user edit. We re-train the classifiers using five configurations: 6.2 Features from Unlabeled Data Baseline and All Features are identical to the first experiment. Unlabeled only uses the new feature The purpose of the last experiment is to exploit set without any other feature. Base + Unlabeled unlabeled data in order to extract additional fea- adds the new feature set to the baseline. All + Un- tures for the classifier. The underlying assumption labeled uses all available features. All results are is that reoccurring patterns may indicate whether for 10-fold cross validation with statistical signif- a user edit is factual or a fluency edit. icance at p < 0.05 by paired t-test, see Table 11. We could assume that fluency edits would re- We find that features extracted from unlabeled occur across many revisions, while factual edits data outperform the baseline and lead to statisti- would only appear in revisions of specific docu- cally significant improvements when added to it. ments. However, this assumption does not nec- The combination of all features allows Random essarily hold. Table 9 gives a simple example of Forests to achieve the highest statistically signifi- single word replacements for which the most re- cant accuracy level of 88.35%. occurring edit is actually factual and other factual and fluency edits reoccur in similar frequencies. 7 Conclusions Finding user edits reoccurrence is not trivial. This work addresses the task of user edits clas- We could rely on exact matches of surface forms, sification as factual or fluency edits. It adopts but this may lead to data sparseness issues. Flu- a supervised machine learning approach and ency edits that exchange co-referents and proper uses character- and word- level features, part- nouns, as illustrated by the example in Table 10, of-speech tags, named entities, language model may reoccur frequently but this fact could not probabilities, and a set of features extracted from be revealed by exact matching of specific proper large amounts of unlabeled data. Our experiments nouns. On the other hand, using a bag of word with contiguous user edits extracted from revision approach may find too many unrelated edits. histories of the English Wikipedia achieve high We introduce a two-step method that measures classification accuracy and demonstrate general- the reoccurrence of edits in unlabeled data us- ization to data beyond labeled edits. ing exact and approximate matching over multi- Our approach shows that machine learning ple representations. The method provides a set of techniques can successfully distinguish between frequencies that is fed into the classifier and al- user edit types, making them a favorable alterna- lows for learning subtle patterns of reoccurrence. tive to heuristic solutions. The simple and adap- Staying consistent with our initial design consid- tive nature of our method allows for application to erations, the method is simple and interoperable. large and evolving sets of user edits. Given a user edit (pre, post), the method does not compare pre with post in any way. It only Acknowledgments. This research was funded compares pre with pre-edited sentences of other in part by the European Commission through the unlabeled edits and post with post-edited sen- CoSyne project FP7-ICT-4-248531. 364 References R. Nelken and E. Yamangil. 2008. Mining Wikipedia’s article revision history for training A. Aji, Y. Wang, E. Agichtein, and E. Gabrilovich. computational linguistics algorithms. In Proceed- 2010. Using the past to score the present: Extend- ings of the AAAI Workshop on Wikipedia and Arti- ing term weighting models through revision history ficial Intelligence: An Evolving Synergy, pages 31– analysis. In Proceedings of the 19th ACM inter- 36. national conference on Information and knowledge S. Nunes, C. Ribeiro, and G. David. 2011. Term management, pages 629–638. weighting based on document revision history. I. Androutsopoulos and P. Malakasiotis. 2010. A sur- Journal of the American Society for Information vey of paraphrasing and textual entailment meth- Science and Technology, 62(12):2471–2478. ods. Journal of Artificial Intelligence Research, M. Potthast and T. Holfeld. 2011. Overview of the 2nd 38(1):135–187. international competition on Wikipedia vandalism L. Breiman. 2001. Random forests. Machine learn- detection. Notebook for PAN at CLEF 2011. ing, 45(1):5–32. M. Potthast, B. Stein, and T. Holfeld. 2010. Overview J. Chae and A. Nenkova. 2009. Predicting the fluency of the 1st international competition on Wikipedia of text with shallow structural features: case stud- vandalism detection. Notebook Papers of CLEF, ies of machine translation and human-written text. pages 22–23. In Proceedings of the 12th Conference of the Euro- D. Shapira and J. Storer. 2002. Edit distance with pean Chapter of the Association for Computational move operations. In Combinatorial Pattern Match- Linguistics, pages 139–147. ing, pages 85–98. C. Cortes and V. Vapnik. 1995. Support-vector net- M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and works. Machine learning, 20(3):273–297. J. Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of C. Dutrey, D. Bernhard, H. Bouamor, and A. Max. Association for Machine Translation in the Ameri- 2011. Local modifications and paraphrases in cas, pages 223–231. Wikipedia’s revision history. Procesamiento del A. Stolcke. 2002. SRILM-an extensible language Lenguaje Natural, Revista no 46:51–58. modeling toolkit. In Proceedings of the interna- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute- tional conference on spoken language processing, mann, and I.H. Witten. 2009. The WEKA data volume 2, pages 901–904. mining software: an update. ACM SIGKDD Explo- F.B. Viegas, M. Wattenberg, and K. Dave. 2004. rations Newsletter, 11(1):10–18. Studying cooperation and conflict between authors A. Hickl, J. Williams, J. Bensley, K. Roberts, B. Rink, with history flow visualizations. In Proceedings of and Y. Shi. 2006. Recognizing textual entailment the SIGCHI conference on Human factors in com- with LCCs GROUNDHOG system. In Proceedings puting systems, pages 575–582. of the Second PASCAL Challenges Workshop. A.G. West and I. Lee. 2011. Multilingual vandalism T. Joachims. 1998. Text categorization with support detection using language-independent & ex post vector machines: Learning with many relevant fea- facto evidence. Notebook for PAN at CLEF 2011. tures. Machine Learning: ECML-98, pages 137– A.G. West, S. Kannan, and I. Lee. 2010. Detecting 142. Wikipedia vandalism via spatio-temporal analysis A. Kittur, B. Suh, B.A. Pendleton, and E.H. Chi. 2007. of revision metadata. In Proceedings of the Third He says, she says: Conflict and coordination in European Workshop on System Security, pages 22– Wikipedia. In Proceedings of the SIGCHI confer- 28. ence on Human factors in computing systems, pages K. Woodsend and M. Lapata. 2011. Learning to 453–462. simplify sentences with quasi-synchronous gram- mar and integer programming. In Proceedings of V.I. Levenshtein. 1966. Binary codes capable of cor- the 2011 Conference on Empirical Methods in Nat- recting deletions, insertions, and reversals. Soviet ural Language Processing, pages 409–420. Physics Doklady, 10(8):707–710. E. Yamangil and R. Nelken. 2008. Mining Wikipedia P. Malakasiotis. 2009. Paraphrase recognition using revision histories for improving sentence compres- machine learning to combine similarity measures. sion. In Proceedings of ACL-08: HLT, Short Pa- In Proceedings of the ACL-IJCNLP 2009 Student pers, pages 137–140. Research Workshop, pages 27–35. M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, and A. Max and G. Wisniewski. 2010. Min- L. Lee. 2010. For the sake of simplicity: Unsu- ing naturally-occurring corrections and paraphrases pervised extraction of lexical simplifications from from Wikipedia’s revision history. In Proceedings Wikipedia. In Human Language Technologies: The of LREC, pages 3143–3148. 2010 Annual Conference of the North American E.W. Myers. 1986. An O(N D) difference algorithm Chapter of the Association for Computational Lin- and its variations. Algorithmica, 1(1):251–266. guistics, pages 365–368. 365 F.M. Zanzotto and M. Pennacchiotti. 2010. Expand- ing textual entailment corpora from Wikipedia us- ing co-training. In Proceedings of the 2nd Work- shop on Collaboratively Constructed Semantic Re- sources, COLING 2010. S. Zwarts and M. Dras. 2008. Choosing the right translation: A syntactically informed classification approach. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 1153–1160. 366 User Participation Prediction in Online Forums Zhonghua Qu and Yang Liu The University of Texas at Dallas {qzh,

[email protected]

} Abstract ommendation systems are built. Content based rec- ommendation systems use the textual information Online community is an important source of news articles and user generated content to rank for latest news and information. Accurate items. Collaborative filtering, on the other hand, prediction of a user’s interest can help pro- uses co-occurrence information from a collection vide better user experience. In this paper, of users for recommendation. we develop a recommendation system for online forums. There are a lot of differ- During the past few years, online community ences between online forums and formal me- has become a large part of internet. More often, dia. For example, content generated by users latest information and knowledge appear at on- in online forums contains more noise com- line community earlier than other formal media. pared to formal documents. Content topics This makes it a favorable place for people seeking in the same forum are more focused than timely update and latest information. Online com- sources like news websites. Some of these munity sites appear in many forms, for example, differences present challenges to traditional online forums, blogs, and social networking web- word-based user profiling and recommenda- tion systems, but some also provide oppor- sites. Here we focus our study on online forums. It tunities for better recommendation perfor- is very helpful to build an automatic system to sug- mance. In our recommendation system, we gest latest information a user would be interested propose to (a) use latent topics to interpo- in. However, unlike formal news media, user gen- late with content-based recommendation; (b) erated content in forums is usually less organized model latent user groups to utilize informa- and not well formed. This presents a great chal- tion from other users. We have collected lenge to many existing news article recommenda- three types of forum data sets. Our experi- mental results demonstrate that our proposed tion systems. In addition, what makes online fo- hybrid approach works well in all three types rums different from other media is that users of of forums. online communities are not only the information consumers but also active providers as participants. Therefore in this study we develop a recommen- 1 Introduction dation system to account for these characteristics of forums. We propose several improvements over Internet is an important source of information. It previous work: has become a habit of many people to go to the in- ternet for latest news and updates. However, not all • Latent topic interpolation: This is to address articles are equally interesting for different users. the issue with the word-based content repre- In order to intelligently predict interesting articles sentation. In this paper we used Latent Dirich- for individual users, personalized news recommen- let Allocation (LDA), a generative multino- dation systems have been developed. There are in mial mixture model, for topic inference inside general two types of approaches upon which rec- threads. We build a system based on words 367 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 367–376, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics and latent topics, and linearly interpolate their according to its “informativeness”. Then, base on results. this “personal profile” a ranking machine is applied to give a ranked recommendation list. In Fabs sys- • User modeling: We model users’ participa- tem, Rocchio’ algorithm (Rocchio, 1971) is used tion inside threads as latent user groups. Each to learn the average TF-IDF vector of highly rated latent group is a multinomial distribution on documents. Skyskill & Webert’s system uses Naive users. Then LDA is used to infer the group Bayes classifiers to give the probability of docu- mixture inside each thread, based on which ments being liked. Winnow’s algorithm (Little- the probability of a user’s participation can be stone, 1988), which is similar to perception algo- derived. rithm, has been shown to perform well when there • Hybrid system: Since content and user- are many features. An adaptive framework is intro- based methods rely on different information duced in (Li et al., 2010) using forum comments sources, we combine the results from them for for news recommendation. In (Wu et al., 2010), further improvement. a topic-specific topic flow model is introduced to rank the likelihood of user participating in a thread We have evaluated our proposed method using in online forums. three data sets collected from three representative Collaborative-filtering based systems, unlike forums. Our experimental results show that in all content-based systems, predict the recommending forums, by using latent topics information, system items using co-occurrence information between can achieve better accuracy in predicting threads users. For example, in a news recommendation for recommendation. In addition, by modeling la- system, in order to recommend an article to user tent user groups in thread participation, further im- c, the system tries to find users with similar taste provement is achieved in the hybrid system. Our as c. Items favored by similar users would be rec- analysis also showed that each forum has its nature, ommended. Grundy (Rich, 1979) is known to be resulting in different optimal parameters in the dif- one of the first collaborative-filtering based sys- ferent forums. tems. Collaborative filtering systems can be ei- ther model based or memory based (Breese et al., 2 Related Work 1998). Memory-based algorithms, such as (Del- Recommendation systems can help make informa- gado and Ishii, 1999; Nakamura and Abe, 1998; tion retrieving process more intelligent. Generally, Shardanand and Maes, 1995), use a utility function recommendation methods are categorized into two to measure the similarity between users. Then rec- types (Adomavicius and Tuzhilin, 2005), content- ommendation of an item is made according to the based filtering and collaborative filtering. sum of the utility values of active users that partic- Systems using content-based filtering use the ipate in it. Model-based algorithms, on the other content information of recommendation items a hand, try to formulate the probability function of user is interested in to recommend new items to one item being liked statistically using active user the user. For example, in a news recommendation information. (Ungar et al., 1998) clustered sim- system, in order to recommend appropriate news ilar users into groups for recommendation. Dif- articles to a user, it finds the most prominent fea- ferent clustering methods have been experimented, tures (e.g., key words, tags, category) in the docu- including K-means and Gibbs Sampling. Other ment that a user likes, then suggests similar articles probabilistic models have also been used to model based on this “personal profile”. In Fabs system collaborative relationships, including a Bayesian (Balabanovic and Shoham, 1997), Skyskill & We- model (Chien and George, 1999), linear regres- bert system (Pazzani et al., 1997), documents are sion model (Sarwar et al., 2001), Gaussian mix- represented using a set of most important words ture models (Hofmann, 2003; Hofmann, 2004). In according to a weighting measure. The most popu- (Blei et al., 2001) a collaborative filtering appli- lar measure of word “importance” is TF-IDF (term cation is discussed using LDA. However in this frequency, inverse document frequency) (Salton model, re-estimation of parameters for the whole and Buckley, 1988), which gives weights to words system is needed when a new item comes in. In 368 this paper, we formulate users’ participation differ- duce some bias toward negative instances in terms ently using the LDA mixture model. of user interests. A users’ absence from a thread Some previous work has also evaluated using does not necessarily mean the user is not interested a hybrid model with both content and collabora- in that thread; it may be a result of the user being tive features and showed outstanding performance. offline by that time or the thread is too behind in For example, in (Basu et al., 1998), hybrid features pages. As a matter of fact, we found most users are used to make recommendation using inductive read only the threads on the first page during their learning. time of visit of a forum. This makes participation prediction an even harder task than interest predic- 3 Forum Data tion. In online forums, threads are ordered by the time We have collected data from three forums in this stamp of their last participating post. Provided with study.1 Ubuntu community forum is a technical the time stamp for each post, we can calculate the support forum; World of Warcraft (WoW) forum is order of a thread on its board during a user’s par- about gaming; Fitness forum is about how to live ticipation. Figure 1 shows the distribution of post a healthy life. These three forums are quite rep- location during users’ participation. We found that resentative of online forums on the internet. Us- most of the users read only the posts on the first ing three different types of forums for task eval- page. In order to minimize the false negative in- uation helps to demonstrate the robustness of our stances from the data set, we did thread location proposed method. In addition, it can show how the filtering. That is, we want to filter out messages same method could have substantial performance that actually interest the user but do not have the difference on forums of different nature. Users’ user’s participation because they are not on the first behaviors in these three forums are very differ- page. For any user, only those threads appearing in ent. Casual forums like “Wow gaming” have much the first 10 entries on a page during a user’s visit more posts in each thread. However its posts are are included in the data set. the shortest in length. This is because discussions inside these types of forums are more like casual conversation, and there is not much requirement on the user’s background, and thus there is more user participation. In contrast, technical forums like “Ubuntu” have fewer average posts in each thread, and have the longest post length. This is because a Question and Answer (QA) forum tends to be very goal oriented. If a user finds the thread is unrelated, then there will be no motivation for participation. Inside forums, different boards are created to categorize the topics allowed for discussion. From Figure 1: Thread position during users’ participation. the data we find that users tend to participate in a few selected boards of their choices. To create a In the pre-processing step of the experiment, first data set for user interest prediction in this study, we use online status filtering discussed above to we pick the most popular boards in each forum. remove threads that a user does not see while of- Even within the same board, users tend to partici- fline. The statistics of the boards we have used in pate in different threads base on their interest. We each forum are shown in Table 1. The statistics use a user’s participation information as an indica- are consistent with the full forum statistics. For tion whether a thread is interesting to a user or not. example, users in technical forums tend to post Hence, our task is to predict the user participation less than casual forums. We define active users as in forum threads. Note this approach could intro- those who have participated in 10 or more threads. 1 Please contact the authors to obtain the data. Column “Part. @300” shows the average number 369 of threads the top 300 users have participated in. that normalization by document length yielded “Filt. Threads@300” shows the average number of good empirical results in approximating a well cal- threads after using online filtering with a window ibrated posterior probability for Naive Bayes clas- of 10. Thread participation in “Ubuntu” forum is sifier. The normalized Naive Bayes classifier they very sparse for each user, having only 10.01% par- used is as follows: ticipating threads for each user after filtering. “Fit- 1 Y 1 ness” and “Wow Forum” have denser participation, P (Ci |f1..k ) = P (Ci ) P (fj |Ci ) |f | (2) Z at 18.97% and 13.86% respectively. j 4 Interesting Thread Prediction In this equation, the probability of generat- ing each word is normalized by the length of In the task of interesting thread prediction, the sys- the feature vector |f |. The posterior probabil- tem generates a ranked list of threads a user is ity P (interested|f1..k ) from (normalized) Naive likely to be interested in based on users’ past his- Bayes classifier is used for recommendation item tory of thread participation. Here, instead of pre- ranking. dicting the true interestedness, we predict the par- ticipation of the user, which is a sufficient condi- 4.1.2 Latent Topics based Interpolation tion for interestedness. This approach is also used Because of noisy forum writing and limited by (Wu et al., 2010) for their task evaluation. In training data, the above bag-of-word model used in this section, we describe our proposed approaches naive Bayes classifier may suffer from data sparsity for thread participation prediction. issues. We thus propose to use latent topic model- ing to alleviate this problem. Latent Dirichlet Allo- 4.1 Content-based Filtering cation (LDA) is a generative model based on latent In the content-based filtering approach, only con- topics. The major difference between LDA and tent of a thread is used as features for prediction. previous methods such as probabilistic Latent Se- Recommendation through content-based filtering mantic Analysis (pLSA) is that LDA can efficiently has its deep root in information retrieval. Here we infer topic composition of new documents, regard- use a Naive Bayes classifier for ranking the threads less of the training data size (Blei et al., 2001). This using information based on the words and the la- makes it ideal for efficiently reducing the dimen- tent topic analysis. sion of incoming documents. In an online forum, words contained in threads 4.1.1 Naive Bayes Classification tend to be very noisy. Irregular words, such as In (Pazzani et al., 1997) Naive Bayesian classi- abbreviation, misspelling and synonyms, are very fier showed outstanding performance in web page common in an online environment. From our ex- recommendation compared to several other clas- periments, we observe that LDA seems to be quite sifiers. A Naive Bayes classifier is a generative robust to these phenomena and able to capture model in which words inside a document are as- word relationship semantically. To illustrate the sumed to be conditionally independent. That is, words inside latent topics in the LDA model in- given the class of a document, words are generated ferred from online forums, we show in Table 2 the independently. The posterior probability of a test top words in 3 out of 20 latent topics inferred from instance in Naive Bayes classifier takes the follow- “Ubuntu” forum according to its multinomial dis- ing form: tribution. We can see that variations of the same 1 Y words are grouped into the same topic. P (Ci |f1..k ) = P (Ci ) P (fj |Ci ) (1) Since each post could be very short and LDA is Z j generally known not to work well with short docu- where Z is the class label independent normaliza- ments, we concatenated the content of posts inside tion term, f1..k is the bag-of-word feature vector each thread to form documents. In order to build for the document. Naive Bayes classifier is known a valid evaluation configuration, only posts before for not having a well calibrated posterior probabil- the first time the testing user participated are used ity (Bennett, 2000). (Pavlov et al., 2004) showed for model fitting and inference. 370 Forum Name Threads Posts Active Users Part. @300 Filt. Threads @300 Ubuntu 185,747 940,230 1,700 464.72 4641.25 Fitness 27,250 529,201 2,808 613.15 3231.04 Wow Gaming 34,187 1,639,720 19,173 313.77 2264.46 Table 1: Data statistics after filtering. Topic 1 Topic 2 Topic 3 where θ1 , ..., θt is the multinomial distribution of lol’d wine email topics for the thread. lol. Wine mail imo. game Thunderbird 4.2 Collaborative Filtering ,’ fixme evolution Collaborative filtering techniques make prediction -, stub send using information from similar users. It has ad- lulz. not emails vantages over content-based filtering in that it can lmao. WINE gmail correctly predict items that are vastly different in rofl. play postfix content but similar in concepts indicated by users’ participation. Table 2: Example of LDA topics that capture words In some previous work, clustering methods were with different variations. used to partition users into several groups, Then, predictions were made using information from users in the same group. However, in the case After model fitting for LDA, the topic distri- of thread recommendation, we found that users’ butions on new threads can be inferred using the interest does not form clean clusters. Figure 2 model. Compared to the original bag-of-word fea- shows the mutual information between users after ture vector, the topic distribution vector is not only doing an average-link clustering on their pairwise more robust against noise, but also closer to hu- mutual information. In a clean clustering, intra- man interpretation of words. For example in topic cluster mutual information should be high, while 3 in Table 2, people who care about “Thunder- inter-cluster mutual information is very low. If so, bird”, an email client, are also very likely to show we would expect that the figure shows clear rect- interest in “postfix”, which is a Linux email ser- angles along the diagonal. Unfortunately, from this vice. These closely related words, however, might figure it appears that users far away in the hierarchy not be captured using the bag-of-word model since tree still have a lot of common thread participation. that would require the exact words to appear in the Here, we propose to model user similarity based on training set. latent user groups. In order to take advantage of the topic level in- formation while not losing the “fine-grained” word 4.2.1 Latent User Groups level feature, we use the topic distribution as ad- In this paper, we model users’ participation in- ditional features in combination with the bag-of- side threads as an LDA generative model. We word features. To tune the contribution of topic model each user group as a multinomial distribu- level features in classifiers like Naive Bayes clas- tion. Users inside each group are assumed to have sifiers, we normalize the topic level feature to a common interests in certain topic(s). A thread in an length of Lt = γ|f | and bag-of-word feature to online forum typically contains several such top- Lw = (1 − γ)|f |. γ is a tuning parameter from 0 to ics. We could model a user’s participation in a 1 that determines the proportion of the topic infor- thread as a mixture of several different user groups. mation used in the features. |f | is from the original Since one thread typically attracts a subset of user bag-of-word feature vector. The final feature vec- groups, it is reasonable to add a Dirichlet prior on tor for each thread can be represented as: the user group mixture. The generative process is the same as the LDA F = Lw w1 , ..., Lw wk ∪ Lt θ1 , ..., Lt θT (3) used above for topic modeling, except now users 371 groups, and θj is the group composition in thread j after inference using the training data. In gen- eral, the probability of user ui appearing in thread j is proportional to the membership probabilities of this user in the groups that compose the partici- pating users. 4.3 Hybrid System Up to this point we have two separate systems that can generate ranked recommendation lists based on different factors of threads. In order to generate the final ranked list, we give each item a score accord- Figure 2: Mutual information between users in Average ing to the ranked lists from the two systems. Then Link Hierarchical clustering. the two scores are linearly interpolated using a tun- ing parameter λ as shown in Equation 5. The final are ‘words’ and user groups are ‘topics’. Using ranked list is generated accordingly. LDA to model user participation can be viewed Ci =(1 − λ)Scorecontent as soft-clustering of users in a sense that one user (5) could appear in multiple groups at the same time. + λScorecollaborative The generative process for participating users is as We propose several different rescoring methods follows. to generate the scores in the above formula for the 1. Choose θ ∼ Dir(α) two individual systems. 2. For each of N participating users, un : • Posterior: The posterior probabilities of each item from the two systems are used directly as (a) Choose a group zn ∼ M ultinomial(θ) the score. (b) Choose a user un ∼ p(un |zn ) Scoredir = p(clike |itemi ) (6) One thing worth noting is that in LDA model a document is assumed to consist of many words. In This way the confidence of “how likely” an the case of modeling user participation, a thread item is interesting is preserved. However, typically has far fewer users than words inside a the downside is that the two different sys- document. This could potentially cause problem tems have different calibration on its posterior during variable estimation and inference. How- probability, which could be problematic when ever, we show that this approach actually works directly adding them together. well in practice (experimental results in Section 5). • Linear rescore: To counter the problem asso- 4.2.2 Using Latent User Groups for ciated with posterior probability calibration, Prediction we use linear rescoring based on the ranked For an incoming new thread, first the latent list: posi group distribution is inferred using collapsed Gibbs Scorelin = 1 − (7) N Sampling (Griffiths and Steyvers, 2004). The pos- terior probability of a user ui participating in thread In the formula, posi is the position of item i j given the user group distribution is as follows. in the ranked list, and N is the total number of items being ranked. The resulting score is X P (ui |θj , φ) = P (ui |φk )P (k|θj ) between 0 and 1, 1 being the first item on the (4) k∈T list and 0 being the last. In the equation, φk is the multinomial distribution • Sigmoid rescore: In a ranked list, usually of users in group k, T is the number of latent user items on the top and bottom of the list have 372 higher confidence than those in the middle. During evaluation, a 3-fold cross-validation is That is to say more “emphasis” should be put performed for each user in the test set. In each fold, on both ends of the list. Hence we use a sig- MAP@10 score is calculated from the ranked list moid function on the Scorelinear to capture generated by the system. Then the average from all this. the folds and all the users is computed as the final 1 result. Scoresig = (8) To make a proper evaluation configuration, for 1 + e−l(Scorelin −0.5) each user, only posts up to the first participation of A sigmoid function is relatively flat on both the testing user are used for the test set. ends while being steep in the middle. In the equation, l is a tuning parameter that decides 5.1 Content-based Results how “flat” the score of both ends of the list is Here we evaluate the performance of interest going to be. Determining the best value for l thread prediction using only features from text. is not a trivial problem. Here we empirically First we use the ranking model with latent topic assign l = 10. information only on the development set to deter- 5 Experiment and Evaluation mine an optimal number of topics. Empirically, we use hyper parameter β = 0.1 and α = 1/K In this section, we evaluate our approach empiri- (K is the number of topics). We use the perfor- cally on the three forum data sets described in Sec- mance of content-based recommendation directly tion 3. We pick the top 300 most active users from to determine the optimal topic number K. We var- each forum for the evaluation. Among the 300 ied the latent topic number K from 10 to 100, and users, 100 of them are randomly selected as the de- found that the best performance was achieved us- velopment set for parameter tuning, while the rest ing 30 topics in all three forums. Hence we use is test set. All the data sets are filtered using an on- K = 30 for content based recommendation unless line filter as previously described, with a window otherwise specified. size of 10 threads. Next, we show how topic information can help Threads are tokenized into words and filtered us- content-based recommendation achieve better re- ing a simple English stop word list. All words sults. We tune the parameter γ described in Sec- are then ordered by their occurrences multiplied by tion 4.1.2 and show corresponding performances. their inverse document frequencies (IDF). We compare the performance using Naive Bayes classifier, before and after normalization. The |D| idfw = log (9) MAP@10 results on the test set are shown in Fig- |{d : w ∈ d}| ure 3 for three forums. When γ = 0, no latent topic The top 4,000 words from this list are then used to information is used, and when γ = 1, latent topics form the vocabulary. are used without any word features. We used standard mean average precision When using Naive Bayes classifier without nor- (MAP) as the evaluation metric. This standard in- malization, we find relatively larger performance formation retrieval evaluation metric measures the gain from adding topic information for the γ val- quality of the returned rank lists from a system. ues of close to 0. This phenomenon is probably Entries higher in the rank are more accurate than because of the poor posterior probabilities of the lower ones. For an interesting thread recommenda- Naive Bayes classifier, which are close to either 1 tion system, it is preferable to provide a short and or 0. high-quality list of recommendation; therefore, in- For normalized Naive Bayes classifier, interpo- stead of reporting full-range MAP, we report MAP lating with latent topics based ranking yields per- on top 10 relevant threads (MAP@10). The reason formance improvement compared to word-based why we picked 10 as the number of relevant doc- results consistently for the three forums. In ument for MAP evaluation is that users might not “Wow Gaming” corpus, the optimal performance have time to read too many posts, even if they are is achieved with a relatively high γ value (at around relevant. 0.5), and it is even higher for the “Fitness” forum. 373 This means that the system relies more on the la- tent topics information. This is because in these fo- rums, casual conversation contains more irregular words, causing more severe data sparsity problem than others. Between the two naive Bayes classifiers, we #word can see that using normalized probabilities out- performs the original one in “Wow Gaming” and “Ubuntu” forums. This observation is consistent with previous work (e.g., (Pavlov et al., 2004)). However, we found that in “Fitness Forum”, the performance degrades with normalization. Further work is still needed to understand why this is the #user case. 5.2 Latent User Group Classification Figure 5: Position of items with different #users and #words in a ranked list. (red=0 being higher on the In this section, collaborative filtering using latent ranked list and green being lower) user groups is evaluated. First, participating users from the training set are used to estimate an LDA model. Then, users participating in a thread are may be interested in a larger variety of topics and used to infer the topic distribution of the thread. thus the user distribution in different topics is not Candidate threads are then sorted by the proba- very obvious. In contrast, people in the gaming bility of a target user’s participation according to forum are more specific to the topics they are inter- Equation 4. Note that all the users in the forum are ested in. used to estimate the latent user groups, but only the It is known that LDA tends to perform poorly top 300 active users are used in evaluation. Here, when there are too few words/users. To have a we vary the number of latent user groups G from general idea of how much user participation is 5 to 100. Hyper parameters were set empirically: “enough” for decent prediction, we show a graph α = 1/G, β = 0.1. (Figure 5) depicting the relationships among the Figure 4 shows the MAP@10 results using dif- number of users, the number of words, and the po- ferent numbers of latent groups for the three fo- sition of the positive instances in the ranked lists. rums. We compare the performance using latent In this graph, every dot is a positive thread instance groups with a baseline using SVM ranking. In in “Wow Gaming” forum. Red color shows that the baseline system, users’ participation in a thread the positive thread is indeed getting higher ranks is used as a binary feature. LibSVM with radius than others. We observe that threads with around based function (RBF) kernel is used to estimate the 16 participants can already achieve a decent perfor- probability of a user’s participation. mance. From the results, we find that ranking using la- 5.3 Hybrid System Performance tent groups information outperforms the baseline in almost all non-trivial cases. In the case of In this section, we evaluate the performance of the “Ubuntu” forum, the performance gain is less com- hybrid system output. Parameters used in each fo- pared to other forums. We believe this is because rum data set are the optimal parameters found in in this technical support forum, the average user the previous sections. Here we show the effect of participation in threads is much less, thus making the tuning parameter λ (described in Section 4.3). it hard to infer a reliable group distribution in a Also, we compare three different scoring schemes thread. In addition, the optimal number of user used to generate the final ranked list. Performance groups differs greatly between “Fitness” forum and of the hybrid system is shown in Table 3. “Wow Gaming” forum. We conjecture the reason We can see that the combination of the two sys- behind this is that in the “Fitness” forum, users tems always outperforms any one model alone. 374 Ubuntu Forum Wow Gaming Fitness Forum 0.54 0.3 1 Naive Bayes Naive Bayes 0.9 Naive Bayes 0.51 Normalized NB 0.28 Normalized NB Normalized NB 0.8 0.48 0.7 MAP 10 MAP 10 MAP 10 0.26 0.6 0.45 0.5 0.24 0.42 0.4 0.22 0.3 0.39 0.2 0.36 0.2 0.1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Gamma Gamma Gamma Figure 3: Content-based filtering results: MAP@10 vs. γ (contribution of topic-based features). Ubuntu Forum Wow Gaming Fitness Forum 0.22 0.35 0.6 Latent Group Latent Group Latent Group SVM SVM SVM 0.2 0.3 0.5 MAP 10 MAP 10 MAP 10 0.18 0.25 0.4 0.16 0.2 0.3 0.14 0.15 0.2 1 10 100 1 10 100 1 10 100 Number of Groups Number of Groups Number of Groups Figure 4: Collaborative filtering results: MAP@10 vs. user group number. Contribution Factor λ 6 Conclusion Forum 0.0 1.0 Optimal Ubuntu 0.523 0.198 0.534 (λ = 0.9) In this paper, we proposed a new system that can Wow 0.278 0.283 0.304 (λ = 0.1) intelligently recommend threads from online com- Fitness 0.545 0.457 0.551 (λ = 0.85) munity according to a user’s interest. The system uses both content-based filtering and collaborative- Table 3: Performance of the hybrid system with differ- filtering techniques. In content-based filtering, we ent λ values. solve the problem of data sparsity in online con- tent by smoothing using latent topic information. In collaborative filtering, we model users’ partici- This is intuitive since the two models use differ- pation in threads with latent groups under an LDA ent information sources. A MAP@10 score of 0.5 framework. The two systems compliment each means that around half of the suggested results do other and their combination achieves better per- have user participation. We think this is a good re- formance than individual ones. Our experiments sult considering that this is not a trivial task. across different forums demonstrate the robustness of our methods and the difference among forums. We also notice that based on the nature of differ- In the future work, we plan to explore how social ent forums, the optimal λ value could be substan- information could help further refine a user’s inter- tially different. For example, in “Wow gaming” est. forum where people participate in more threads, a higher λ value is observed which favors collabo- rative filtering score. In contrast, in “Ubuntu” fo- References rum, where people participate in far fewer threads, Gediminas Adomavicius and Alexander Tuzhilin. the content-based system is more reliable in thread 2005. Toward the next generation of recommender prediction, hence a lower λ is used. This observa- systems: A survey of the state-of-the-art and possi- tion also shows that the hybrid system is more ro- ble extensions. IEEE TRANSACTIONS ON KNOWL- bust against differences among forums compared EDGE AND DATA ENGINEERING, 17(6):734–749. with single model systems. Marko Balabanovic and Yoav Shoham. 1997. 375 Fab: Content-based, collaborative recommendation. Michael Pazzani, Daniel Billsus, S. Michalski, and Communications of the ACM, 40:66–72. Janusz Wnek. 1997. Learning and revising user pro- Chumki Basu, Haym Hirsh, and William Cohen. 1998. files: The identification of interesting web sites. In Recommendation as classification: Using social and Machine Learning, pages 313–331. content-based information in recommendation. In In Elaine Rich. 1979. User modeling via stereotypes. Proceedings of the Fifteenth National Conference on Cognitive Science, 3(4):329–354. Artificial Intelligence, pages 714–720. AAAI Press. J. Rocchio, 1971. Relevance Feedback in Information Paul N. Bennett. 2000. Assessing the calibration of Retrieval. naive bayes’ posterior estimates. Gerard Salton and Christopher Buckley. 1988. Term- David Blei, Andrew Y. Ng, and Michael I. Jordan. weighting approaches in automatic text retrieval. 2001. Latent dirichlet allocation. Journal of Ma- In INFORMATION PROCESSING AND MANAGE- chine Learning Research, 3:2003. MENT, pages 513–523. John S. Breese, David Heckerman, and Carl Kadie. Badrul Sarwar, George Karypis, Joseph Konstan, and 1998. Empirical analysis of predictive algorithms for John Reidl. 2001. Item-based collaborative fil- collaborative filtering. pages 43–52. Morgan Kauf- tering recommendation algorithms. In WWW ’01: mann. Proceedings of the 10th international conference on Y H Chien and E I George, 1999. A bayesian model for World Wide Web, pages 285–295, New York, NY, collaborative filtering. Number 1. USA. ACM. Joaquin Delgado and Naohiro Ishii. 1999. Memory- Upendra Shardanand and Pattie Maes. 1995. So- based weighted-majority prediction for recom- cial information filtering: Algorithms for automating mender systems. “word of mouth”. In CHI, pages 210–217. Thomas L. Griffiths and Mark Steyvers. 2004. Find- Lyle Ungar, Dean Foster, Ellen Andre, Star Wars, ing scientific topics. Proceedings of the National Fred Star Wars, Dean Star Wars, and Jason Hiver Academy of Sciences of the United States of Amer- Whispers. 1998. Clustering methods for collabo- ica, 101(Suppl 1):5228–5235, April. rative filtering. AAAI Press. Thomas Hofmann. 2003. Collaborative filtering via Hao Wu, Jiajun Bu, Chun Chen, Can Wang, Guang Qiu, gaussian probabilistic latent semantic analysis. In Lijun Zhang, and Jianfeng Shen. 2010. Modeling Proceedings of the 26th annual international ACM dynamic multi-topic discussions in online forums. In SIGIR conference on Research and development in AAAI. informaion retrieval, SIGIR ’03, pages 259–266, New York, NY, USA. ACM. Thomas Hofmann. 2004. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1):89–115. Qing Li, Jia Wang, Yuanzhu Peter Chen, and Zhangxi Lin. 2010. User comments for news recom- mendation in forum-based social media. Inf. Sci., 180:4929–4939, December. Nick Littlestone. 1988. Learning quickly when irrele- vant attributes abound: A new linear-threshold algo- rithm. In Machine Learning, pages 285–318. Atsuyoshi Nakamura and Naoki Abe. 1998. Collab- orative filtering using weighted majority prediction algorithms. In Proceedings of the Fifteenth Interna- tional Conference on Machine Learning, ICML ’98, pages 395–403, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Dmitry Pavlov, Ramnath Balasubramanyan, Byron Dom, Shyam Kapur, and Jignashu Parikh. 2004. Document preprocessing for naive bayes classifica- tion and clustering with mixture of multinomials. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data min- ing, KDD ’04, pages 829–834, New York, NY, USA. ACM. 376 Inferring Selectional Preferences from Part-Of-Speech N-grams Hyeju Jang and Jack Mostow Project LISTEN (www.cs.cmu.edu/~listen), School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA

[email protected]

,

[email protected]

because originally it referred to a vocabulary Abstract word targeted for instruction, and r its “relative.” We present the PONG method to compute selectional preferences using part-of-speech Notation Description (POS) N-grams. From a corpus labeled with R a relation between words grammatical dependencies, PONG learns the t a target word distribution of word relations for each POS r, r' possible relatives of t N-gram. From the much larger but unlabeled g a word N-gram Google N-grams corpus, PONG learns the gi and gj ith and jth words of g distribution of POS N-grams for a given pair of words. We derive the probability that one p the POS N-gram of g word has a given grammatical relation to the other. PONG estimates this probability by Table 1: Notation used throughout this paper combining both distributions, whether or not either word occurs in the labeled corpus. Previous work on selectional preferences has PONG achieves higher average precision on used them primarily for natural language analytic 16 relations than a state-of-the-art baseline in tasks such as word sense disambiguation (Resnik, a pseudo-disambiguation task, but lower 1997), dependency parsing (Zhou et al., 2011), coverage and recall. and semantic role labeling (Gildea and Jurafsky, 2002). However, selectional preferences can 1 Introduction also apply to natural language generation tasks Selectional preferences specify plausible fillers such as sentence generation and question for the arguments of a predicate, e.g., celebrate. generation. For generation tasks, choosing the Can you celebrate a birthday? Sure. Can you right word to express a specified argument of a celebrate a pencil? Arguably yes: Today the relation requires knowing its connotations – that Acme Pencil Factory celebrated its one-billionth is, its selectional preferences. Therefore, it is pencil. However, such a contrived example is useful to know selectional preferences for many unnatural because unlike birthday, pencil lacks a different relations. Such knowledge could have strong association with celebrate. How can we many uses. In education, they could help teach compute the degree to which birthday or pencil word connotations. In machine learning they is a plausible and typical object of celebrate? could help computers learn languages. In Formally, we are interested in computing the machine translation, they could help generate probability Pr(r | t, R), where (as Table 1 more natural wording. specifies), t is a target word such as celebrate, r This paper introduces a method named PONG is a word possibly related to it, such as birthday (for Part-Of-Speech N-Grams) to compute or pencil, and R is a possible relation between selectional preferences for many different them, whether a semantic role such as the agent relations by combining part-of-speech of an action, or a grammatical dependency such information and Google N-grams. PONG as the object of a verb. We call t the “target” achieves higher precision on a pseudo- 377 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 377–386, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics disambiguation task than the best previous model relation between these two content words: sky is (Erk et al., 2010), but lower coverage. the location where flies occurs. Other function The paper is organized as follows. Section 2 words yield different collapsed dependencies. describes the relations for which we compute For example, consider these two sentences: selectional preferences. Section 3 describes The airplane flies over the ocean. PONG. Section 4 evaluates PONG. Section 5 The airplane flies and lands. relates PONG to prior work. Section 6 concludes. Collapsed dependencies for the first sentence include prep_over between flies and ocean, 2 Relations Used which characterizes their relative vertical position, and conj_and between flies and lands, Selectional preferences characterize constraints which links two actions that an airplane can on the arguments of predicates. Selectional perform. As these examples illustrate, collapsing preferences for semantic roles (such as agent and dependencies involving prepositions and patient) are generally more informative than for conjunctions can yield informative dependencies grammatical dependencies (such as subject and between content words. object). For example, consider these Besides collapsed dependencies, PONG infers semantically equivalent but grammatically inverse dependencies. Inverse selectional distinct sentences: preferences are selectional preferences of Pat opened the door. arguments for their predicates, such as a The door was opened by Pat. preference of a subject or object for its verb. In both sentences the agent of opened, namely They capture semantic regularities such as the set Pat, must be capable of opening something – an of verbs that an agent can perform, which tend to informative constraint on Pat. In contrast, outnumber the possible agents for a verb (Erk et knowing that the grammatical subject of opened al., 2010). is Pat in the first sentence and the door in the second sentence tells us only that they are nouns. 3 Method Despite this limitation, selectional preferences for grammatical dependencies are still useful, for To compute selectional preferences, PONG a number of reasons. First, in practice they combines information from a limited corpus approximate semantic role labels. For instance, labeled with the grammatical dependencies typically the grammatical subject of opened is its described in Section 2, and a much larger agent. Second, grammatical dependencies can be unlabeled corpus. The key idea is to abstract extracted by parsers, which tend to be more word sequences labeled with grammatical accurate than current semantic role labelers. relations into POS N-grams, in order to learn a Third, the number of different grammatical mapping from POS N-grams to those relations. dependencies is large enough to capture diverse For instance, PONG abstracts the parsed relations, but not so large as to have sparse data sentence Pat opened the door as NN VB DT NN, for individual relations. Thus in this paper, we with the first and last NN as the subject and use grammatical dependencies as relations. object of the VB. To estimate the distribution of A parse tree determines the basic grammatical POS N-grams containing particular target and dependencies between the words in a sentence. relative words, PONG POS-tags Google N- For instance, in the parse of Pat opened the door, grams (Franz and Brants, 2006). the verb opened has Pat as its subject and door Section 3.1 derives PONG’s probabilistic as its object, and door has the as its determiner. model for combining information from labeled Besides these basic dependencies, we use two and unlabeled corpora. Section 3.2 and Section additional types of dependencies. 3.3 describe how PONG estimates probabilities Composing two basic dependencies yields a from each corpus. Section 3.4 discusses a collapsed dependency (de Marneffe and Manning, sparseness problem revealed during probability 2008). For example, consider this sentence: estimation, and how we address it in PONG. The airplane flies in the sky. Here sky is the prepositional object of in, which 3.1 Probabilistic model is the head of a prepositional phrase attached to We quantify the selectional preference for a flies. Composing these two dependencies yields relative r to instantiate a relation R of a target t as the collapsed dependency prep_in between flies the probability Pr(r | t, R), estimated as follows. and sky, which captures an important semantic By the definition of conditional probability: 378 Pr(r , t , R) Pr( R | t , r , p) Pr( p | t , r ) Pr(t , r ) Pr(r | t , R) Pr(t , R) p Pr(t , r ) We care only about the relative probability of Cancelling the common factor yields: different r for fixed t and R, so we rewrite it as: Pr( R | p, t , r ) Pr( p | t , r ) Pr(r, t , R) p We use the chain rule: We approximate the first term Pr(R | p, t, r) as Pr( R | r, t ) Pr(r | t ) Pr(t ) Pr(R | p), based on the simplifying assumption and notice that t is held constant: that R is conditionally independent of t and r, Pr( R | r , t ) Pr( r | t ) given p. In other words, we assume that given a POS N-gram, the target and relative words t and We estimate the second factor as follows: r give no additional information about the Pr(t , r ) freq(t , r ) probability of a relation. However, their Pr(r | t ) Pr(t ) freq(t ) respective positions i and j in the POS N-gram p matter, so we condition the probability on them: We calculate the denominator freq(t) as the number of N-grams in the Google N-gram Pr( R | p, t , r ) Pr( R | p, i, j ) corpus that contain t, and the numerator freq(t, r) Summing over their possible positions, we get as the number of N-grams containing both t and r. Pr( R | r , t ) To estimate the factor Pr(R | r, t) directly from Pr( R | p, i, j ) Pr( p | t gi , r g j) a corpus of text labeled with grammatical p i j relations, it would be trivial to count how often a word r bears relation R to target word t. As Figure 1 shows, we estimate Pr(R | p, i, j) by However, the results would be limited to the abstracting the labeled corpus into POS N-grams. words in the corpus, and many relation We estimate Pr(p | t = gi, r = gj) based on the frequencies would be estimated sparsely or frequency of partially lexicalized POS N-grams missing altogether; t or r might not even occur. like DT JJ:red NN:hat VB NN among Google N- Instead, we abstract each word in the corpus as grams with t and r in the specified positions. its part-of-speech (POS) label. Thus we abstract Sections 3.2 and 3.3 describe how we estimate The big boy ate meat as DT JJ NN VB NN. We Pr(R | p, i, j) and Pr(p | t = gi, r = gj), respectively. call this sequence of POS tags a POS N-gram. Note that PONG estimates relative rather than We use POS N-grams to predict word relations. absolute probabilities. Therefore it cannot (and For instance, we predict that in any word does not) compare them against a fixed threshold sequence with this POS N-gram, the JJ will to make decisions about selectional preferences. modify (amod) the first NN, and the second NN 3.2 Mapping POS N-grams to relations will be the direct object (dobj) of the VB. This prediction is not 100% reliable. For To estimate Pr(R | p, i, j), we use the Penn example, the initial 5-gram of The big boy ate Treebank Wall Street Journal (WSJ) corpus, meat pie has the same POS 5-gram as before. which is labeled with grammatical relations However, the dobj of its VB (ate) is not the using the Stanford dependency parser (Klein and second NN (meat), but the subsequent NN (pie). Manning, 2003). Thus POS N-grams predict word relations only To estimate the probability Pr(R | p, i, j) of a in a probabilistic sense. relation R between a target at position i and a To transform Pr(R | r, t) into a form we can relative at position j in a POS N-gram p, we estimate, we first apply the definition of compute what fraction of the word N-grams g conditional probability: with POS N-gram p have relation R between Pr( R, t , r ) some target t and relative r at positions i and j: Pr( R | t , r ) Pr( R | p, i, j ) Pr(t , r ) freq( g s.t.POS( g ) p relation( gi , g j ) R) To estimate the numerator Pr(R, t, r), we first marginalize over the POS N-gram p: freq( g s.t.POS( g ) p relation( gi , g j )) Pr( R, t , r , p) 3.3 Estimating POS N-gram distributions p Pr(t , r ) Given a target and relative, we need to estimate We expand the numerator using the chain rule: their distribution of POS N-grams and positions. 379 Figure 1: Overview of PONG. From the labeled corpus, PONG extracts abstract mappings from POS N-grams to relations. From the unlabeled corpus, PONG estimates POS N-gram probability given a target and relative. A labeled corpus is too sparse for this purpose, instance consists of two randomly chosen words so we use the much larger unlabeled Google N- in the WSJ corpus labeled with a grammatical grams corpus (Franz and Brants, 2006). relation. Coarse POS tags increased coverage of The probability that an N-gram with target t at this pilot set – that is, the fraction of instances for position i and relative r at position j will have the which PONG computes a probability – from 69% POS N-gram p is: to 92%. Pr( p | t gi , r gj) Using the universal tag set (Petrov et al., 2011) as an even coarser tag set is an interesting future freq( g s.t.POS( g ) p, g i t , g j r )) direction, especially for other languages. Its freq( g s.t. gi t gj r) smaller size (12 tags vs. our 23) should reduce data sparseness, but increase the risk of over- To compute this ratio, we first use a well- generalization. indexed table to efficiently retrieve all N-grams with words t and r at the specified positions. We 4 Evaluation then obtain their POS N-grams from the Stanford POS tagger (Toutanova et al., 2003), and count To evaluate PONG, we use a standard pseudo- how many of them have the POS N-gram p. disambiguation task, detailed in Section 4.1. Section 4.2 describes our test set. Section 4.3 3.4 Reducing POS N-gram sparseness lists the metrics we evaluate on this test set. We abstract word N-grams into POS N-grams to Section 4.4 describes the baselines we compare address the sparseness of the labeled corpus, but PONG against on these metrics, and Section 4.5 even the POS N-grams can be sparse. For n=5, describes the relations we compare them on. the rarer ones occur too sparsely (if at all) in our Section 4.6 reports our results. Section 4.7 labeled corpus to estimate their frequency. analyzes sources of error. To address this issue, we use a coarser POS tag set than the Penn Treebank POS tag set. As 4.1 Evaluation task Table 2 shows, we merge tags for adjectives, The pseudo-disambiguation task (Gale et al., nouns, adverbs, and verbs into four coarser tags. 1992; Schutze, 1992) is as follows: given a Coarse Original target word t, a relation R, a relative r, and a random distracter r', prefer either r or r', ADJ JJ, JJR, JJS whichever is likelier to have relation R to word t. ADVERB RB, RBR, RBS This evaluation does not use a threshold: just NOUN NN, NNS, NNP, NNPS prefer whichever word is likelier according to the VERB VB, VBD, VBG, VBN, VBP, VBZ model being evaluated. If the model assigns only Table 2: Coarser POS tag set used in PONG one of the words a probability, prefer it, based on the assumption that the unknown probability of To gauge the impact of the coarser POS tags, the other word is lower. If the model assigns the we calculated Pr(r | t, R) for 76 test instances same probability to both words, or no probability used in an earlier unpublished study by Liu Liu, to either word, do not prefer either word. a former Project LISTEN graduate student. Each 380 4.2 Test set distracters any actual relatives, i.e. candidates r' where the test corpus contained the triple (R, t, r'). As a source of evaluation data, we used the Table 3 shows the resulting number of (R, t, r, r') British National Corpus (BNC). As a common test tuples for each relation. test corpus for all the methods we evaluated, we selected one half of BNC by sorting filenames alphabetically and using the odd-numbered files. Relation R # tuples for R # tuples for RT We used the other half of BNC as a training corpus for the baseline methods we compared advmod 121 131 PONG to. amod 162 128 A test set for the pseudo-disambiguation task conj_and 155 151 task consists of tuples of the form (R, t, r, r'). To dobj 145 167 construct a test set, we adapted the process used nn 173 158 by Rooth et al. (1999) and Erk et al. (2010). nsubj 97 124 First, we chose 100 (R, t) pairs for each prep_of 144 153 relation R at random from the test corpus. Rooth xcomp 139 140 et al. (1999) and Erk et al. (2010) chose such Table 3: Test set size for each relation pairs from a training corpus to ensure that it contained the target t. In contrast, choosing pairs 4.3 Metrics from an unseen test corpus includes target words whether or not they occur in the training corpus. We report four evaluation metrics: precision, To obtain a sample stratified by frequency, coverage, recall, and F-score. Precision (called rather than skewed heavily toward high- “accuracy” in some papers on selectional frequency pairs, Erk et al. (2010) drew (R, t) preferences) is the percentage of all covered pairs from each of five frequency bands in the tuples where the original relative r is preferred. entire British National Corpus (BNC): 50-100 Coverage is the percentage of tuples for which occurrences; 101-200; 201-500; 500-1000; and the model prefers r to r' or vice versa. Recall is more than 1000. However, we use only half of the percentage of all tuples where the original BNC as our test corpus, so to obtain a relative is preferred, i.e., precision times comparable test set, we drew 20 (R, t) pairs from coverage. F-score is the harmonic mean of each of the corresponding frequency bands in precision and recall. that half: 26-50 occurrences; 51-100; 101-250; 4.4 Baselines 251-500; and more than 500. For each chosen (R, t) pair, we drew a separate We compare PONG to two baseline methods. (R, t, r) triple from each of six frequency bands: EPP is a state-of-the-art model for which Erk 1-25 occurrences; 26-50; 51-100; 101-250; 251- et al. (2010) reported better performance than 500; and more than 500. We necessarily omitted both Resnik’s (1996) WordNet model and frequency bands that contained no such triples. Rooth’s (1999) EM clustering model. EPP We filtered out triples where r did not have the computes selectional preferences using most frequent part of speech for the relation R. distributional similarity, based on the assumption For example, this filter would exclude the triple that relatives are likely to appear in the same (dobj, celebrate, the) because a direct object is contexts as relatives seen in the training corpus. most frequently a noun, but the is a determiner. EPP computes the similarity of a potential Then, like Erk et al. (2010), we paired the relative’s vector space representation to relatives relative r in each (R, t, r) triple with a distracter r' in the training corpus. with the same (most frequent) part of speech as EPP has various options for its vector space the relative r, yielding the test tuple (R, t, r, r'). representation, similarity measure, weighting Rooth et al. (1999) restricted distracter scheme, generalization space, and whether to use candidates to words with between 30 and 3,000 PCA. In re-implementing EPP, we chose the occurrences in BNC; accordingly, we chose only options that performed best according to Erk et al. distracters with between 15 and 1,500 (2010), with one exception. To save work, we occurrences in our test corpus. We selected r' chose not to use PCA, which Erk et al. (2010) from these candidates randomly, with probability described as performing only slightly better in proportional to their frequency in the test corpus. the dependency-based space. Like Rooth et al. (1999), we excluded as 381 Relation Target Relative Description advmod verb adverb Adverbial modifier amod noun adjective Adjective modifier conj_and noun noun Conjunction with “and” dobj verb noun Direct object nn noun noun Noun compound modifier nsubj verb noun Nominal subject prep_of noun noun Prepositional modifier xcomp verb verb Open clausal complement Table 4: Relations tested in the pseudo-disambiguation experiment. Relation names and descriptions are from de Marneffe and Manning (2008) except for prep_of. Target and relative POS are the most frequent POS pairs for the relations in our labeled WSJ corpus. Precision (%) Coverage (%) Recall (%) F-score (%) Relation PONG EPP DEP PONG EPP DEP PONG EPP DEP PONG EPP DEP advmod 78.7 - 98.6 72.1 - 69.2 56.7 - 68.3 65.9 - 80.7 advmodT 89.0 71.0 97.4 69.5 100 59.5 61.8 71.0 58.0 73.0 71.0 72.7 amod 78.8 - 99.0 90.1 - 61.1 71.0 - 60.5 74.7 - 75.1 amodT 84.1 74.0 97.3 83.6 99.2 57.0 70.3 73.4 55.5 76.6 73.7 70.6 conj_and 77.2 74.2 100 73.6 100 52.3 56.8 74.2 52.3 65.4 74.2 68.6 conj_andT 80.5 70.2 97.3 74.8 100 49.7 60.3 70.2 48.3 68.9 70.2 64.6 dobj 87.2 80.0 97.7 80.7 100 60.0 70.3 80.0 58.6 77.9 80.0 73.3 dobjT 89.6 80.2 98.1 92.2 100 64.1 82.6 80.2 62.9 86.0 80.2 76.6 nn 86.7 73.8 97.2 95.3 99.4 63.0 82.7 73.4 61.3 84.6 73.6 75.2 nnT 83.8 79.7 99.0 93.7 100 60.8 78.5 79.7 60.1 81.0 79.7 74.8 nsubj 76.1 77.3 100 69.1 100 42.3 52.6 77.3 42.3 62.2 77.3 59.4 nsubjT 78.5 66.9 95.0 86.3 100 48.4 67.7 66.9 46.0 72.7 66.9 62.0 prep_of 88.4 77.8 98.4 84.0 100 44.4 74.3 77.8 43.8 80.3 77.8 60.6 prep_ofT 79.2 76.5 97.4 81.7 100 50.3 64.7 76.5 49.0 71.2 76.5 65.2 xcomp 84.0 61.9 95.3 85.6 100 61.2 71.9 61.9 58.3 77.5 61.9 72.3 xcompT 86.4 78.6 98.9 89.3 100 63.6 77.1 78.6 62.9 81.5 78.6 76.9 average 83.0 74.4 97.9 82.6 99.9 56.7 68.7 74.4 55.5 75.0 74.4 70.5 Table 5: Coverage, Precision, Recall, and F-score for various relations; RT is the inverse of relation R. PONG uses POS N-grams, EPP uses distributional similarity, and DEP uses dependency parses. To score a potential relative r0, EPP uses this DEP, our second baseline method, runs the formula: Stanford dependency parser to label the training wtR ,t (r ) corpus with grammatical relations, and uses their Selpref R ,t (r0 ) sim(r0 , r ) r Seen arg s ( R ,t ) Z R ,t frequencies to predict selectional preferences. To do the pseudo-disambiguation task, DEP Here sim(r0, r) is the nGCM similarity defined compares the frequencies of (R, t, r) and (R, t, r'). below between vector space representations of r0 and a relative r seen in the training data: 4.5 Relations tested To test PONG, EPP, and DEP, we chose the n abi a 'bi simnGCM (a, a ') exp( ( )2 ) most frequent eight relations between content i 1 a a' words in the WSJ corpus, which occur over 10,000 times and are described in Table 4. We n also tested their inverse relations. However, EPP where a ab2i i 1 does not compute selectional preferences for The weight function wtr,t(a) is analogous to adjective and adverb as relatives. For this reason, inverse document frequency in Information we did not test EPP on advmod and amod Retrieval. relations with adverbs and adjectives as relatives. 382 4.6 Experimental results is the probability of a POS N-gram for rare co- occurrences of a target and relative in Google Table 5 displays results for all 16 relations. To word N-grams. Using a smaller tag set may compute statistical significance conservatively in reduce the sparse data problem but increase the comparing methods, we used paired t-tests with risk of over-generalization. N = 16 relations. PONG’s precision was significantly better 5 Relation to Prior Work than EPP (p<0.001) but worse than DEP (p<0.0001). Still, PONG’s high precision In predicting selectional preferences, a key validates its underlying assumption that POS N- issue is generalization. Our DEP baseline simply grams strongly predict grammatical counts co-occurrences of target and relative dependencies. words in a corpus to predict selectional On coverage and recall, EPP beat PONG, preferences, but only for words seen in the which beat DEP (p<0.0001). PONG’s F-score corpus. Prior work, summarized in was higher, but not significantly, than EPP’s Table 6, has therefore tried to infer the similarity (p>0.5) or DEP’s (p>0.02). of unseen relatives to seen relatives. To illustrate, consider the problem of inducing that the direct 4.7 Error analysis objects of celebrate tend to be days or events. In the pseudo-disambiguation task of choosing Resnik (1996) combined WordNet with a which of two words is related to a target, PONG labeled corpus to model the probability that makes errors of coverage (preferring neither relatives of a predicate belong to a particular word) and precision (preferring the wrong word). conceptual class. This method could notice, for Coverage errors, which occurred 17.4% of the example, that the direct objects of celebrate tend time on average, arose only when PONG failed to belong to the conceptual class event. Thus it to estimate a probability for either word. PONG could prefer anniversary or occasion as the fails to score a potential relative r of a target t object of celebrate even if unseen in its training with a specified relation R if the labeled corpus corpus. However, this method depends strongly has no POS N-grams that (a) map to R, (b) on the WordNet taxonomy. contain the POS of t and r, and (c) match Google Rather than use linguistic resources such as word N-grams with t and r at those positions. WordNet, Rooth et al. (1999) and Wald et al. Every relation has at least one POS N-gram that (2008) induced semantically annotated maps to it, so condition (a) never fails. PONG subcategorization frames from unlabeled corpora. uses the most frequent POS of t and r, and we They modeled semantic classes as hidden believe that condition (b) never fails. However, variables, which they estimated using EM-based condition (c) can and does fail when t and r do clustering. Ritter (2010) computed selectional not co-occur in any Google N-grams, at least that preferences by using unsupervised topic models match a POS N-gram that can map to relation R. such as LinkLDA, which infers semantic classes For example, oversee and diet do not co-occur in of words automatically instead of requiring a pre- any Google N-grams, so PONG cannot score diet defined set of classes as input. as a potential dobj of oversee. The contexts in which a linguistic unit occurs Precision errors, which occur 17% of the time provide information about its meaning. Erk on average, arose when (a) PONG scored the (2007) and Erk et al. (2010) modeled the distracter but failed to score the true relative, or contexts of a word as the distribution of words (b) scored them both but preferred the distracter. that co-occur with it. They calculated the Case (a) accounted for 44.62% of the errors on semantic similarity of two words as the similarity the covered test tuples. of their context distributions according to various One likely cause of errors in case (b) is over- measures. Erk et al. (2010) reported the state-of- generalization when PONG abstracts a word N- the-art method we used as our EPP baseline. gram labeled with a relation by mapping its POS In contrast to prior work that explored various N-gram to that relation. In particular, the coarse solutions to the generalization problem, we don’t POS tag set may discard too much information. so much solve this problem as circumvent it. Another likely cause of errors is probabilities Instead of generalizing from a training corpus estimated poorly due to sparse data. The directly to unseen words, PONG abstracts a word probability of a relation for a POS N-gram rare in N-gram to a POS N-gram and maps it to the the training corpus is likely to be inaccurate. So relations that the word N-gram is labeled with. 383 Reference Relation to Lexical Primary corpus Generalization Method target resource (labeled) & corpus information (unlabeled) & used information used Resnik, Verb-object Senses in Target, relative, none Information 1996 Verb-subject WordNet and relation in a theoretic Adjective-noun noun parsed, partially model Modifier-head taxonomy sense-tagged Head-modifier corpus (Brown corpus) Rooth et Verb-object none Target, relative, none EM-based al., 1999 Verb-subject and relation in a clustering parsed corpus (parsed BNC) Ritter, Verb-subject none Subject-verb- none LDA model 2010 Verb-object object tuples Subject-verb- from 500 million object web-pages Erk, 2007 Predicate and none Target, relative, Words and their Similarity Semantic roles and relation in a relations in a model based semantic role parsed corpus on word co- labeled corpus (BNC) occurrence (FrameNet) Erk et al., SYN option: none Target, relative, Two options: Similarity 2010 Verb-subject and relation in model using Verb-object, and SYN option: a WORDSPACE: vector space their inverse parsed corpus an unlabeled representation relations (parsed BNC) corpus (BNC) of words SEM option: SEM option: a verb and semantic role DEPSPACE: semantic roles labeled corpus Words and their that have nouns (FrameNet) subject and object as their headword relations in a in a primary parsed corpus corpus, and their (parsed BNC) inverse relations Zhou et Any (relations none Counts of words none PMI al., 2011 not distinguished) in Web or (Pointwise Google N-gram Mutual Information) This paper All grammatical none POS N-gram POS N-gram Combine both dependencies in a distribution for distribution for POS N-gram parsed corpus, relations in target and relative distributions and their inverse parsed WSJ in Google N-gram relations corpus Table 6: Comparison with prior methods to compute selectional preferences To compute selectional preferences, whether the corpus. The most closely related work we found words are in the training corpus or not, PONG was by Gormley et al. (2011). They used applies these abstract mappings to word N-grams patterns in POS N-grams to generate test data for in the much larger Google N-grams corpus. their selectional preferences model, but not to Some prior work on selectional preferences infer preferences. Zhou et al. (2011) identified has used POS N-grams and a large unlabeled selectional preferences of one word for another 384 by using Pointwise Mutual Information (PMI) Erk, K. 2007. A Simple, Similarity-Based Model for (Fano, 1961) to check whether they co-occur Selectional Preferences. In Proceedings of the 45th more frequently in a large corpus than predicted Annual Meeting of the Association of by their unigram frequencies. However, their Computational Linguistics, Prague, Czech Republic, June, 2007, 216-223. method did not distinguish among different relations. Erk, K., Padó, S. and Padó, U. 2010. A Flexible, Corpus-Driven Model of Regular and Inverse 6 Conclusion Selectional Preferences. Computational Linguistics 36(4), 723-763. This paper describes, derives, and evaluates PONG, a novel probabilistic model of selectional Fano, R. 1961. Transmission O F Information: A preferences. PONG uses a labeled corpus to map Statistical Theory of Communications. MIT POS N-grams to grammatical relations. It Press, Cambridge, MA. combines this mapping with probabilities estimated from a much larger POS-tagged but Franz, A. and Brants, T. 2006. All Our N-Gram Are unlabeled Google N-grams corpus. Belong to You. We tested PONG on the eight most common relations in the WSJ corpus, and their inverses – Gale, W.A., Church, K.W. and Yarowsky, D. 1992. Work on Statistical Methods for Word Sense more relations than evaluated in prior work. Disambiguation. In Proceedings of the AAAI Fall Compared to the state-of-the-art EPP baseline Symposium on Probabilistic Approaches to Natural (Erk et al., 2010), PONG averaged higher Language, Cambridge, MA, October 23–25, 1992, precision but lower coverage and recall. 54-60. Compared to the DEP baseline, PONG averaged lower precision but higher coverage and recall. Gildea, D. and Jurafsky, D. 2002. Automatic Labeling All these differences were substantial (p < 0.001). of Semantic Roles. Computational Linguistics Compared to both baselines, PONG’s average F- 28(3), 245-288. score was higher, though not significantly. Some directions for future work include: First, Gormley, M.R., Dredze, M., Durme, B.V. and Eisner, improve PONG by incorporating models of J. 2011. Shared Components Topic Models with Application to Selectional Preference, NIPS lexical similarity explored in prior work. Second, Workshop on Learning Semantics Sierra Nevada, use the universal tag set to extend PONG to other Spain. languages, or to perform better in English. Third, in place of grammatical relations, use rich, im Walde, S.S., Hying, C., Scheible, C. and Schmid, diverse semantic roles, while avoiding sparsity. H. 2008. Combining Em Training and the Mdl Finally, use selectional preferences to teach word Principle for an Automatic Verb Classification connotations by using various relations to Incorporating Selectional Preferences. In generate example sentences or useful questions. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Acknowledgments Columbus, OH, 2008, 496-504. The research reported here was supported by the Klein, D. and Manning, C.D. 2003. Accurate Institute of Education Sciences, U.S. Department Unlexicalized Parsing. In Proceedings of the 41st of Education, through Grant R305A080157. The Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, July 7- opinions expressed are those of the authors and 12, 2003, E.W. HINRICHS and D. ROTH, Eds. do not necessarily represent the views of the Institute or the U.S. Department of Education. Petrov, S., Das, D. and McDonald, R.T. 2011. A We thank the helpful reviewers and Katrin Erk Universal Part-of-Speech Tagset. ArXiv for her generous assistance. 1104.2086. References Resnik, P. 1996. Selectional Constraints: An Information-Theoretic Model and Its de Marneffe, M.-C. and Manning, C.D. 2008. Computational Realization. Cognition 61, 127-159. Stanford Typed Dependencies Manual. http://nlp.stanford.edu/software/dependencies_man Resnik, P. 1997. Selectional Preference and Sense ual.pdf, Stanford University, Stanford, CA. Disambiguation. In ACL SIGLEX Workshop on 385 Tagging Text with Lexical Semantics: Why, What, and How, Washington, DC, April 4-5, 1997, 52-57. Ritter, A., Mausam and Etzioni, O. 2010. A Latent Dirichlet Allocation Method for Selectional Preferences. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010, 424-434. Rooth, M., Riezler, S., Prescher, D., Carroll, G. and Beil, F. 1999. Inducing a Semantically Annotated Lexicon Via Em-Based Clustering. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, College Park, MD, 1999, Association for Computational Linguistics, 104-111. Schutze, H. 1992. Context Space. In Proceedings of the AAAI Fall Symposium on Intelligent Probabilistic Approaches to Natural Language, Cambridge, MA, 1992, 113-120. Toutanova, K., Klein, D., Manning, C. and Singer, Y. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the Human Language Technology Conference and Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Edmonton, Canada, 2003, 252– 259. Zhou, G., Zhao, J., Liu, K. and Cai, L. 2011. Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, OR, 2011, 1556–1565. 386 WebCAGe – A Web-Harvested Corpus Annotated with GermaNet Senses Verena Henrich, Erhard Hinrichs, and Tatiana Vodolazova University of T¨ubingen Department of Linguistics {firstname.lastname}@uni-tuebingen.de Abstract Thus far, sense-annotated corpora have typi- cally been constructed manually, making the cre- This paper describes an automatic method ation of such resources expensive and the com- for creating a domain-independent sense- pilation of larger data sets difficult, if not com- annotated corpus harvested from the web. pletely infeasible. It is therefore timely and ap- As a proof of concept, this method has propriate to explore alternatives to manual anno- been applied to German, a language for tation and to investigate automatic means of cre- which sense-annotated corpora are still in short supply. The sense inventory is taken ating sense-annotated corpora. Ideally, any auto- from the German wordnet GermaNet. The matic method should satisfy the following crite- web-harvesting relies on an existing map- ria: ping of GermaNet to the German version of the web-based dictionary Wiktionary. (1) The method used should be language inde- The data obtained by this method consti- pendent and should be applicable to as many tute WebCAGe (short for: Web-Harvested languages as possible for which the neces- Corpus Annotated with GermaNet Senses), sary input resources are available. a resource which currently represents the largest sense-annotated corpus available for (2) The quality of the automatically generated German. While the present paper focuses data should be extremely high so as to be us- on one particular language, the method as able as is or with minimal amount of manual such is language-independent. post-correction. (3) The resulting sense-annotated materials (i) 1 Motivation should be non-trivial in size and should be The availability of large sense-annotated corpora dynamically expandable, (ii) should not be is a necessary prerequisite for any supervised and restricted to a narrow subject domain, but many semi-supervised approaches to word sense be as domain-independent as possible, and disambiguation (WSD). There has been steady (iii) should be freely available for other re- progress in the development and in the perfor- searchers. mance of WSD algorithms for languages such as The method presented below satisfies all of English for which hand-crafted sense-annotated the above criteria and relies on the following re- corpora have been available (Agirre et al., 2007; sources as input: (i) a sense inventory and (ii) a Erk and Strapparava, 2012; Mihalcea et al., 2004), mapping between the sense inventory in question while WSD research for languages that lack these and a web-based resource such as Wiktionary1 or corpora has lagged behind considerably or has 1 been impossible altogether. http://www.wiktionary.org/ 387 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 387–396, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Wikipedia2 . lexical space into a set of concepts that are inter- As a proof of concept, this automatic method linked by semantic relations. A semantic concept has been applied to German, a language for which is represented as a synset, i.e., as a set of words sense-annotated corpora are still in short supply whose individual members (referred to as lexical and fail to satisfy most if not all of the crite- units) are taken to be (near) synonyms. Thus, a ria under (3) above. While the present paper synset is a set-representation of the semantic rela- focuses on one particular language, the method tion of synonymy. as such is language-independent. In the case There are two types of semantic relations in of German, the sense inventory is taken from GermaNet. Conceptual relations hold between the German wordnet GermaNet3 (Henrich and two semantic concepts, i.e. synsets. They in- Hinrichs, 2010; Kunze and Lemnitzer, 2002). clude relations such as hypernymy, part-whole re- The web-harvesting relies on an existing map- lations, entailment, or causation. Lexical rela- ping of GermaNet to the German version of the tions hold between two individual lexical units. web-based dictionary Wiktionary. This mapping Antonymy, a pair of opposites, is an example of a is described in Henrich et al. (2011). The lexical relation. resulting resource consists of a web-harvested GermaNet covers the three word categories of corpus WebCAGe (short for: Web-Harvested adjectives, nouns, and verbs, each of which is Corpus Annotated with GermaNet Senses), hierarchically structured in terms of the hyper- which is freely available at: http://www.sfs.uni- nymy relation of synsets. The development of tuebingen.de/en/webcage.shtml GermaNet started in 1997, and is still in progress. The remainder of this paper is structured as GermaNet’s version 6.0 (release of April 2011) follows: Section 2 provides a brief overview of contains 93407 lexical units, which are grouped the resources GermaNet and Wiktionary. Sec- into 69594 synsets. tion 3 introduces the mapping of GermaNet to 2.2 Wiktionary Wiktionary and how this mapping can be used to automatically harvest sense-annotated materi- Wiktionary is a web-based dictionary that is avail- als from the web. The algorithm for identifying able for many languages, including German. As the target words in the harvested texts is described is the case for its sister project Wikipedia, it in Section 4. In Section 5, the approach of au- is written collaboratively by volunteers and is tomatically creating a web-harvested corpus an- freely available4 . The dictionary provides infor- notated with GermaNet senses is evaluated and mation such as part-of-speech, hyphenation, pos- compared to existing sense-annotated corpora for sible translations, inflection, etc. for each word. German. Related work is discussed in Section 6, It includes, among others, the same three word together with concluding remarks and an outlook classes of adjectives, nouns, and verbs that are on future work. also available in GermaNet. Distinct word senses are distinguished by sense descriptions and ac- 2 Resources companied with example sentences illustrating the sense in question. 2.1 GermaNet Further, Wiktionary provides relations to GermaNet (Henrich and Hinrichs, 2010; Kunze other words, e.g., in the form of synonyms, and Lemnitzer, 2002) is a lexical semantic net- antonyms, hypernyms, hyponyms, holonyms, and work that is modeled after the Princeton Word- meronyms. In contrast to GermaNet, the relations Net for English (Fellbaum, 1998). It partitions the are (mostly) not disambiguated. For the present project, a dump of the Ger- 2 http://www.wikipedia.org/ man Wiktionary as of February 2, 2011 is uti- 3 Using a wordnet as the gold standard for the sense inven- 4 tory is fully in line with standard practice for English where Wiktionary is available under the Cre- the Princeton WordNet (Fellbaum, 1998) is typically taken ative Commons Attribution/Share-Alike license as the gold standard. http://creativecommons.org/licenses/by-sa/3.0/deed.en 388 Figure 1: Sense mapping of GermaNet and Wiktionary using the example of Bogen. lized, consisting of 46457 German words com- possibilities for data mining community-driven prising 70339 word senses. The Wiktionary data resources such as Wikipedia and web-generated was extracted by the freely available Java-based content more generally. It is precisely this poten- library JWKTL5 . tial that is fully exploited for the creation of the WebCAGe sense-annotated corpus. 3 Creation of a Web-Harvested Corpus Fig. 1 illustrates the existing GermaNet- Wiktionary mapping using the example word Bo- The starting point for creating WebCAGe is an gen. The polysemous word Bogen has three dis- existing mapping of GermaNet senses with Wik- tinct senses in GermaNet which directly corre- tionary sense definitions as described in Henrich spond to three separate senses in Wiktionary6 . et al. (2011). This mapping is the result of a Each Wiktionary sense entry contains a definition two-stage process: i) an automatic word overlap and one or more example sentences illustrating alignment algorithm in order to match GermaNet the sense in question. The examples in turn are senses with Wiktionary sense descriptions, and often linked to external references, including sen- ii) a manual post-correction step of the automatic tences contained in the German Gutenberg text alignment. Manual post-correction can be kept at archive7 (see link in the topmost Wiktionary sense a reasonable level of effort due to the high accu- entry in Fig. 1), Wikipedia articles (see link for racy (93.8%) of the automatic alignment. the third Wiktionary sense entry in Fig. 1), and The original purpose of this mapping was to other textual sources (see the second sense en- automatically add Wiktionary sense descriptions try in Fig. 1). It is precisely this collection of to GermaNet. However, the alignment of these two resources opens up a much wider range of 6 Note that there are further senses in both resources not displayed here for reasons of space. 5 7 http://www.ukp.tu-darmstadt.de/software/jwktl http://gutenberg.spiegel.de/ 389 Figure 2: Sense mapping of GermaNet and Wiktionary using the example of Archiv. heterogeneous material that can be harvested for than once in a given text. In keeping with the purpose of compiling a sense-annotated cor- the widely used heuristic of “one sense per dis- pus. Since the target word (rendered in Fig. 1 course”, multiple occurrences of a target word in in bold face) in the example sentences for a par- a given text are all assigned to the same GermaNet ticular Wiktionary sense is linked to a GermaNet sense. An inspection of the annotated data shows sense via the sense mapping of GermaNet with that this heuristic has proven to be highly reliable Wiktionary, the example sentences are automati- in practice. It is correct in 99.96% of all target cally sense-annotated and can be included as part word occurrences in the Wiktionary example sen- of WebCAGe. tences, in 96.75% of all occurrences in the exter- Additional material for WebCAGe is harvested nal webpages, and in 95.62% of the Wikipedia by following the links to Wikipedia, the Guten- files. berg archive, and other web-based materials. The WebCAGe is developed primarily for the pur- external webpages and the Gutenberg texts are ob- pose of the word sense disambiguation task. tained from the web by a web-crawler that takes Therefore, only those target words that are gen- some URLs as input and outputs the texts of the uinely ambiguous are included in this resource. corresponding web sites. The Wikipedia articles Since WebCAGe uses GermaNet as its sense in- are obtained by the open-source Java Wikipedia ventory, this means that each target word has at Library JWPL 8 . Since the links to Wikipedia, the least two GermaNet senses, i.e., belongs to at least Gutenberg archive, and other web-based materials two distinct synsets. also belong to particular Wiktionary sense entries The GermaNet-Wiktionary mapping is not al- that in turn are mapped to GermaNet senses, the ways one-to-one. Sometimes one GermaNet target words contained in these materials are au- sense is mapped to more than one sense in Wik- tomatically sense-annotated. tionary. Fig. 2 illustrates such a case. For Notice that the target word often occurs more the word Archiv each resource records three dis- 8 http://www.ukp.tu-darmstadt.de/software/jwpl/ tinct senses. The first sense (‘data repository’) 390 in GermaNet corresponds to the first sense in of Apache OpenNLP tools9 and the TreeTagger Wiktionary, and the second sense in GermaNet (Schmid, 1994) are used. Further, compounds (‘archive’) corresponds to both the second and are split by using BananaSplit10 . Since the au- third senses in Wiktionary. The third sense in tomatic lemmatization obtained by the tagger and GermaNet (‘archived file’) does not map onto any the compound splitter are not 100% accurate, tar- sense in Wiktionary at all. As a result, the word get word identification also utilizes the full set of Archiv is included in the WebCAGe resource with inflected forms for a target word whenever such precisely the sense mappings connected by the information is available. As it turns out, Wik- arrows shown in Fig. 2. The fact that the sec- tionary can often be used for this purpose as well ond GermaNet sense corresponds to two sense since the German version of Wiktionary often descriptions in Wiktionary simply means that the contains the full set of word forms in tables11 such target words in the example are both annotated by as the one shown in Fig. 3 for the word Bogen. the same sense. Furthermore, note that the word Archiv is still genuinely ambiguous since there is a second (one-to-one) mapping between the first senses recorded in GermaNet and Wiktionary, re- spectively. However, since the third GermaNet sense is not mapped onto any Wiktionary sense at all, WebCAGe will not contain any example sen- tences for this particular GermaNet sense. The following section describes how the target words within these textual materials can be auto- Figure 3: Wiktionary inflection table for Bogen. matically identified. Fig. 4 shows an example of such a sense- 4 Automatic Detection of Target Words annotated text for the target word Bogen ‘vi- olin bow’. The text is an excerpt from the For highly inflected languages such as German, Wikipedia article Violine ‘violin’, where the target target word identification is more complex com- word (rendered in bold face) appears many times. pared to languages with an impoverished inflec- Only the second occurrence shown in the figure tional morphology, such as English, and thus re- (marked with a 2 on the left) exactly matches the quires automatic lemmatization. Moreover, the word Bogen as is. All other occurrences are ei- target word in a text to be sense-annotated is ther the plural form B¨ogen (4 and 7), the geni- not always a simplex word but can also appear tive form Bogens (8), part of a compound such as subpart of a complex word such as a com- as Bogenstange (3), or the plural form as part pound. Since the constituent parts of a compound of a compound such as in Fernambukb¨ogen and are not usually separated by blank spaces or hy- Sch¨ulerb¨ogen (5 and 6). The first occurrence phens, German compounding poses a particular of the target word in Fig. 4 is also part of a challenge for target word identification. Another compound. Here, the target word occurs in the challenging case for automatic target word detec- singular as part of the adjectival compound bo- tion in German concerns particle verbs such as an- gengestrichenen. k¨undigen ‘announce’. Here, the difficulty arises For expository purposes, the data format shown when the verbal stem (e.g., k¨undigen) is separated in Fig. 4 is much simplified compared to the ac- from its particle (e.g., an) in German verb-initial tual, XML-based format in WebCAGe. The infor- and verb-second clause types. 9 As a preprocessing step for target word identi- http://incubator.apache.org/opennlp/ 10 http://niels.drni.de/s9y/pages/bananasplit.html fication, the text is split into individual sentences, 11 The inflection table cannot be extracted with the Java tokenized, and lemmatized. For this purpose, the Wikipedia Library JWPL. It is rather extracted from the Wik- sentence detector and the tokenizer of the suite tionary dump file. 391 Figure 4: Excerpt from Wikipedia article Violine ‘violin’ tagged with target word Bogen ‘violin bow’. mation for each occurrence of a target word con- ysemous words contained in GermaNet, among sists of the GermaNet sense, i.e., the lexical unit which there are 211 adjectives, 1499 nouns, and ID, the lemma of the target word, and the Ger- 897 verbs (see Table 2). On average, these words maNet word category information, i.e., ADJ for have 2.9 senses in GermaNet (2.4 for adjectives, adjectives, NN for nouns, and VB for verbs. 2.6 for nouns, and 3.6 for verbs). Table 2 also shows that WebCAGe is consid- 5 Evaluation erably larger than the other two sense-annotated In order to assess the effectiveness of the ap- corpora available for German ((Broscheit et al., proach, we examine the overall size of WebCAGe 2010) and (Raileanu et al., 2002)). It is impor- and the relative size of the different text col- tant to keep in mind, though, that the other two lections (see Table 1), compare WebCAGe to resources were manually constructed, whereas other sense-annotated corpora for German (see WebCAGe is the result of an automatic harvesting Table 2), and present a precision- and recall-based method. Such an automatic method will only con- evaluation of the algorithm that is used for auto- stitute a viable alternative to the labor-intensive matically identifying target words in the harvested manual method if the results are of sufficient qual- texts (see Table 3). ity so that the harvested data set can be used as is or can be further improved with a minimal amount Table 1 shows that Wiktionary (7644 tagged of manual post-editing. word tokens) and Wikipedia (1732) contribute by far the largest subsets of the total number of For the purpose of the present evaluation, we tagged word tokens (10750) compared with the conducted a precision- and recall-based analy- external webpages (589) and the Gutenberg texts sis for the text types of Wiktionary examples, (785). These tokens belong to 2607 distinct pol- external webpages, and Wikipedia articles sep- 392 Table 1: Current size of WebCAGe. Wiktionary External Wikipedia Gutenberg All examples webpages articles texts texts Number of adjectives 575 31 79 28 713 tagged nouns 4103 446 1643 655 6847 word verbs 2966 112 10 102 3190 tokens all word classes 7644 589 1732 785 10750 adjectives 565 31 76 26 698 Number of nouns 3965 420 1404 624 6413 tagged verbs 2945 112 10 102 3169 sentences all word classes 7475 563 1490 752 10280 adjectives 623 1297 430 65030 67380 Total nouns 4184 9630 6851 376159 396824 number of verbs 3087 5285 263 146755 155390 sentences all word classes 7894 16212 7544 587944 619594 Table 2: Comparing WebCAGe to other sense-tagged corpora of German. Broscheit et Raileanu et WebCAGe al., 2010 al., 2002 adjectives 211 6 0 Sense nouns 1499 18 25 tagged verbs 897 16 0 words all word classes 2607 40 25 Number of tagged word tokens 10750 approx. 800 2421 medical Domain independent yes yes domain arately for the three word classes of adjectives, of sense-annotated target words and to manually nouns, and verbs. Table 3 shows that precision sense-tag any missing target words for the four and recall for all three word classes that occur text types. for Wiktionary examples, external webpages, and Wikipedia articles lies above 92%. The only size- 6 Related Work and Future Directions able deviations are the results for verbs that occur in the Gutenberg texts. Apart from this one excep- With relatively few exceptions to be discussed tion, the results in Table 3 prove the viability of shortly, the construction of sense-annotated cor- the proposed method for automatic harvesting of pora has focussed on purely manual methods. sense-annotated data. The average precision for This is true for SemCor, the WordNet Gloss Cor- all three word classes is of sufficient quality to be pus, and for the training sets constructed for En- used as-is if approximately 2-5% noise in the an- glish as part of the SensEval and SemEval shared notated data is acceptable. In order to eliminate task competitions (Agirre et al., 2007; Erk and such noise, manual post-editing is required. How- Strapparava, 2012; Mihalcea et al., 2004). Purely ever, such post-editing is within acceptable lim- manual methods were also used for the German its: it took an experienced research assistant a to- sense-annotated corpora constructed by Broscheit tal of 25 hours to hand-correct all the occurrences et al. (2010) and Raileanu et al. (2002) as well as for other languages including the Bulgarian and 393 Table 3: Evaluation of the algorithm of identifying the target words. Wiktionary External Wikipedia Gutenberg examples webpages articles texts adjectives 97.70% 95.83% 99.34% 100% nouns 98.17% 98.50% 95.87% 92.19% Precision verbs 97.38% 92.26% 100% 69.87% all word classes 97.32% 96.19% 96.26% 87.43% adjectives 97.70% 97.22% 98.08% 97.14% nouns 98.30% 96.03% 92.70.% 97.38% Recall verbs 97.51% 99.60% 100% 89.20% all word classes 97.94% 97.32% 93.36% 95.42% the Chinese sense-tagged corpora (Koeva et al., list supervised WSD algorithm as a seed set for it- 2006; Wu et al., 2006). The only previous at- eratively disambiguating the remaining examples tempts of harvesting corpus data for the purpose collected in step 1. The selection and annotation of constructing a sense-annotated corpus are the of the representative examples in Yarowsky’s ap- semi-supervised method developed by Yarowsky proach is performed completely manually and is (1995), the knowledge-based approach of Lea- therefore limited to the amount of data that can cock et al. (1998), later also used by Agirre and reasonably be annotated by hand. Lopez de Lacalle (2004), and the automatic asso- Leacock et al. (1998), Agirre and Lopez de La- ciation of Web directories (from the Open Direc- calle (2004), and Mihalcea and Moldovan (1999) tory Project, ODP) to WordNet senses by Santa- propose a set of methods for automatic harvesting mar´ıa et al. (2003). of web data for the purposes of creating sense- The latter study (Santamar´ıa et al., 2003) is annotated corpora. By focusing on web-based closest in spirit to the approach presented here. data, their work resembles the research described It also relies on an automatic mapping between in the present paper. However, the underlying har- wordnet senses and a second web resource. While vesting methods differ. While our approach re- our approach is based on automatic mappings be- lies on a wordnet to Wiktionary mapping, their tween GermaNet and Wiktionary, their mapping approaches all rely on the monosemous relative algorithm maps WordNet senses to ODP subdi- heuristic. Their heuristic works as follows: In or- rectories. Since these ODP subdirectories contain der to harvest corpus examples for a polysemous natural language descriptions of websites relevant word, the WordNet relations such as synonymy to the subdirectory in question, this textual mate- and hypernymy are inspected for the presence of rial can be used for harvesting sense-specific ex- unambiguous words, i.e., words that only appear amples. The ODP project also covers German so in exactly one synset. The examples found for that, in principle, this harvesting method could be these monosemous relatives can then be sense- applied to German in order to collect additional annotated with the particular sense of its ambigu- sense-tagged data for WebCAGe. ous word relative. In order to increase coverage The approach of Yarowsky (1995) first collects of the monosemous relatives approach, Mihalcea all example sentences that contain a polysemous and Moldovan (1999) have developed a gloss- word from a very large corpus. In a second step, based extension, which relies on word overlap of a small number of examples that are representa- the gloss and the WordNet sense in question for tive for each of the senses of the polysemous tar- all those cases where a monosemous relative is get word is selected from the large corpus from not contained in the WordNet dataset. step 1. These representative examples are manu- The approaches of Leacock et al., Agirre and ally sense-annotated and then fed into a decision- Lopez de Lacalle, and Mihalcea and Moldovan as 394 well as Yarowsky’s approach provide interesting hild Barkey, Sarah Schulz, and Johannes Wahle directions for further enhancing the WebCAGe re- for their help with the evaluation reported in Sec- source. It would be worthwhile to use the au- tion 5. Special thanks go to Yana Panchenko and tomatically harvested sense-annotated examples Yannick Versley for their support with the web- as the seed set for Yarowsky’s iterative method crawler and to Emanuel Dima and Klaus Sut- for creating a large sense-annotated corpus. An- tner for helping us to obtain the Gutenberg and other fruitful direction for further automatic ex- Wikipedia texts. pansion of WebCAGe is to use the heuristic of monosemous relatives used by Leacock et al., by Agirre and Lopez de Lacalle, and by Mihalcea References and Moldovan. However, we have to leave these Agirre, E., Lopez de Lacalle, O. 2004. Publicly matters for future research. available topic signatures for all WordNet nominal In order to validate the language independence senses. Proceedings of the 4th International Con- ference on Languages Resources and Evaluations of our approach, we plan to apply our method to (LREC’04), Lisbon, Portugal, pp. 1123–1126 sense inventories for languages other than Ger- Agirre, E., Marquez, L., Wicentowski, R. 2007. Pro- man. A precondition for such an experiment is an ceedings of the 4th International Workshop on Se- existing mapping between the sense inventory in mantic Evaluations. Assoc. for Computational Lin- question and a web-based resource such as Wik- guistics, Stroudsburg, PA, USA tionary or Wikipedia. With BabelNet, Navigli and Broscheit, S., Frank, A., Jehle, D., Ponzetto, S. P., Ponzetto (2010) have created a multilingual re- Rehl, D., Summa, A., Suttner, K., Vola, S. 2010. source that allows the testing of our approach to Rapid bootstrapping of Word Sense Disambigua- languages other than German. As a first step in tion resources for German. Proceedings of the 10. this direction, we applied our approach to English Konferenz zur Verarbeitung Nat¨urlicher Sprache, Saarbr¨ucken, Germany, pp. 19–27 using the mapping between the Princeton Word- Erk, K., Strapparava, C. 2010. Proceedings of the 5th Net and the English version of Wiktionary pro- International Workshop on Semantic Evaluation. vided by Meyer and Gurevych (2011). The re- Assoc. for Computational Linguistics, Stroudsburg, sults of these experiments, which are reported in PA, USA Henrich et al. (2012), confirm the general appli- Fellbaum, C. (ed.). 1998. WordNet An Electronic cability of our approach. Lexical Database. The MIT Press. To conclude: This paper describes an automatic Henrich, V., Hinrichs, E. 2010. GernEdiT – The Ger- method for creating a domain-independent sense- maNet Editing Tool. Proceedings of the Seventh annotated corpus harvested from the web. The Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, pp. data obtained by this method for German have 2228–2235 resulted in the WebCAGe resource which cur- Henrich, V., Hinrichs, E., Vodolazova, T. 2011. Semi- rently represents the largest sense-annotated cor- Automatic Extension of GermaNet with Sense Def- pus available for this language. The publication of initions from Wiktionary. Proceedings of the 5th this paper is accompanied by making WebCAGe Language & Technology Conference: Human Lan- freely available. guage Technologies as a Challenge for Computer Science and Linguistics (LTC’11), Poznan, Poland, Acknowledgements pp. 126–130 Henrich, V., Hinrichs, E., Vodolazova, T. 2012. An The research reported in this paper was jointly Automatic Method for Creating a Sense-Annotated Corpus Harvested from the Web. Poster pre- funded by the SFB 833 grant of the DFG and by sented at 13th International Conference on Intelli- the CLARIN-D grant of the BMBF. We would gent Text Processing and Computational Linguistics like to thank Christina Hoppermann, Marie Hin- (CICLing-2012), New Delhi, India, March 2012 richs as well as three anonymous EACL 2012 re- Koeva, S., Leseva, S., Todorova, M. 2006. Bul- viewers for their helpful comments on earlier ver- garian Sense Tagged Corpus. Proceedings of the sions of this paper. We are very grateful to Rein- 5th SALTMIL Workshop on Minority Languages: 395 Strategies for Developing Machine Translation for for Computational Linguistics (ACL’95), Associ- Minority Languages, Genoa, Italy, pp. 79–87 ation for Computational Linguistics, Stroudsburg, Kunze, C., Lemnitzer, L. 2002. GermaNet rep- PA, USA, pp. 189–196 resentation, visualization, application. Proceed- ings of the 3rd International Language Resources and Evaluation (LREC’02), Las Palmas, Canary Is- lands, pp. 1485–1491 Leacock, C., Chodorow, M., Miller, G. A. 1998. Using corpus statistics and wordnet relations for sense identification. Computational Linguistics, 24(1):147–165 Meyer, C. M., Gurevych, I. 2011. What Psycholin- guists Know About Chemistry: Aligning Wik- tionary and WordNet for Increased Domain Cov- erage. Proceedings of the 5th International Joint Conference on Natural Language Processing (IJC- NLP), Chiang Mai, Thailand, pp. 883–892 Mihalcea, R., Moldovan, D. 1999. An Auto- matic Method for Generating Sense Tagged Cor- pora. Proceedings of the American Association for Artificial Intelligence (AAAI’99), Orlando, Florida, pp. 461–466 Mihalcea, R., Chklovski, T., Kilgarriff, A. 2004. Pro- ceedings of Senseval-3: Third International Work- shop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain Navigli, R., Ponzetto, S. P. 2010. BabelNet: Build- ing a Very Large Multilingual Semantic Network. Proceedings of the 48th Annual Meeting of the As- sociation for Computational Linguistics (ACL’10), Uppsala, Sweden, pp. 216–225 Raileanu, D., Buitelaar, P., Vintar, S., Bay, J. 2002. Evaluation Corpora for Sense Disambiguation in the Medical Domain. Proceedings of the 3rd In- ternational Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands, pp. 609– 612 Santamar´ıa, C., Gonzalo, J., Verdejo, F. 2003. Au- tomatic Association of Web Directories to Word Senses. Computational Linguistics 29 (3), MIT Press, PP. 485–502 Schmid, H. 1994. Probabilistic Part-of-Speech Tag- ging Using Decision Trees. Proceedings of the In- ternational Conference on New Methods in Lan- guage Processing, Manchester, UK Wu, Y., Jin, P., Zhang, Y., Yu, S. 2006. A Chinese Corpus with Word Sense Annotation. Proceedings of 21st International Conference on Computer Pro- cessing of Oriental Languages (ICCPOL’06), Sin- gapore, pp. 414–421 Yarowsky, D. 1995. Unsupervised word sense dis- ambiguation rivaling supervised methods. Proceed- ings of the 33rd Annual Meeting on Association 396 Learning to Behave by Reading Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

[email protected]

Abstract In this talk, I will address the problem of grounding linguistic analysis in control applications, such as game playing and robot navigation. We assume access to natural language documents that describe the desired behavior of a control algorithm (e.g., game strategy guides). Our goal is to demonstrate that knowledge automatically extracted from such documents can dramatically improve performance of the target application. First, I will present a reinforcement learning algorithm for learning to map natural language instructions to executable actions. This technique has enabled automation of tasks that until now have required human participation — for example, automatically configuring software by consulting how-to guides. Next, I will present a Monte-Carlo search algorithm for game playing that incorporates information from game strategy guides. In this framework, the task of text inter- pretation is formulated as a probabilistic model that is trained based on feedback from Monte-Carlo search. When applied to the Civilization strategy game, a language-empowered player outperforms its traditional counterpart by a significant margin. 397 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, page 397, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Lexical surprisal as a general predictor of reading time Irene Fernandez Monsalve, Stefan L. Frank and Gabriella Vigliocco Division of Psychology and Language Sciences University College London {ucjtife, s.frank, g.vigliocco}@ucl.ac.uk Abstract predictive. For example, on-line detection of se- mantic or syntactic anomalies can be observed in Probabilistic accounts of language process- the brain’s EEG signal (Hagoort et al., 2004) and ing can be psychologically tested by com- eye gaze is directed in anticipation at depictions paring word-reading times (RT) to the con- of plausible sentence completions (Kamide et al., ditional word probabilities estimated by 2003). Moreover, probabilistic accounts of lan- language models. Using surprisal as a link- guage processing have identified unpredictability ing function, a significant correlation be- as a major cause of processing difficulty in lan- tween unlexicalized surprisal and RT has been reported (e.g., Demberg and Keller, guage comprehension. In such incremental pro- 2008), but success using lexicalized models cessing, parsing would entail a pre-allocation of has been limited. In this study, phrase struc- resources to expected interpretations, so that ef- ture grammars and recurrent neural net- fort would be related to the suitability of such works estimated both lexicalized and unlex- an allocation to the actually encountered stimulus icalized surprisal for words of independent (Levy, 2008). sentences from narrative sources. These same sentences were used as stimuli in Possible sentence interpretations can be con- a self-paced reading experiment to obtain strained by both linguistic and extra-linguistic RTs. The results show that lexicalized sur- context, but while the latter is difficult to evalu- prisal according to both models is a signif- ate, the former can be easily modeled: The pre- icant predictor of RT, outperforming its un- dictability of a word for the human parser can be lexicalized counterparts. expressed as the conditional probability of a word given the sentence so far, which can in turn be es- timated by language models trained on text cor- 1 Introduction pora. These probabilistic accounts of language Context-sensitive, prediction-based processing processing difficulty can then be validated against has been proposed as a fundamental mechanism empirical data, by taking reading time (RT) on a of cognition (Bar, 2007): Faced with the prob- word as a measure of the effort involved in its pro- lem of responding in real-time to complex stim- cessing. uli, the human brain would use basic information Recently, several studies have followed this ap- from the environment, in conjunction with previ- proach, using “surprisal” (see Section 1.1) as the ous experience, in order to extract meaning and linking function between effort and predictabil- anticipate the immediate future. Such a cognitive ity. These can be computed for each word in a style is a well-established finding in low level sen- text, or alternatively for the words’ parts of speech sory processing (e.g., Kveraga et al., 2007), but (POS). In the latter case, the obtained estimates has also been proposed as a relevant mechanism can give an indication of the importance of syn- in higher order processes, such as language. In- tactic structure in developing upcoming-word ex- deed, there is ample evidence to show that human pectations, but ignore the rich lexical information language comprehension is both incremental and that is doubtlessly employed by the human parser 398 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 398–408, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics to constrain predictions. However, whereas such 1.2 Empirical evidence for surprisal an unlexicalized (i.e., POS-based) surprisal has been shown to significantly predict RTs, success The simplest statistical language models that can with lexical (i.e., word-based) surprisal has been be used to estimate surprisal values are n-gram limited. This can be attributed to data sparsity models or Markov chains, which condition the (larger training corpora might be needed to pro- probability of a given word only on its n − 1 pre- vide accurate lexical surprisal than for the unlex- ceding ones. Although Markov models theoret- icalized counterpart), or to the noise introduced ically limit the amount of prior information that by participant’s world knowledge, inaccessible to is relevant for prediction of the next step, they the models. The present study thus sets out to find are often used in linguistic context as an approx- such a lexical surprisal effect, trying to overcome imation to the full conditional probability. The possible limitations of previous research. effect of bigram probability (or forward transi- tional probability) has been repeatedly observed 1.1 Surprisal theory (e.g. McDonald and Shillcock, 2003), and Smith The concept of surprisal originated in the field of and Levy (2008) report an effect of lexical sur- information theory, as a measure of the amount of prisal as estimated by a trigram model on RTs information conveyed by a particular event. Im- for the Dundee corpus (a collection of newspaper probable (‘surprising’) events carry more infor- texts with eye-tracking data from ten participants; mation than expected ones, so that surprisal is in- Kennedy and Pynte, 2005). versely related to probability, through a logarith- Phrase structure grammars (PSGs) have also mic function. In the context of sentence process- been amply used as language models (Boston et ing, if w1 , ..., wt−1 denotes the sentence so far, al., 2008; Brouwer et al., 2010; Demberg and then the cognitive effort required for processing Keller, 2008; Hale, 2001; Levy, 2008). PSGs the next word, wt , is assumed to be proportional can combine statistical exposure effects with ex- to its surprisal: plicit syntactic rules, by annotating norms with their respective probabilities, which can be es- effort(t) ∝ surprisal(wt ) timated from occurrence counts in text corpora. Information about hierarchical sentence structure = − log(P (wt |w1 , ..., wt−1 )) (1) can thus be included in the models. In this way, Different theoretical groundings for this rela- Brouwer et al. trained a probabilistic context- tionship have been proposed (Hale, 2001; Levy free grammar (PCFG) on 204,000 sentences ex- 2008; Smith and Levy, 2008). Smith and Levy tracted from Dutch newspapers to estimate lexi- derive it by taking a scale free assumption: Any cal surprisal (using an Earley-Stolcke parser; Stol- linguistic unit can be subdivided into smaller en- cke, 1995), showing that it could account for tities (e.g., a sentence is comprised of words, a the noun phrase coordination bias previously de- word of phonemes), so that time to process the scribed and explained by Frazier (1987) in terms whole will equal the sum of processing times for of a minimal-attachment preference of the human each part. Since the probability of the whole can parser. In contrast, Demberg and Keller used texts be expressed as the product of the probabilities of from a naturalistic source (the Dundee corpus) as the subunits, the function relating probability and the experimental stimuli, thus evaluating surprisal effort must be logarithmic. Levy (2008), on the as a wide-coverage account of processing diffi- other hand, grounds surprisal in its information- culty. They also employed a PSG, trained on a theoretical context, describing difficulty encoun- one-million-word language sample from the Wall tered in on-line sentence processing as a result of Street Journal (part of the Penn Treebank II, Mar- the need to update a probability distribution over cus et al., 1993). Using Roark’s (2001) incremen- possible parses, being directly proportional to the tal parser, they found significant effects of unlexi- difference between the previous and updated dis- calized surprisal on RTs (see also Boston et al. for tributions. By expressing the difference between a similar approach and results for German texts). these in terms of relative entropy, Levy shows that However, they failed to find an effect for lexical- difficulty at each newly encountered word should ized surprisal, over and above forward transitional be equal to its surprisal. probability. Roark et al. (2009) also looked at the 399 effects of syntactic and lexical surprisal, using RT words than on POS (both the Brown corpus and data for short narrative texts. However, their es- the WSJ are relatively small), and in addition, the timates of these two surprisal values differ from particular journalistic style of the WSJ might not those described above: In order to tease apart se- be the best alternative for modeling human be- mantic and syntactic effects, they used Demberg haviour. Although similarity between the train- and Keller’s lexicalized surprisal as a total sur- ing and experimental data sets (both from news- prisal measure, which they decompose into syn- paper sources) can improve the linguistic perfor- tactic and lexical components. Their results show mance of the models, their ability to simulate hu- significant effects of both syntactic and lexical man behaviour might be limited: Newspaper texts surprisal, although the latter was found to hold probably form just a small fraction of a person’s only for closed class words. Lack of a wider effect linguistic experience. This study thus aims to was attributed to data sparsity: The models were tackle some of the identified limitations: Rather trained on the relatively small Brown corpus (over than cohesive texts, independent sentences, from one million words from 500 samples of American a narrative style are used as experimental stim- English text), so that surprisal estimates for the uli for which word-reading times are collected less frequent content words would not have been (as explained in Section 3). In addition, as dis- accurate enough. cussed in the following section, language mod- Using the same training and experimental lan- els are trained on a larger corpus, from a more guage samples as Demberg and Keller (2008), representative language sample. Following Frank and only unlexicalized surprisal estimates, Frank (2009) and Frank and Bod (2011), two contrasting (2009) and Frank and Bod (2011) focused on types of models are employed: hierarchical PSGs comparing different language models, including and linear RNNs. various n-gram models, PSGs and recurrent net- works (RNN). The latter were found to be the bet- 2 Models ter predictors of RTs, and PSGs could not explain 2.1 Training data any variance in RT over and above the RNNs, suggesting that human processing relies on linear The training texts were extracted from the writ- rather than hierarchical representations. ten section of the British National Corpus (BNC), Summing up, the only models taking into ac- a collection of language samples from a variety count actual words that have been consistently of sources, designed to provide a comprehensive shown to simulate human behaviour with natural- representation of current British English. A total istic text samples are bigram models.1 A possi- of 702,412 sentences, containing only the 7,754 ble limitation in previous studies can be found in most frequent words (the open-class words used the stimuli employed. In reading real newspaper by Andrews et al., 2009, plus the 200 most fre- texts, prior knowledge of current affairs is likely quent words in English) were selected, making up to highly influence RTs, however, this source of a 7.6-million-word training corpus. In addition to variability cannot be accounted for by the mod- providing a larger amount of data than the WSJ, els. In addition, whereas the models treat each this training set thus provides a more representa- sentence as an independent unit, in the text cor- tive language sample. pora employed they make up coherent texts, and are therefore clearly dependent. Thirdly, the stim- 2.2 Experimental sentences uli used by Demberg and Keller (2008) comprise Three hundred and sixty-one sentences, all com- a very particular linguistic style: journalistic edi- prehensible out of context and containing only torials, reducing the ability to generalize conclu- words included in the subset of the BNC used sions to language in general. Finally, failure to to train the models, were randomly selected from find lexical surprisal effects can also be attributed three freely accessible on-line novels2 (for addi- to the training texts. Larger corpora are likely to tional details, see Frank, 2012). The fictional be needed for training language models on actual narrative provides a good contrast to the pre- 1 2 Although Smith and Levy (2008) report an effect of tri- Obtained from www.free-online-novels.com. grams, they did not check if it exceeded that of simpler bi- Having not been published elsewhere, it is unlikely partici- grams. pants had read the novels previously. 400 viously examined newspaper editorials from the probability distribution Dundee corpus, since participants did not need over 7,754 word types prior knowledge regarding the details of the sto- ries, and a less specialised language and style were employed. In addition, the randomly se- lected sentences did not make up coherent texts 200 (in contrast, Roark et al., 2009, employed short stories), so that they were independent from each other, both for the models and the readers. 400 2.3 Part-of-speech tagging 500 In order to produce POS-based surprisal esti- mates, versions of both the training and exper- imental texts with their words replaced by POS 400 were developed: The BNC sentences were parsed by the Stanford Parser, version 1.6.7 (Klein and 7,754 word types Manning, 2003), whilst the experimental texts were tagged by an automatic tagger (Tsuruoka and Tsujii, 2005), with posterior review and cor- Figure 1: Architecture of neural network language rection by hand following the Penn Treebank model, and its three learning stages. Numbers indicate the number of units in each network layer. Project Guidelines (Santorini, 1991). By training language models and subsequently running them on the POS versions of the texts, unlexicalized Stage 1: Developing word representations surprisal values were estimated. Neural network language models can bene- 2.4 Phrase-structure grammars fit from using distributed word representations: The Treebank formed by the parsed BNC sen- Each word is assigned a vector in a continu- tences served as training data for Roark’s (2001) ous, high-dimensional space, such that words that incremental parser. Following Frank and Bod are paradigmatically more similar are closer to- (2011), a range of grammars was induced, dif- gether (e.g., Bengio et al., 2003; Mnih and Hin- fering in the features of the tree structure upon ton, 2007). Usually, these representations are which rule probabilities were conditioned. In learned together with the rest of the model, but four grammars, probabilities depended on the left- here we used a more efficient approach in which hand side’s ancestors, from one up to four levels word representations are learned in an unsuper- up in the parse tree (these grammars will be de- vised manner from simple co-occurrences in the noted a1 to a4 ). In four other grammars (s1 to training data. First, vectors of word co-occurrence s4 ), the ancestors’ left siblings were also taken frequencies were developed using Good-Turing into account. In addition, probabilities were con- (Gale and Sampson, 1995) smoothed frequency ditioned on the current head node in all grammars. counts from the training corpus. Values in the Subsequently, Roark’s (2001) incremental parser vector corresponded to the smoothed frequencies parsed the experimental sentences under each of with which each word directly preceded or fol- the eight grammars, obtaining eight surprisal val- lowed the represented word. Thus, each word ues for each word. Since earlier research (Frank, w was assigned a vector (fw,1 , ..., fw,15508 ), such 2009) showed that decreasing the parser’s base that fw,v is the number of times word v directly beam width parameter improves performance, it precedes (for v ≤ 7754) or follows (for v > was set to 10−18 (the default being 10−12 ). 7754) word w. Next, the frequency counts were transformed into Pointwise Mutual Information 2.5 Recurrent neural network (PMI) values (see Equation 2), following Bulli- The RNN (see Figure 1) was trained in three naria and Levy’s (2007) findings that PMI pro- stages, each taking the selected (unparsed) BNC duced more psychologically accurate predictions sentences as training data. than other measures: 401 3 Experiment P ! fw,v i,j fi,j 3.1 Procedure PMI(w, v) = log P P (2) i fi,v j fw,j Text display followed a self-paced reading paradigm: Sentences were presented on a com- Finally, the 400 columns with the highest vari- puter screen one word at a time, with onset of ance were selected from the 7754×15508-matrix the next word being controlled by the subject of row vectors, making them more computation- through a key press. The time between word ally manageable, but not significantly less infor- onset and subsequent key press was recorded as mative. the RT (measured in milliseconds) on that word Stage 2: Learning temporal structure by that subject.3 Words were presented centrally Using the standard backpropagation algorithm, aligned in the screen, and punctuation marks ap- a simple recurrent network (SRN) learned to pre- peared with the word that preceded them. A fixed- dict, at each point in the training corpus, the next width font type (Courier New) was used, so that word’s vector given the sequence of word vectors physical size of a word equalled number of char- corresponding to the sentence so far. The total acters. Order of presentation was randomized for corpus was presented five times, each time with each subject. The experiment was time-bounded the sentences in a different random order. to 40 minutes, and the number of sentences read by each participant varied between 120 and 349, Stage 3: Decoding predicted word with an average of 224. Yes-no comprehension representations questions followed 46% of the sentences. The distributed output of the trained SRN served as training input to the feedforward “de- 3.2 Participants coder” network, that learned to map the dis- A total of 117 first year psychology students took tributed representations back to localist ones. part in the experiment. Subjects unable to an- This network, too, used standard backpropaga- swer correctly to more than 20% of the questions tion. Its output units had softmax activation func- and 47 participants who were non-native English tions, so that the output vector constitutes a prob- speakers were excluded from the analysis, leaving ability distribution over word types. These trans- a total of 54 subjects. late directly into surprisal values, which were col- lected over the experimental sentences at ten in- 3.3 Design tervals over the course of Stage 3 training (after The obtained RTs served as the dependent vari- presenting 2K, 5K, 10K, 20K, 50K, 100K, 200K, able against which a mixed-effects multiple re- and 350K sentences, and after presenting the full gression analysis with crossed random effects for training corpus once and twice). These will be subjects and items (Baayen et al., 2008) was per- denoted by RNN-1 to RNN-10. formed. In order to control for low-level lexical A much simpler RNN model suffices for ob- factors that are known to influence RTs, such as taining unlexicalized surprisal. Here, we used word length or frequency, a baseline regression the same models as described by Frank and Bod model taking them into account was built. Subse- (2011), albeit trained on the POS tags of our quently, the decrease in the model’s deviance, af- BNC training corpus. These models employed ter the inclusion of surprisal as a fixed factor to the so-called Echo State Networks (ESN; Jaeger and baseline, was assessed using likelihood tests. The Haas, 2004), which are RNNs that do not develop resulting χ2 statistic indicates the extent to which internal representations because weights of input each surprisal estimate accounts for RT, and can and recurrent connections remain fixed at ran- thus serve as a measure of the psychological ac- dom values (only the output connection weights curacy of each model. are trained). Networks of six different sizes were However, this kind of analysis assumes that RT used. Of each size, three networks were trained, for a word reflects processing of only that word, using different random weights. The best and worst model of each size were discarded to reduce 3 The collected RT data are available for download at the effect of the random weights. www.stefanfrank.info/EACL2012. 402 but spill-over effects (in which processing diffi- • Word position: Low-level effects of word or- culty at word wt shows up in the RT on wt+1 ) der, not related to predictability itself, were have been found in self-paced and natural read- modeled by including word position in the ing (Just et al., 1982; Rayner, 1998; Rayner and sentence, both as a linear and quadratic fac- Pollatsek, 1987). To evaluate these effects, the tor (some of the sentences were quite long, decrease in deviance after adding surprisal of the so that the effect of word position is unlikely previous item to the baseline was also assessed. to be linear). The following control predictors were included in the baseline regression model: • Reading time for previous word: As sug- gested by Baayen and Milin (2010), includ- Lexical factors: ing RT on the previous word can control for • Number of characters: Both physical size several autocorrelation effects. and number of characters have been found to affect RTs for a word (Rayner and Pollat- 4 Results sek, 1987), but the fixed-width font used in Data were analysed using the free statistical soft- the experiment assured number of characters ware package R (R Development Core Team, also encoded physical word length. 2009) and the lme4 library (Bates et al., 2011). • Frequency and forward transitional proba- Two analyses were performed for each language bility: The effects of these two factors have model, using surprisal for either current or pre- been repeatedly reported (e.g. Juhasz and vious word as the dependent variable. Unlikely Rayner, 2003; Rayner, 1998). Given the high reading times (lower than 50ms or over 3000ms) correlations between surprisal and these two were removed from the analysis, as were clitics, measures, their inclusion in the baseline as- words followed by punctuation, words follow- sures that the results can be attributed to pre- ing punctuation or clitics (since factors for pre- dictability in context, over and above fre- vious word were included in the analysis), and quency and bigram probability. Frequency sentence-initial words, leaving a total of 132,298 was estimated from occurrence counts of data points (between 1,335 and 3,829 per subject). each word in the full BNC corpus (written 4.1 Baseline model section). The same transformation (nega- tive logarithm) was applied as for computing Theoretical considerations guided the selection surprisal, thus obtaining “unconditional” and of the initial predictors presented above, but an bigram surprisal values. empirical approach led actual regression model building. Initial models with the original set of • Previous word lexical factors: Lexical fac- fixed effects, all two-way interactions, plus ran- tors for the previous word were included in dom intercepts for subjects and items were evalu- the analysis to control for spill-over effects. ated, and least significant factors were removed Temporal factors and autocorrelation: one at a time, until only significant predictors were left (|t| > 2). A different strategy was RT data over naturalistic texts violate the re- used to assess which by-subject and by item ran- gression assumption of independence of obser- dom slopes to include in the model. Given the vations in several ways, and important word-by- large number of predictors, starting from the sat- word sequential correlations exist. In order to en- urated model with all random slopes generated sure validity of the statistical analysis, as well as non-convergence problems and excessively long providing a better model fit, the following factors running times. By-subject and by-item random were also included: slopes for each fixed effect were therefore as- • Sentence position: Fatigue and practice ef- sessed individually, using likelihood tests. The fects can influence RTs. Sentence position final baseline model included by-subject random in the experiment was included both as linear intercepts, by-subject random slopes for sentence and quadratic factor, allowing for the model- position and word position, and by-item slopes for ing of initial speed-up due to practice, fol- previous RT. All factors (random slopes and fixed lowed by a slowing down due to fatigue. effects) were centred and standardized to avoid 403 Lexicalized models Unlexicalized models 70 30 81 90 7 43 60 25 Psychological accuracy (χ²) 2 50 6 32 20 4 5 41 3 56 40 2 1 34 1 3 4 15 4 30 3 2 10 2 20 2 10 5 1 1 1 0 0 -6.6 -6.4 -6.2 -6 -5.8 -5.6 -5.4 -5.2 -5 -2.55 -2.5 -2.45 -2.4 -2.35 -2.3 -2.25 -2.2 -2.15 -2.1 Linguistic accuracy (-average surprisal) PSG-a PSG-s RNN Figure 2: Psychological accuracy (combined effect of current and previous surprisal) against linguistic accuracy of the different models. Numbered labels denote the maximum number of levels up in the tree from which conditional information is used (PSG); point in training when estimates were collected (word-based RNN); or network size (POS-based RNN). multicollinearity-related problems. is by the test corpus). For the lexicalized models, RNNs clearly outperform PSGs. Moreover, the 4.2 Surprisal effects RNN’s accuracy increases as training progresses All model categories (PSGs and RNNs) produced (the highest psychological accuracy is achieved lexicalized surprisal estimates that led to a signif- at point 8, when 350K training sentences were icant (p < 0.05) decrease in deviance when in- presented). The PSGs taking into account sib- cluded as a fixed factor in the baseline, with pos- ling nodes are slightly better than their ancestor- itive coefficients: Higher surprisal led to longer only counterparts (the best psychological model RTs. Significant effects were also found for their is PSG-s3 ). Contrary to the trend reported by unlexicalized counterparts, albeit with consider- Frank and Bod (2011), the unlexicalized PSGs ably smaller χ2 -values. and RNNs reach similar levels of psychological Both for the lexicalized and unlexicalized ver- accuracy, with the PSG-s4 achieving the highest sions, these effects persisted whether surprisal for χ2 -value. the previous or current word was taken as the in- dependent variable. However, the effect size was Model comparison χ2 (2) p-value much larger for previous surprisal, indicating the PSG over RNN 12.45 0.002 presence of strong spill-over effects (e.g. lexical- RNN over PSG 30.46 0.001 ized PSG-s3 : current surprisal: χ2 (1) = 7.29, p = 0.007; previous surprisal: χ2 (1) = 36.73, Table 1: Model comparison between best performing p 0.001). word-based PSG and RNN. From hereon, only results for the combined ef- fect of both (inclusion of previous and current Although RNNs outperform PSGs in the lexi- surprisal as fixed factors in the baseline) are re- calized estimates, comparisons between the best ported. Figure 2 shows the psychological accu- performing model (i.e. highest χ2 ) in each cate- racy of each model (χ2 (2) values) plotted against gory showed both were able to explain variance its linguistic accuracy (i.e., its quality as a lan- over and above each other (see Table 1). It is guage model, measured by the negative aver- worth noting, however, that if comparisons are age surprisal on the experimental sentences: the made amongst models including surprisal for cur- higher this value, the “less surprised” the model rent, but not previous word, the PSG is unable 404 to explain a significant amount of variance over sults show the ability of lexicalized surprisal to and above the RNN (χ2 (1) = 2.28; p = 0.13).4 explain a significant amount of variance in RT Lexicalized models achieved greater psychologi- data for naturalistic texts, over and above that cal accuracy than their unlexicalized counterparts, accounted for by other low-level lexical factors, but the latter could still explain a small amount of such as frequency, length, and forward transi- variance over and above the former (see Table 2).5 tional probability. Although previous studies had presented results supporting such a probabilistic Model comparison χ2 (2) p-value language processing account, evidence for word- based surprisal was limited: Brouwer et al. (2010) Best models overall: only examined a specific psycholinguistic phe- POS- over word-based 10.40 0.006 nomenon, rather than a random language sample; word- over POS-based 47.02 0.001 Demberg and Keller (2008) reported effects that PSGs: were only significant for POS but not word-based POS- over word-based 6.89 0.032 surprisal; and Smith and Levy (2008) found an word- over POS-based 25.50 0.001 effect of lexicalized surprisal (according to a tri- gram model), but did not assess whether simpler RNNs: predictability estimates (i.e., by a bigram model) POS- over word-based 5.80 0.055 could have accounted for those effects. word- over POS-based 49.74 0.001 Demberg and Keller’s (2008) failure to find lex- icalized surprisal effects can be attributed both to Table 2: Word- vs. POS-based models: comparisons the language corpus used to train the language between best models overall, and best models within models, as well as to the experimental texts used. each category. Both were sourced from newspaper texts: As training corpora these are unrepresentative of a 4.3 Differences across word classes person’s linguistic experience, and as experimen- tal texts they are heavily dependent on partici- In order to make sure that the lexicalized sur- pant’s world knowledge. Roark et al. (2009), in prisal effects found were not limited to closed- contrast, used a more representative, albeit rela- class words (as Roark et al., 2009, report), a fur- tively small, training corpus, as well as narrative- ther model comparison was performed by adding style stimuli, thus obtaining RTs less dependent by-POS random slopes of surprisal to the models on participant’s prior knowledge. With such an containing the baseline plus surprisal. If particu- experimental set-up, they were able to demon- lar syntactic categories were contributing to the strate the effects of lexical surprisal for RT of overall effect of surprisal more than others, in- closed-class, but not open-class, words, which cluding such random slopes would lead to addi- they attributed to their differential frequency and tional variance being explained. However, this to training-data sparsity: The limited Brown cor- was not the case: inclusion of by-POS random pus would have been enough to produce accurate slopes of surprisal did not lead to a significant im- estimates of surprisal for function words, but not provement in model fit (PSG: χ2 (1) = 0.86, p = for the less frequent content words. A larger train- 0.35; RNN: χ2 (1) = 3.20, p = 0.07).6 ing corpus, constituting a broad language sample, 5 Discussion was used in our study, and the detected surprisal effects were shown to hold across syntactic cate- The present study aimed to find further evidence gory (modeling slopes for POS separately did not for surprisal as a wide-coverage account of lan- improve model fit). However, direct comparison guage processing difficulty, and indeed, the re- with Roark et al.’s results is not possible: They 4 Best models in this case were PSG-a3 and RNN-7. employed alternative definitions of structural and 5 Since best performing lexicalized and unlexicalized lexical surprisal, which they derived by decom- models belonged to different groups: RNN and PSG, respec- posing the total surprisal as obtained with a fully tively, Table 2 also shows comparisons within model type. 6 lexicalized PSG model. Comparison was made on the basis of previous word surprisal (best models in this case were PSG-s3 and RNN- In the current study, a similar approach to that 9). taken by Demberg and Keller (2008) was used to 405 define structural (or unlexicalized), and lexical- texts, or self-paced reading of independent, nar- ized surprisal, but the results are strikingly differ- rative sentences. The absence of global context, ent: Whereas Demberg and Keller report a signif- or the unnatural reading methodology employed icant effect for POS-based estimates, but not for in the current experiment, could have led to an word-based surprisal, our results show that lexi- increased reliance on hierarchical structure for calized surprisal is a far better predictor of RTs sentence comprehension. The sources and struc- than its unlexicalized counterpart. This is not sur- tures relied upon by the human parser to elabo- prising, given that while the unlexicalized mod- rate upcoming-word expectations could therefore els only have access to syntactic sources of in- be task-dependent. On the other hand, our re- formation, the lexicalized models, like the hu- sults show that the independent effects of word- man parser, can also take into account lexical co- based PSG estimates only become apparent when occurrence trends. However, when a training cor- investigating the effect of surprisal of the previous pus is not large enough to accurately capture the word. That is, considering only the current word’s latter, it might still be able to model the former, surprisal, as in Frank and Bod’s analysis, did not given the higher frequency of occurrence of each reveal a significant contribution of PSGs over and possible item (POS vs. word) in the training data. above RNNs. Thus, additional effects of PSG sur- Roark et al. (2009) also included in their analysis prisal might only be apparent when spill-over ef- a POS-based surprisal estimate, which lost signif- fects are investigated by taking previous word sur- icance when the two components of the lexical- prisal as a predictor of RT. ized surprisal were present, suggesting that such unlexicalized estimates can be interpreted only as 6 Conclusion a coarse version of the fully lexicalized surprisal, The results here presented show that lexicalized incorporating both syntactic and lexical sources surprisal can indeed model RT over naturalistic of information at the same time. The results pre- texts, thus providing a wide-coverage account of sented here do not replicate this finding: The best language processing difficulty. Failure of previ- unlexicalized estimates were able to explain ad- ous studies to find such an effect could be at- ditional variance over and above the best word- tributed to the size or nature of the training cor- based estimates. However, this comparison con- pus, suggesting that larger and more general cor- trasted two different model types: a word-based pora are needed to model successfully both the RNN and a POS-based PSG, so that the observed structural and lexical regularities used by the hu- effects could be attributed to the model represen- man parser to generate predictions. Another cru- tations (hierarchical vs. linear) rather than to the cial finding presented here is the importance of item of analysis (POS vs. words). Within-model spill-over effects: Surprisal of a word had a much comparisons showed that unlexicalized estimates larger influence on RT of the following item than were still able to account for additional variance, of the word itself. Previous studies where lexi- although only reaching significance at the 0.05 calized surprisal was only analysed in relation to level for the PSGs. current RT could have missed a significant effect Previous results reported by Frank (2009) and only manifested on the following item. Whether Frank and Bod (2011) regarding the higher psy- spill-over effects are as important for different RT chological accuracy of RNNs and the inability of collection paradigms (e.g., eye-tracking) remains the PSGs to explain any additional variance in to be tested. RT, were not replicated. Although for the word- based estimates RNNs outperform the PSGs, we Acknowledgments found both to have independent effects. Further- The research presented here was funded by the more, in the POS-based analysis, performance of European Union Seventh Framework Programme PSGs and RNNs reaches similarly high levels of (FP7/2007-2013) under grant number 253803. psychological accuracy, with the best-performing The authors acknowledge the use of the UCL Le- PSG producing slightly better results than the gion High Performance Computing Facility, and best-performing RNN. This discrepancy in the re- associated support services, in the completion of sults could reflect contrasting reading styles in this work. the two studies: natural reading of newspaper 406 References Stefan L. Frank. 2012. Uncertainty reduction as a measure of cognitive processing load in sentence Gerry T.M. Altmann and Yuki Kamide. 1999. Incre- comprehension. Manuscript submitted for publica- mental interpretation at verbs: Restricting the do- tion. main of subsequent reference. Cognition, 73:247– Peter Hagoort, Lea Hald, Marcel Bastiaansen, and 264. Karl Magnus Petersson. 2004. Integration of word Mark Andrews, Gabriella Vigliocco, and David P. Vin- meaning and world knowledge in language compre- son. 2009. Integrating experiential and distribu- hension. Science, 304:438–441. tional data to learn semantic representations. Psy- John Hale. 2001. A probabilistic earley parser as a chological Review, 116:463–498. psycholinguistic model. In Proceedings of the sec- R. Harald Baayen and Petar Milin. 2010. Analyzing ond meeting of the North American Chapter of the reaction times. International Journal of Psycholog- Association for Computational Linguistics on Lan- ical Research, 3:12–28. guage technologies, pages 1–8, Stroudsburg, PA. R. Harald Baayen, Doug J. Davidson, and Douglas M. Herbert Jaeger and Harald Haas. 2004. Harnessing Bates. 2008. Mixed-effects modeling with crossed nonlinearity: predicting chaotic systems and saving random effects for subjects and items. Journal of energy in wireless communication. Science, pages Memory and Language, 59:390–412. 78–80. Moshe Bar. 2007. The proactive brain: using Barbara J. Juhasz and Keith Rayner. 2003. Investigat- analogies and associations to generate predictions. ing the effects of a set of intercorrelated variables on Trends in Cognitive Sciences, 11:280–289. eye fixation durations in reading. Journal of Exper- Douglas Bates, Martin Maechler, and Ben Bolker, imental Psychology: Learning, Memory and Cogni- 2011. lme4: Linear mixed-effects models using tion, 29:1312–1318. S4 classes. Available from: http://CRAN.R- Marcel A. Just, Patricia A. Carpenter, and Jacque- project.org/package=lme4 (R package version line D. Woolley. 1982. Paradigms and processes 0.999375-39). in reading comprehension. Journal of Experimen- tal Psychology: General, 111:228–238. Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilis- Yuki Kamide, Christoph Scheepers, and Gerry T. M. tic language model. Journal of Machine Learning Altmann. 2003. Integration of syntactic and se- Research, 3:1137–1155. mantic information in predictive processing: cross- linguistic evidence from German and English. Marisa Ferrara Boston, John Hale, Reinhold Kliegl, Journal of Psycholinguistic Research, 32:37–55. Umesh Patil, and Shravan Vasishth. 2008. Parsing Alan Kennedy and Jo¨el Pynte. 2005. Parafoveal-on costs as predictors of reading difficulty: An evalua- foveal effects in normal reading. Vision Research, tion using the potsdam sentence corpus. Journal of 45:153–168. Eye Movement Research,, 2:1–12. Dan Klein and Christopher D. Manning. 2003. Ac- Harm Brouwer, Hartmut Fitz, and John C. J. Hoeks. curate unlexicalized parsing. In Proceedings of the 2010. Modeling the noun phrase versus sentence 41st Meeting of the Association for Computational coordination ambiguity in Dutch: evidence from Linguistics,, pages 423–430. surprisal theory. In Proceedings of the 2010 Work- Kestutis Kveraga, Avniel S. Ghuman, and Moshe Bar. shop on Cognitive Modeling and Computational 2007. Top-down predictions in the cognitive brain. Linguistics, pages 72–80, Stroudsburg, PA, USA. Brain and Cognition, 65:145–168. John A. Bullinaria and Joseph P. Levy. 2007. Ex- Roger Levy. 2008. Expectation-based syntactic com- tracting semantic representations from word co- prehension. Cognition, 106:1126–1177. occurrence statistics: A computational study. Be- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and havior Research Methods, 39:510–526. Beatrice Santorini. 1993. Building a large anno- Vera Demberg and Frank Keller. 2008. Data from eye- tated corpus of English: the Penn Treebank. Com- tracking corpora as evidence for theories of syn- putational Linguistics, 19:313–330. tactic processing complexity. Cognition, 109:193– Scott A. McDonald and Richard C. Shillcock. 2003. 210. Low-level predictive inference in reading: the influ- Stefan L. Frank and Rens Bod. 2011. Insensitivity of ence of transitional probabilities on eye movements. the human sentence-processing system to hierarchi- Vision Research, 43:1735–1751. cal structure. Psychological Science, 22:829–834. Andriy Mnih and Geoffrey Hinton. 2007. Three new Stefan L. Frank. 2009. Surprisal-based comparison graphical models for statistical language modelling. between a symbolic and a connectionist model of Proceedings of the 25th International Conference of sentence processing. In Proceedings of the 31st An- Machine Learning, pages 641–648. nual Conference of the Cognitive Science Society, Keith Rayner and Alexander Pollatsek. 1987. Eye pages 1139–1144, Austin, TX. movements in reading: A tutorial review. In 407 M. Coltheart, editor, Attention and performance XII: the psychology of reading., pages 327–362. Lawrence Erlbaum Associates, London, UK. Keith Rayner. 1998. Eye movements in reading and information processing: 20 years of research. Psy- chological Bulletin, 124:372–422. Brian Roark, Asaf Bachrach, Carlos Cardenas, and Christophe Pallier. 2009. Deriving lexical and syn- tactic expectation-based measures for psycholin- guistic modeling via incremental top-down parsing. In Proceedings of the 2009 Conference on Empiri- cal Methods in Natural Language Processing: Vol- ume 1 - Volume 1, pages 324–333, Stroudsburg, PA. Brian Roark. 2001. Probabilistic top-down parsing and language modeling. Computational Linguis- tics, 27:249–276. Beatrice Santorini. 1991. Part-of-speech tagging guidelines for the Penn Treebank Project. Technical report, Philadelphia, PA. Nathaniel J. Smith and Roger Levy. 2008. Optimal processing times in reading: a formal model and empirical investigation. In Proceedings of the 30th Annual Conference of the Cognitive Science Soci- ety, pages 595–600, Austin,TX. Andreas Stolcke. 1995. An efficient probabilistic context-free parsing algorithm that computes prefix probabilities. Computational linguistics, 21:165– 201. Yoshimasa Tsuruoka and Jun’ichi Tsujii. 2005. Bidi- rectional inference with the easiest-first strategy for tagging sequence data. In Proceedings of the con- ference on Human Language Technology and Em- pirical Methods in Natural Language Processing, pages 467–474, Stroudsburg, PA. 408 Spectral Learning for Non-Deterministic Dependency Parsing Franco M. Luque Ariadna Quattoni and Borja Balle and Xavier Carreras Universidad Nacional de C´ordoba Universitat Polit`ecnica de Catalunya and CONICET Barcelona E-08034 C´ordoba X5000HUA, Argentina {aquattoni,bballe,carreras}@lsi.upc.edu

[email protected]

Abstract is to explicitly tell the model what properties of higher-order factors need to be remembered. This In this paper we study spectral learning can be achieved by means of feature engineering, methods for non-deterministic split head- but compressing such information into a state of automata grammars, a powerful hidden- bounded size will typically be labor intensive, and state formalism for dependency parsing. will not generalize across languages. (2) Increas- We present a learning algorithm that, like other spectral methods, is efficient and non- ing the size of the factors generally results in poly- susceptible to local minima. We show nomial increases in the parsing cost. how this algorithm can be formulated as In principle, hidden variable models could a technique for inducing hidden structure solve some of the problems of feature engineering from distributions computed by forward- in higher-order factorizations, since they could backward recursions. Furthermore, we automatically induce the information in a deriva- also present an inside-outside algorithm for the parsing model that runs in cubic tion history that should be passed across factors. time, hence maintaining the standard pars- Potentially, they would require less feature engi- ing costs for context-free grammars. neering since they can learn from an annotated corpus an optimal way to compress derivations into hidden states. For example, one line of work 1 Introduction has added hidden annotations to the non-terminals Dependency structures of natural language sen- of a phrase-structure grammar (Matsuzaki et al., tences exhibit a significant amount of non-local 2005; Petrov et al., 2006; Musillo and Merlo, phenomena. Historically, there have been two 2008), resulting in compact grammars that ob- main approaches to model non-locality: (1) in- tain parsing accuracies comparable to lexicalized creasing the order of the factors of a dependency grammars. A second line of work has modeled model (e.g. with sibling and grandparent relations hidden sequential structure, like in our case, but (Eisner, 2000; McDonald and Pereira, 2006; Car- using PDFA (Infante-Lopez and de Rijke, 2004). reras, 2007; Martins et al., 2009; Koo and Collins, Finally, a third line of work has induced hidden 2010)), and (2) using hidden states to pass in- structure from the history of actions of a parser formation across factors (Matsuzaki et al., 2005; (Titov and Henderson, 2007). Petrov et al., 2006; Musillo and Merlo, 2008). However, the main drawback of the hidden Higher-order models have the advantage that variable approach to parsing is that, to the best they are relatively easy to train, because estimat- of our knowledge, there has not been any convex ing the parameters of the model can be expressed formulation of the learning problem. As a result, as a convex optimization. However, they have training a hidden-variable model is both expen- two main drawbacks. (1) The number of param- sive and prone to local minima issues. eters grows significantly with the size of the fac- In this paper we present a learning algorithm tors, leading to potential data-sparsity problems. for hidden-state split head-automata grammars A solution to address the data-sparsity problem (SHAG) (Eisner and Satta, 1999). In this for- 409 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 409–419, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics malism, head-modifier sequences are generated ner, 2000), a context-free grammatical formal- by a collection of finite-state automata. In our ism whose derivations are projective dependency case, the underlying machines are probabilistic trees. We will use xi:j = xi xi+1 · · · xj to de- non-deterministic finite state automata (PNFA), note a sequence of symbols xt with i ≤ t ≤ j. which we parameterize using the operator model A SHAG generates sentences s0:N , where sym- representation. This representation allows the use bols st ∈ X with 1 ≤ t ≤ N are regular words of simple spectral algorithms for estimating the and s0 = ? 6∈ X is a special root symbol. Let model parameters from data (Hsu et al., 2009; X¯ = X ∪ {?}. A derivation y, i.e. a depen- Bailly, 2011; Balle et al., 2012). In all previous dency tree, is a collection of head-modifier se- work, the algorithms used to induce hidden struc- quences hh, d, x1:T i, where h ∈ X¯ is a word, ture require running repeated inference on train- d ∈ {LEFT, RIGHT} is a direction, and x1:T is ing data—e.g. Expectation-Maximization (Demp- a sequence of T words, where each xt ∈ X is ster et al., 1977), or split-merge algorithms. In a modifier of h in direction d. We say that h is contrast, spectral methods are simple and very ef- the head of each xt . Modifier sequences x1:T are ficient —parameter estimation is reduced to com- ordered head-outwards, i.e. among x1:T , x1 is the puting some data statistics, performing SVD, and word closest to h in the derived sentence, and xT inverting matrices. is the furthest. A derivation y of a sentence s0:N The main contributions of this paper are: consists of a LEFT and a RIGHT head-modifier se- quence for each st . As special cases, the LEFT se- • We present a spectral learning algorithm for quence of the root symbol is always empty, while inducing PNFA with applications to head- the RIGHT one consists of a single word corre- automata dependency grammars. Our for- sponding to the head of the sentence. We denote mulation is based on thinking about the dis- by Y the set of all valid derivations. tribution generated by a PNFA in terms of Assume a derivation y contains hh, LEFT, x1:T i the forward-backward recursions. and hh, RIGHT, x01:T 0 i. Let L(y, h) be the derived • Spectral learning algorithms in previous sentence headed by h, which can be expressed as work only use statistics of prefixes of se- L(y, xT ) · · · L(y, x1 ) h L(y, x01 ) · · · L(y, x0T 0 ).1 quences. In contrast, our algorithm is able The language generated by a SHAG are the to learn from substring statistics. strings L(y, ?) for any y ∈ Y. In this paper we use probabilistic versions of • We derive an inside-outside algorithm for SHAG where probabilities of head-modifier se- non-deterministic SHAG that runs in cubic quences in a derivation are independent of each time, keeping the costs of CFG parsing. other: • In experiments we show that adding non- Y P(y) = P(x1:T |h, d) . (1) determinism improves the accuracy of sev- hh,d,x1:T i∈y eral baselines. When we compare our algo- rithm to EM we observe a reduction of two In the literature, standard arc-factored models fur- orders of magnitude in training time. ther assume that TY +1 The paper is organized as follows. Next section P(x1:T |h, d) = P(xt |h, d, σt ) , (2) describes the necessary background on SHAG t=1 and operator models. Section 3 introduces Op- where xT +1 is always a special STOP word, and σt erator SHAG for parsing, and presents a spectral is the state of a deterministic automaton generat- learning algorithm. Section 4 presents a parsing ing x1:T +1 . For example, setting σ1 = FIRST and algorithm. Section 5 presents experiments and σt>1 = REST corresponds to first-order models, analysis of results, and section 6 concludes. while setting σ1 = NULL and σt>1 = xt−1 corre- 2 Preliminaries sponds to sibling models (Eisner, 2000; McDon- ald et al., 2005; McDonald and Pereira, 2006). 2.1 Head-Automata Dependency Grammars 1 Throughout the paper we assume we can distinguish the In this work we use split head-automata gram- words in a derivation, irrespective of whether two words at mars (SHAG) (Eisner and Satta, 1999; Eis- different positions correspond to the same symbol. 410 2.2 Operator Models symbol a and moving to state i given that we are An operator model A with n states is a tuple at state j. > , {A } hα1 , α∞ a a∈X i, where Aa ∈ R n×n is an op- HMM are only one example of distributions erator matrix and α1 , α∞ ∈ Rn are vectors. A that can be parameterized by operator models. computes a function f : X ∗ → R as follows: In general, operator models can parameterize any PNFA, where the parameters of the model corre- > f (x1:T ) = α∞ A xT · · · A x1 α 1 . (3) spond to probabilities of emitting a symbol from a state and moving to the next state. One intuitive way of understanding operator The advantage of working with operator mod- models is to consider the case where f computes els is that, under certain mild assumptions on the a probability distribution over strings. Such a dis- operator parameters, there exist algorithms that tribution can be described in two equivalent ways: can estimate the operators from observable statis- by making some independence assumptions and tics of the input sequences. These algorithms are providing the corresponding parameters, or by ex- extremely efficient and are not susceptible to local plaining the process used to compute f . This is minima issues. See (Hsu et al., 2009) for theoret- akin to describing the distribution defined by an ical proofs of the learnability of HMM under the HMM in terms of a factorization and its corre- operator model representation. sponding transition and emission parameters, or In the following, we write x = xi:j ∈ X ∗ to using the inductive equations of the forward al- denote sequences of symbols, and use Axi:j as a gorithm. The operator model representation takes shorthand for Axj · · · Axi . Also, for convenience the latter approach. we assume X = {1, . . . , l}, so that we can index Operator models have had numerous applica- vectors and matrices by symbols in X . tions. For example, they can be used as an alter- native parameterization of the function computed 3 Learning Operator SHAG by an HMM (Hsu et al., 2009). Consider an HMM with n hidden states and initial-state probabilities We will define a SHAG using a collection of op- π ∈ Rn , transition probabilities T ∈ Rn×n , and erator models to compute probabilities. Assume observation probabilities Oa ∈ Rn×n for each that for each possible head h in the vocabulary X¯ a ∈ X , with the following meaning: and each direction d ∈ {LEFT, RIGHT} we have an operator model that computes probabilities of • π(i) is the probability of starting at state i, modifier sequences as follows: • T (i, j) is the probability of transitioning h,d > h,d P(x1:T |h, d) = (α∞ ) AxT · · · Ah,d h,d x1 α1 . from state j to state i, Then, this collection of operator models defines • Oa is a diagonal matrix, such that Oa (i, i) is an operator SHAG that assigns a probability to the probability of generating symbol a from each y ∈ Y according to (1). To learn the model state i. parameters, namely hα1h,d , α∞ h,d , {Ah,d a }a∈X i for Given an HMM, an equivalent operator model h ∈ X¯ and d ∈ {LEFT, RIGHT}, we use spec- can be defined by setting α1 = π, Aa = T Oa tral learning methods based on the works of Hsu and α∞ = ~1. To see this, let us show that the for- et al. (2009), Bailly (2011) and Balle et al. (2012). ward algorithm computes the expression in equa- The main challenge of learning an operator tion (3). Let σt denote the state of the HMM model is to infer a hidden-state space from ob- at time t. Consider a state-distribution vector servable quantities, i.e. quantities that can be com- αt ∈ Rn , where αt (i) = P(x1:t−1 , σt = i). Ini- puted from the distribution of sequences that we tially α1 = π. At each step in the chain of prod- observe. As it turns out, we cannot recover the ucts (3), αt+1 = Axt αt updates the state dis- actual hidden-state space used by the operators tribution from positions t to t + 1 by applying we wish to learn. The key insight of the spectral the appropriate operator, i.e. by emitting symbol learning method is that we can recover a hidden- xt and transitioning to the new statePdistribution. state space that corresponds to a projection of the The probability of x1:T is given by i αT +1 (i). original hidden space. Such projected space is Hence, Aa (i, j) is the probability of generating equivalent to the original one in the sense that we 411 can find operators in the projected space that pa- Furthermore, for each b ∈ X let Pb ∈ Rl×l denote rameterize the same probability distribution over the matrix whose entries are given by sequences. Pb (c, a) = E(abc v] x) , (7) In the rest of this section we describe an algo- rithm for learning an operator model. We will as- the expected number of occurrences of trigrams. l l sume a fixed head word and direction, and drop h P p1 ∈ R and p∞ ∈ R Finally, we define vectors and d from all terms. Hence, our goal is to learn as follows: p1 (a) = s∈X ∗ P(as), the probabil- the following distribution, parameterized by oper- ity that a stringPbegins with a particular symbol; ators α1 , {Aa }a∈X , and α∞ : and p∞ (a) = p∈X ∗ P(pa), the probability that a string ends with a particular symbol. > P(x1:T ) = α∞ A xT · · · A x1 α 1 . (4) Now we show a particularly useful way to ex- press the quantities defined above in terms of the Our algorithm shares many features with the > , {A } operators hα1 , α∞ a a∈X i of P. First, note previous spectral algorithms of Hsu et al. (2009) that each entry of P can be written in this form: and Bailly (2011), though the derivation given X here is based upon the general formulation of P (b, a) = P(pabs) (8) Balle et al. (2012). The main difference is that p,s∈X ∗ our algorithm is able to learn operator models X > = α∞ As Ab Aa Ap α 1 from substring statistics, while algorithms in pre- p,s∈X ∗ vious works were restricted to statistics on pre- > X X fixes. In principle, our algorithm should extract = (α∞ As ) Ab Aa ( Ap α 1 ) . s∈X ∗ p∈X ∗ much more information from a sample. It is not hard to see that, since P isPa probability 3.1 Preliminary Definitions distribution over X ∗ , actually α∞> ∗ A = P s∈X s The spectral learning algorithm will use statistics ~1> . Furthermore, since p∈X ∗ Ap = estimated from samples of the target distribution. P P k (I − P −1 k≥0 ( a∈X Aa ) P= a∈X Aa ) , More specifically, consider the function that com- we write α ˜ 1 = (I − a∈X Aa )−1 α1 . From (8) it putes the expected number of occurrences of a is natural to define a forward matrix F ∈ Rn×l substring x in a random string x0 drawn from P: whose ath column contains the sum of all hidden- state vectors obtained after generating all prefixes f (x) = E(x v] x0 ) X ended in a: = (x v] x0 )P(x0 ) X x0 ∈X ∗ F (:, a) = Aa Ap α1 = Aa α ˜ 1 . (9) X p∈X ∗ = P(pxs) , (5) p,s∈X ∗ Conversely, we also define a backward matrix B ∈ Rl×n whose ath row contains the probability where x v] x0 denotes the number of times x ap- of generating a from any possible state: pears in x0 . Here we assume that the true values > X of f (x) for bigrams are known, though in practice B(a, :) = α∞ As Aa = ~1> Aa . (10) s∈X ∗ the algorithm will work with empirical estimates of these. By plugging the forward and backward matri- The information about f known by the algo- ces into (8) one obtains the factorization P = rithm is organized in matrix form as follows. Let BF . With similar arguments it is easy to see P ∈ Rl×l be a matrix containing the value of f (x) that one also has Pb = BAb F , p1 = B α1 , and for all strings of length two, i.e. bigrams.2 . That p> > ∞ = α∞ F . Hence, if B and F were known, one is, each entry in P ∈ Rl×l contains the expected could in principle invert these expressions in order number of occurrences of a given bigram: to recover the operators of the model from em- pirical estimations computed from a sample. In P (b, a) = E(ab v] x) . (6) the next section we show that in fact one does not 2 In fact, while we restrict ourselves to strings of length need to know B and F to learn an operator model two, an analogous algorithm can be derived that considers for P, but rather that having a “good” factorization longer strings to define P . See (Balle et al., 2012) for details. of P is enough. 412 3.2 Inducing a Hidden-State Space Algorithm 1 Learn Operator SHAG We have shown that an operator model A com- inputs: • An alphabet X puting P induces a factorization of the matrix P , • A training set TRAIN = {hhi , di , xi1:T i}M i=1 namely P = BF . More generally, it turns out that • The number of hidden states n when the rank of P equals the minimal number of 1: for each h ∈ X¯ and d ∈ {LEFT, RIGHT} do states of an operator model that computes P, then 2: Compute an empirical estimate from TRAIN of one can prove a duality relation between opera- statistics matrices pb1 , pb∞ , Pb, and {Pba }a∈X tors and factorizations of P . In particular, one can 3: Compute the SVD of Pb and let U b be the matrix show that, for any rank factorization P = QR, the of top n left singular vectors of Pb operators given by α ¯ 1 = Q+ p1 , α¯∞> = p> R+ , ∞ 4: Compute the observable operators for h and d: and A¯a = Q Pa R , yield an operator model for + + 5: b1h,d = U α b > pb1 P. A key fact in proving this result is that the func- 6: α∞ ) = pb> (b h,d > b> b + ∞ (U P ) h,d > b > Pb)+ for each a ∈ X tion P is invariant to the basis chosen to represent 7: Ab =U a b Pba (U operator matrices. See (Balle et al., 2012) for fur- 8: end for ther details. 9: return Operators hb α1h,d , α b∞h,d bh,d , Aa i ¯ for each h ∈ X , d ∈ {LEFT, RIGHT}, a ∈ X Thus, we can recover an operator model for P from any rank factorization of P , provided a rank assumption on P holds (which hereafter we as- SHAG is learned separately. The running time sume to be the case). Since we only have access of the algorithm is dominated by two computa- to an approximation of P , it seems reasonable to tions. First, a pass over the training sequences to choose a factorization which is robust to estima- compute statistics over unigrams, bigrams and tri- tion errors. A natural such choice is the thin SVD grams. Second, SVD and matrix operations for decomposition of P (i.e. using top n singular vec- computing the operators, which run in time cubic tors), given by: P = U (ΣV > ) = U (U > P ). in the number of symbols l. However, note that Intuitively, we can think of U and U > P as pro- when dealing with sparse matrices many of these jected backward and forward matrices. Now that operations can be performed more efficiently. we have a factorization of P we can construct an operator model for P as follows: 3 4 Parsing Algorithms ¯ 1 = U > p1 , α (11) Given a sentence s0:N we would like > α ¯∞ = p> > ∞ (U P ) + , (12) to find its most likely derivation, yˆ = > > argmaxy∈Y(s0:N ) P(y). This problem, known as A¯a = U Pa (U P )+ . (13) MAP inference, is known to be intractable for Algorithm 1 presents pseudo-code for an algo- hidden-state structure prediction models, as it rithm learning operators of a SHAG from train- involves finding the most likely tree structure ing head-modifier sequences using this spectral while summing out over hidden states. We use method. Note that each operator model in the a common approximation to MAP based on first computing posterior marginals of tree edges (i.e. 3 To see that equations (11-13) define a model for P, one dependencies) and then maximizing over the must first see that the matrix M = F (ΣV > )+ is invertible with inverse M −1 = U > B. Using this and recalling that tree structure (see (Park and Darwiche, 2004) p1 = Bα1 , Pa = BAa F , p> > ∞ = α∞ F , one obtains that: for complexity of general MAP inference and approximations). For parsing, this strategy is ¯ 1 = U > Bα1 = M −1 α1 , α > > sometimes known as MBR decoding; previous α ¯∞ = α∞ F (U > BF )+ = α∞ > M , work has shown that empirically it gives good A¯a = U > BAa F (U > BF )+ = M −1 Aa M . performance (Goodman, 1996; Clark and Cur- Finally: ran, 2004; Titov and Henderson, 2006; Petrov > and Klein, 2007). In our case, we use the P(x1:T ) = α∞ AxT · · · Ax1 α1 > non-deterministic SHAG to compute posterior = α∞ M M −1 AxT M · · · M −1 Ax1 M M −1 α1 > ¯ marginals of dependencies. We first explain the = α ¯∞ Ax · · · A¯x1 α ¯1 T general strategy of MBR decoding, and then present an algorithm to compute marginals. 413 Let (si , sj ) denote a dependency between head and Satta (1999), we use decoding structures re- word i and modifier word j. The posterior lated to complete half-constituents (or “triangles”, or marginal probability of a dependency (si , sj ) denoted C) and incomplete half-constituents (or given a sentence s0:N is defined as “trapezoids”, denoted I), each decorated with a di- X rection (denoted L and R). We assume familiarity µi,j = P((si , sj ) | s0:N ) = P(y) . with their algorithm. I,R y∈Y(s0:N ) : (si ,sj )∈y We define θi,j ∈ Rn as the inside score-vector of a right trapezoid dominated by dependency To compute marginals, the sum over derivations (si , sj ), can be decomposed into a product of inside and outside quantities (Baker, 1979). Below we de- X I,R θi,j = P(y 0 )αsi ,R (x1:t ) . (15) scribe an inside-outside algorithm for our gram- y∈Y(si:j ) : (si ,sj )∈y , mars. Given a sentence s0:N and marginal scores y={hsi ,R,x1:t i} ∪ y 0 , xt =sj µi,j , we compute the parse tree for s0:N as The term P(y 0 ) is the probability of head-modifier sequences in the range si:j that do not involve X yˆ = argmax log µi,j (14) y∈Y(s0:N ) (s ,s )∈y i j si . The term αsi ,R (x1:t ) is a forward state- distribution vector —the qth coordinate of the using the standard projective parsing algorithm vector is the probability that si generates right for arc-factored models (Eisner, 2000). Overall modifiers x1:t and remains at state q. Similarly, I,R we use a two-pass parsing process, first to com- we define φi,j ∈ Rn as the outside score-vector pute marginals and then to compute the best tree. of a right trapezoid, as X 4.1 An Inside-Outside Algorithm I,R φi,j = P(y 0 )β si ,R (xt+1:T ) , (16) In this section we sketch an algorithm to com- y∈Y(s0:i sj:n ) : root(y)=s0 , y={hsi ,R,xt:T i} ∪ y 0 , xt =sj pute marginal probabilities of dependencies. Our algorithm is an adaptation of the parsing algo- where β si ,R (xt+1:T ) ∈ Rn is a backward state- rithm for SHAG by Eisner and Satta (1999) to distribution vector —the qth coordinate is the the case of non-deterministic head-automata, and probability of being at state q of the right au- has a runtime cost of O(n2 N 3 ), where n is the tomaton of si and generating xt+1:T . Analogous number of states of the model, and N is the inside-outside expressions can be defined for the length of the input sentence. Hence the algorithm rest of structures (left/right triangles and trape- maintains the standard cubic cost on the sentence zoids). With these quantities, we can compute length, while the quadratic cost on n is inher- marginals as ent to the computations defined by our model in ( I , R > I , R −1 Eq. (3). The main insight behind our extension (φi,j ) θi,j Z if i < j , is that, because the computations of our model in- µi,j = I,L > I,L −1 (17) (φi,j ) θi,j Z if j < i , volve state-distribution vectors, we need to extend the standard inside/outside quantities to be in the P ?,R > C , R form of such state-distribution quantities.4 where Z = y∈Y(s0:N) P(y) = (α∞ ) θ0,N . Throughout this section we assume a fixed sen- Finally, we sketch the equations for computing tence s0:N . Let Y(xi:j ) be the set of derivations inside scores in O(N 3 ) time. The outside equa- that yield a subsequence xi:j . For a derivation y, tions can be derived analogously (see (Paskin, we use root(y) to indicate the root word of it, 2001)). For 0 ≤ i < j ≤ N : and use (xi , xj ) ∈ y to refer a dependency in y C,R θi,i = α1si ,R (18) from head xi to modifier xj . Following Eisner j X 4 C,R I,R sk , R > C , R Technically, when working with the projected operators θi,j = θi,k (α∞ ) θk,j (19) the state-distribution vectors will not be distributions in the k=i+1 formal sense. However, they correspond to a projection of a j state distribution, for some projection that we do not recover s ,L X from data (namely M −1 in footnote 3). This projection has I,R θi,j = C,R Assij,R θi,k (α∞j )> θk+1,j C,L (20) no effect on the computations because it cancels out. k=i 414 5 Experiments 82 80 The goal of our experiments is to show that in- unlabeled attachment score corporating hidden states in a SHAG using oper- 78 ator models can consistently improve parsing ac- 76 curacy. A second goal is to compare the spec- 74 tral learning algorithm to EM, a standard learning 72 Det method that also induces hidden states. Det+F Spectral The first set of experiments involve fully unlex- 70 EM (5) EM (10) icalized models, i.e. parsing part-of-speech tag se- 68 EM (25) EM (100) quences. While this setting falls behind the state- 2 4 6 8 10 12 14 of-the-art, it is nonetheless valid to analyze empir- number of states ically the effect of incorporating hidden states via operator models, which results in large improve- Figure 1: Accuracy curve on English development set ments. In a second set of experiments, we com- for fully unlexicalized models. bine the unlexicalized hidden-state models with simple lexicalized models. Finally, we present created a diagonal matrix Oah,d ∈ Rn×n , some analysis of the automaton learned by the where Oah,d (i, i) is the probability of gener- spectral algorithm to see the information that is ating symbol a from h and d (estimated from captured in the hidden state space. training); (4) we set A bh,d h,d a = T Oa . 5.1 Fully Unlexicalized Grammars We trained SHAG models using the standard We trained fully unlexicalized dependency gram- WSJ sections of the English Penn Treebank (Mar- mars from dependency treebanks, that is, X are cus et al., 1994). Figure 1 shows the Unlabeled PoS tags and we parse PoS tag sequences. In Attachment Score (UAS) curve on the develop- all cases, our modifier sequences include special ment set, in terms of the number of hidden states START and STOP symbols at the boundaries. 5 6 for the spectral and EM models. We can see We compare the following SHAG models: that D ET +F largely outperforms D ET7 , while the hidden-state models obtain much larger improve- • D ET: a baseline deterministic grammar with ments. For the EM model, we show the accuracy a single state. curve after 5, 10, 25 and 100 iterations.8 • D ET +F: a deterministic grammar with two In terms of peak accuracies, EM gives a slightly states, one emitting the first modifier of a better result than the spectral method (80.51% for sequence, and another emitting the rest (see EM with 15 states versus 79.75% for the spectral (Eisner and Smith, 2010) for a similar deter- method with 9 states). However, the spectral al- ministic baseline). gorithm is much faster to train. With our Matlab • S PECTRAL: a non-deterministic grammar implementation, it took about 30 seconds, while with n hidden states trained with the spectral each iteration of EM took from 2 to 3 minutes, algorithm. n is a parameter of the model. depending on the number of states. To give a con- • EM: a non-deterministic grammar with n crete example, to reach an accuracy close to 80%, states trained with EM. Here, we estimate there is a factor of 150 between the training times operators hb α1 , α bh,d b∞ , A a i using forward- of the spectral method and EM (where we com- backward for the E step. To initialize, we pare the peak performance of the spectral method mimicked an HMM initialization: (1) we set versus EM at 25 iterations with 13 states). α b1 and α b∞ randomly; (2) we created a ran- 7 dom transition matrix T ∈ Rn×n ; (3) we For parsing with deterministic SHAG we employ MBR inference, even though Viterbi inference can be performed 5 Even though the operators α1 and α∞ of a PNFA ac- exactly. In experiments on development data D ET improved count for start and stop probabilities, in preliminary experi- from 62.65% using Viterbi to 68.52% using MBR, and ments we found that having explicit START and STOP sym- D ET +F improved from 72.72% to 74.80%. 8 bols results in more accurate models. We ran EM 10 times under different initial conditions 6 Note that, for parsing, the operators for the START and and selected the run that gave the best absolute accuracy after STOP symbols can be packed into α1 and α∞ respectively. 100 iterations. We did not observe significant differences One just defines α10 = ASTART α1 and α∞ 0> = α∞> ASTOP . between the runs. 415 D ET D ET +F S PECTRAL EM 86 WSJ 69.45% 75.91% 80.44% 81.68% 84 unlabeled attachment score 82 Table 1: Unlabeled Attachment Score of fully unlexi- calized models on the WSJ test set. 80 78 Table 1 shows results on WSJ test data, se- 76 Lex lecting the models that obtain peak performances Lex+F Lex+FCP 74 in development. We observe the same behavior: Lex + Spectral Lex+F + Spectral 72 Lex+FCP + Spectral hidden-states largely improve over deterministic baselines, and EM obtains a slight improvement 2 3 4 5 6 7 8 9 10 number of states over the spectral algorithm. Comparing to previ- ous work on parsing WSJ PoS sequences, Eisner Figure 2: Accuracy curve on English development set and Smith (2010) obtained an accuracy of 75.6% for lexicalized models. using a deterministic SHAG that uses informa- tion about dependency lengths. However, they coarse level conditions on {th , d, σ}. For PB we used Viterbi inference, which we found to per- use three levels, which from fine to coarse are form worse than MBR inference (see footnote 7). {ta , wh , d, σ}, {ta , th , d, σ} and {ta }. We follow 5.2 Experiments with Lexicalized Collins (1999) to estimate PA and PB from a tree- Grammars bank using a back-off strategy. We now turn to combining lexicalized determinis- We use a simple approach to combine lexical tic grammars with the unlexicalized grammars ob- models with the unlexical hidden-state models we tained in the previous experiment using the spec- obtained in the previous experiment. Namely, we tral algorithm. The goal behind this experiment use a log-linear model that computes scores for is to show that the information captured in hidden head-modifier sequences as states is complimentary to head-modifier lexical preferences. s(hh, d, x1:T i) = log Psp (x1:T |h, d) (21) In this case X consists of lexical items, and we + log Pdet (x1:T |h, d) , assume access to the PoS tag of each lexical item. We will denote as ta and wa the PoS tag and word where Psp and Pdet are respectively spectral and of a symbol a ∈ X¯ . We will estimate condi- deterministic probabilistic models. We tested tional distributions P(a | h, d, σ), where a ∈ X combinations of each deterministic model with is a modifier, h ∈ X¯ is a head, d is a direction, the spectral unlexicalized model using different and σ is a deterministic state. Following Collins number of states. Figure 2 shows the accuracies of (1999), we use three configurations of determin- single deterministic models, together with combi- istic states: nations using different number of states. In all cases, the combinations largely improve over the • L EX: a single state. purely deterministic lexical counterparts, suggest- • L EX +F: two distinct states for first modifier ing that the information encoded in hidden states and rest of modifiers. is complementary to lexical preferences. • L EX +FCP: four distinct states, encoding: first modifier, previous modifier was a coor- 5.3 Results Analysis dination, previous modifier was punctuation, We conclude the experiments by analyzing the and previous modifier was some other word. state space learned by the spectral algorithm. To estimate P we use a back-off strategy: Consider the space Rn where the forward-state vectors lie. Generating a modifier sequence corre- P(a|h, d, σ) = PA (ta |h, d, σ)PB (wa |ta , h, d, δ) sponds to a path through the n-dimensional state space. We clustered sets of forward-state vectors To estimate PA we use two back-off levels, in order to create a DFA that we can use to visu- the fine level conditions on {wh , d, σ} and the alize the phenomena captured by the state space. 416 nns ments in accuracy with respect to the baselines. STOP , I A DFA for the automaton (NN, LEFT) is shown cc in Figure 3. The vectors were originally divided prp$ vbg jjs rb vbn pos $ nn in ten clusters, but the DFA construction required jjr nnp cd jj in dt cd two state mergings, leading to a eight state au- 9 2 tomaton. The state named I is the initial state. prp$ nn pos nn $ nnp jj dt nnp Clearly, we can see that there are special states cc for punctuation (state 9) and coordination (states , , STOP 1 and 5). States 0 and 2 are harder to interpret. nn 1 0 cc To understand them better, we computed an esti- 5 mation of the probabilities of the transitions, by cd nns cc prp$ rb pos counting the number of times each of them is jj dt nnp used. We found that our estimation of generating STOP 3 STOP from state 0 is 0.67, and from state 2 it is STOP 0.15. Interestingly, state 2 can transition to state 0 generating prp$, POS or DT, that are usual end- ings of modifier sequences for nouns (recall that 7 modifiers are generated head-outwards, so for a left automaton the final modifier is the left-most Figure 3: DFA approximation for the generation of NN modifier in the sentence). left modifier sequences. 6 Conclusion To build a DFA, we computed the forward vec- Our main contribution is a basic tool for inducing tors corresponding to frequent prefixes of modi- sequential hidden structure in dependency gram- fier sequences of the development set. Then, we mars. Most of the recent work in dependency clustered these vectors using a Group Average parsing has explored explicit feature engineering. Agglomerative algorithm using the cosine simi- In part, this may be attributed to the high cost of larity measure (Manning et al., 2008). This simi- using tools such as EM to induce representations. larity measure is appropriate because it compares Our experiments have shown that adding hidden- the angle between vectors, and is not affected by structure improves parsing accuracy, and that our their magnitude (the magnitude of forward vec- spectral algorithm is highly scalable. tors decreases with the number of modifiers gen- Our methods may be used to enrich the rep- erated). Each cluster i defines a state in the DFA, resentational power of more sophisticated depen- and we say that a sequence x1:t is in state i if its dency models. For example, future work should corresponding forward vector at time t is in clus- consider enhancing lexicalized dependency gram- ter i. Then, transitions in the DFA are defined us- mars with hidden states that summarize lexical ing a procedure that looks at how sequences tra- dependencies. Another line for future research verse the states. If a sequence x1:t is at state i at should extend the learning algorithm to be able time t − 1, and goes to state j at time t, then we to capture vertical hidden relations in the depen- define a transition from state i to state j with la- dency tree, in addition to sequential relations. bel xt . This procedure may require merging states Acknowledgements We are grateful to Gabriele to give a consistent DFA, because different se- Musillo and the anonymous reviewers for providing us quences may define different transitions for the with helpful comments. This work was supported by same states and modifiers. After doing a merge, a Google Research Award and by the European Com- new merges may be required, so the procedure mission (PASCAL2 NoE FP7-216886, XLike STREP must be repeated until a DFA is obtained. FP7-288342). Borja Balle was supported by an FPU fellowship (AP2008-02064) of the Spanish Ministry For this analysis, we took the spectral model of Education. The Spanish Ministry of Science and with 9 states, and built DFA from the non- Innovation supported Ariadna Quattoni (JCI-2009- deterministic automata corresponding to heads 04240) and Xavier Carreras (RYC-2008-02223 and and directions where we saw largest improve- “KNOW2” TIN2009-14715-C04-04). 417 References Daniel Hsu, Sham M. Kakade, and Tong Zhang. 2009. A spectral algorithm for learning hidden markov Raphael Bailly. 2011. Quadratic weighted automata: models. In COLT 2009 - The 22nd Conference on Spectral algorithm and likelihood maximization. Learning Theory. JMLR Workshop and Conference Proceedings – Gabriel Infante-Lopez and Maarten de Rijke. 2004. ACML. Alternative approaches for generating bodies of James K. Baker. 1979. Trainable grammars for speech grammar rules. In Proceedings of the 42nd Meet- recognition. In D. H. Klatt and J. J. Wolf, editors, ing of the Association for Computational Lin- Speech Communication Papers for the 97th Meeting guistics (ACL’04), Main Volume, pages 454–461, of the Acoustical Society of America, pages 547– Barcelona, Spain, July. 550. Terry Koo and Michael Collins. 2010. Efficient third- Borja Balle, Ariadna Quattoni, and Xavier Carreras. order dependency parsers. In Proceedings of the 2012. Local loss optimization in operator models: 48th Annual Meeting of the Association for Compu- A new insight into spectral learning. Technical Re- tational Linguistics, pages 1–11, Uppsala, Sweden, port LSI-12-5-R, Departament de Llenguatges i Sis- July. Association for Computational Linguistics. temes Inform`atics (LSI), Universitat Polit`ecnica de Catalunya (UPC). Christopher D. Manning, Prabhakar Raghavan, and Xavier Carreras. 2007. Experiments with a higher- Hinrich Sch¨utze. 2008. Introduction to Information order projective dependency parser. In Proceed- Retrieval. Cambridge University Press, Cambridge, ings of the CoNLL Shared Task Session of EMNLP- first edition, July. CoNLL 2007, pages 957–961, Prague, Czech Re- Mitchell P. Marcus, Beatrice Santorini, and Mary A. public, June. Association for Computational Lin- Marcinkiewicz. 1994. Building a large annotated guistics. corpus of english: The penn treebank. Computa- Stephen Clark and James R. Curran. 2004. Parsing tional Linguistics, 19. the wsj using ccg and log-linear models. In Pro- Andre Martins, Noah Smith, and Eric Xing. 2009. ceedings of the 42nd Meeting of the Association for Concise integer linear programming formulations Computational Linguistics (ACL’04), Main Volume, for dependency parsing. In Proceedings of the Joint pages 103–110, Barcelona, Spain, July. Conference of the 47th Annual Meeting of the ACL Michael Collins. 1999. Head-Driven Statistical Mod- and the 4th International Joint Conference on Natu- els for Natural Language Parsing. Ph.D. thesis, ral Language Processing of the AFNLP, pages 342– University of Pennsylvania. 350, Suntec, Singapore, August. Association for Arthur P. Dempster, Nan M. Laird, and Donald B. Ru- Computational Linguistics. bin. 1977. Maximum likelihood from incomplete Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. data via the em algorithm. Journal of the royal sta- 2005. Probabilistic CFG with latent annotations. In tistical society, Series B, 39(1):1–38. Proceedings of the 43rd Annual Meeting of the As- Jason Eisner and Giorgio Satta. 1999. Efficient pars- sociation for Computational Linguistics (ACL’05), ing for bilexical context-free grammars and head- pages 75–82, Ann Arbor, Michigan, June. Associa- automaton grammars. In Proceedings of the 37th tion for Computational Linguistics. Annual Meeting of the Association for Computa- Ryan McDonald and Fernando Pereira. 2006. Online tional Linguistics (ACL), pages 457–464, Univer- learning of approximate dependency parsing algo- sity of Maryland, June. rithms. In Proceedings of the 11th Conference of Jason Eisner and Noah A. Smith. 2010. Favor the European Chapter of the Association for Com- short dependencies: Parsing with soft and hard con- putational Linguistics, pages 81–88. straints on dependency length. In Harry Bunt, Paola Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Merlo, and Joakim Nivre, editors, Trends in Parsing Jan Hajic. 2005. Non-projective dependency pars- Technology: Dependency Parsing, Domain Adapta- ing using spanning tree algorithms. In Proceed- tion, and Deep Parsing, chapter 8, pages 121–150. ings of Human Language Technology Conference Springer. and Conference on Empirical Methods in Natural Jason Eisner. 2000. Bilexical grammars and their Language Processing, pages 523–530, Vancouver, cubic-time parsing algorithms. In Harry Bunt and British Columbia, Canada, October. Association for Anton Nijholt, editors, Advances in Probabilis- Computational Linguistics. tic and Other Parsing Technologies, pages 29–62. Gabriele Antonio Musillo and Paola Merlo. 2008. Un- Kluwer Academic Publishers, October. lexicalised hidden variable models of split depen- Joshua Goodman. 1996. Parsing algorithms and met- dency grammars. In Proceedings of ACL-08: HLT, rics. In Proceedings of the 34th Annual Meeting Short Papers, pages 213–216, Columbus, Ohio, of the Association for Computational Linguistics, June. Association for Computational Linguistics. pages 177–183, Santa Cruz, California, USA, June. James D. Park and Adnan Darwiche. 2004. Com- Association for Computational Linguistics. plexity results and approximation strategies for map 418 explanations. Journal of Artificial Intelligence Re- search, 21:101–133. Mark Paskin. 2001. Cubic-time parsing and learning algorithms for grammatical bigram models. Techni- cal Report UCB/CSD-01-1148, University of Cali- fornia, Berkeley. Slav Petrov and Dan Klein. 2007. Improved infer- ence for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computa- tional Linguistics; Proceedings of the Main Confer- ence, pages 404–411, Rochester, New York, April. Association for Computational Linguistics. Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and in- terpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 433– 440, Sydney, Australia, July. Association for Com- putational Linguistics. Ivan Titov and James Henderson. 2006. Loss mini- mization in parse reranking. In Proceedings of the 2006 Conference on Empirical Methods in Natu- ral Language Processing, pages 560–567, Sydney, Australia, July. Association for Computational Lin- guistics. Ivan Titov and James Henderson. 2007. A latent vari- able model for generative dependency parsing. In Proceedings of the Tenth International Conference on Parsing Technologies, pages 144–155, Prague, Czech Republic, June. Association for Computa- tional Linguistics. 419 Combining Tree Structures, Flat Features and Patterns for Biomedical Relation Extraction Md. Faisal Mahbub Chowdhury † ‡ and Alberto Lavelli ‡ ‡ Fondazione Bruno Kessler (FBK-irst), Italy † University of Trento, Italy {chowdhury,lavelli}@fbk.eu Abstract and tree kernels have been designed and experi- mented. Kernel based methods dominate the current trend for various relation extraction tasks Early RE approaches more or less fall in one of including protein-protein interaction (PPI) the following categories: (i) exploitation of statis- extraction. PPI information is critical in un- tics about co-occurrences of entities, (ii) usage of derstanding biological processes. Despite patterns and rules, and (iii) usage of flat features considerable efforts, previously reported to train machine learning (ML) classifiers. These PPI extraction results show that none of the approaches have been studied for a long period approaches already known in the literature is consistently better than other approaches and have their own pros and cons. Exploitation when evaluated on different benchmark PPI of co-occurrence statistics results in high recall corpora. In this paper, we propose a but low precision, while rule or pattern based ap- novel hybrid kernel that combines (auto- proaches can increase precision but suffer from matically collected) dependency patterns, low recall. Flat feature based ML approaches em- trigger words, negative cues, walk fea- ploy various kinds of linguistic, syntactic or con- tures and regular expression patterns along textual information and integrate them into the with tree kernel and shallow linguistic ker- feature space. They obtain relatively good results nel. The proposed kernel outperforms the exiting state-of-the-art approaches on the but are hindered by drawbacks of limited feature BioInfer corpus, the largest PPI benchmark space and excessive feature engineering. Kernel corpus available. On the other four smaller based approaches have become an attractive alter- benchmark corpora, it performs either bet- native solution, as they can exploit huge amount ter or almost as good as the existing ap- of features without an explicit representation. proaches. Moreover, empirical results show In this paper, we propose a new hybrid kernel that the proposed hybrid kernel attains con- siderably higher precision than the existing for RE. We apply the kernel to Protein–protein approaches, which indicates its capability interaction (PPI) extraction, the most widely re- of learning more accurate models. This also searched topic in biomedical relation extraction. demonstrates that the different types of in- PPI1 information is very critical in understanding formation that we use are able to comple- biological processes. Considerable progress has ment each other for relation extraction. been made for this task. Nevertheless, empirical results of previous studies show that none of the 1 Introduction approaches already known in the literature is con- Kernel methods are considered the most effective sistently better than other approaches when evalu- techniques for various relation extraction (RE) ated on different benchmark PPI corpora (see Ta- tasks on both general (e.g. newspaper text) and ble 4). This demands further study and innovation specialized (e.g. biomedical text) domains. In 1 PPIs occur when two or more proteins bind together, particular, as the importance of syntactic struc- and are integral to virtually all cellular processes, such as tures for deriving the relationships between en- metabolism, signalling, regulation, and proliferation (Tikk tities in text has been growing, several graph et al., 2010). 420 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 420–429, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics of new approaches that are sensitive to the varia- The remainder of the paper is organized as fol- tions of complex linguistic constructions. lows. In Section 2, we briefly review previous The proposed hybrid kernel is the composition work. Section 3 lists the datasets. Then, in Sec- of one tree kernel and two feature based kernels tion 4, we define our proposed hybrid kernel and (one of them is already known in the literature describe its individual component kernels. Sec- and the other is proposed in this paper for the first tion 5 outlines the experimental settings. Follow- time). The novelty of the newly proposed feature ing that, empirical results are discussed in Section based kernel is that it envisages to accommodate 6. Finally, we conclude with a summary of our the advantages of pattern based approaches. More study as well as suggestions for further improve- precisely: ment of our approach. 1. We propose a new feature based kernel (de- 2 Related Work tails in Section 4.1) by using syntactic de- In this section, we briefly discuss some of the pendency patterns, trigger words, negative recent work on PPI extraction. Several RE ap- cues, regular expression (henceforth, regex) proaches have been reported to date for the PPI patterns and walk features (i.e. e-walks and task, most of which are kernel based methods. v-walks)2 . Tikk et al. (2010) reported a benchmark evalu- ation of various kernels on PPI extraction. An 2. The syntactic dependency patterns are au- interesting finding is that the Shallow Linguis- tomatically collected from a type of depen- tic (SL) kernel (Giuliano et al., 2006) (to be dis- dency subgraph (we call it reduced graph, cussed in Section 4.2), despite its simplicity, is on more details in Section 4.1.1) during run- par with the best kernels in most of the evaluation time. settings. 3. We only use the regex patterns, trigger words Kim et al. (2010) proposed walk-weighted sub- and negative cues mentioned in the literature sequence kernel using e-walks, partial matches, (Ono et al., 2001; Fundel et al., 2007; Bui et non-contiguous paths, and different weights for al., 2010). The objective is to verify whether different sub-structures (which are used to capture we can exploit knowledge which is already structural similarities during kernel computation). known and used. Miwa et al. (2009a) proposed a hybrid kernel, which combines the all-paths graph (APG) kernel 4. We propose a hybrid kernel by combin- (Airola et al., 2008), the bag-of-words kernel, and ing the proposed feature based kernel (out- the subset tree kernel (Moschitti, 2006) (applied lined above) with the Shallow Linguistic on the shortest dependency paths between target (SL) kernel (Giuliano et al., 2006) and the protein pairs). They used multiple parser inputs. Path-enclosed Tree (PET) kernel (Moschitti, The system is regarded as the current state-of-the- 2004). art PPI extraction system because of its high re- sults on different PPI corpora (see the results in The aim of our work is to take advantage of Table 4). different types of information (i.e., dependency As an extension of their work, they boosted sys- patterns, regex patterns, trigger words, negative tem performance by training on multiple PPI cor- cues, syntactic dependencies among words and pora instead of on a single corpus and adopting constituent parse trees) and their different repre- a corpus weighting concept with support vector sentations (i.e. flat features, tree structures and machine (SVM) which they call SVM-CW (Miwa graphs) which can complement each other to learn et al., 2009b). Since most of their results are re- more accurate models. ported by training on the combination of multi- 2 The syntactic dependencies of the words of a sentence ple corpora, it is not possible to compare them create a dependency graph. A v-walk feature consists of directly with the results published in the other re- (wordi − dependency typei,i+1 − wordi+1 ), and an e- lated works (that usually adopt 10-fold cross vali- walk feature is composed of (dependency typei−1,i − wordi − dependency typei,i+1 ). Note that, in a depen- dation on a single PPI corpus). To be comparable dency graph, the words are nodes while the dependency with the vast majority of the existing work, we types are edges. also report results using 10-fold cross validation 421 Corpus Sentences Positive pairs Negative pairs (PET) kernel respectively. w is a multiplicative BioInfer 1,100 2,534 7,132 constant used for the PET kernel. It allows the AIMed 1,955 1,000 4,834 hybrid kernel to assign more (or less) weight to IEPA 486 335 482 the information obtained using tree structures de- HPRD50 145 163 270 pending on the corpus. The proposed hybrid ker- LLL 77 164 166 nel is valid according to the closure properties of kernels. Table 1: Basic statistics of the 5 benchmark PPI cor- Both the TPWF and SL kernels are linear ker- pora. nels, while PET kernel is computed using Unlex- icalized Partial Tree (uPT) kernel (Severyn and on single corpora. Moschitti, 2010). The following subsections ex- Apart from the approaches described above, plain each of the individual kernels in more detail. there also exist other studies that used kernels for 4.1 Proposed TPWF Kernel PPI extraction (e.g. subsequence kernel (Bunescu and Mooney, 2006)). 4.1.1 Reduced graph, trigger words, A notable exception is the work published by negative cues and dependency patterns Bui et al. (2010). They proposed an approach that For each of the candidate entity pairs, we consists of two phases. In the first phase, their construct a type of subgraph from the depen- system categorizes the data into different groups dency graph formed by the syntactic dependen- (i.e. subsets) based on various properties and pat- cies among the words of a sentence. We call it terns. Later they classify candidate PPI pairs in- “reduced graph” and define it in the follow- side each of the groups using SVM trained with ing way: features specific for the corresponding group. A reduced graph is a subgraph 3 Data of the dependency graph of a sentence which includes: There are 5 benchmark corpora for the PPI task • the two candidate entities and their that are frequently used: HPRD50 (Fundel et al., governor nodes up to their least 2007), IEPA (Ding et al., 2002), LLL (N´edellec, common governor (if exists). 2005), BioInfer (Pyysalo et al., 2007) and AIMed (Bunescu et al., 2005). These corpora adopt dif- • dependent nodes (if exist) of all the ferent PPI annotation formats. For a comparative nodes added in the previous step. evaluation Pyysalo et al. (2008) put all of them • the immediate governor(s) (if ex- in a common format which has become the stan- ists) of the least common governor. dard evaluation format for the PPI task. In our Figure 1 shows an example of a reduced graph. experiments, we use the versions of the corpora A reduced graph is an extension of the smallest converted to such format. common subgraph of the dependency graph that Table 1 shows various statistics regarding the 5 aims at overcoming its limitations. It is a known (converted) corpora. issue that the smallest common subgraph (or sub- 4 Proposed Hybrid Kernel tree) sometimes does not contain cue words. Pre- viously, Chowdhury et al. (2011a) proposed a lin- The hybrid kernel that we propose is as follows: guistically motivated extension of the minimal KHybrid (R1 , R2 ) = KT P W F (R1 , R2 ) (i.e. smallest) common subtree (which includes + KSL (R1 , R2 ) + w * KP ET (R1 , R2 ) the candidate entity pairs), known as Mildly Ex- tended Dependency Tree (MEDT). However, the where KT P W F stands for the new feature rules used for MEDT are too constrained. Our ob- based kernel (henceforth, TPWF kernel) com- jective in constructing the reduced graph is to in- puted using flat features collected by exploiting clude any potential modifier(s) or cue word(s) that patterns, trigger words, negative cues and walk describes the relation between the given pair of features. KSL and KP ET stand for the Shallow entities. Sometimes such modifiers or cue words Linguistic (SL) kernel and the Path-enclosed Tree are not directly dependent (syntactically) on any 422 BioInfer AIMed IEPA HPRD50 LLL P R F P R F P R F P R F P R F Only walk features 51.8 71.2 60.0 48.7 63.2 55.0 61.0 75.2 67.4 60.2 65.0 62.5 64.6 87.8 74.4 Features: dep. patterns, 53.8 68.8 60.4 50.6 63.9 56.5 63.9 74.6 68.9 65.0 71.8 68.2 66.5 89.6 76.4 trigger, neg. cues, walks Features: dep. patterns, 53.5 68.6 60.1 52.5 62.9 57.2 63.8 74.6 68.8 65.1 69.9 67.5 67.4 88.4 76.5 trigger, neg. cues, walks, regex patterns Table 2: Results of the proposed TPWF feature based kernel on 5 benchmark PPI corpora before and after adding features collected using dependency patterns, regex patterns, trigger words and negative cues to the walk features. The TPWF kernel is a component of the new hybrid kernel. Figure 1: Dependency graph for the sentence “A pVHL mutant containing a P154L substitution does not promote degradation of HIF1-Alpha” generated by the Stanford parser. The edges with blue dots form the smallest common subgraph for the candidate entity pair pVHL and HIF1-Alpha, while the edges with red dots form the reduced graph for the pair. of the entities (of the candidate pair). Rather they of a (positive or negative) entity pair in the train- are dependent on some other word(s) which is de- ing data. For example, the dependency pattern for pendent on one (or both) of the entities. The word the reduced graph in Figure 1 is {det, amod, part- “not” in Figure 1 is one such example. The re- mod, nsubj, aux, neg, dobj, prep of }. The same duced graph aims to preserve these cue words. dependency pattern might be constructed for mul- The following types of features are collected tiple (positive or negative) entity pairs. However, from the reduced graph of a candidate pair: if it is constructed for both positive and negative pairs, it has to be discarded from the pattern list. 1. HasTriggerWord: whether the least common The dependency patterns allow some kind of governor(s) of the target entity pairs inside underspecification as they do not contain the lex- the reduced graph matches any trigger word. ical items (i.e. words) but contain the likely com- 2. Trigger-X: whether the least common gov- bination of syntactic dependencies that a given re- ernor(s) of the target entity pairs inside the lated pair of entities would pose inside their re- reduced graph matches the trigger word ‘X’. duced graph. The list of trigger words contains 144 words 3. HasNegWord: whether the reduced graph previously used by Bui et al. (2010) and Fundel contains any negative word. et al. (2007). The list of negative cues contain 18 4. DepPattern-i: whether the reduced graph words, most of which are mentioned in Fundel et contains all the syntactic dependencies of the al. (2007). i-th pattern of dependency pattern list. 4.1.2 Walk features The dependency pattern list is automatically We extract e-walk and v-walk features from constructed from the training data during the the Mildly Extended Dependency Tree (MEDT) learning phase. Each pattern is a set of syntactic (Chowdhury et al., 2011a) of each candidate pair. dependencies of the corresponding reduced graph Reduced graphs sometimes include some unin- 423 BioInfer AIMed IEPA HPRD50 LLL Pos. / Neg. 2,534 / 7,132 1,000 / 4,834 335 / 482 163 / 270 164 / 166 P R F P R F P R F P R F P R F Proposed TPWF kernel 53.8 68.8 60.4 50.6 63.9 56.5 63.9 74.6 68.9 65.0 71.8 68.2 66.5 89.6 76.4 (without regex) Proposed TPWF kernel 53.5 68.6 60.1 52.5 62.9 57.2 63.8 74.6 68.8 65.1 69.9 67.5 67.4 88.4 76.5 (with regex) SL kernel 60.8 65.8 63.2 56.2 64.4 60.0 73.3 71.9 72.6 62.0 65.0 63.5 74.9 85.4 79.8 PET kernel 72.8 74.9 73.9 44.8 72.8 55.5 70.7 77.9 74.2 65.0 73.0 68.8 72.1 89.6 79.9 Proposed hybrid kernel 80.0 71.4 75.5 64.2 58.2 61.1 81.1 69.3 74.7 72.9 59.5 65.5 70.4 95.7 81.1 (PET + SL + TPWF (without regex)) Proposed hybrid kernel 80.1 72.0 75.9 64.4 58.3 61.2 79.3 69.6 74.1 71.9 61.4 66.2 70.6 95.1 81.0 (PET + SL + TPWF (with regex)) Table 3: Results of the proposed hybrid kernel and its individual components. Pos. and Neg. refer to number positive and negative relations respectively. PET refers to the path-enclosed tree kernel, SL refers to the shallow linguistic kernel, and TPWF refers to the kernel computed using trigger, pattern, negative cue and walk features. formative words which produce uninformative (i.e. {node X, dependent 1 of X} and walk features. Hence, they are not suitable for {node X, dependent 2 of X}). walk feature generation. MEDT suits better for Apart from the above types of features, we also this purpose. The walk features extracted from add features for lemmas of the immediate preced- MEDTs have the following properties: ing and following words of the candidate entities. These feature names are augmented with -1 or +1 • The directionality of the edges (or nodes) in depending on whether the corresponding words an e-walk (or v-walk) is not considered. In are preceded or followed by a candidate entity. other words, e.g., pos(stimulatory)−amod− pos(ef f ects) and pos(ef f ects) − amod − 4.1.3 Regular expression patterns pos(stimulatory) are treated as the same fea- We use a set of 22 regex patterns as binary ture. features. These patterns were previously used by Ono et al. (2001) and Bui et al. (2010). • The v-walk features are of the form (posi − If there is a match for a pattern (e.g. “En- dependency typei,i+1 −posi+1 ). Here, posi is tity 1.*activates.*Entity 2” where Entity 1 and the POS tag of wordi , i is the governor node Entity 2 form the candidate entity pair) in a given and i + 1 is the dependent node. sentence, value 1 is added for the feature (i.e., pat- tern) inside the feature vector. • The e-walk features are of the form (dep. typei−1,i − posi − dep. typei,i+1 ) and 4.2 Shallow Linguistic (SL) Kernel (dep. typei−1,i − lemmai − dep. typei,i+1 ). The Shallow Linguistic (SL) kernel was proposed Here, lemmai is the lemmatized form of by Giuliano et al. (2006). It is one of the best wordi . performing kernels applied on different biomedi- • Usually, the e-walk features are con- cal RE tasks such as PPI and DDI (drug-drug in- structed using dependency types be- teraction) extraction (Tikk et al., 2010; Segura- tween {governor of X, node X} and Bedmar et al., 2011; Chowdhury and Lavelli, {node X, dependent of X}. However, 2011b; Chowdhury et al., 2011c). It is defined we also extract e-walk features from as follows: the dependency types between any two KSL (R1 , R2 ) = KLC (R1 , R2 ) + KGC dependents and their common governor (R1 , R2 ) 424 BioInfer AIMed IEPA HPRD50 LLL Pos. / Neg. 2,534 / 7,132 1,000 / 4,834 335 / 482 163 / 270 164 / 166 P R F P R F P R F P R F P R F SL kernel – – – 60.9 57.2 59.0 – – – – – – – – – (Giuliano et al., 2006) APG kernel 56.7 67.2 61.3 52.9 61.8 56.4 69.6 82.7 75.1 64.3 65.8 63.4 72.5 87.2 76.8 (Airola et al., 2008) Hybrid kernel and 65.7 71.1 68.1 55.0 68.8 60.8 67.5 78.6 71.7 68.5 76.1 70.9 77.6 86.0 80.1 multiple parser input (Miwa et al., 2009a) SVM-CW, multiple – – 67.6 – – 64.2 – – 74.4 – – 69.7 – – 80.5 parser input and graph, walk and BOW features (Miwa et al., 2009b) kBSPS kernel 49.9 61.8 55.1 50.1 41.4 44.6 58.8 89.7 70.5 62.2 87.1 71.0 69.3 93.2 78.1 (Tikk et al., 2010) Walk weighted 61.8 54.2 57.6 61.4 53.3 56.6 73.8 71.8 72.9 66.7 69.2 67.8 76.9 91.2 82.4 subsequence kernel (Kim et al., 2010) 2 phase extraction 61.7 57.5 60.0 55.3 68.5 61.2 – – – – – – – – – (Bui et al., 2010) Our proposed hybrid 80.0 71.4 75.5 64.2 58.2 61.1 81.1 69.3 74.7 72.9 59.5 65.5 70.4 95.7 81.1 kernel (PET + SL + TPWF without regex) Table 4: Comparison of the results on the 5 benchmark PPI corpora. Pos. and Neg. refer to number positive and negative relations respectively. The underlined numbers indicate the best results for the corresponding corpus reported by any of the existing state-of-the-art approaches. The results of Bui et al. (2010) on LLL, HPRD50, and IEPA are not reported since thy did not use all the positive and negative examples during cross validation. Miwa et al. (2009b) showed that better results can be obtained using multiple corpora for training. However, we consider only those results of their experiments where they used single training corpus as it is the standard evaluation approach adopted by all the other studies on PPI extraction for comparing results. All the results of the previous approaches reported in this table are directly quoted from their respective original papers. where KSL , KGC and KLC correspond to SL, main). A PET is the smallest common subtree of a global context (GC) and local context (LC) ker- phrase structure tree that includes the two entities nels respectively. The GC kernel exploits contex- involved in a relation. tual information of the words occurring before, between and after the pair of entities (to be in- vestigated for RE) in the corresponding sentence; while the LC kernel exploits contextual informa- A tree kernel calculates the similarity between tion surrounding individual entities. two input trees by counting the number of com- mon sub-structures. Different techniques have 4.3 Path-enclosed tree (PET) Kernel been proposed to measure such similarity. We use the Unlexicalized Partial Tree (uPT) kernel (Sev- The path-enclosed tree (PET) kernel3 was first eryn and Moschitti, 2010) for the computation of proposed by Moschitti (2004) for semantic role the PET kernel since a comparative evaluation by labeling. It was later successfully adapted by Chowdhury et al. (2011a) reported that uPT ker- Zhang et al. (2005) and other works for relation nels achieve better results for PPI extraction than extraction on general texts (such as newspaper do- the other techniques used for tree kernel compu- 3 Also known as shortest path-enclosed tree (SPT) kernel. tation. 425 5 Experimental Settings 6 Results and Discussion We have followed the same criteria commonly To measure the contribution of the features col- used for the PPI extraction tasks, i.e. abstract- lected from the reduced graphs (using dependency wise 10-fold cross validation on individual corpus patterns, trigger words and negative cues) and and one-answer-per-occurrence criterion. In fact, regex patterns, we have applied the new TPWF we have used exactly the same (abstract-wise) kernel on the 5 PPI corpora before and after using fold splitting of the 5 benchmark (converted) cor- these features. Results shown in Table 2 clearly pora used by Tikk et al. (2010) for benchmarking indicate that usage of these features improve the various kernel methods4 . performance. The improvement of performance The Charniak-Johnson reranking parser (Char- is primarily due to the usage of dependency pat- niak and Johnson, 2005), along with a self-trained terns which resulted in higher precision for all the biomedical parsing model (McClosky, 2010), has corpora. been used for tokenization, POS-tagging and We have tried to measure the contribution of parsing of the sentences. Before parsing the sen- the regex patterns. However, from the empirical tences, all the entities are blinded by assigning results a clear trend does not emerge (see Table names as EntityX where X is the entity index. 2). In each example, the POS tags of the two can- Table 3 shows a comparison among the re- didate entities are changed to EntityX. The sults of the proposed hybrid kernel and its indi- parse trees produced by the Charniak-Johnson vidual components. As we can see, the overall reranking parser are then processed by the Stan- results of the hybrid kernel (with and without us- ford parser5 (Klein and Manning, 2003) to obtain ing regex pattern features) are better than those syntactic dependencies according to the Stanford by any of its individual component kernels. Inter- Typed Dependency format. estingly, precision achieved on the 4 benchmark The Stanford parser often skips some syntactic corpora (other than the smallest corpus LLL) is dependencies in output. We use the following two much higher for the hybrid kernel than for the in- rules to add some of such dependencies: dividual components. This strongly indicates that these different types of information (i.e. depen- • If there is a “conj and” or “conj or” depen- dency patterns, regex patterns, triggers, negative dency between two words X and Y, then X cues, syntactic dependencies among words and should be dependent on any word Z on which constituent parse trees) and their different repre- Y is dependent and vice versa. sentations (i.e. flat features, tree structures and graphs) can complement each other to learn more • If there are two verbs X and Y such that in- accurate models. side the corresponding sentence they have Table 4 shows a comparison of the PPI extrac- only the word “and” or “or” between them, tion results of our proposed hybrid kernel with then any word Z dependent on X should be those of other state-of-the-art approaches. Since also dependent on Y and vice versa. the contribution of regex patterns in the perfor- mance of the hybrid kernel was not relevant (as Our system exploits SVM-LIGHT-TK6 (Mos- Tables 2 and 3 show), we used the results of pro- chitti, 2006; Joachims, 1999). We made minor posed hybrid kernel without regex for the compar- changes in the toolkit to compute the proposed ison. As we can see, the proposed kernel achieves hybrid kernel. The ratio of negative and positive significantly higher results on the BioInfer corpus, examples has been used as the value of the cost- the largest benchmark PPI corpus (2,534 positive ratio-factor parameter. We have done parameter PPI pair annotations) available, than any of the tuning following the approach described by Hsu existing approaches. Moreover, the results of the et al. (2003). proposed hybrid kernel are on par with the state- 4 of-the-art results on the other smaller corpora. Downloaded from http://informatik.hu- berlin.de/forschung /gebiete/wbi/ppi-benchmark . Furthermore, empirical results show that the 5 http://nlp.stanford.edu/software/lex-parser.shtml proposed hybrid kernel attains considerably 6 http://disi.unitn.it/moschitti/Tree-Kernel.htm higher precision than the existing approaches. 426 Since a dependency pattern, by construction, also demonstrates that the different types of infor- contains all the syntactic dependencies inside the mation that we use are able to complement each corresponding reduced graph, it may happen that other for relation extraction. some of the dependencies (e.g. det or determiner) We believe there are at least three ways to are not informative for classifying the label of the further improve the proposed approach. First corresponding class label (i.e., positive or nega- of all, the 22 regular expression patterns (col- tive relation) of the pattern. Their presence in- lected from Ono et al. (2001) and Bui et al. side a pattern might make it unnecessarily rigid (2010)) are applied at the level of the sen- and less general. So, we tried to identify and dis- tences and this sometimes produces unwanted card such non informative dependencies by mea- matches. For example, consider the sentence suring probabilities of the dependencies with re- “X activates Y and inhibits Z” where X, Y, spect to the class label and then removing any of and Z are entities. The pattern “Entity1. ∗ them which has probability lower than a threshold activates. ∗ Entity2” matches both the X–Y and (we tried with different threshold values). But do- X–Z pairs in the sentence. But only the X–Y pair ing so decreased the performance. This suggests should be considered. So, the patterns should that the syntactic dependencies of a dependency be constrained to reduce the number of unwanted pattern are not independent of each other even if matches. For example, they could be applied on some of them might have low probability (with smaller linguistic units than full sentences. Sec- respect to the class label) individually. We plan to ondly, different techniques could be used to iden- further investigate whether there could be differ- tify less-informative syntactic dependencies in- ent criteria for identifying non informative depen- side dependency patterns to make them more ac- dencies. For the work reported in this paper, we curate and effective. Thirdly, usage of automati- used the dependency patterns as they are initially cally collected paraphrases of regular expression constructed. patterns instead of the patterns directly could be We also did experiments to see whether collect- also helpful. Weakly supervised collection of ing features for trigger words from the whole re- paraphrases for RE has been already investigated duced graph would help. But that also decreased (e.g. Romano et al. (2006)) and, hence, can be performance. This suggests that trigger words are tried for improving the TPWF kernel (which is a more likely to appear in the least common gover- component of the proposed hybrid kernel). nors. Acknowledgments 7 Conclusion This work was carried out in the context of the project In this paper, we have proposed a new hybrid “eOnco - Pervasive knowledge and data management kernel for RE that combines two vector based in cancer care”. The authors are grateful to Alessan- kernels and a tree kernel. The proposed kernel dro Moschitti for his help in the use of SVM-LIGHT- outperforms any of the exiting approaches by a TK. We also thank the anonymous reviewers for help- wide margin on the BioInfer corpus, the largest ful suggestions. PPI benchmark corpus available. On the other four smaller benchmark corpora, it performs ei- ther better or almost as good as the existing state- References of-the art approaches. We have also proposed a novel feature based Antti Airola, Sampo Pyysalo, Jari Bjorne, Tapio kernel, called TPWF kernel, using (automatically Pahikkala, Filip Ginter, and Tapio Salakoski. 2008. All-paths graph kernel for protein-protein inter- collected) dependency patterns, trigger words, action extraction with evaluation of cross-corpus negative cues, walk features and regular expres- learning. BMC Bioinformatics, 9(Suppl 11):S2. sion patterns. The TPWF kernel is used as a com- Quoc-Chinh Bui, Sophia Katrenko, and Peter M.A. ponent of the new hybrid kernel. Sloot. 2010. A hybrid approach to extract protein- Empirical results show that the proposed hy- protein interactions. Bioinformatics. brid kernel achieves considerably higher precision Razvan Bunescu and Raymond J. Mooney. 2006. than the existing approaches, which indicates its Subsequence kernels for relation extraction. In Pro- capability of learning more accurate models. This ceedings of NIPS 2006, pages 171–178. 427 Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Ed- Parsing. Ph.D. thesis, Department of Computer ward M. Marcotte, Raymond J. Mooney, Arun Ku- Science, Brown University. mar Ramani, and Yuk Wah Wong. 2005. Compara- Makoto Miwa, Rune Sætre, Yusuke Miyao, and tive experiments on learning information extractors Jun’ichi Tsujii. 2009a. Protein-protein interac- for proteins and their interactions. Artificial Intelli- tion extraction by leveraging multiple kernels and gence in Medicine, 33(2):139–155. parsers. International Journal of Medical Informat- Eugene Charniak and Mark Johnson. 2005. Coarse- ics, 78. to-fine n-best parsing and maxent discriminative Makoto Miwa, Rune Sætre, Yusuke Miyao, and reranking. In Proceedings of ACL 2005. Jun’ichi Tsujii. 2009b. A rich feature vector for Md. Faisal Mahbub Chowdhury and Alberto Lavelli. protein-protein interaction extraction from multiple 2011b. Drug-drug interaction extraction using com- corpora. In Proceedings of EMNLP 2009, pages posite kernels. In Proceedings of DDIExtrac- 121–130, Singapore. tion2011: First Challenge Task: Drug-Drug In- Alessandro Moschitti. 2004. A study on convolution teraction Extraction, pages 27–33, Huelva, Spain, kernels for shallow semantic parsing. In Proceed- September. ings of ACL 2004, Barcelona, Spain. Md. Faisal Mahbub Chowdhury, Alberto Lavelli, and Alessandro Moschitti. 2006. Making Tree Kernels Alessandro Moschitti. 2011a. A study on de- Practical for Natural Language Learning. In Pro- pendency tree kernels for automatic extraction of ceedings of EACL 2006, Trento, Italy. protein-protein interaction. In Proceedings of Claire N´edellec. 2005. Learning language in logic - BioNLP 2011 Workshop, pages 124–133, Portland, genic interaction extraction challenge. Proceedings Oregon, USA, June. of the ICML 2005 workshop: Learning Language in Md. Faisal Mahbub Chowdhury, Asma Ben Abacha, Logic (LLL05), pages 31–37. Alberto Lavelli, and Pierre Zweigenbaum. 2011c. Toshihide Ono, Haretsugu Hishigaki, Akira Tanigami, Two dierent machine learning techniques for drug- and Toshihisa Takagi. 2001. Automated ex- drug interaction extraction. In Proceedings of traction of information on protein–protein interac- DDIExtraction2011: First Challenge Task: Drug- tions from the biological literature. Bioinformatics, Drug Interaction Extraction, pages 19–26, Huelva, 17(2):155–161. Spain, September. Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari J. Ding, D. Berleant, D. Nettleton, and E. Wurtele. Bj¨orne, Jorma Boberg, Jouni Jarvinen, and Tapio 2002. Mining MEDLINE: abstracts, sentences, or Salakoski. 2007. Bioinfer: a corpus for information phrases? Pacific Symposium on Biocomputing, extraction in the biomedical domain. BMC Bioin- pages 326–337. formatics, 8(1):50. Katrin Fundel, Robert K¨uffner, and Ralf Zimmer. Sampo Pyysalo, Antti Airola, Juho Heimonen, Jari 2007. Relex–relation extraction using dependency Bj¨orne, Filip Ginter, and Tapio Salakoski. 2008. parse trees. Bioinformatics, 23(3):365–371. Comparative analysis of five protein-protein in- Claudio Giuliano, Alberto Lavelli, and Lorenza Ro- teraction corpora. BMC Bioinformatics, 9(Suppl mano. 2006. Exploiting shallow linguistic infor- 3):S6. mation for relation extraction from biomedical lit- Lorenza Romano, Milen Kouylekov, Idan Szpektor, erature. In Proceedings of EACL 2006, pages 401– Ido Dagan, and Alberto Lavelli. 2006. Investi- 408. gating a generic paraphrase–based approach for re- CW Hsu, CC Chang, and CJ Lin, 2003. A practical lation extraction. In Proceedings of EACL 2006, guide to support vector classification. Department pages 409–416. of Computer Science and Information Engineering, Isabel Segura-Bedmar, Paloma Mart´ınez, and Cesar de National Taiwan University, Taipei, Taiwan. Pablo-S´anchez. 2011. Using a shallow linguistic Thorsten Joachims. 1999. Making large-scale sup- kernel for drug-drug interaction extraction. Jour- port vector machine learning practical. In Advances nal of Biomedical Informatics, In Press, Corrected in kernel methods: support vector learning, pages Proof, Available online, 24 April. 169–184. MIT Press, Cambridge, MA, USA. Aliaksei Severyn and Alessandro Moschitti. 2010. Seonho Kim, Juntae Yoon, Jihoon Yang, and Seog Fast cutting plane training for structural kernels. In Park. 2010. Walk-weighted subsequence kernels Proceedings of ECML-PKDD 2010. for protein-protein interaction extraction. BMC Domonkos Tikk, Philippe Thomas, Peter Palaga, Bioinformatics, 11(1). J¨org Hakenberg, and Ulf Leser. 2010. A Compre- Dan Klein and Christopher D. Manning. 2003. Accu- hensive Benchmark of Kernel Methods to Extract rate unlexicalized parsing. In Proceedings of ACL Protein-Protein Interactions from Literature. PLoS 2003, pages 423–430, Sapporo, Japan. Computational Biology, 6(7), July. David McClosky. 2010. Any Domain Parsing: Au- Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, tomatic Domain Adaptation for Natural Language and Chew Lim Tan. 2005. Discovering relations 428 between named entities from a large raw corpus us- ing tree similarity-based clustering. In Natural Lan- guage Processing – IJCNLP 2005, volume 3651 of Lecture Notes in Computer Science, pages 378–389. Springer Berlin / Heidelberg. 429 Coordination Structure Analysis using Dual Decomposition Atsushi Hanamoto 1 Takuya Matsuzaki 1 Jun’ichi Tsujii 2 1. Department of Computer Science, University of Tokyo, Japan 2. Web Search & Mining Group, Microsoft Research Asia, China {hanamoto, matuzaki}@is.s.u-tokyo.ac.jp

[email protected]

Abstract a freshman advertising and marketing major. Ta- ble 1 shows the output from them and the correct Coordination disambiguation remains a dif- coordination structure. ficult sub-problem in parsing despite the The coordination structure above is obvious to frequency and importance of coordination humans because there is a symmetry of conjuncts structures. We propose a method for disam- biguating coordination structures. In this (-ing) in the sentence. Coordination structures of- method, dual decomposition is used as a ten have such structural and semantic symmetry framework to take advantage of both HPSG of conjuncts. One approach is to capture local parsing and coordinate structure analysis symmetry of conjuncts. However, this approach with alignment-based local features. We fails in VP and sentential coordinations, which evaluate the performance of the proposed can easily be detected by a grammatical approach. method on the Genia corpus and the Wall This is because conjuncts in these coordinations Street Journal portion of the Penn Tree- bank. Results show it increases the per- do not necessarily have local symmetry. centage of sentences in which coordination It is therefore natural to think that consider- structures are detected correctly, compared ing both the syntax and local symmetry of con- with each of the two algorithms alone. juncts would lead to a more accurate analysis. However, it is difficult to consider both of them in a dynamic programming algorithm, which has 1 Introduction been often used for each of them, because it ex- Coordination structures often give syntactic ambi- plodes the computational and implementational guity in natural language. Although a wrong anal- complexity. Thus, previous studies on coordina- ysis of a coordination structure often leads to a tion disambiguation often dealt only with a re- totally garbled parsing result, coordination disam- stricted form of coordination (e.g. noun phrases) biguation remains a difficult sub-problem in pars- or used a heuristic approach for simplicity. ing, even for state-of-the-art parsers. In this paper, we present a statistical analysis One approach to solve this problem is a gram- model for coordination disambiguation that uses matical approach. This approach, however, of- the dual decomposition as a framework. We con- ten fails in noun and adjective coordinations be- sider both of the syntax, and structural and se- cause there are many possible structures in these mantic symmetry of conjuncts so that it outper- coordinations that are grammatically correct. For forms existing methods that consider only either example, a noun sequence of the form “n0 n1 of them. Moreover, it is still simple and requires and n2 n3 ” has as many as five possible struc- only O(n4 ) time per iteration, where n is the num- tures (Resnik, 1999). Therefore, a grammatical ber of words in a sentence. This is equal to that approach is not sufficient to disambiguate coor- of coordination structure analysis with alignment- dination structures. In fact, the Stanford parser based local features. The overall system still has a (Klein and Manning, 2003) and Enju (Miyao and quite simple structure because we need just slight Tsujii, 2004) fail to disambiguate a sentence I am modifications of existing models in this approach, 430 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 430–438, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Stanford parser/Enju They disambiguated coordination structures I am a ( freshman advertising ) and ( based on the edit distance between two conjuncts. marketing major ) Hara et al. (2009) extended the method, dealing with nested coordinations as well. We used their Correct coordination structure method as one of the two sub-models. I am a freshman ( ( advertising and mar- keting ) major ) 3 Background 3.1 Coordination structure analysis with Table 1: Output from the Stanford parser, Enju and the correct coordination structure alignment-based local features Coordination structure analysis with alignment- so we can easily add other modules or features for based local features (Hara et al., 2009) is a hy- future. brid approach to coordination disambiguation that The structure of this paper is as follows. First, combines a simple grammar to ensure consistent we describe three basic methods required in the global structure of coordinations in a sentence, technique we propose: 1) coordination structure and features based on sequence alignment to cap- analysis with alignment-based local features, 2) ture local symmetry of conjuncts. In this section, HPSG parsing, and 3) dual decomposition. Fi- we describe the method briefly. nally, we show experimental results that demon- A sentence is denoted by x = x1 ...xk , where xi strate the effectiveness of our approach. We com- is the i-th word of x. A coordination boundaries pare three methods: coordination structure anal- set is denoted by y = y1 ...yk , where ysis with alignment-based local features, HPSG    (bl , el , br , er ) (if xi is a coordinating parsing, and the dual-decomposition-based ap-   proach that combines both.  conjunction having left yi = conjunct xbl ...xel and   2 Related Work   right conjunct xbr ...xer )  null (otherwise) Many previous studies for coordination disam- biguation have focused on a particular type of NP In other words, yi has a non-null value coordination (Hogan, 2007). Resnik (1999) dis- only when it is a coordinating conjunction. ambiguated coordination structures by using se- For example, a sentence I bought books and mantic similarity of the conjuncts in a taxonomy. stationary has a coordination boundaries set He dealt with two kinds of patterns, [n0 n1 and (null, null, null, (3, 3, 5, 5), null). n2 n3 ] and [n1 and n2 n3 ], where ni are all nouns. The score of a coordination boundaries set is He detected coordination structures based on sim- defined as the sum of score of all coordinating ilarity of form, meaning and conceptual associa- conjunctions in the sentence. tion between n1 and n2 and between n1 and n3 . Nakov and Hearst (2005) used the Web as a train- ∑ k ing set and applied it to a task that is similar to score(x, y) = score(x, ym ) m=1 Resnik’s. In terms of integrating coordination disam- ∑k = w · f (x, ym ) (1) biguation with an existing parsing model, our ap- m=1 proach resembles the approach by Hogan (2007). She detected noun phrase coordinations by find- where f (x, ym ) is a real-valued feature vector of ing symmetry in conjunct structure and the depen- the coordination conjunct xm . We used almost the dency between the lexical heads of the conjuncts. same feature set as Hara et al. (2009): namely, the They are used to rerank the n-best outputs of the surface word, part-of-speech, suffix and prefix of Bikel parser (2004), whereas two models interact the words, and their combinations. We used the with each other in our method. averaged perceptron to tune the weight vector w. Shimbo and Hara (2007) proposed an Hara et al. (2009) proposed to use a context- alignment-based method for detecting and dis- free grammar to find a properly nested coordina- ambiguating non-nested coordination structures. tion structure. That is, the scoring function Eq (1) 431 COMPS list of synsem COMPS < > SEM semantics nonlocal Spring NONLOC REL list of local SLASH list of local Figure 1: HPSG sign COORD Coordination. HEAD 1 SUBJ < > HEAD 1 SUBJ 2 COMPS < > COMPS 4 CJT Conjunct. SUBJ <> HEAD 1 HEAD 1 HEAD N Non-coordination. 2 COMPS < > SUBJ < 2 > COMPS < > SUBJ 2 COMPS < 3 | 4 > 3 COMPS < > 1 SUBJ COMPS CC Coordinating conjunction like “and”. Sprin W Any word. Figure 2: Subject-Head Schema (left) and Head- Figure 1: subject-head schema (left) and head- Complement complement Schema schema (right) (right); taken from Miyao et al. Fig Table 2: Non-terminals (2004). Rules for coordinations: and unbounded dependencies. SEM feature rep- required becaus COORDi,m → CJTi,j CCj+1,k−1 CJTk,m resents the semantics of a constituent, and in this tive, i.e., daught formalism. In a lexicalized grammar, quite a Rules for conjuncts: study it expresses a predicate-argument structure. termined given t small numbers of schemata are used to explain CJTi,j → (COORD|N)i,j Figure 2 presents the Subject-Head Schema tions are at least general grammatical constraints, compared with and the Head-Complement Schema1 defined in of each non-head Rules for non-coordinations: other theories. On the other hand, rich word- (Pollard and Sag, 1994). In order to express gen- this is not perco Ni,k → COORDi,j Nj+1,k specific characteristics areonly embedded in lexical eral constraints, schemata provide sharing of SLASH/REL feat Ni,j → Wi,i (COORD|N)i+1,j entries. feature values, and no instantiated values. entries Both of schemata and lexical our previous stu Ni,i → Wi,i are Figure represented 3 hasbyantyped examplefeature structures, of HPSG and parsing Rules for pre-terminals: the SUBJ featur constraints in parsing of the sentence “Spring arehas checked come.”byFirst, unification each the Head-Comp CCi,i → (and|or|but|, |; |+|+/−)i among them. Figure 1 shows examples of the lexical entries for “has” and “come” of HPSG are since this schem CCi,i+1 → (, |; )i (and|or|but)i+1 schema. unified with a daughter feature structure of the rated constituen CCi,i+2 → (as)i (well)i+1 (as)i+2 Figure 2 shows anSchema. Head-Complement HPSG parse tree ofprovides Unification the sen- empty SUBJ fea Wi,i → ∗i the phrasal tence “Spring signhasof come.” the mother.First,Thethe sign of the lexical en- tated with at lea larger tries ofconstituent “has” andis “come” obtained areby repeatedly joined byapply-head- tries required to Table 3: Production rules ing schemataschema. complement to lexical/phrasal Unification signs. givesFinally, the the HPSG determined. In phrasal sign of the entire sentence sign of mother. After applying schemata to HPSGis output on the specified deriva top of signs the derivation repeatedly, tree. sign of the whole sen- the HPSG tated with schem is only defined on the coordination structures that are licensed by the grammar. We only slightly ex- tence is output. ing the specifica 3 Acquiring HPSG from the Penn tended their grammar for convering more variety We use Enju for an English HPSG parser We describe t Treebank of coordinating conjunctions. (Miyao et al., 2004). Figure 3 shows how a co- ment in terms o Table 2 and Table 3 show the non-terminals and As discussed ordination in Section structure is built1, in ourthegrammar devel- Enju grammar. externalization, production rules used in the model. The only ob- opment First, requires each conjunction a coordinating sentence to and be annotated the right with i) aarehistory ofby rule applications, and ii) ad- 3.1 Specificat jective of the grammar is to ensure the consistency conjunct joined coord right schema. Af- ditional annotations to make terwards, the parent and the left conjunctthe grammar rules are General gramm of two or more coordinations in a sentence, which be pseudo-injective. joined by coord left schema. In HPSG, a history of rule this phase, and means for any two coordinations they must be ei- applications The Enju parser is represented is equippedby a tree withannotated a disam- through the desi ther non-overlapping or nested coordinations. We with schema names. Additional annotations are ure 1 shows the use a bottom-up chart parsing algorithm to out- biguation model trained by the maximum entropy 1 The value of category has been2008). presentedSince for simplicity, structure of a sig put the coordination boundaries with the highest method (Miyao and Tsujii, we do while the other portions of the sign have been omitted. features are defin score. Note that these production rules don’t need not need the probability of each parse tree, we to be isomorphic to those of HPSG parsing and treat the model just as a linear model that defines actually they aren’t. This is because the two meth- the score of a parse tree as the sum of feature ods interact only through dual decomposition and weights. The features of the model are defined the search spaces defined by the methods are con- on local subtrees of a parse tree. sidered separately. The Enju parser takes O(n3 ) time since it uses This method requires O(n4 ) time, where n is the CKY algorithm, and each cell in the CKY the number of words. This is because there are parse table has at most a constant number of edges O(n2 ) possible coordination structures in a sen- because we use beam search algorithm. Thus, we tence, and the method requires O(n2 ) time to get can regard the parser as a decoder for a weighted a feature vector of each coordination structure. CFG. 3.2 HPSG parsing 3.3 Dual decomposition HPSG (Pollard and Sag, 1994) is one of the Dual decomposition is a classical method to solve linguistic theories based on lexicalized grammar complex optimization problems that can be de- 432 u(1) ← 0 Head-complement schema HEAD 1 for k = 1 to K do 2 x(k) ← arg maxx (f (x) + u(k) x) SUBJ COMPS 4 Unify HEAD 1 SUBJ 2 Unify 3 COMPS < > y (k) ← arg maxy (g(y) − u(k) y) COMPS < 3 | 4 > synsem if x = y then synsem return u(k) HEAD verb HEAD verb SUBJ < 5 > ynsem HEAD noun SUBJ < > HEAD verb HEAD noun SUBJ < SUBJ < > > synsem COMPS < > COMPS < SUBJ < 5 > > COMPS < > COMPS < > COMPS < > end if Spring has come u(k+1) ← uk − ak (x(k) − y (k) ) Lexical entries end for return u(K) HEAD verb n SUBJ < > COMPS < > subject-head Table 4: The subgradient method HEAD 1 HEAD verb SUBJ 2 SUBJ < 1 > COMPS 4 COMPS < > head-comp HEAD noun HEAD verb HEAD verb 3 COMPS < > 1 SUBJ < > SUBJ < 1 > 2 SUBJ < 1 > 4> COMPS < > COMPS < 2 > COMPS < > shows, you can use existing algorithms and don’t Spring has come need to have an exact algorithm for the optimiza- left) and Head- tion problem, which are features of dual decom- Figure 3: HPSG parsing position. Figure 2: HPSG parsing; taken from Miyao et al. (2004). If x(k) = y (k) occurs during the algorithm, then EM feature rep- we simply take x(k) as the primal solution, which required because HPSG schemata are not injec- ent, and in this is the exact answer. If not, we simply take x(K) , tive, i.e., daughters’ signs cannot be uniquely de- ment structure. Coordina(on termined given the mother. The following annota- the answer of coordination structure analysis with -Head Schema tions are at least required. First, the HEAD feature alignment-based features, as an approximate an- ← coord_left_schema ma1 defined in of each non-head daughter must be specified since swer to the primal solution. The answer does not to express gen- this is not percolated Par(al, to the mother sign. Second, always solve the original problem Eq (2), but pre- ovide sharing of Le3,Conjunct vious works (e.g., (Rush et al., 2010)) has shown SLASH/REL features Coordina(on are required as described in values. our previous study (Miyao et al., 2003a). Finally, that it is effective in practice. We use it in this ← coord_right_schema HPSG parsing the SUBJ feature of the complement daughter in paper. e.” First, each the Head-Complement Coordina(ng, Schema must be specified Right, nd “come” are since this Conjunc(on Conjunct an unsatu- schema may subcategorize 4 Proposed method tructure of the rated constituent, i.e., a constituent with a non- cation provides In this section, we describe how we apply dual empty SUBJ feature. When the corpus is anno- The sign of the decomposition to the two models. tated with Figure 3: at least these offeatures, Construction the lexical coordination in Enjuen- peatedly apply- tries required to explain the sentence are uniquely ns. Finally, the 4.1 Notation determined. In this study, we define partially- is output on the composed into efficiently specified derivation solvable trees as sub-problems. tree structures anno- We define some notations here. First we describe Ittated is becoming popular in the with schema names and HPSG signs NLP community includ- weighted CFG parsing, which is used for both and inghas been shown to the specifications of work the aboveeffectively features.on sev- coordination structure analysis with alignment- e Penn eral We NLP tasks (Rush et al., 2010). describe the process of grammar develop- based features and HPSG parsing. We follows the We consider an optimization ment in terms of the four phases: problemspecification, formulation by Rush et al., (2010). We assume a rammar devel- externalization, extraction, and verification. context-free grammar in Chomsky normal form, arg max(f (x) + g(x)) (2) o be annotated x with a set of non-terminals N . All rules of the 3.1 Specification ons, and ii) ad- grammar are either the form A → BC or A → w which is difficult to solve (e.g. NP-hard), while grammar rules General grammatical constraints are defined in where A, B, C ∈ N and w ∈ V . For rules of the arg maxx f (x) and arg maxx g(x) are effectively history of rule this phase, and in HPSG, they are represented form A → w we refer to A as the pre-terminal for tree annotated solvable. through theIn dual decomposition, design of the sign andwe solve Fig- schemata. w. annotations are uremin 1 shows max(f the(x) definition + g(y) + foru(x the−typed y)) feature Given a sentence with n words, w1 w2 ...wn , a nted for simplicity, structure of a sign used in this study. Some more u x,y parse tree is a set of rule productions of the form been omitted. features are defined for each syntactic category al- instead of the original problem. ⟨A → BC, i, k, j⟩ where A, B, C ∈ N , and To find the minimum value, we can use a sub- 1 ≤ i ≤ k ≤ j ≤ n. Each rule production rep- gradient method (Rush et al., 2010). The subgra- resents the use of CFG rule A → BC where non- dient method is given in Table 4. As the algorithm terminal A spans words wi ...wj , non-terminal B 433 spans word wi ...wk , and non-terminal C spans 1 if rule COORDa,c → CJTa,b CC , CJT ,c or word wk+1 ...wj if k < j, and the use of CFG COORD ,c → CJT , CCa,b CJT ,c is in the parse rule A → wi if i = k = j. tree; otherwise it is 0. We now define the index set for the coordina- We apply the same extension to the HPSG in- tion structure analysis as dex set, also giving an over-complete representa- tion. We define za,b,c analogously to ya,b,c . Icsa = {⟨A → BC, i, k, j⟩ : A, B, C ∈ N, 1 ≤ i ≤ k ≤ j ≤ n} 4.2 Proposed method We now describe the dual decomposition ap- Each parse tree is a vector y = {yr : r ∈ Icsa }, proach for coordination disambiguation. First, we with yr = 1 if rule r is in the parse tree, and yr = define the set Q as follows: 0 otherwise. Therefore, each parse tree is repre- sented as a vector in {0, 1}m , where m = |Icsa |. Q = {(y, z) : y ∈ Y, z ∈ Z, ya,b,c = za,b,c We use Y to denote the set of all valid parse-tree for all (a, b, c) ∈ Iuni } vectors. The set Y is a subset of {0, 1}m . Therefore, Q is the set of all (y, z) pairs that In addition, we assume a vector θcsa = {θrcsa : agree on their coordination structures. The coor- r ∈ Icsa } that specifies a score for each rule pro- dination structure analysis with alignment-based duction. Each θrcsa can take any real value. The features and HPSG parsing problem is then to optimal parse tree ∑ is y ∗ = arg maxy∈Y y · θcsa solve where y · θcsa = r yr · θrcsa is the inner product between y and θcsa . max (y · θcsa + γz · θhpsg ) (3) (y,z)∈Q We use similar notation for HPSG parsing. We define Ihpsg , Z and θhpsg as the index set for where γ > 0 is a parameter dictating the relative HPSG parsing, the set of all valid parse-tree vec- weight of the two models and is chosen to opti- tors and the weight vector for HPSG parsing re- mize performance on the development test set. spectively. This problem is equivalent to We extend the index sets for both the coor- dination structure analysis with alignment-based max(g(z) · θcsa + γz · θhpsg ) (4) z∈Z features and HPSG parsing to make a constraint where g : Z → Y is a function that maps a between the two sub-problems. For the coor- HPSG tree z to its set of coordination structures dination structure analysis with alignment-based z = g(y). features we define ∪ the extended index set to be ′ We solve this optimization problem by using I csa = Icsa Iuni where dual decomposition. Figure 4 shows the result- Iuni = {(a, b, c) : a, b, c ∈ {1...n}} ing algorithm. The algorithm tries to optimize the combined objective by separately solving the Here each triple (a, b, c) represents that word sub-problems again and again. After each itera- wc is recognized as the last word of the right tion, the algorithm updates the weights u(a, b, c). conjunct and the scope of the left conjunct or These updates modify the objective functions for the coordinating conjunction is wa ...wb 1 . Thus the two sub-problems, encouraging them to agree each parse-tree vector y will have additional com- on the same coordination structures. If y (k) = ponents ya,b,c . Note that this representation is z (k) occurs during the iterations, then the algo- over-complete, since a parse tree is enough to rithm simply returns y (k) as the exact answer. If determine unique coordination structures for a not, the algorithm returns the answer of coordina- sentence: more explicitly, the value of ya,b,c is tion analysis with alignment features as a heuristic 1 This definition is derived from the structure of a co- answer. ordination in Enju (Figure 3). The triples show where It is needed to modify original sub-problems the coordinating conjunction and right conjunct are in for calculating (1) and (2) in Table 4. We modified coord right schema, and the left conjunct and partial coor- the sub-problems to regard the score of u(a, b, c) dination are in coord left schema. Thus they alone enable not only the coordination structure analysis with alignment- as a bonus/penalty of the coordination. The mod- based features but Enju to uniquely determine the structure ified coordination structure analysis with align- of a coordination. ment features adds u(k) (i, j, m) and u(k) (j+1, l− 434 u(1) (a, b, c) ← 0 for all (a, b, c) ∈ Iuni for k = 1 to K do ! y (k) ← arg maxy∈Y (y · θcsa − (a,b,c)∈Iuni u(k) (a, b, c)ya,b,c ) ... (1) ! z (k) ← arg maxz∈Z (z · θhpsg + (a,b,c)∈Iuni u(k) (a, b, c)za,b,c ) ... (2) if y (k) (a, b, c) = z (k) (a, b, c) for all (a, b, c) ∈ Iuni then return y (k) end if for all (a, b, c) ∈ Iuni do u(k+1) (a, b, c) ← u(k) (a, b, c) − ak (y (k) (a, b, c) − z (k) (a, b, c)) end for end for return y (K) Figure 4: Proposed algorithm Figure 4: Proposed algorithm w · f (x, (i, j, l, m)) to the score of the sub- COORD WSJ Genia 1, m), as well as adding w · f (x, (i, j, l, m)) to COORD WSJ Genia NP 63.7 66.3 tree, when the rule production COORDi,m → the score of the subtree, when the rule produc- NP 63.7 66.3 CJTi,j CCj+1,l−1 CJTl,m is applied. VP 13.8 11.4 tion COORDi,m → CJTi,j CCj+1,l−1 CJT l,m is VP 13.8 ADJP 11.4 6.8 9.6 The modified Enju adds u (i, j, l) when co- (k) applied. ADJP 6.8 9.6 ord left schema is applied, where word wc S 11.4 6.0 The modified Enju adds u(k) (a, b, c) when S 11.4 PP 6.0 2.4 5.1 is recognized as a coordinating conjunction coord right schema is applied, where word PP 2.4 5.1 and left side of its scope is wa ...wb , or co- Others 1.9 1.5 wa ...wb is recognized as a coordinating conjunc- Others 1.9 1.5 ord right schema is applied, where word wc tion and the last word of the right conjunct is is recognized as a coordinating conjunction and Table 6: The percentage of each conjunct type (%) of wc , or coord left schema is applied, where word Table 6: The each percentage test setof each conjunct type (%) of right side of its scope is wa ...wb . each test set wa ...wb is recognized as the left conjunct and the last word of 5 Experiments the right conjunct is wc . Penn Treebank has more VP-COOD tags and S- rized into phrase COOD types suchwhile tags, as a NP the coordination Genia corpus has more 5 Experiments 5.1 Test/Training data or PP coordination. NP-COOD Table 6 tags shows and the percentagetags. ADJP-COOD We trained the alignment-based coordination of each phrase type in all coordianitons. It indi- 5.1 Test/Training data analysis model on both the Genia corpus (?) Wall Street Journal portion of the Penn cates the 5.2 Implementation of sub-problems We trained and the the Wall Street Journal alignment-based portion of Treebank coordination the Penn has more VP coordinations and S co- analysis modelTreebank on both (?),theand evaluated Genia corpusthe(Kimperformance of We ordianitons, used Enju (?) for the implementation of while the Genia corpus has more NP et al., 2003)ourandmethod the Wall onStreet (i) theJournal Genia portion corpus andcoordianitons (ii) the HPSG parsing, which has a wide-coverage prob- and ADJP coordiations. of the Penn Wall TreebankStreet(Marcus Journal et portion Penn Treebank. abilistic HPSG grammar and an efficient parsing of the and al., 1993), evaluated theMore precisely, we used HPSG (i) treebank algorithm, while we re-implemented Hara et al., performance of our method on 5.2 con-Implementation of sub-problems verted and the Genia corpus from(ii)thethePenn WallTreebank Street Jour-and Genia, and (2009)’s algorithm with slight modifications. nal portion offurther the Pennextracted the training/test data for Treebank. More precisely, We coor- used Enju (Miyao and Tsujii, 2004) for the 5.2.1 Step implementation of HPSGsize parsing, which has we used HPSG dination treebankstructure convertedanalysis fromwith alignment-based the Penn Treebank and features Genia,usingandthe annotation further in thethe extracted a wide-coverage Treebank. Ta- We probabilistic used the followingHPSG step size in our algo- grammar data?? training/test ble forshows the corpus used in the experiments. coordination structure analy- and an rithm efficient (Figure parsing ??). First, algorithm, we whileinitialized we re- a0 , which The Wall features sis with alignment-based Street Journal using the portion implemented anno-of the Penn is chosen Hara et to al., optimize (2009)’s performance algorithm withon the devel- tation in the Treebank Treebank.has 2317 Table sentences 5 shows from WSJslight the corpus articles, opment set. Then we defined ak = a0 · 2−ηk , modifications. (k! ) and there are 1356 COOD tags in the sentences, where! ηk is the number of times that L(u ) > used in the experiments. 5.2.1 Step size(k −1) ) for k # ≤ k. The Wall while Streetthe Genia portion Journal corpus has 1754 of the sentences from L(u Penn Treebank in MEDLINE abstracts, the test set has and there from 2317 sentences are 1848 COODWe used the following step size in our algo- WSJ articles,tags andinthere the sentences. COOD tags arerithm are 1356 coordinations further(Figure5.34). Evaluation metric a , which First, we initialized 0 subcategorized in the sentences, while the intoGeniaphrase corpustypes NP- to We in thesuchisaschosen evaluated optimize the performance performance on the of the tested meth- devel- −η a0 · 2 k , COOD test set has 1764 or VP-COOD. sentences from MEDLINETable ?? ab- showsopment the per-set. ods Then by we the defined accuracyakof=coordination-level brack- ′ stracts, and centage there areof1848 each coordinations phrase type ininall theCOODwhere tags.ηk is eting (?); i.e., the number of we count times L(u(k thateach of )the ) >coordination ′ −1) It indicates theare sentences. Coordinations Wall Streetsubcatego- further Journal portion L(uof(k the k ′ ≤as scopes ) for k.one output of the system, and the system 435 Task (i) Task (ii) Training WSJ (sec. 2–21) + Genia (No. 1–1600) WSJ (sec. 2–21) Development Genia (No. 1601–1800) WSJ (sec. 22) Test Genia (No. 1801–1999) WSJ (sec. 23) Table 5: The corpus used in the experiments 100%$ Proposed Enju CSA 95%$ Precision 72.4 66.3 65.3 90%$ Recall 67.8 65.5 60.5 85%$ F1 70.0 65.9 62.8 80%$ 75%$ Table 7: Results of Task (i) on the test set. The preci- 70%$ sion, recall, and F1 (%) for the proposed method, Enju, 65%$ and Coordination structure analysis with alignment- 60%$ 1$ 3$ 5$ 7$ 9$ 11$13$15$17$19$21$23$25$27$29$31$33$35$37$39$41$43$45$47$49$ based features (CSA) accuracy certificates 5.3 Evaluation metric Figure 5: Performance of the approach as a function of We evaluated the performance of the tested meth- K of Task (i) on the development set. accuracy (%): the percentage of sentences that are correctly parsed. ods by the accuracy of coordination-level bracket- certificates (%): the percentage of sentences for which ing (Shimbo and Hara, 2007); i.e., we count each a certificate of optimality is obtained. of the coordination scopes as one output of the system, and the system output is regarded as cor- rect if both of the beginning of the first output tion only on NP coordinations to have a better re- conjunct and the end of the last conjunct match sult. annotations in the Treebank (Hara et al., 2009). Figure 5 shows performance of the approach as a function of K, the maximum number of iter- 5.4 Experimental results of Task (i) ations of dual decomposition. The graphs show We ran the dual decomposition algorithm with a that values of K much less than 50 produce al- limit of K = 50 iterations. We found the two most identical performance to K = 50 (with sub-problems return the same answer during the K = 50, the accuracy of the method is 73.4%, algorithm in over 95% of sentences. with K = 20 it is 72.6%, and with K = 1 it We compare the accuracy of the dual decompo- is 69.3%). This means you can use smaller K in sition approach to two baselines: Enju and coor- practical use for speed. dination structure analysis with alignment-based 5.5 Experimental results of Task (ii) features. Table 7 shows all three results. The dual decomposition method gives a statistically signif- We also ran the dual decomposition algorithm icant gain in precision and recall over the two with a limit of K = 50 iterations on Task (ii). methods2 . Table 9 and 10 show the results of task (ii). They Table 8 shows the recall of coordinations of show the proposed method outperformed the two each type. It indicates our re-implementation of methods statistically in precision and recall3 . CSA and Hara et al. (2009) have a roughly simi- Figure 6 shows performance of the approach as lar performance, although their experimental set- a function of K, the maximum number of iter- tings are different. It also shows the proposed ations of dual decomposition. The convergence method took advantage of Enju and CSA in NP speed for WSJ was faster than that for Genia. This coordination, while it is likely just to take the an- is because a sentence of WSJ often have a simpler swer of Enju in VP and sentential coordinations. coordination structure, compared with that of Ge- This means we might well use dual decomposi- nia. 2 3 p < 0.01 (by chi-square test) p < 0.01 (by chi-square test) 436 COORD # Proposed Enju CSA # Hara et al. (2009) Overall 1848 67.7 63.3 61.9 3598 61.5 NP 1213 67.5 61.4 64.1 2317 64.2 VP 208 79.8 78.8 66.3 456 54.2 ADJP 193 58.5 59.1 54.4 312 80.4 S 111 51.4 52.3 34.2 188 22.9 PP 110 64.5 59.1 57.3 167 59.9 Others 13 78.3 73.9 65.2 140 49.3 Table 8: The number of coordinations of each type (#), and the recall (%) for the proposed method, Enju, coordination structure analysis with alignment-based features (CSA) , and Hara et al. (2009) of Task (i) on the development set. Note that Hara et al. (2009) uses a different test set and different annotation rules, although its test data is also taken from the Genia corpus. Thus we cannot compare them directly. 100%$ Proposed Enju CSA 95%$ Precision 76.3 70.7 66.0 90%$ Recall 70.6 69.0 60.1 85%$ F1 73.3 69.9 62.9 80%$ 75%$ Table 9: Results of Task (ii) on the test set. The preci- 70%$ sion, recall, and F1 (%) for the proposed method, Enju, 65%$ and Coordination structure analysis with alignment- 60%$ 1$ 3$ 5$ 7$ 9$ 11$13$15$17$19$21$23$25$27$29$31$33$35$37$39$41$43$45$47$49$ based features (CSA) accuracy certificates COORD # Proposed Enju CSA Overall 1017 71.6 68.1 60.7 Figure 6: Performance of the approach as a function of NP 573 76.1 71.0 67.7 K of Task (ii) on the development set. accuracy (%): VP 187 62.0 62.6 47.6 the percentage of sentences that are correctly parsed. ADJP 73 82.2 75.3 79.5 certificates (%): the percentage of sentences for which a certificate of optimality is provided. S 141 64.5 62.4 42.6 PP 19 52.6 47.4 47.4 Others 24 62.5 70.8 54.2 method with corpus in different domains. Be- Table 10: The number of coordinations of each type cause characteristics of coordination structures (#), and the recall (%) for the proposed method, Enju, differs from corpus to corpus, experiments on and coordination structure analysis with alignment- other corpus would lead to a different result. Sec- based features (CSA) of Task (ii) on the development ond, we would want to add some features to coor- set. dination structure analysis with alignment-based local features such as ontology. Finally, we can 6 Conclusion and Future Work add other methods (e.g. dependency parsing) as sub-problems to our method by using the exten- In this paper, we presented an efficient method for sion of dual decomposition, which can deal with detecting and disambiguating coordinate struc- more than two sub-problems. tures. Our basic idea was to consider both gram- mar and symmetries of conjuncts by using dual decomposition. Experiments on the Genia corpus Acknowledgments and the Wall Street Journal portion of the Penn Treebank showed that we could obtain statisti- cally significant improvement in accuracy when The second author is partially supported by KAK- using dual decomposition. ENHI Grant-in-Aid for Scientific Research C We would need a further study in the follow- 21500131 and Microsoft CORE project 7. ing points of view: First, we should evaluate our 437 References In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Process- Kazuo Hara, Masashi Shimbo, Hideharu Okuma, and ing and Computational Natural Language Learn- Yuji Matsumoto. 2009. Coordinate structure analy- ing, pages 610–619, Jun. sis with global structural constraints and alignment- based local features. In Proceedings of the 47th An- nual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 967–975, Aug. Deirdre Hogan. 2007. Coordinate noun phrase dis- ambiguation in a generative parsing model. In Pro- ceedings of the 45th Annual Meeting of the Asso- ciation of Computational Linguistics (ACL 2007), pages 680–687. Jun-Dong Kim, Tomoko Ohta, and Jun’ich Tsujii. 2003. Genia corpus - a semantically annotated cor- pus for bio-textmining. Bioinformatics, 19. Dan Klein and Christopher D. Manning. 2003. Fast exact inference with a factored model for natural language parsing. Advances in Neural Information Processing Systems, 15:3–10. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The penn treebank. Computa- tional Linguistics, 19:313–330. Yusuke Miyao and Jun’ich Tsujii. 2004. Deep lin- guistic analysis for the accurate identification of predicate-argument relations. In Proceeding of COLING 2004, pages 1392–1397. Yusuke Miyao and Jun’ich Tsujii. 2008. Feature forest models for probabilistic hpsg parsing. MIT Press, 1(34):35–80. Yusuke Miyao, Takashi Ninomiya, and Jun’ichi Tsu- jii. 2004. Corpus-oriented grammar development for acquiring a head-driven phrase structure gram- mar from the penn treebank. In Proceedings of the First International Joint Conference on Natural Language Processing (IJCNLP 2004). Preslav Nakov and Marti Hearst. 2005. Using the web as an implicit training set: Application to structural ambiguity resolution. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language (HLT- EMNLP 2005), pages 835–842. Carl Pollard and Ivan A. Sag. 1994. Head-driven phrase structure grammar. University of Chicago Press. Philip Resnik. 1999. Semantic similarity in a takon- omy. Journal of Artificial Intelligence Research, 11:95–130. Alexander M. Rush, David Sontag, Michael Collins, and Tommi Jaakkola. 2010. On dual decomposi- tion and linear programming relaxations for natu- ral language processing. In Proceeding of the con- ference on Empirical Methods in Natural Language Processing. Masashi Shimbo and Kazuo Hara. 2007. A discrimi- native learning model for coordinate conjunctions. 438 Cutting the Long Tail: Hybrid Language Models for Translation Style Adaptation Arianna Bisazza and Marcello Federico Fondazione Bruno Kessler Trento, Italy {bisazza,federico}@fbk.eu Abstract Hybrid class-based LMs are trained on text where only infrequent words are mapped to Part- In this paper, we address statistical ma- of-Speech (POS) classes. In this way, topic- chine translation of public conference talks. specific words are discarded and the model fo- Modeling the style of this genre can be very cuses on generic words that we assume more use- challenging given the shortage of available ful to characterize the language style. The factor- in-domain training data. We investigate the ization of similar expressions made possible by use of a hybrid LM, where infrequent words this mixed text representation yields a better n- are mapped into classes. Hybrid LMs are used to complement word-based LMs with gram coverage, but with a much higher discrimi- statistics about the language style of the native power than POS-level LMs. talks. Extensive experiments comparing Hybrid LM also differs from POS-level LM in different settings of the hybrid LM are re- that it uses a word-to-class mapping to determine ported on publicly available benchmarks POS tags. Consequently, it doesn’t require the de- based on TED talks, from Arabic to English coding overload of factored models nor the tag- and from English to French. The proposed ging of all parallel data used to build phrase ta- models show to better exploit in-domain data than conventional word-based LMs for bles. A hybrid LM trained on in-domain data can the target language modeling component of thus be easily added to an existing baseline sys- a phrase-based statistical machine transla- tem trained on large amounts of background data. tion system. The proposed models are used in addition to standard word-based LMs, in the framework of log-linear phrase-based SMT. 1 Introduction The remainder of this paper is organized as fol- lows. After discussing the language style adapta- The translation of TED conference talks1 is an tion problem, we will give an overview of relevant emerging task in the statistical machine transla- work. In the following sections we will describe tion (SMT) community (Federico et al., 2011). in detail hybrid LM and its possible variants. Fi- The variety of topics covered by the speeches, as nally, we will present an empirical analysis of the well as their specific language style, make this a proposed technique, including intrinsic evaluation very challenging problem. and SMT experiments. Fixed expressions, colloquial terms, figures of speech and other phenomena recurrent in the talks 2 Background should be properly modeled to produce transla- Our working scenario is the translation of TED tions that are not only fluent but that also em- talks transcripts as proposed by the IWSLT Eval- ploy the right register. In this paper, we propose uation Campaign2 . This genre covers a variety a language modeling technique that leverages in- of topics ranging from business to psychology. domain training data for style adaptation. The available training material – both parallel and 1 2 http://www.ted.com/talks http://www.iwslt2011.org 439 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 439–448, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Beginning of Sentence: [s] End of Sentence: [/s] TED NEWS TED NEWS 1st [s] Thank you . [/s] 1st [s] ( AP ) - 1st [s] Thank you . [/s] 1st ” he said . [/s] 2 [s] Thank you very much 2 [s] WASHINGTON ( ... 2 you very much . [/s] 2 ” she said . [/s] 3 [s] I ’m going to 3 [s] NEW YORK ( AP 3 in the world . [/s] 3 , he said . [/s] 4 [s] And I said , 4 [s] ( CNN ) – 4 and so on . [/s] 4 ” he said . [/s] 5 [s] I don ’t know 5 [s] NEW YORK ( R... 5 , you know . [/s] 5 in a statement . [/s] 6 [s] He said , “ 6 [s] He said : “ 6 of the world . [/s] 6 the United States . [/s] 7 [s] I said , “ 7 [s] ” I don ’t 7 around the world . [/s] 7 to this report . [/s] 8 [s] And of course , 8 [s] It was last updated 8 . Thank you . [/s] 8 ” he added . [/s] 9 [s] And one of the 9 [s] At the same time 9 the United States . [/s] 9 , police said . [/s] 10 [s] And I want to ... 10 all the time . [/s] 10 , officials said . [/s] 11 [s] And that ’s what 69 [s] I don ’t know 11 to do it . [/s] ... 12 [s] We ’re going to 612 [s] I ’m going to 12 and so forth . [/s] 13 in the world . [/s] 13 [s] And I think that 2434 [s] ” I said , 13 don ’t know . [/s] 17 around the world . [/s] 14 [s] And you can see 7034 [s] He said , “ 14 to do that . [/s] 46 of the world . [/s] 15 [s] And this is a 8199 [s] And I said , 15 in the future . [/s] 129 all the time . [/s] 16 [s] And this is the 8233 [s] Thank you very much 16 the same time . [/s] 157 and so on . [/s] 17 [s] And he said , ... 17 , you know ? [/s] 1652 , you know . [/s] 18 [s] So this is a ∅ [s] Thank you . [/s] 18 to do this . [/s] 5509 you very much . [/s] Table 1: Common sentence-initial and sentence-final 5-grams, as ranked by frequency, in the TED and NEWS corpora. Numbers denote the frequency rank. monolingual – consists of a rather small collection used by these two LMs to score the test’s refer- of TED talks plus a variety of large out-of-domain ence translations. Note that the latter measure is corpora, such as news stories and UN proceed- bounded at the LM order minus one, and is in- ings. versely proportional to the number of back-offs Given the diversity of topics, the in-domain performed by the model. Hence, we use this value data alone cannot ensure sufficient coverage to an to estimate how well an n-gram LM fits the test SMT system. The addition of background data data. Indeed, despite the genre mismatch, the per- can certainly improve the n-gram coverage and plexity of a NEWS 5-gram LM on the TED-2010 thus the fluency of our translations, but it may also test reference translations is 104 versus 112 for move our system towards an unsuitable language the in-domain LM, and the average history size is style, such as that of written news. 2.5 versus 1.7 words. In our study, we focus on the subproblem of TED NEWS target language modeling and consider two En- 1st , the 1st glish text collections, namely the in-domain TED ... ... and the out-of-domain NEWS3 , summarized in 9 I 40 I 12 you 64 you Table 2. Because of its larger size – two orders 90 actually 965 actually of magnitude – the NEWS corpus can provide a 268 stuff 2479 guy better LM coverage than the TED on the test data. 370 guy 2861 stuff This is reflected both on perplexity and on the av- 436 amazing 4706 amazing erage length of the context (or history h) actually Table 3: Excerpts from TED and NEWS training vo- 3 http://www.statmt.org/wmt11/translation-task.html cabularies, as ranked by frequency. Numbers denote the frequency rank. LM Data |S| |W | |V | PP h5g Yet we observe that the style of public speeches TED-En 124K 2.4M 51K 112 1.7 NEWS-En 30.7M 782M 2.2M 104 2.5 is much better represented in the in-domain cor- pus than in the out-of-domain one. For instance, Table 2: Training data and coverage statistics of two let us consider the vocabulary distribution4 of the 5-gram LMs used for the TED task: number of sen- 4 Hesitations and filler words, typical of spoken language, tences and tokens, vocabulary size; perplexity and av- are not covered in our study because they are generally not erage word history. reported in the TED talk transcripts. 440 two corpora (Table 3). The very first forms, as adoption of the log-linear modeling framework in ranked by frequency, are quite similar in the two many NLP tasks has recently introduced the use corpora. However, there are important excep- of multiple LM components (features), which per- tions: the pronouns I and you are among the top mit to naturally factor out and integrate different 20 frequent forms in the TED, while in the NEWS aspects of language into one model. In SMT, the they are ranked only 40th and 64th respectively. factored model (Koehn and Hoang, 2007), for in- Other interesting cases are the words actually, stance, permits to better tailor the LM to the task stuff, guy and amazing, all ranked about 10 times syntax, by complementing word-based n-grams higher in the TED than in the NEWS corpus. with a part-of-speech (POS) LM , that can be es- We can also analyze the most typical ways timated even on a limited amount of task-specific to start and end a sentence in the two text col- data. Besides many works addressing holistic LM lections. As shown in Table 1, the frequency domain adaptation for SMT, e.g. Foster and Kuhn ranking of sentence-initial and sentence-final 5- (2007), recently methods were also proposed to grams in the in-domain corpus is notably different explicitly adapt the LM to the discourse topic of a from the out-of-domain one. TED’s most frequent talk (Ruiz and Federico, 2011). Our work makes sentence-initial 5-gram “[s] Thank you . [/s] ” is another step in this direction by investigating hy- not at all attested in the NEWS corpus. As for brid LMs that try to explicitly represent the speak- the 4th most common sentence start “[s] And I ing style of the talk genre. As a difference from said ,” is only ranked 8199th in the NEWS, and standard class-based LMs (Brown et al., 1992) or so on. Notably, the top ranked NEWS 5-grams in- the more recent local LMs (Monz, 2011), which clude names of cities (Washington, New York) and are used to predict sequences of classes or word- of news agency (AP, Reuters). As regards sen- class pairs, our hybrid LM is devised to pre- tence endings, we observe similar contrasts: for dict sequences of classes interleaved by words. instance, the word sequence “and so on . [/s] ” While we do not claim any technical novelty in is ranked 4th in the TED and 157th in the NEWS the model itself, to our knowledge a deep investi- while “, you know . [/s] ” is 5th in the TED and gation of hybrid LMs for the sake of style adap- only 1652th in the NEWS. tation is definitely new. Finally, the term hybrid These figures confirm that the talks have a spe- LM was inspired by Yazgan and Sarac¸lar (2004), cific language style, remarkably different from which called with this name a LM predicting se- that of the written news genre. In summary, talks quences of words and sub-words units, devised to are characterized by a massive use of first and sec- let a speech recognizer detect out-of-vocabulary- ond persons, by shorter sentences, and by more words. colloquial lexical and syntactic constructions. 4 Hybrid Language Model 3 Related Work Hybrid LMs are n-gram models trained on a The brittleness of n-gram LMs in case of mis- mixed text representation where each word is ei- match between training and task data is a well ther mapped to a class or left as is. This choice known issue (Rosenfeld, 2000). So called do- is made according to a measure of word common- main adaptation methods (Bellegarda, 2004) can ness and is univocal for each word type. improve the situation, once a limited amount The rationale is to discard topic-specific words, of task specific data become available. Ideally, while preserving those words that best character- domain-adaptive LMs aim to improve model ro- ize the language style (note that word frequency bustness under changing conditions, involving is computed on the in-domain corpus only). Map- possible variations in vocabulary, syntax, content, ping non-frequent terms to classes naturally leads and style. Most of the known LM adaption tech- to a shorter tail in the frequency distribution, as niques (Bellegarda, 2004), however, address all visualized by Figure 1. A model trained on such these variations in a holistic way. A possible rea- data has a better n-gram coverage of the test set son for this is that LM adaptation methods were and may take advantage of a larger context when originally developed under the automatic speech scoring translation hypotheses. recognition framework, which typically assumes As classes, we use deterministically assigned the presence of one single LM. The progressive POS tags, obtained by first tagging the data with 441 !""""""# !"""""# 4.1 Word commonness criteria The most intuitive way to measure word common- !""""# )*+,-# ness is by absolute term frequency (F ). We will !"""# $'./01# use this criterion in most of our experiments. A finer solution would be to also consider the com- !""# monness of a word across different talks. At this end, we propose to use the fdf statistics, that is the !"# product of relative term f requency and document "# !"""# $"""# %"""# &"""# '"""# ("""# f requency5 : c(w) c(dw ) Figure 1: Type frequency distribution in the English f dfw = P 0) × TED corpus before and after POS-mapping of words w 0 c(w c(d) with less than 500 occurrences (25% of tokens). The where dw are the documents (talks) containing at rank in the frequency list (x-axis) is plotted against the respective frequency in logarithmic scale. Types with least one occurrence of the word w. less than 20 occurrences are omitted from the graph. If available, real talk boundaries can be used to define the documents. Alternatively, we can simply split the corpus into chunks of fixed size. Tree Tagger (Schmid, 1994) and then choosing In this work we use this approximation. the most likely tag for each word type. In this Another issue is how to set the threshold. In- way, we avoid the overload of searching for the dependently from the chosen commonness mea- best tagging decisions at run-time at the cost of sure, we can reason in terms of the ratio of tokens a slightly higher imprecision (see Section 5.1). that are mapped to POS classes (WP ). For in- The hybridly mapped data is used to train a high- stance, in our experiments with English, we can order n-gram LM that is plugged into an SMT de- set the threshold to F =500 and observe that WP coder as an additional feature on target word se- corresponds to 25% of the tokens (and 99% of the quences. During the translation process, words types). In the same corpus, a similar ratio is ob- are mapped to their class just before querying the tained with fdf =0.012. hybrid LM, therefore translation models can be In our study, we consider three ratios WP ={.25, trained on plain un-tagged data. .50, .75} that correspond to different levels of lan- As exemplified in Table 4, hybrid LMs can guage modeling: from a domain-generic word- draw useful statistics on the context of common level LM to a lexically anchored POS-level LM. words even from a small corpus such as the TED. To have an idea of data sparseness, consider that 4.2 Handling morphology in the unprocessed TED corpus the most frequent Token frequency-based measures may not be suit- 5-gram containing the common word guy occurs able for languages other than English. When only 3 times. After the mapping of words with translating into French, for instance, we have to frequency <500, the highest 5-gram frequency deal with a much richer morphology. grows to 17, the second one to 9, and so on. As a solution we can use lemmas, univocally assigned to word types in the same manner as guy 598 actually 3978 POS tags. Lemmas can be employed in two ways: a guy VBN NP NP 17 [s] This is actually a 20 guy VBN NP NP , 9 [s] It ’s actually a 17 only for word selection, as a frequency measure, guy , NP NP , 8 , you can actually VB 13 or also for word representation, as a mapping for a guy called NP NP 8 is actually a JJ NN 13 common words. In the former, we preserve in- this guy , NP NP 6 This is actually a NN 12 flected variants that may be useful to model the guy VBN NP NP . 6 [s] And this is actually 12 language style, but we also risk to see n-gram cov- by a guy VBN NP 5 [s] And that ’s actually 10 a JJ guy . [/s] 5 , but it ’s actually 10 erage decrease due to the presence of rare types. I was VBG this guy 4 NN , it ’s actually 9 In the latter, only canonical forms and POS tags guy VBN NP . [/s] 4 we’re actually going to 8 5 This differs from the tf-idf widely used in information retrieval, which is used to measure the relevance of a term in Table 4: Most common hybrid 5-grams containing the a document. Instead, we measure commonness of a term in words guy and actually, along with absolute frequency. the whole corpus. 442 appear in the processed text, thus introducing a Hybrid 10g LM |V | POS-Err h10g further level of abstraction from the original text. all words 51299 0.0% 1.7 all lemmas 38486 0.0% 1.9 Here follows a TED sentence in its original .25 POS/words 475 1.9% 2.7 version (first line) and after three different hy- .50 POS/words 93 4.1% 3.5 brid mappings – namely WP =.25, WP =.25 with .75 POS/words 50 5.7% 4.1 lemma forms, and WP =.50: allPOS 43 6.6% 4.4 .25 POS/lemmas 302 1.8% 2.8 Now you laugh, but that quote has kind of a sting to it, right. .25 POS/words(fdf) 301 1.9% 2.7 Now you VB , but that NN has kind of a NN to it, right. Now you VB , but that NN have kind of a NN to it, right. Table 5: Comparison of LMs obtained from different RB you VB , CC that NN VBZ NN of a NN to it, RB . hybrid mappings of the English TED corpus: vocabu- lary size, POS error rate, and average word history on IWSLT–tst2010’s reference translations. 5 Evaluation In this section we perform an intrinsic evaluation course, the more words are mapped, the less dis- of the proposed LM technique, then we measure criminative our model will be. Thus, choosing the its impact on translation quality when integrated best hybrid mapping means finding the best trade- into a state-of-the-art phrase-based SMT system. off between coverage and informativeness. We also applied hybrid LM to the French lan- 5.1 Intrinsic evaluation guage, again using Tree Tagger to create the POS We analyze here a set of hybrid LMs trained on mapping. The tag set in this case comprises 34 the English TED corpus by varying the ratio of classes and the POS error rate with WP =.25 is POS-mapped words and the word representation 1.2% (compare with 1.9% in English). As previ- technique (word vs lemma). All models were ously discussed, morphology has a notable effect trained with the IRSTLM toolkit (Federico et al., on the modeling of French. In fact, the vocabu- 2008), using a very high n-gram order (10) and lary reduction obtained by mapping all the words Witten-Bell smoothing. to their most probable lemma is -45% (57959 to First, we estimate an upper bound of the POS 31908 types in the TED corpus), while in English tagging errors introduced by deterministic tag- it is only -25%. ging. At this end, the hybridly mapped data is 5.2 SMT baseline compared with the actual output of Tree Tagger on the TED training corpus (see Table 5). Naturally, Our SMT experiments address the translation of the impact of tagging errors correlates with the ra- TED talks from Arabic to English and from En- tio of POS-mapped tokens, as no error is counted glish to French. The training and test datasets on non-mapped tokens. For instance, we note that were provided by the organizers of the IWSLT11 the POS error rate is only 1.9% in our primary set- evaluation, and are summarized in Table 6. ting, WP =.25 and word representation, whereas Marked in bold are the corpora used for hybrid on a fully POS-mapped text it is 6.6%. Note that LM training. Dev and test sets have a single ref- the English tag set used by Tree Tagger includes erence translation. 43 classes. For both language pairs, we set up com- Now we focus on the main goal of hybrid text petitive phrase-based systems6 using the Moses representation, namely increasing the coverage of toolkit (Koehn et al., 2007). The decoder fea- the in-domain LM on the test data. Here too, we tures a statistical log-linear model including a measure coverage by the average length of word phrase translation model and a phrase reordering history h used to score the test reference transla- model (Tillmann, 2004; Koehn et al., 2005), two tions (see Section 2). We do not provide perplex- word-based language models, distortion, word ity figures, since these are not directly compara- and phrase penalties. The translation and re- ble across models with different vocabularies. As ordering models are obtained by combining mod- shown by Table 5, n-gram coverage increases with els independently trained on the available paral- the ratio of POS-mapped tokens, ranging from 1.7 6 The SMT systems used in this paper are thoroughly de- on an all-words LM to 4.4 on an all-POS LM. Of scribed in (Ruiz et al., 2011). 443 Corpus |S| |W | ` translation models, while the English-French sys- TED 90K 1.7M 18.9 tem uses lowercased models and a standard re- AR-EN UN 7.9M 220M 27.8 casing post-process. TED 124K 2.4M 19.5 EN NEWS 30.7M 782M 25.4 Feature weights are tuned on dev2010 by dev2010 934 19K 20.0 means of a minimum error training procedure AR test (MERT) (Och, 2003). Following suggestions by tst2010 1664 30K 18.1 TED 105K 2.0M 19.5 Clark et al. (2011) and Cettolo et al. (2011) on EN-FR UN 11M 291M 26.5 controlling optimizer instability, we run MERT NEWS 111K 3.1M 27.6 four times on the same configuration and use the FR TED 107K 2.2M 20.6 average of the resulting weights to evaluate trans- NEWS 11.6M 291M 25.2 lation performance. dev2010 934 20K 21.5 EN test tst2010 1664 32K 19.1 5.3 Hybrid LM integration As previously stated, hybrid LMs are trained only Table 6: IWSLT11 training and test data statistics: number of sentences |S|, number of tokens |W | and on in-domain data and are added to the log-linear average sentence length `. Token numbers are com- decoder as an additional target LM. To this end, puted on the target language, except for the test sets. we use the class-based LM implementation pro- vided in Moses and IRSTLM, which applies the word-to-class mapping to translation hypotheses lel corpora: namely TED and NEWS for Arabic- before LM querying8 . The order of the additional English; TED, NEWS and UN for English- LM is set to 10 in the Arabic-English evaluation French. To this end we applied the fill-up method and 7 in the English-French, as these appeared to (Nakov, 2008; Bisazza et al., 2011) in which out- be the best settings in preliminary tests. of-domain phrase tables are merged with the in- Translation quality is measured by BLEU (Pa- domain table by adding only new phrase pairs. pineni et al., 2002), METEOR (Banerjee and Out-of-domain phrases are marked with a binary Lavie, 2005) and TER (Snover et al., 2006)9 . To feature whose weight is tuned together with the test whether differences among systems are statis- SMT system weights. tically significant we use approximate randomiza- For each target language, two standard 5-gram tion as done in (Riezler and Maxwell, 2005)10 . LMs are trained separately on the monolingual TED and NEWS datasets, and log-linearly com- Model variants. The effect on MT quality of bined at decoding time. In the Arabic-English various hybrid LM variants is shown in Table 7. task, we use a hierarchical reordering model (Gal- Note that allPOS and allLemmas refer to deter- ley and Manning, 2008; Hardmeier et al., 2011), ministically assigned POS tags and lemmas, re- while in the English-French task we use a default spectively. Concerning the ratio of POS-mapped word-based bidirectional model. The distortion tokens, the best performing values are WP =.25 in limit is set to the default value of 6. Note that Arabic-English and WP =.50 in English-French. the use of large n-gram LMs and of lexicalized These hybrid mappings outperform all the uni- reordering models was shown to wipe out the im- form representations (words, lemmas and POS) provement achievable by POS-level LM (Kirch- with statistically significant BLEU and METEOR hoff and Yang, 2005; Birch et al., 2007). improvements. Concerning data preprocessing we apply stan- The fdf experiment involves the use of doc- dard tokenization to the English and French text, ument frequency for the selection of common while for Arabic we use an in-house tokenizer that words. Its performance is very close to that of hy- removes diacritics and normalizes special charac- 8 Detailed instructions on how to build and use hybrid ters and digits. Arabic text is then segmented with LMs can be found at http://hlt.fbk.eu/people/bisazza. AMIRA (Diab et al., 2004) according to the ATB 9 We use case-sensitive BLEU and TER, but case- scheme7 . The Arabic-English system uses cased insensitive METEOR to enable the use of paraphrase tables distributed with the tool (version 1.3). 7 10 The Arabic Treebank tokenization scheme isolates con- Translation scores and significance tests were com- junctions w+ and f+, prepositions l+, k+, b+, future marker puted with the Multeval toolkit (Clark et al., 2011): s+, pronominal suffixes, but not the article Al+. https://github.com/jhclark/multeval. 444 (a) Arabic to English, IWSLT–tst2010 (b) English to French, IWSLT–tst2010 Added InDomain 10gLM BLEU↑ MET ↑ TER ↓ Added InDomain 7gLM BLEU↑ MET ↑ TER ↓ .00 POS/words (all words)† 26.1 30.5 55.4 .00 POS/words (all words) 31.1 52.5 49.9 .00 POS/lemmas (all lem.) 26.0 30.5 55.4 .00 POS/lemmas (all lem.)† 31.2 52.6 49.7 1.0 POS/words (all POS)† 25.9 30.6 55.3 1.0 POS/words (all POS)† 31.4 52.8 49.8 .25 POS/words† 26.5 30.6 54.7 .25 POS/lemmas† 31.5 52.9 49.7 .50 POS/words 26.5 30.6 54.9 .50 POS/lemmas 31.9 53.3 49.5 .75 POS/words 26.3 30.7 55.0 .75 POS/lemmas 31.7 53.2 49.6 .25 POS/words(fdf) 26.5 30.7 54.7 .50 POS/lemmas(fdf) 31.9 53.3 49.5 .25 POS/lemmaF 26.4 30.6 54.8 .50 POS/lemmaF 31.6 53.0 49.6 .25 POS/lemmas 26.5 30.8 54.6 .50 POS/words 31.7 53.1 49.5 Table 7: Comparison of various hybrid LM variants. Translation quality is measured with BLEU, METEOR and TER (all in percentage form). The settings used for weight tuning are marked with †. Best models according to all metrics are highlighted in bold. brid LMs simply based on term frequency; only Comparison with baseline. In Table 8 the METEOR gains 0.1 points in Arabic-English. A best performing hybrid LM is compared against possible reason for this is that document fre- the baseline that only includes the standard LMs quency was computed on fixed-size text chunks described in Section 5.2. To complete our eval- rather than on real document boundaries (see Sec- uation, we also report the effect of an in-domain tion 4.1). The lemmaF experiment refers to the LM trained on 50 word classes induced from the use of canonical forms for frequency measuring: corpus by maximum-likelihood based clustering this technique does not seem to help in either lan- (Och, 1999). guage pair. Finally, we compare the use of lem- In the two language pairs, both types of LM mas versus surface forms to represent common result in consistent improvements over the base- words. As expected, lemmas appear to be help- line. However, the gains achieved by the hybrid ful for French language modeling. Interestingly approach are larger and all statistically signifi- this is also the case for English, even if by a small cant. The hybrid approach is significantly bet- margin (+0.2 METEOR, -0.1 TER). ter than the unsupervised one by TER in Arabic- English and by BLEU and METEOR in English- Summing up, hybrid mapping appears as a French (these siginificances are not reported in winning strategy compared to uniform map- ping. Although differences among LM variants (a) Arabic to English, IWSLT–tst2010 are small, the best model in Arabic-English is .25-POS/lemmas, which can be thought of as Added InDomain BLEU↑ MET ↑ TER ↓ a domain-generic lemma-level LM. In English- 10g LM French, instead, the highest scores are achieved none (baseline) 26.0 30.4 55.6 by .50-POS/lemmas or .50-POS/lemmas(fdf), that unsup. classes 26.4◦ 30.8• 55.1◦ is POS-level LM with few frequently occurring hybrid 26.5 (+.5) 30.8 (+.4) 54.6• (-1.0) • • lexical anchors (vocabulary size 59). An inter- (b) English to French, IWSLT–tst2010 pretation of this result is that, for French, mod- Added InDomain BLEU↑ MET ↑ TER ↓ eling the syntax is more helpful than modeling 7g LM the style. We also suspect that the French TED none (baseline) 31.2 52.7 49.8 corpus is more irregular and diverse with respect unsup. classes 31.5 52.9 49.6 to the style, than its English counterpart. In fact, hybrid 31.9• (+.7) 53.3• (+.6) 49.5◦ (-.3) while the English corpus include transcripts of talks given by English speakers, the French one is Table 8: Final MT results: baseline vs unsupervised mostly a collection of (human) translations. Typi- word classes-based LM and best hybrid LM. Statis- cal features of the speech style may have been lost tically significant improvements over the baseline are in this process. marked with • at the p < .01 and ◦ at the p < .05 level. 445 the table for clarity). The proposed method ap- points, while that of the word-based PP is 79. The pears to better leverage the available in-domain BLEU improvement given by hybrid LM, how- data, achieving improvements according to all ever modest, is consistent across the talks, with metrics: +0.5/+0.4/-1.0 BLEU/METEOR/TER only two outliers: a drop of -0.2 on talk “00”, and in Arabic-English and +0.7/-0.6/-0.3 in English- a drop of -0.7 on talk “02”. The largest gain (+1.1) French, without requiring any bitext annotation or is observed on talk “10”, from 16.8 to 17.9 BLEU. decoder modification. 6 Conclusions Talk-level analysis. To conclude the study, we analyze the effect of our best hybrid LM We have proposed a language modeling technique on Arabic-English translation quality, at the sin- that leverages the in-domain data for SMT style gle talk level. The test used in the experiments adaptation. Trained to predict mixed sequences (tst2010) consists of 11 transcripts with an av- of POS classes and frequent words, hybrid LMs erage length of 151±73 sentences. For each are devised to capture typical lexical and syntactic talk, we compare the baseline BLEU score with constructions that characterize the style of speech that obtained by adding a .25-POS/lemmas hybrid transcripts. LM. Results are presented in Figure 2. The dark Compared to standard language models, hy- and light columns denote baseline and hybrid-LM brid LMs generalize better to the test data and BLEU scores, respectively, and refer to the left y- partially compensate for the disproportion be- axis. Additional data points, plotted on the right tween in-domain and out-of-domain training data. y-axis in reverse order, represent talk-level per- At the same time, hybrid LMs show more dis- plexities (PP) of a standard 5-gram LM trained criminative power than merely POS-level LMs. on TED (◦) and those of the .25-POS/lemmas The integration of hybrid LMs into a competi- 10-gram hybrid LM (M), computed on reference tive phrase-based SMT system is straightforward translations. and leads to consistent improvements on the TED What emerges first is a dramatic variation of task, according to three different translation qual- performance among the speeches, with baseline ity metrics. BLEU scores ranging from 33.95 on talk “00” to Target language modeling is only one aspect only 12.42 on talk “02”. The latter talk appears as of the statistical translation problem. Now that a corner case also according to perplexities (397 the usability of the proposed method has been as- by word LM and 111 by hybrid LM). Notably, the sessed for language modeling, future work will perplexities of the two LMs correlate well with address the extension of the idea to the modeling each other, but the hybrid’s PP is much more sta- of phrase translation and reordering. ble across talks: its standard deviation is only 14 Acknowledgments -./0"123456" -./0"17826" 99"1:;<=4#>6" 99"17826" This work was supported by the T4ME network &#(!!" !" &%(#!" #!" of excellence (IST-249119), funded by the DG &!(!!" $!!" INFSO of the European Commission through the %)(#!" $#!" 7th Framework Programme. We thank the anony- %#(!!" mous reviewers for their valuable suggestions. %!!" %%(#!" %#!" %!(!!" &!!" $)(#!" $#(!!" &#!" References $%(#!" '!!" Satanjeev Banerjee and Alon Lavie. 2005. METEOR: $!(!!" '#!" !!" !$" !%" !&" !'" !#" !*" !)" !+" !," $!" An automatic metric for MT evaluation with im- proved correlation with human judgments. In Pro- ceedings of the ACL Workshop on Intrinsic and Ex- Figure 2: Talk-level evaluation on Arabic-English trinsic Evaluation Measures for Machine Transla- (IWSLT-tst2010). Left y-axis: BLEU impact of a .25- tion and/or Summarization, pages 65–72, Ann Ar- POS/lemma hybrid LM. Right y-axis: perplexities by bor, Michigan, June. Association for Computational word LM and by hybrid LM. Linguistics. 446 Jerome R. Bellegarda. 2004. Statistical language Processing, pages 848–856, Morristown, NJ, USA. model adaptation: review and perspectives. Speech Association for Computational Linguistics. Communication, 42(1):93 – 108. Christian Hardmeier, J¨org Tiedemann, Markus Saers, Alexandra Birch, Miles Osborne, and Philipp Koehn. Marcello Federico, and Mathur Prashant. 2011. 2007. CCG supertags in factored statistical ma- The Uppsala-FBK systems at WMT 2011. In Pro- chine translation. In Proceedings of the Second ceedings of the Sixth Workshop on Statistical Ma- Workshop on Statistical Machine Translation, pages chine Translation, pages 372–378, Edinburgh, Scot- 9–16, Prague, Czech Republic, June. Association land, July. Association for Computational Linguis- for Computational Linguistics. tics. Arianna Bisazza, Nick Ruiz, and Marcello Fed- Katrin Kirchhoff and Mei Yang. 2005. Improved lan- erico. 2011. Fill-up versus Interpolation Meth- guage modeling for statistical machine translation. ods for Phrase-based SMT Adaptation. In Interna- In Proceedings of the ACL Workshop on Building tional Workshop on Spoken Language Translation and Using Parallel Texts, pages 125–128, Ann Ar- (IWSLT), San Francisco, CA. bor, Michigan, June. Association for Computational P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, Linguistics. and R. L. Mercer. 1992. Class-based n-gram mod- Philipp Koehn and Hieu Hoang. 2007. Factored els of natural language. Computational Linguistics, translation models. In Proceedings of the 2007 18(4):467–479. Joint Conference on Empirical Methods in Natural Mauro Cettolo, Nicola Bertoldi, and Marcello Fed- Language Processing and Computational Natural erico. 2011. Methods for smoothing the optimizer Language Learning (EMNLP-CoNLL), pages 868– instability in SMT. In MT Summit XIII: the Thir- 876, Prague, Czech Republic, June. Association for teenth Machine Translation Summit, pages 32–39, Computational Linguistics. Xiamen, China. Philipp Koehn, Amittai Axelrod, Alexandra Birch Jonathan Clark, Chris Dyer, Alon Lavie, and Mayne, Chris Callison-Burch, Miles Osborne, and Noah Smith. 2011. Better hypothesis testing David Talbot. 2005. Edinburgh system description for statistical machine translation: Controlling for the 2005 IWSLT speech translation evaluation. for optimizer instability. In Proceedings of In Proc. of the International Workshop on Spoken the Association for Computational Lingustics, Language Translation, October. ACL 2011, Portland, Oregon, USA. Associa- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, tion for Computational Linguistics. available at M. Federico, N. Bertoldi, B. Cowan, W. Shen, http://www.cs.cmu.edu/ jhclark/pubs/significance.pdf. C. Moran, R. Zens, C. Dyer, O. Bojar, A. Con- Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. stantin, and E. Herbst. 2007. Moses: Open Source 2004. Automatic Tagging of Arabic Text: From Toolkit for Statistical Machine Translation. In Pro- Raw Text to Base Phrase Chunks. In Daniel Marcu ceedings of the 45th Annual Meeting of the Associa- Susan Dumais and Salim Roukos, editors, HLT- tion for Computational Linguistics Companion Vol- NAACL 2004: Short Papers, pages 149–152, ume Proceedings of the Demo and Poster Sessions, Boston, Massachusetts, USA, May 2 - May 7. As- pages 177–180, Prague, Czech Republic. sociation for Computational Linguistics. Christof Monz. 2011. Statistical Machine Translation Marcello Federico, Nicola Bertoldi, and Mauro Cet- with Local Language Models. In Proceedings of the tolo. 2008. IRSTLM: an Open Source Toolkit for 2011 Conference on Empirical Methods in Natural Handling Large Scale Language Models. In Pro- Language Processing, pages 869–879, Edinburgh, ceedings of Interspeech, pages 1618–1621, Mel- Scotland, UK., July. Association for Computational bourne, Australia. Linguistics. Marcello Federico, Luisa Bentivogli, Michael Paul, Preslav Nakov. 2008. Improving English-Spanish and Sebastian St¨uker. 2011. Overview of the Statistical Machine Translation: Experiments in IWSLT 2011 Evaluation Campaign. In Interna- Domain Adaptation, Sentence Paraphrasing, Tok- tional Workshop on Spoken Language Translation enization, and Recasing. . In Workshop on Statis- (IWSLT), San Francisco, CA. tical Machine Translation, Association for Compu- George Foster and Roland Kuhn. 2007. Mixture- tational Linguistics. model adaptation for SMT. In Proceedings of the Franz Josef Och. 1999. An efficient method for de- Second Workshop on Statistical Machine Transla- termining bilingual word classes. In Proceedings of tion, pages 128–135, Prague, Czech Republic, June. the 9th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (EACL), Michel Galley and Christopher D. Manning. 2008. A pages 71–76. simple and effective hierarchical phrase reordering Franz Josef Och. 2003. Minimum Error Rate Train- model. In EMNLP ’08: Proceedings of the Con- ing in Statistical Machine Translation. In Erhard ference on Empirical Methods in Natural Language Hinrichs and Dan Roth, editors, Proceedings of the 447 41st Annual Meeting of the Association for Compu- tational Linguistics, pages 160–167. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for auto- matic evaluation of machine translation. In Pro- ceedings of the 40th Annual Meeting of the Asso- ciation of Computational Linguistics (ACL), pages 311–318, Philadelphia, PA. Stefan Riezler and John T. Maxwell. 2005. On some pitfalls in automatic evaluation and significance testing for MT. In Proceedings of the ACL Work- shop on Intrinsic and Extrinsic Evaluation Mea- sures for Machine Translation and/or Summariza- tion, pages 57–64, Ann Arbor, Michigan, June. As- sociation for Computational Linguistics. R. Rosenfeld. 2000. Two decades of statistical lan- guage modeling: where do we go from here? Pro- ceedings of the IEEE, 88(8):1270 –1278. Nick Ruiz and Marcello Federico. 2011. Topic adap- tation for lecture translation through bilingual la- tent semantic models. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 294–302, Edinburgh, Scotland, July. Association for Computational Linguistics. Nick Ruiz, Arianna Bisazza, Fabio Brugnara, Daniele Falavigna, Diego Giuliani, Suhel Jaber, Roberto Gretter, and Marcello Federico. 2011. FBK @ IWSLT 2011. In International Workshop on Spo- ken Language Translation (IWSLT), San Francisco, CA. Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of In- ternational Conference on New Methods in Lan- guage Processing. Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In 5th Conference of the Association for Machine Translation in the Americas (AMTA), Boston, Mas- sachusetts, August. Christoph Tillmann. 2004. A Unigram Orientation Model for Statistical Machine Translation. In Pro- ceedings of the Joint Conference on Human Lan- guage Technologies and the Annual Meeting of the North American Chapter of the Association of Com- putational Linguistics (HLT-NAACL). A. Yazgan and M. Sarac¸lar. 2004. Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition. In Proceedings of ICASSP, volume 1, pages I – 745–8 vol.1, may. 448 Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge Ivan Vuli´c and Marie-Francine Moens Department of Computer Science KU Leuven Celestijnenlaan 200A Leuven, Belgium {ivan.vulic,marie-francine.moens}@cs.kuleuven.be Abstract with partially overlapping content, usually avail- able in abundance. Thus, it is much easier to build In this paper, we extend the work on using a high-volume comparable corpus. A representa- latent cross-language topic models for iden- tive example of such a comparable text collection tifying word translations across compara- ble corpora. We present a novel precision- is Wikipedia, where one may observe articles dis- oriented algorithm that relies on per-topic cussing the similar topic, but strongly varying in word distributions obtained by the bilin- style, length and vocabulary, while still sharing a gual LDA (BiLDA) latent topic model. certain amount of main concepts (or topics). The algorithm aims at harvesting only the most probable word translations across lan- Over the years, several approaches for min- guages in a greedy fashion, without any ing translations from non-parallel corpora have prior knowledge about the language pair, emerged (Rapp, 1995; Fung and Yee, 1998; Rapp, relying on a symmetrization process and 1999; Diab and Finch, 2000; D´ejean et al., 2002; the one-to-one constraint. We report our re- Chiao and Zweigenbaum, 2002; Gaussier et al., sults for Italian-English and Dutch-English 2004; Fung and Cheung, 2004; Morin et al., 2007; language pairs that outperform the current state-of-the-art results by a significant mar- Haghighi et al., 2008; Shezaf and Rappoport, gin. In addition, we show how to use the al- 2010; Laroche and Langlais, 2010), all sharing gorithm for the construction of high-quality the same Firthian assumption, often called the initial seed lexicons of translations. distributionial hypothesis (Harris, 1954), which states that words with a similar meaning are likely to appear in similar contexts across languages. 1 Introduction All these methods have examined different rep- Bilingual lexicons serve as an invaluable resource resentations of word contexts and different meth- of knowledge in various natural language pro- ods for matching words across languages, but they cessing tasks, such as dictionary-based cross- all have in common a need for a seed lexicon of language information retrieval (Carbonell et al., translations to efficiently bridge the gap between 1997; Levow et al., 2005) and statistical machine languages. That seed lexicon is usually crawled translation (SMT) (Och and Ney, 2003). In or- from the Web or obtained from parallel corpora. der to construct high quality bilingual lexicons for Recently, Li et al. (2011) have proposed an ap- different domains, one usually needs to possess proach that improves precision of the existing parallel corpora or build such lexicons by hand. methods for bilingual lexicon extraction, based Compiling such lexicons manually is often an ex- on improving the comparability of the corpus un- pensive and time-consuming task, whereas the der consideration, prior to extracting actual bilin- methods for mining the lexicons from parallel cor- gual lexicons. Other methods such as (Koehn and pora are not applicable for language pairs and do- Knight, 2002) try to design a bootstrapping algo- mains where such corpora is unavailable or miss- rithm based on an initial seed lexicon of transla- ing. Therefore the focus of researchers turned to tions and various lexical evidences. However, the comparable corpora, which consist of documents quality of their initial seed lexicon is disputable, 449 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 449–459, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics since the construction of their lexicon is language- employed for a precision-oriented algorithm. In pair biased and cannot be completely employed our setting, it basically means that we keep a on distant languages. It solely relies on unsatis- translation pair (wiS , wjT ) if and only if, after the factory language-pair independent cross-language symmetrization process, the top translation candi- clues such as words shared across languages. date for the source word wiS is the target word wiT Recent work from Vuli´c et al.(2011) utilized and vice versa. The one-to-one constraint aims the distributional hypothesis in a different direc- at matching the most confident candidates during tion. It attempts to abrogate the need of a seed lex- the early stages of the algorithm, and then exclud- icon as a prerequisite for bilingual lexicon extrac- ing them from further search. The utility of the tion. They train a cross-language topic model on constraint for parallel corpora has already been document-aligned comparable corpora and intro- evaluated by Melamed (2000). duce different methods for identifying word trans- The remainder of the paper is structured as lations across languages, underpinned by per- follows. Section 2 gives a brief overview of topic word distributions from the trained topic the methods, relying on per-topic word distribu- model. Due to the fact that they deal with compa- tions, which serve as the tool for computing cross- rable Wikipedia data, their translation model con- language similarity between words. In Section tains a lot of noise, and some words are poorly 3, we motivate the main assumptions of the al- translated simply because there are not enough gorithm and describe the full algorithm. Sec- occurrences in the corpus. The goal of this work is tion 4 justifies the underlying assumptions of to design an algorithm which will learn to harvest the algorithm by providing comparisons with a only the most probable translations from the per- current-state-of-the-art system for Italian-English word topic distributions. The translations learned and Dutch-English language pairs. It also con- by the algorithm then might serve as a highly ac- tains another set of experiments which inves- curate, precision-based initial seed lexicon, which tigates the potential of the algorithm in build- can then be used as a tool for translating source ing a language-pair unbiased seed lexicon, and word vectors into the target language. The key ad- compares the lexicon with other seed lexicons. vantage of such a lexicon lies in the fact that there Finally, Section 5 lists conclusion and possible is no language-pair dependent prior knowledge paths of future work. involved in its construction (e.g., orthographic features). Hence, it is completely applicable to 2 Calculating Initial Cross-Language any language pair for which there exist sufficient Word Similarity comparable data for training of the topic model. This section gives a quick overview of the Cue Since comparable corpora often construct a method, the TI method, and their combination, very noisy environment, it is of the utmost impor- described by Vuli´c et al.(2011), which proved to tance for a precision-oriented algorithm to learn be the most efficient and accurate for identify- when to stop the process of matching words, and ing potential word translations once the cross- which candidate pairs are surely not translations language BiLDA topic model is trained and the of each other. The method described in this paper associated per-topic distributions are obtained for follows this intuition: while extracting a bilingual both source and target corpora. The BiLDA lexicon, we try to rematch words, keeping only model we use is a natural extension of the stan- the most confident candidate pairs and disregard- dard LDA model and, along with the definition of ing all the others. After that step, the most con- per-topic word distributions, has been presented fident candidate pairs might be used with some in (Ni et al., 2009; De Smet and Moens, 2009; of the existing context-based techniques to find Mimno et al., 2009). BiLDA takes advantage of translations for the words discarded in the pre- the document alignment by using a single variable vious step. The algorithm is based on: (1) the that contains the topic distribution θ. This vari- assumption of symmetry, and (2) the one-to-one able is language-independent, because it is shared constraint. The idea of symmetrization has been by each of the paired bilingual comparable doc- borrowed from the symmetrization heuristics in- uments. Topics for each document are sampled troduced for word alignments in SMT (Och and from θ, from which the words are then sampled Ney, 2003), where the intersection heuristics is in conjugation with the vocabulary distribution φ 450 S S in the same textual units and therefore add ex- zji wji tra information of potential relatedness. These MS α θ two methods for automatic bilingual lexicon ex- traction interpret and exploit underlying per-topic T T zji wji word distributions in different ways, so combin- MT ing the two should lead to even better results. The D two methods are linearly combined, with the over- φ all score given by: β SimT I+Cue (w1S , w2T ) = λSimT I (w1S , w2T ) ψ + (1 − λ)SimCue (w1S , w2T ) (1) Figure 1: Plate Figure modelbilingual 1: The for bilingual LDALatent Dirichletmodel (BiLDA) Allocation Both methods posses several desirable proper- ties. According to Griffiths et al. (2007), the con- ditioning for the Cue method automatically com- (for language S) and ψ (for language T). promises between word frequency and semantic 2.1 Cue Method relatedness since higher frequency words tend to A straightforward approach to express similarity have higher probability across all topics, but the between words tries to emphasize the associative distribution over topics P (zk |w1S ) ensures that se- relation in a natural way - modeling the proba- mantically related topics dominate the sum. The bility P (w2T |w1S ), i.e. the probability that a tar- similar phenomenon is captured by the TI method get word w2T will be generated as a response to a by the usage of TF, which rewards high frequency cue source word w1S , where the link between the words, and ITF, which assigns a higher impor- 1 words is established via the shared topic space: tance for words semantically more related to a T S PK P (w2 |w1 ) = k=1 P (w2 |zk )P (zk |w1S ), where T specific topic. These properties are incorporated K denotes the number of cross-language topics. in the combination of the methods. As the final result, the combined method provides, for each 2.2 TI Method source word, a ranked list of target words with as- This approach constructs word vectors over a sociated scores that measure the strength of cross- shared space of cross-language topics, where val- language similarity. The higher the score, the ues within vectors are the TF-ITF scores (term more confident a translation pair is. We will use frequency - inverse topic frequency), computed this observation in the next section during the al- in a completely analogical manner as the TF- gorithm construction. IDF scores for the original word-document space The lexicon constructed by solely applying the (Manning and Sch¨utze, 1999). Term frequency, combination of these methods without any addi- given a source word wiS and a topic zk , measures tional assumptions will serve as a baseline in the the importance of the word wiS within the particu- results section. lar topic zk , while inverse topical frequency (ITF) 3 Constructing the Algorithm of the word wiS measures the general importance of the source word wiS across all topics. The fi- This section explains the underlying assumptions nal TF-ITF score for the source word wiS and the of the algorithm: the assumption of symmetry topic zk is given by T F − IT Fi,k = T Fi,k · IT Fi . and the one-to-one assumption. Finally, it pro- The TF-ITF scores for target words associated vides the complete outline of the algorithm. with target topics are calculated in an analogical manner and the standard cosine similarity is then 3.1 Assumption of Symmetry used to find the most similar target word vectors First, we start with the intuition that the assump- for a given source word vector. tion of symmetry strengthens the confidence of a translation pair. In other words, if the most prob- 2.3 Combining the Methods able translation candidate for a source word w1S is Topic models have the ability to build clusters of a target word w2T and, vice versa, the most prob- words which might not always co-occur together able translation candidate of the target word w2T 451 is the source word w1S , and their TI+Cue scores T , GM ) to the list F inal . ple (ws,i i s are above a certain threshold, we can claim that (b) If we have reached the end of the list the words w1S and w2T are a translation pair. The for the target candidate word ws,i T with- definition of the symmetric relation can also be out finding the given source word wsS , relaxed. Instead of observing only one top can- and i < N , continue with the next word didate from the lists, we can observe top N can- T ws,i+1 . Do not add any tuple to F inals didates from both sides and include them in the in this step. search space, and then re-rank the potential candi- 5. If the list F inals is not empty, sort the tuples dates taking into account their associated TI+Cue in the list in descending order according to scores and their respective positions in the list. their GMi scores. The first element of the We will call N the search space depth. Here is T sorted list contains a word ws,high , the final the outline of the re-ranking method if the search translation candidate of the source word wsS . space consists of the top N candidates on both If the list F inals is not empty, the final re- sides: sult of this process will be the cross-language 1. Given is a source word wsS , for which we ac- word translation pair (wsS , ws,high T ). tually want to find the most probable trans- We will call this symmetrization process the lation candidate. Initialize an empty list symmetrizing re-ranking. It attempts at push- F inals = {} in which target language ing the correct cross-language synonym to the top candidates with their recalculated associated of the candidates list, taking into account both scores will be stored. the strength of similarities defined through the 2. Obtain TI+Cue scores for all target words. TI+Cue scores in both directions, and positions Keep only N best scoring target candidates: in ranked lists. A blatant example depicting how T , . . . , w T } along with their respective {ws,1 s,N this process helps boost precision is presented in scores. Figure 2. We can also design a thresholded variant 3. For each target candidate from of this procedure by imposing an extra constraint. T T {ws,1 , . . . , ws,N } acquire TI+Cue scores When calculating target language candidates for over the entire source vocabulary. Keep only the source word wsS in Step 2, we proceed fur- N best scoring source language candidates. ther only if the first target candidate scores above Each word ws,i T ∈ {ws,1 T , . . . , w T } now s,N a certain threshold P and, additionally, in Step 3, has a list of N source language candidates we keep lists of N source language candidates associated with it: {wi,1 S , w S . . . , w S }. i,2 i,N for only those target words for which the first 4. For each target candidate word ws,i T ∈ source language candidate in their respective list T T {ws,1 , . . . , ws,N }, do as follows: scored above the same threshold P . We will call (a) If one of the words from the associated this procedure the thresholded symmetrizing re- list is the given source word wsS , re- ranking, and this version will be employed in the member: (1) the position m, denoting final algorithm. how high in the list the word wsS was found, and (2) the associated TI+Cue 3.2 One-to-one Assumption score SimT I+Cue (ws,i T , wS S i,m = ws ). Melamed (2000) has already established that most Calculate: source words in parallel corpora tend to translate (i) G1,i = SimT I+Cue (wsS , ws,i T )/i to only one target word. That tendency is modeled (ii) G2,i = SimT I+Cue (ws,i T , w S )/m by the one-to-one assumption, which constrains i,m Following that, calculate GMi , the ge- each source word to have at most one translation on the target side. Melamed’s paper reports that ometric mean of 1 p the values G1,i and G2,i : GMi = G1,i · G2,i . Add a tu- this bias leads to a significant positive impact on precision and recall of bilingual lexicon extraction 1 Scores G1,i and G2,i are structured in such a way to from parallel corpora. This assumption should balance between positions in the ranked lists and the TI+Cue scores, since they reward candidate words which have high also be reasonable for many types of comparable TI+Cue scores associated with them, and penalize words if corpora such as Wikipedia or news corpora, which they are found lower in the list of potential candidates. are topically aligned or cover similar themes. We 452 klooster 0.3049 0.1740 monastery monnik 0.1338 benedictijn 0.2237 klooster 0.2266 0.1586 0.1494 abdij monk monnik 0.1131 abdij 0.1155 abdij 0.2549 0.1496 abbey monnik 0.1288 klooster Figure 2: An example where the assumption of symmetry and the one-to-one assumption clearly help boost precision. If we keep top Nc = 3 candidates from both sides, the algorithm is able to detect that the correct Dutch-English translation pair is (abdij, abbey). The TI+Cue method without any assumptions would result with an indirect association (abdij, monastery). If only the one-to-one assumption was present, the algorithm would greedily learn the correct direct association (monastery, klooster), remove those words from their respective vocabularies and then again result with another indirect association (abdij, monk). By additionally employing the assumption of symmetry with the re-ranking method from Subsection 3.1, the algorithm correctly learns the translation pair (abdij, abbey). Correct translation pairs (klooster, monastery) and (monnik, monk) are also obtained. Again here, the pair (monnik, monk) would not be obtained without the one-to-one assumption. will prove that the assumption leads to better pre- cally very close, and therefore have similar distri- cision scores even for bilingual lexicon extraction butions over cross-language topics, but island is a from such comparable data. The intuition be- much more frequent term. The TI+Cue method hind introducing this constraint is fairly simple. concludes that two words are potential trans- Without the assumption, the similarity scores be- lations whenever their distributions over cross- tween source and target words are calculated in- language topics are much more similar than ex- dependently of each other. We will illustrate the pected by chance. Moreover, it gives a preference problem arising from the independence assump- to more frequent candidates, so it will eventually tion with an example. end up learning an indirect association2 between Suppose we have an Italian word arcipelago, words arcipelago and island. The one-to-one as- and we would like to detect its correct English sumption should mitigate the problem of such in- translation (archipelago). However, after the direct associations if we design our algorithm in TI+Cue method is employed, and even after the such a way that it learns the most confident direct symmetrizing re-ranking process from the previ- associations2 first: ous step is used, we still acquire a wrong transla- 2 A direct association, as defined in (Melamed, 2000), is tion candidate pair (arcipelago, island). Why is an association between two words (in this setting found by that so? The word (arcipelago) (or its translation) the TI+Cue method) where the two words are indeed mutual and the acquired translation (island) are semanti- translations. Otherwise, it is an indirect association. 453 1. Learn the correct direct association pair vocabularies: V S = V S − {wsS } and (isola, island). V T = V T − {ws,high T } to satisfy the 2. Remove the words isola and island from one-to-one constraint. Add the pair their respective vocabularies. (wsS , ws,high T ) to the lexicon L. 3. Since island is not in the vocabulary, the indirect association between arcipelago and We will name this procedure the one- island is not present any more. The algo- vocabulary-pass and employ it later in an iter- rithm learns the correct direct association ative algorithm with a varying threshold and a (arcipelago, archipelago). varying maximum search space depth. 3.3 The Algorithm 3.3.2 The Final Algorithm 3.3.1 One-Vocabulary-Pass Let us now define P0 as the initial threshold, let First, we will provide a version of the algorithm Pf be the threshold at which we stop decreas- with a fixed threshold P which completes only ing the value for threshold and start expanding one pass through the source vocabulary. Let V S our maximum search space depth for the thresh- denote a given source vocabulary, and let V T de- olded symmetrizing re-ranking, and let decp be a note a given target vocabulary. We need to define value for which we decrease the current threshold several parameters of the algorithm. Let N0 be in each step. Finally, let Nf be the limit for the the initial maximum search space depth for the maximum search space depth, and NM denote the thresholded symmetrizing re-ranking procedure. current maximum search space depth. The final In Figure 2, the current depth Nc is 3, while the algorithm is given by: maximum depth might be set to a value higher 1. Initialize the maximum search space depth than 3. The algorithm with the fixed threshold P NM = N0 and the starting threshold P = proceeds as follows: P0 . Initialize an empty lexicon Lf inal . 1. Initialize the maximum search space depth 2. Check the stopping criterion: If NM > Nf , NM = N0 . Initialize an empty lexicon L. go to Step 5, otherwise continue with Step 3. 2. For each source word wsS ∈ V S do: 3. Perform the one-vocabulary-pass with the (a) Set the current search space depth Nc = current values of P and NM . Whenever a 1.3 translation pair is found, it is added to the (b) Perform the thresholded symmetrizing lexicon Lf inal . Additionally, we can also re-ranking procedure with the current save the threshold and the depth at which that search space set to Nc and the threshold pair was found. T P . If a translation pair (wsS , ws,high ) is 4. Decrease P : P = P − decp , and check found, go to the Sub-step 2(d). if P < Pf . If still not P < Pf , go to (c) If a translation pair is not found, and Step 3 and perform the one-vocabulary-pass Nc < NM , increment the current again. Otherwise, if P < Pf and there are search space Nc = Nc + 1 and return to still unmatched words in the source vocab- the previous Sub-step 2(b). If a trans- ulary, reset P : P = P0 , increment NM : lation pair is not found and Nc = NM , NM = NM + 1 and go to Step 2. return to Step 2 and proceed with the 5. Return Lf inal as the final output of the algo- next word. rithm. (d) For the found translation pair The parameters of the algorithm model its be- (wsS , ws,high T ), remove words wsS T havior. Typically, we would like to set P0 to a high and ws,high from their respective value, and N0 to a low value, which makes our 3 The intuition here is simple – we are trying to detect constraints strict and narrows our search space, a direct association as high as possible in the list. In other and consequently, extracts less translation pairs words, if the first translation candidate for the source word in the first steps of the algorithm, but the set isola is the target word island, and, vice versa, the first translation candidate for the target word island is isola, we of those translation pairs should be highly accu- do not need to expand our search depth, because these two rate. Once it is not possible to extract any more words are the most likely translations. pairs with such strict constraints, the algorithm re- 454 laxes them by lowering the threshold and expand- BiLDA training are obtained from Vuli´c et al. ing the search space by incrementing the max- (2011). We train the BiLDA model with 2000 imum search space depth. The algorithm may topics using Gibbs sampling, since that number leave some of the source words unmatched, which of topics displays the best performance in their is also dependent on the parameters of the algo- paper. The linear interpolation parameter for the rithm, but, due to the one-to-one assumption, that combined TI+Cue method is set to λ = 0.1. scenario also occurs whenever a target vocabulary The parameters of the algorithm, adjusted on a contains more words than a source vocabulary. set of 500 randomly sampled Italian words, are set The number of operations of the algorithm also to the following values in all experiments, except depends on the parameters, but it mostly depends where noted different: P0 = 0.20, Pf = 0.00, on the sizes of the given vocabularies. The com- decp = 0.01, N0 = 3, and Nf = 10. plexity is O(|V S ||V T |), but the algorithm is com- The initial ground truth for our source vocab- putationally feasible even for large vocabularies. ularies has been constructed by the freely avail- able Google Translate tool. The final ground truth 4 Results and Discussion for our test sets has been established after we 4.1 Training Collections have manually revised the list of pairs obtained by Google Translate, deleting incorrect entries and The data used for training of the models is col- adding additional correct entries. All translation lected from various sources and varies strongly in candidates are evaluated against this benchmark theme, style, length and its comparableness. In lexicon. order to reduce data sparsity, we keep only lem- matized non-proper noun forms. 4.3 Experiment I: Do Our Assumptions Help For Italian-English language pair, we use Lexicon Extraction? 18, 898 Wikipedia article pairs to train BiLDA, With this set of experiments, we wanted to test covering different themes with different scopes whether both the assumption of symmetry and and subtopics being addressed. Document align- the one-to-one assumption are useful in improv- ment is established via interlingual links from the ing precision of the initial TI+Cue lexicon extrac- Wikipedia metadata. Our vocabularies consist of tion method. We compare three different lexicon 7, 160 Italian nouns and 9, 116 English nouns. extraction algorithms: (1) the basic TI+Cue ex- For Dutch-English language pair, we use 7, 602 traction algorithm (LALG-BASIC) which serves Wikipedia article pairs, and 6, 206 Europarl doc- as the baseline algorithm5 , (2) the algorithm from ument pairs, and combine them for training.4 Our Section 3, but without the one-to-one assump- final vocabularies consist of 15, 284 Dutch nouns tion (LALG-SYM), meaning that if we find a and 12, 715 English nouns. translation pair, we still keep words from the Unlike, for instance, Wikipedia articles, where translation pair in their respective vocabularies, document alignment is established via interlin- and (3) the complete algorithm from Section 3 gual links, in some cases it is necessary to perform (LALG-ALL). In order to evaluate these lexicon document alignment as the initial step. Since our extraction algorithms for both Italian-English and work focuses on Wikipedia data, we will not get Dutch-English, we have constructed a test set of into detail with algorithms for document align- 650 Italian nouns, and a test set of 1000 Dutch ment. An IR-based method for document align- nouns of high and medium frequency. Precision ment is given in (Utiyama and Isahara, 2003; scores for both language pairs and for all lexicon Munteanu and Marcu, 2005), and a feature-based extraction algorithms are provided in Table 1. method can be found in (Vu et al., 2009). Based on these results, it is clearly visible that 4.2 Experimental Setup both assumptions our algorithm makes are valid All our experiments rely on BiLDA training 5 We have also tested whether LALG-BASIC outperforms with comparable data. Corpora and software for a method modeling direct co-occurrence, that uses cosine to detect similarity between word vectors consisting of TF- 4 In case of Europarl, we use only the evidence of docu- IDF scores in the shared document space (Cimiano et al., ment alignment during the training and do not benefit from 2009). Precision using that method is significantly lower, the parallelness of the sentences in the corpus. e.g. 0.5538 vs. 0.6708 of LALG-BASIC for Italian-English. 455 1 LEX Algorithm Italian-English Dutch-English IT-EN Precision LALG-BASIC 0.6708 0.6560 IT-EN F-score LALG-SYM 0.6862 0.6780 0.95 NL-EN Precision NL-EN F-score LALG-ALL 0.7215 0.7170 0.9 Table 1: Precision scores on our test sets for the 3 dif- Precision/F-score ferent lexicon extraction algorithms. 0.85 and contribute to better overall scores. Therefore 0.8 in all further experiments we will use the LALG- ALL extraction algorithm. 0.75 4.4 Experiment II: How Does Thresholding 0.7 Affect Precision? 0.65 The next set of experiments aims at exploring how 0.2 0.15 0.1 0.05 0 precision scores change while we gradually de- Threshold crease threshold values. The main goal of these Figure 3: Precision and F0.5 scores in relation to experiments is to detect when to stop with the ex- threshold values. We can observe that the algorithm traction of translation candidates in order to pre- retrieves only highly accurate translations for both lan- serve a lexicon of only highly accurate transla- guage pairs while the threshold goes down from value tions. We have fixed the maximum search space 0.2 to 0.1, while precision starts to drop significantly depth N0 = Nf = 3. We used the same test sets after the threshold of 0.1. F0.5 scores also reach their from Experiment I. Figure 3 displays the change peaks within that threshold region. of precision in relation to different threshold val- ues, where we start harvesting translations from If we do not know anything about a given lan- the threshold P0 = 0.2 down to Pf = 0.0. Since guage pair, we can only use words shared across our goal is to extract as many correct translation languages as lexical clues for the construction of pairs as possible, but without decreasing the pre- a seed lexicon. It often leads to a low precision cision scores, we have also examined what impact lexicon, since many false friends are detected. this gradual decrease of threshold also has on the For Italian-English, we have found 431 nouns number of extracted translations. We have opted shared between the two languages, of which 350 for the Fβ measure (van Rijsbergen, 1979): were correct translations, leading to a precision P recision · Recall of 0.8121. As an illustration, if we take the Fβ = (1 + β 2 ) (2) first 431 translation pairs retrieved by LALG- β 2 · P recision + Recall ALL, there are 427 correct translation pairs, lead- Since our task is precision-oriented, we have set ing to a precision of 0.9907. Some pairs do β = 0.5. F0.5 measure values precision as twice not share any orthographic similarities: (uccello, as important as recall. The F0.5 scores are also bird), (tastiera, keyboard), (salute, health), (terre- provided in Figure 3. moto, earthquake) etc. Following Koehn and Knight (2002), we have 4.5 Experiment III: Building a Seed Lexicon also employed simple transformation rules for the Finally, we wanted to test how many accurate adoption of words from one language to another. translation pairs our best scoring LALG-ALL al- The rules specific to the Italian-English transla- gorithm is able to acquire from the entire source tion process that have been employed are: (R1) if vocabulary, with very high precision still remain- an Italian noun ends in −ione, but not in −zione, ing paramount. The obtained highly-precise seed strip the final e to obtain the corresponding En- lexicon then might be employed for an additional glish noun. Otherwise, strip the suffix −zione, bootstrapping procedure similar to (Koehn and and append −tion; (R2) if a noun ends in −ia, Knight, 2002; Fung and Cheung, 2004) or sim- but not in −zia or −f ia, replace the suffix −ia ply for translating context vectors as in (Gaussier with −y. If a noun ends in −zia, replace the suf- et al., 2004). fix with −cy and if a noun ends in −f ia, replace 456 Italian-English Dutch-English Lexicon # Correct Precision F0.5 # Correct Precision F0.5 LEX-1 350 0.8121 0.1876 898 0.8618 0.2308 LEX-2 766 0.8938 0.3473 1376 0.9011 0.3216 LEX-LALG 782 0.8958 0.3524 1106 0.9559 0.2778 LEX-1+LEX-LALG 1070 0.8785 0.4290 1860 0.9082 0.3961 LEX-R+LEX-LALG 1141 0.9239 0.4548 1507 0.9642 0.3500 LEX-2+LEX-LALG 1429 0.8926 0.5102 2261 0.9217 0.4505 Table 2: A comparison of different lexicons. For lexicons employing our LALG-ALL algorithm, only translation candidates that scored above the threshold P = 0.11 have been kept. it with −phy. Similar rules have been introduced Knight (2002) has been outperformed in terms of for Dutch-English: the suffix −tie is replaced by precision and coverage. Additionally, we have −tion, −sie by −sion, and −teit by −ty. shown that adding simple translation rules for lan- Finally, we have compared the results of the guages sharing same roots might lead to even bet- following constructed lexicons: ter scores (LEX-2+LEX-LALG). However, it is not always possible to rely on such knowledge, • A lexicon containing only words shared and the usefulness of the designed LALG-ALL across languages (LEX-1). algorithm really comes to the fore when the algo- • A lexicon containing shared words and trans- rithm is applied on distant language pairs which lation pairs found by applying the language- do not share many words and cognates, and word specific transformation rules (LEX-2). translation rules cannot be easily established. In • A lexicon containing only translation pairs such cases, without any prior knowledge about the obtained by the LALG-ALL algorithm that languages involved in a translation process, one is score above a certain threshold P (LEX- left with the linguistically unbiased LEX-1+LEX- LALG). LALG lexicon, which also displays a promising • A combination of the lexicons LEX-1 and performance. LEX-LALG (LEX-1+LEX-LALG). Non- 5 Conclusions and Future Work matching duplicates are resolved by taking We have designed an algorithm that focuses on ac- the translation pair from LEX-LALG as the quiring and keeping only highly confident trans- correct one. Note that this lexicon is com- lation candidates from multilingual comparable pletely language-pair independent. corpora. By employing the algorithm we have • A lexicon combining only translation pairs improved precision scores of the methods rely- found by applying the language-specific ing on per-topic word distributions from a cross- transformation rules and LEX-LALG (LEX- language topic model. We have shown that the al- R+LEX-LALG). gorithm is able to produce a highly reliable bilin- • A combination of the lexicons LEX-2 and gual seed lexicon even when all other lexical clues LEX-LALG, where non-matching dupli- are absent, thus making our algorithm suitable cates are resolved by taking the translation even for unrelated language pairs. In future work, pair from LEX-LALG if it is present in we plan to further improve the algorithm and use LEX-1, and from LEX-2 otherwise (LEX- it as a source of translational evidence for differ- 2+LEX-LALG). ent alignment tasks in the setting of non-parallel According to the results from Table 2, we can corpora. conclude that adding translation pairs extracted Acknowledgments by our LALG-ALL algorithm has a major posi- tive impact on both precision and coverage. Ob- The research has been carried out in the frame- taining results for two different language pairs work of the TermWise Knowledge Platform (IOF- proves that the approach is generic and appli- KP/09/001) funded by the Industrial Research cable to any other language pairs. The previ- Fund K.U. Leuven, Belgium. ous approach relying on work from Koehn and 457 References 46th Annual Meeting of the Association for Compu- tational Linguistics, pages 771–779. Jaime G. Carbonell, Jaime G. Yang, Robert E. Fred- Zellig S. Harris. 1954. Distributional structure. Word erking, Ralf D. Brown, Yibing Geng, Danny Lee, 10, (23):146–162. Yiming Frederking, Robert E, Ralf D. Geng, and Philipp Koehn and Kevin Knight. 2002. Learning a Yiming Yang. 1997. Translingual information re- translation lexicon from monolingual corpora. In trieval: A comparative evaluation. In Proceedings Proceedings of the ACL-02 Workshop on Unsuper- of the 15th International Joint Conference on Arti- vised Lexical Acquisition, pages 9–16. ficial Intelligence, pages 708–714. Audrey Laroche and Philippe Langlais. 2010. Re- Yun-Chuang Chiao and Pierre Zweigenbaum. 2002. visiting context-based projection methods for term- Looking for candidate translational equivalents in translation spotting in comparable corpora. In Pro- specialized, comparable corpora. In Proceedings ceedings of the 23rd International Conference on of the 19th International Conference on Computa- Computational Linguistics, pages 617–625. tional Linguistics, pages 1–5. Gina-Anne Levow, Douglas W. Oard, and Philip Philipp Cimiano, Antje Schultz, Sergej Sizov, Philipp Resnik. 2005. Dictionary-based techniques for Sorg, and Steffen Staab. 2009. Explicit versus cross-language information retrieval. Information latent concept models for cross-language informa- Processing and Management, 41:523–547. tion retrieval. In Proceedings of the 21st Inter- Bo Li, Eric Gaussier, and Akiko Aizawa. 2011. Clus- national Joint Conference on Artifical Intelligence, tering comparable corpora for bilingual lexicon ex- pages 1513–1518. traction. In Proceedings of the 49th Annual Meeting Wim De Smet and Marie-Francine Moens. 2009. of the Association for Computational Linguistics: Cross-language linking of news stories on the Web Human Language Technologies, pages 473–478. using interlingual topic modeling. In Proceedings Christopher D. Manning and Hinrich Sch¨utze. 1999. of the CIKM 2009 Workshop on Social Web Search Foundations of Statistical Natural Language Pro- and Mining, pages 57–64. cessing. MIT Press, Cambridge, MA, USA. ´ Herv´e D´ejean, Eric Gaussier, and Fatia Sadat. 2002. I. Dan Melamed. 2000. Models of translational equiv- An approach based on multilingual thesauri and alence among words. Computational Linguistics, model combination for bilingual lexicon extraction. 26:221–249. In Proceedings of the 19th International Conference David Mimno, Hanna M. Wallach, Jason Naradowsky, on Computational Linguistics, pages 1–7. David A. Smith, and Andrew McCallum. 2009. Mona T. Diab and Steve Finch. 2000. A statis- Polylingual topic models. In Proceedings of the tical translation model using comparable corpora. 2009 Conference on Empirical Methods in Natural In Proceedings of the 6th Triennial Conference on Language Processing, pages 880–889. Recherche d’Information Assist´ee par Ordinateur Emmanuel Morin, B´eatrice Daille, Koichi Takeuchi, (RIAO), pages 1500–1508. and Kyo Kageura. 2007. Bilingual terminology Pascale Fung and Percy Cheung. 2004. Mining very- mining - using brain, not brawn comparable cor- non-parallel corpora: Parallel sentence and lexicon pora. In Proceedings of the 45th Annual Meeting extraction via bootstrapping and EM. In Proceed- of the Association for Computational Linguistics, ings of the Conference on Empirical Methods in pages 664–671. Natural Language Processing, pages 57–63. Dragos Stefan Munteanu and Daniel Marcu. 2005. Pascale Fung and Lo Yuen Yee. 1998. An IR ap- Improving machine translation performance by ex- proach for translating new words from nonparallel, ploiting non-parallel corpora. Computational Lin- comparable texts. In Proceedings of the 17th Inter- guistics, 31:477–504. national Conference on Computational Linguistics, Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng pages 414–420. Chen. 2009. Mining multilingual topics from Eric Gaussier, Jean-Michel Renders, Irina Matveeva, Wikipedia. In Proceedings of the 18th International Cyril Goutte, and Herv´e D´ejean. 2004. A geomet- World Wide Web Conference, pages 1155–1156. ric view on bilingual lexicon extraction from com- Franz Josef Och and Hermann Ney. 2003. A sys- parable corpora. In Proceedings of the 42nd Annual tematic comparison of various statistical alignment Meeting of the Association for Computational Lin- models. Computational Linguistics, 29(1):19–51. guistics, pages 526–533. Reinhard Rapp. 1995. Identifying word translations in Thomas L. Griffiths, Mark Steyvers, and Joshua B. non-parallel texts. In Proceedings of the 33rd An- Tenenbaum. 2007. Topics in semantic represen- nual Meeting of the Association for Computational tation. Psychological Review, 114(2):211–244. Linguistics, pages 320–322. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, Reinhard Rapp. 1999. Automatic identification of and Dan Klein. 2008. Learning bilingual lexicons word translations from unrelated English and Ger- from monolingual corpora. In Proceedings of the man corpora. In Proceedings of the 37th Annual 458 Meeting of the Association for Computational Lin- guistics, pages 519–526. Daphna Shezaf and Ari Rappoport. 2010. Bilingual lexicon generation using non-aligned signatures. In Proceedings of the 48th Annual Meeting of the As- sociation for Computational Linguistics, pages 98– 107. Masao Utiyama and Hitoshi Isahara. 2003. Reliable measures for aligning Japanese-English news arti- cles and sentences. In Proceedings of the 41st An- nual Meeting of the Association for Computational Linguistics, pages 72–79. C. J. van Rijsbergen. 1979. Information Retrieval. Butterworth. Thuy Vu, Ai Ti Aw, and Min Zhang. 2009. Feature- based method for document alignment in compara- ble news corpora. In Proceedings of the 12th Con- ference of the European Chapter of the Association for Computational Linguistics, pages 843–851. Ivan Vuli´c, Wim De Smet, and Marie-Francine Moens. 2011. Identifying word translations from compara- ble corpora using latent topic models. In Proceed- ings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 479–484. 459 Efficient Parsing with Linear Context-Free Rewriting Systems Andreas van Cranenburgh Huygens ING & ILLC, University of Amsterdam Royal Netherlands Academy of Arts and Sciences Postbus 90754, 2509 LT The Hague, the Netherlands

[email protected]

Abstract SBARQ Previous work on treebank parsing with SQ discontinuous constituents using Linear Context-Free Rewriting systems (LCFRS) VP has been limited to sentences of up to 30 words, for reasons of computational com- WHNP MD NP VB . plexity. There have been some results on binarizing an LCFRS in a manner that min- What should I do ? imizes parsing complexity, but the present work shows that parsing long sentences with Figure 1: A tree with WH-movement from the Penn such an optimally binarized grammar re- treebank, in which traces have been converted to dis- mains infeasible. Instead, we introduce a continuity. Taken from Evang and Kallmeyer (2011). technique which removes this length restric- tion, while maintaining a respectable accu- racy. The resulting parser has been applied to a discontinuous treebank with favorable 1997) and Tiger (Brants et al., 2002) corpora, or results. those that can be extracted from traces such as in the Penn treebank (Marcus et al., 1993) annota- tion. However, the computational complexity is 1 Introduction such that until now, the length of sentences needed Discontinuity in constituent structures (cf. figure 1 to be restricted. In the case of Kallmeyer and & 2) is important for a variety of reasons. For Maier (2010) and Evang and Kallmeyer (2011) the one, it allows a tight correspondence between limit was 25 words. Maier (2010) and van Cranen- syntax and semantics by letting constituent struc- burgh et al. (2011) manage to parse up to 30 words ture express argument structure (Skut et al., 1997). with heuristics and optimizations, but no further. Other reasons are phenomena such as extraposi- Algorithms have been suggested to binarize the tion and word-order freedom, which arguably re- grammars in such a way as to minimize parsing quire discontinuous annotations to be treated sys- complexity, but the current paper shows that these tematically in phrase-structures (McCawley, 1982; techniques are not sufficient to parse longer sen- Levy, 2005). Empirical investigations demon- tences. Instead, this work presents a novel form strate that discontinuity is present in non-negligible of coarse-to-fine parsing which does alleviate this amounts: around 30% of sentences contain dis- limitation. continuity in two German treebanks (Maier and The rest of this paper is structured as follows. Søgaard, 2008; Maier and Lichte, 2009). Re- First, we introduce linear context-free rewriting cent work on treebank parsing with discontinuous systems (LCFRS). Next, we discuss and evalu- constituents (Kallmeyer and Maier, 2010; Maier, ate binarization strategies for LCFRS. Third, we 2010; Evang and Kallmeyer, 2011; van Cranen- present a technique for approximating an LCFRS burgh et al., 2011) shows that it is feasible to by a PCFG in a coarse-to-fine framework. Lastly, directly parse discontinuous constituency anno- we evaluate this technique on a large corpus with- tations, as given in the German Negra (Skut et al., out the usual length restrictions. 460 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 460–470, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics ROOT ROOT(ab) → S(a) $.(b) S S(abcd) → VAFIN(b) NN(c) VP2 (a, d) VP2 (a, bc) → PROAV(a) NN(b) VVPP(c) VP PROAV(Danach) → PROAV VAFIN NN NN VVPP $. VAFIN(habe) → NN(Kohlenstaub) → Danach habe Kohlenstaub Feuer gefangen . NN(Feuer) → Afterwards had coal dust fire caught . VVPP(gefangen) → $.(.) → Figure 2: A discontinuous tree from the Negra corpus. Figure 3: The productions that can be read off from the Translation: After that coal dust had caught fire. tree in figure 2. Note that lexical productions rewrite to , because they do not rewrite to any non-terminals. 2 Linear Context-Free Rewriting Systems terminal may cover a tuple of discontinuous strings instead of a single, contiguous sequence of termi- Linear Context-Free Rewriting Systems (LCFRS; nals. The number of components in such a tuple Vijay-Shanker et al., 1987; Weir, 1988) subsume is called the fan-out of a rule, which is equal to a wide variety of mildly context-sensitive for- the number of gaps plus one; the fan-out of the malisms, such as Tree-Adjoining Grammar (TAG), grammar is the maximum fan-out of its production. Combinatory Categorial Grammar (CCG), Min- A context-free grammar is a LCFRS with a fan-out imalist Grammar, Multiple Context-Free Gram- of 1. For convenience we will will use the rule mar (MCFG) and synchronous CFG (Vijay-Shanker notation of simple RCG (Boullier, 1998), which and Weir, 1994; Kallmeyer, 2010). Furthermore, is a syntactic variant of LCFRS, with an arguably they can be used to parse dependency struc- more transparent notation. tures (Kuhlmann and Satta, 2009). Since LCFRS A LCFRS is a tuple G = hN, T, V, P, Si. N subsumes various synchronous grammars, they are is a finite set of non-terminals; a function dim : also important for machine translation. This makes N → N specifies the unique fan-out for every non- it possible to use LCFRS as a syntactic backbone terminal symbol. T and V are disjoint finite sets with which various formalisms can be parsed by of terminals and variables. S is the distinguished compiling grammars into an LCFRS, similar to the start symbol with dim(S) = 1. P is a finite set of TuLiPa system (Kallmeyer et al., 2008). As all rewrite rules (productions) of the form: mildly context-sensitive formalisms, LCFRS are parsable in polynomial time, where the degree A(α1 , . . . αdim(A) ) →B1 (X11 , . . . , Xdim(B 1 1) ) depends on the productions of the grammar. In- . . . Bm (X1m , . . . , Xdim(B m m) ) tuitively, LCFRS can be seen as a generalization of context-free grammars to rewriting other ob- for m ≥ 0, where A, B1 , . . . , Bm ∈ N , jects than just continuous strings: productions are each Xji ∈ V for 1 ≤ i ≤ m, 1 ≤ j ≤ dim(Aj ) context-free, but instead of strings they can rewrite and αi ∈ (T ∪ V )∗ for 1 ≤ i ≤ dim(Ai ). tuples, trees or graphs. Productions must be linear: if a variable occurs We focus on the use of LCFRS for parsing with in a rule, it occurs exactly once on the left hand discontinuous constituents. This follows up on side (LHS), and exactly once on the right hand side recent work on parsing the discontinuous anno- (RHS). A rule is ordered if for any two variables tations in German corpora with LCFRS (Maier, X1 and X2 occurring in a non-terminal on the RHS, 2010; van Cranenburgh et al., 2011) and work on X1 precedes X2 on the LHS iff X1 precedes X2 parsing the Wall Street journal corpus in which on the RHS. traces have been converted to discontinuous con- Every production has a fan-out determined by stituents (Evang and Kallmeyer, 2011). In the case the fan-out of the non-terminal symbol on the left- of parsing with discontinuous constituents a non- hand side. Apart from the fan-out productions also 461 have a rank: the number of non-terminals on the This binarization introduces a production with right-hand side. These two variables determine a fan-out of 2, which could have been avoided. the time complexity of parsing with a grammar. A After binarization, an LCFRS can be parsed in production can be instantiated when its variables O(|G| · |w|p ) time, where |G| is the size of the can be bound to non-overlapping spans such that grammar, |w| is the length of the sentence. The de- for each component αi of the LHS, the concatena- gree p of the polynomial is the maximum parsing tion of its terminals and bound variables forms a complexity of a rule, defined as: contiguous span in the input, while the endpoints of each span are non-contiguous. parsing complexity := ϕ + ϕ1 + ϕ2 (6) As in the case of a PCFG, we can read off LCFRS where ϕ is the fan-out of the left-hand side and productions from a treebank (Maier and Søgaard, ϕ1 and ϕ2 are the fan-outs of the right-hand side 2008), and the relative frequencies of productions of the rule in question (Gildea, 2010). As Gildea form a maximum likelihood estimate, for a prob- (2010) shows, there is no one to one correspon- abilistic LCFRS (PLCFRS), i.e., a (discontinuous) dence between fan-out and parsing complexity: it treebank grammar. As an example, figure 3 shows is possible that parsing complexity can be reduced the productions extracted from the tree in figure 2. by increasing the fan-out of a production. In other words, there can be a production which can be bi- 3 Binarization narized with a parsing complexity that is minimal A probabilistic LCFRS can be parsed using a CKY- while its fan-out is sub-optimal. Therefore we fo- like tabular parsing algorithm (cf. Kallmeyer and cus on parsing complexity rather than fan-out in Maier, 2010; van Cranenburgh et al., 2011), but this work, since parsing complexity determines the this requires a binarized grammar.1 Any LCFRS actual time complexity of parsing with a grammar. can be binarized. Crescenzi et al. (2011) state There has been some work investigating whether “while CFGs can always be reduced to rank two the increase in complexity can be minimized ef- (Chomsky Normal Form), this is not the case for fectively (G´omez-Rodr´ıguez et al., 2009; Gildea, LCFRS with any fan-out greater than one.” How- 2010; Crescenzi et al., 2011). ever, this assertion is made under the assumption of More radically, it has been suggested that the a fixed fan-out. If this assumption is relaxed then power of LCFRS should be limited to well-nested it is easy to binarize either deterministically or, as structures, which gives an asymptotic improve- will be investigated in this work, optimally with ment in parsing time (G´omez-Rodr´ıguez et al., a dynamic programming approach. Binarizing an 2010). However, there is linguistic evidence that LCFRS may increase its fan-out, which results in not all language use can be described in well- an increase in asymptotic complexity. Consider nested structures (Chen-Main and Joshi, 2010). the following production: Therefore we will use the full power of LCFRS in X(pqrs) → A(p, r) B(q) C(s) (1) this work—parsing complexity is determined by the treebank, not by a priori constraints. Henceforth, we assume that non-terminals on the right-hand side are ordered by the order of their 3.1 Further binarization strategies first variable on the left-hand side. There are two Apart from optimizing for parsing complexity, for ways to binarize this production. The first is from linguistic reasons it can also be useful to parse left to right: the head of a constituent first, yielding so-called X(ps) →XAB (p) C(s) (2) head-driven binarizations (Collins, 1999). Addi- tionally, such a head-driven binarization can be XAB (pqr) →A(p, r) B(q) (3) ‘Markovized’–i.e., the resulting production can be This binarization maintains the fan-out of 1. The constrained to apply to a limited amount of hor- second way is from right to left: izontal context as opposed to the full context in the original constituent (e.g., Klein and Manning, X(pqrs) →A(p, r) XBC (q, s) (4) 2003), which can have a beneficial effect on accu- XBC (q, s) →B(q) C(s) (5) racy. In the notation of Klein and Manning (2003) 1 Other algorithms exist which support n-ary productions, there are two Markovization parameters: h and but these are less suitable for statistical treebank parsing. v. The first parameter describes the amount of 462 X X X X XB,C,D,E XB XD XB,C,D,E XB,C,D B XE XA X XC,D,E XB,C XD XB B B XD,E B B A X C Y D E A X C Y D E A X C Y D E A X C Y D E A X C Y D E 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 original right branching optimal head-driven optimal head-driven p = 4, ϕ = 2 p = 5, ϕ = 2 p = 4, ϕ = 2 p = 5, ϕ = 2 p = 4, ϕ = 2 Figure 4: The four binarization strategies. C is the head node. Underneath each tree is the maximum parsing complexity and fan-out among its productions. horizontal context for the artificial labels of a bi- dering and there is no probabilistic interpretation narized production. In a normal form binarization, of Markovization in such a setting. this parameter equals infinity, because the bina- To summarize, we have at least four binarization rized production should only apply in the exact strategies (cf. figure 4 for an illustration): same context as the context in which it originally belongs, as otherwise the set of strings accepted 1. right branching: A right-to-left binarization. by the grammar would be affected. An artificial No regard for optimality or statistical tweaks. label will have the form XA,B,C for a binarized 2. optimal: A binarization which minimizes pars- production of a constituent X that has covered ing complexity, introduced in Gildea (2010). children A, B, and C of X. The other extreme, Binarizing with this strategy is exponential in h = 1, enables generalizations by stringing parts the resulting optimal fan-out (Gildea, 2010). of binarized constituents together, as long as they 3. head-driven: Head-outward binarization with share one non-terminal. In the previous example, horizontal Markovization. No regard for opti- the label would become just XA , i.e., the pres- mality. ence of B and C would no longer be required, 4. optimal head-driven: Head-outward binariza- which enables switching to any binarized produc- tion with horizontal Markovization. Min- tion that has covered A as the last node. Limit- imizes parsing complexity. Introduced in ing the amount of horizontal context on which a and proven to be NP-hard by Crescenzi et al. production is conditioned is important when the (2011). treebank contains many unique constituents which 3.2 Finding optimal binarizations can only be parsed by stringing together different binarized productions; in other words, it is a way An issue with the minimal binarizations is that of dealing with the data sparseness about n-ary the algorithm for finding them has a high compu- productions in the treebank. tational complexity, and has not been evaluated empirically on treebank data.2 Empirical inves- The second parameter describes parent annota- tigation is interesting for two reasons. First of tion, which will not be investigated in this work; all, the high computational complexity may not the default value is v = 1 which implies only in- be relevant with constant factors of constituents, cluding the immediate parent of the constituent which can reasonably be expected to be relatively that is being binarized; including grandparents is a small. Second, it is important to establish whether way of weakening independence assumptions. an asymptotic improvement is actually obtained Crescenzi et al. (2011) also remark that through optimal binarizations, and whether this an optimal head-driven binarization allows for translates to an improvement in practice. Markovization. However, it is questionable Gildea (2010) presents a general algorithm to whether such a binarization is worthy of the name binarize an LCFRS while minimizing a given scor- Markovization, as the non-terminals are not intro- ing function. We will use this algorithm with two duced deterministically from left to right, but in different scoring functions. an arbitrary fashion dictated by concerns of pars- 2 Gildea (2010) evaluates on a dependency bank, but does ing complexity; as such there is not a Markov not report whether any improvement is obtained over a naive process based on a meaningful (e.g., temporal) or- binarization. 463 100000 100000 right branching head-driven optimal optimal head-driven 10000 10000 Frequency Frequency 1000 1000 100 100 10 10 1 1 3 4 5 6 7 8 9 3 4 5 6 7 8 9 Parsing complexity Parsing complexity Figure 5: The distribution of parsing complexity Figure 6: The distribution of parsing complexity among among productions in binarized grammars read off from productions in Markovized, head-driven grammars read NEGRA -25. The y-axis has a logarithmic scale. off from NEGRA -25. The y-axis has a logarithmic scale. The first directly optimizes parsing complexity. opment and test splits (Dubey and Keller, 2003). Given a (partially) binarized constituent c, the func- Following common practice, punctuation, which tion returns a tuple of scores, for which a linear is left out of the phrase-structure in Negra, is re- order is defined by comparing elements starting attached to the nearest constituent. from the most significant (left-most) element. The In the course of experiments it was discovered tuples contain the parsing complexity p, and the that the heuristic method for punctuation attach- fan-out ϕ to break ties in parsing complexity; if ment used in previous work (e.g., Maier, 2010; there are still ties after considering the fan-out, thevan Cranenburgh et al., 2011), as implemented in sum of the parsing complexities of the subtrees of rparse,3 introduces additional discontinuity. We c is considered, which will give preference to a bi- applied a slightly different heuristic: punctuation narization where the worst case complexity occurs is attached to the highest constituent that contains a once instead of twice. The formula is then: neighbor to its right. The result is that punctuation opt(c) = hp, ϕ, si can be introduced into the phrase-structure with- out any additional discontinuity, and thus without The second function is the similar except that artificially inflating the fan-out and complexity of only head-driven strategies are accepted. A head- grammars read off from the treebank. This new driven strategy is a binarization in which the head heuristic provides a significant improvement: in- is introduced first, after which the rest of the chil- stead of a fan-out of 9 and a parsing complexity of dren are introduced one at a time. 19, we obtain values of 4 and 9 respectively. hp, ϕ, si if c is head-driven The parser is presented with the gold part-of- opt-hd(c) = h∞, ∞, ∞i otherwise speech tags from the corpus. For reasons of effi- ciency we restrict sentences to 25 words (includ- Given a (partial) binarization c, the score should ing punctuation) in this experiment: NEGRA -25. reflect the maximum complexity and fan-out in A grammar was read off from the training part that binarization, to optimize for the worst case, as of NEGRA -25, and sentences of up to 25 words well as the sum, to optimize the average case. This in the development set were parsed using the re- aspect appears to be glossed over by Gildea (2010). sulting PLCFRS, using the different binarization Considering only the score of the last production in schemes. First with a right-branching, right-to-left a binarization produces suboptimal binarizations. binarization, and second with the minimal bina- 3.3 Experiments rization according to parsing complexity and fan- As data we use version 2 of the Negra (Skut et al., 3 Available from http://www.wolfgang-maier.net/ 1997) treebank, with the common training, devel- rparse/downloads. Retrieved March 25th, 2011 464 right optimal branching optimal head-driven head-driven Markovization v=1, h=∞ v=1, h=∞ v=1, h=2 v=1, h=2 fan-out 4 4 4 4 complexity 8 8 9 8 labels 12861 12388 4576 3187 clauses 62072 62097 53050 52966 time to binarize 1.83 s 46.37 s 2.74 s 28.9 s time to parse 246.34 s 193.94 s 2860.26 s 716.58 s coverage 96.08 % 96.08 % 98.99 % 98.73 % F1 score 66.83 % 66.75 % 72.37 % 71.79 % Table 1: The effect of binarization strategies on parsing efficiency, with sentences from the development section of NEGRA -25. out. The last two binarizations are head-driven rizations is exponential (Gildea, 2010) and NP- and Markovized—the first straightforwardly from hard (Crescenzi et al., 2011), they can be computed left-to-right, the latter optimized for minimal pars- relatively quickly on this data set.5 Importantly, in ing complexity. With Markovization we are forced the first case there is no improvement on fan-out to add a level of parent annotation to tame the or parsing complexity, while in the head-driven increase in productivity caused by h = 1. case there is a minimal improvement because of a The distribution of parsing complexity (mea- single production with parsing complexity 15 with- sured with eq. 6) in the grammars with different out optimal binarization. On the other hand, the binarization strategies is shown in figure 5 and optimal binarizations might still have a significant 6. Although the optimal binarizations do seem effect on the average case complexity, rather than to have some effect on the distribution of parsing the worst-case complexities. Indeed, in both cases complexities, it remains to be seen whether this parsing with the optimal grammar is faster; in the can be cashed out as a performance improvement first case, however, when the time for binariza- in practice. To this end, we also parse using the tion is considered as well, this advantage mostly binarized grammars. disappears. In this work we binarize and parse with The difference in F1 scores might relate to the disco-dop introduced in van Cranenburgh et al. efficacy of Markovization in the binarizations. It (2011).4 In this experiment we report scores of the should be noted that it makes little theoretical (exact) Viterbi derivations of a treebank PLCFRS; sense to ‘Markovize’ a binarization when it is not cf. table 1 for the results. Times represent CPU a left-to-right or right-to-left binarization, because time (single core); accuracy is given with a gener- with an optimal binarization the non-terminals of alization of PARSEVAL to discontinuous structures, a constituent are introduced in an arbitrary order. described in Maier (2010). More importantly, in our experiments, these Instead of using Maier’s implementation of dis- techniques of optimal binarizations did not scale continuous F1 scores in rparse, we employ a vari- to longer sentences. While it is possible to obtain ant that ignores (a) punctuation, and (b) the root an optimal binarization of the unrestricted Negra node of each tree. This makes our evaluation in- corpus, parsing long sentences with the resulting comparable to previous results on discontinuous grammar remains infeasible. Therefore we need to parsing, but brings it in line with common practice look at other techniques for parsing longer sen- on the Wall street journal benchmark. Note that tences. We will stick with the straightforward this change yields scores about 2 or 3 percentage points lower than those of rparse. 5 The implementation exploits two important optimiza- Despite the fact that obtaining optimal bina- tions. The first is the use of bit vectors to keep track of which non-terminals are covered by a partial binarization. The sec- 4 All code is available from: http://github.com/ ond is to skip constituents without discontinuity, which are andreasvc/disco-dop. equivalent to CFG productions. 465 head-driven, head-outward binarization strategy, cedure introduced in Boyd (2007). Each discontin- despite this being a computationally sub-optimal uous node is split into a set of new nodes, one for binarization. each component; for example a node NP2 will be One technique for efficient parsing of LCFRS is split into two nodes labeled NP *1 and NP *2 (like the use of context-summary estimates (Kallmeyer Barth´elemy et al., we mark components with an and Maier, 2010), as part of a best-first parsing index to reduce overgeneration). Because Boyd’s algorithm. This allowed Maier (2010) to parse transformation is reversible, chart items from this sentences of up to 30 words. However, the calcu- grammar can be converted back to discontinuous lation of these estimates is not feasible for longer chart items, and can guide parsing of an LCFRS. sentences and large grammars (van Cranenburgh This guiding takes the form of a white list. Af- et al., 2011). ter parsing with the coarse grammar, the result- Another strategy is to perform an online approx- ing chart is pruned by removing all items that imation of the sentence to be parsed, after which fail to meet a certain criterion. In our case this parsing with the LCFRS can be pruned effectively. is whether a chart item is part of one of the k-best This is the strategy that will be explored in the derivations—we use k = 50 in all experiments (as current work. in van Cranenburgh et al., 2011). This has simi- lar effects as removing items below a threshold 4 Context-free grammar approximation of marginalized posterior probability; however, for coarse-to-fine parsing the latter strategy requires computation of outside Coarse-to-fine parsing (Charniak et al., 2006) is probabilities from a parse forest, which is more a technique to speed up parsing by exploiting the involved with an LCFRS than with a PCFG. When information that can be gained from parsing with parsing with the fine grammar, whenever a new simpler, coarser grammars—e.g., a grammar with item is derived, the white list is consulted to see a smaller set of labels on which the original gram- whether this item is allowed to be used in further mar can be projected. Constituents that do not derivations; otherwise it is immediately discarded. contribute to a full parse tree with a coarse gram- This coarse-to-fine approach will be referred to as mar can be ruled out for finer grammars as well, CFG - CTF , and the transformed, coarse grammar which greatly reduces the number of edges that will be referred to as a split-PCFG. need to be explored. However, by changing just Splitting discontinuous nodes for the coarse the labels only the grammar constant is affected. grammar introduces new nodes, so obviously we With discontinuous treebank parsing the asymp- need to binarize after this transformation. On the totic complexity of the grammar also plays a major other hand, the coarse-to-fine approach requires a role. Therefore we suggest to parse not just with mapping between the grammars, so after reversing a coarser grammar, but with a coarser grammar the transformation of splitting nodes, the resulting formalism, following a suggestion in van Cranen- discontinuous trees must be binarized (and option- burgh et al. (2011). ally Markovized) in the same manner as those on This idea is inspired by the work of Barth´elemy which the fine grammar is based. et al. (2001), who apply it in a non-probabilistic To resolve this tension we elect to binarize twice. setting where the coarse grammar acts as a guide to The first time is before splitting discontinuous the non-deterministic choices of the fine grammar. nodes, and this is where we introduce Markoviza- Within the coarse-to-fine approach the technique tion. This same binarization will be used for the becomes a matter of pruning with some probabilis- fine grammar as well, which ensures the models tic threshold. Instead of using the coarse gram- make the same kind of generalizations. The sec- mar only as a guide to solve non-deterministic ond binarization is after splitting nodes, this time choices, we apply it as a pruning step which also with a binary normal form (2NF; all productions discards the most suboptimal parses. The basic are either unary, binary, or lexical). idea is to extract a grammar that defines a superset Parsing with this grammar proceeds as fol- of the language we want to parse, but with a fan- lows. After obtaining an exhaustive chart from out of 1. More concretely, a context-free grammar the coarse stage, the chart is pruned so as to only can be read off from discontinuous trees that have contain items occurring in the k-best derivations. been transformed to context-free trees by the pro- When parsing in the fine stage, each new item is 466 S S SA SA S SB SB SA B*0 SB : SC *0,B*1,SC *1 B SB SC *0 SB : B*1,SC *1 SC B*0 SC *0 B*1 SC *1 B*1 SC *1 S SD SD SD B SE SE SE A X C Y D E A X C Y D E A X C Y D E A X C Y D E 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Figure 7: Transformations for a context-free coarse grammar. From left to right: the original constituent, Markovized with v = 1, h = 1, discontinuities resolved, normal form (second binarization). model train dev test rules labels fan-out complexity Split-PCFG 17988 975 968 57969 2026 1 3 PLCFRS 17988 975 968 55778 947 4 9 Disco-DOP 17988 975 968 2657799 702246 4 9 Table 2: Some statistics on the coarse and fine grammars read off from NEGRA -40. looked up in this pruned coarse chart, with multi- 45 PLCFRS ple lookups if the item is discontinuous (one for 40 CFG-CTF (Split-PCFG ⇒ PLCFRS) each component). 35 To summarize, the transformation happens in 30 cpu time (s) four steps (cf. figure 7 for an illustration): 25 1. Treebank tree: Original (discontinuous) tree 20 2. Binarization: Binarize discontinuous tree, op- 15 tionally with Markovization 10 3. Resolve discontinuity: Split discontinuous 5 nodes into components, marked with indices 0 4. 2NF: A binary normal form is applied; all pro- 0 5 10 15 20 25 ductions are either unary, binary, or lexical. Sentence length 5 Evaluation Figure 8: Efficiency of parsing PLCFRS with and with- We evaluate on Negra with the same setup as in out coarse-to-fine. The latter includes time for both section 3.3. We report discontinuous F1 scores as coarse & fine grammar. Datapoints represent the aver- well as exact match scores. For previous results on age time to parse sentences of that length; each length discontinuous parsing with Negra, see table 3. For is made up of 20–40 sentences. results with the CFG - CTF method see table 4. We first establish the viability of the CFG - CTF method on NEGRA -25, with a head-driven v = 1, sentences of length > 22 despite its overhead of h = 2 binarization, and reporting again the scores parsing twice. of the exact Viterbi derivations from a treebank The second experiment demonstrates the CFG - PLCFRS versus a PCFG using our transformations. CTF technique on longer sentences. We restrict the Figure 8 compares the parsing times of LCFRS length of sentences in the training, development with and without the new CFG - CTF method. The and test corpora to 40 words: NEGRA -40. As a first graph shows a steep incline for parsing with LCFRS step we apply the CFG - CTF technique to parse with directly, which makes it infeasible to parse longer a PLCFRS as the fine grammar, pruning away all sentences, while the CFG - CTF method is faster for items not occurring in the 10,000 best derivations 467 words PARSEVAL Exact (F1 ) match DPSG :Plaehn (2004) ≤ 15 73.16 39.0 PLCFRS :Maier (2010) ≤ 30 71.52 31.65 Disco-DOP: van Cranenburgh et al. (2011) ≤ 30 73.98 34.80 Table 3: Previous work on discontinuous parsing of Negra. words PARSEVAL Exact (F1 ) match PLCFRS , dev set ≤ 25 72.37 36.58 Split-PCFG, dev set ≤ 25 70.74 33.80 Split-PCFG, dev set ≤ 40 66.81 27.59 CFG - CTF , PLCFRS , dev set ≤ 40 67.26 27.90 CFG - CTF , Disco- DOP , dev set ≤ 40 74.27 34.26 CFG - CTF , Disco- DOP , test set ≤ 40 72.33 33.16 CFG - CTF , Disco- DOP , dev set ∞ 73.32 33.40 CFG - CTF , Disco- DOP , test set ∞ 71.08 32.10 Table 4: Results on NEGRA -25 and NEGRA -40 with the CFG - CTF method. NB: As explained in section 3.3, these F1 scores are incomparable to the results in table 3; for comparison, the F1 score for Disco-DOP on the dev set ≤ 40 is 77.13 % using that evaluation scheme. from the PCFG chart. The result shows that the same model from NEGRA -40 can also be used to PLCFRS gives a slight improvement over the split-- parse the full development set, without length re- pcfg, which accords with the observation that the strictions, establishing that the CFG - CTF method latter makes stronger independence assumptions effectively eliminates any limitation of length for in the case of discontinuity. parsing with LCFRS. In the next experiments we turn to an all- 6 Conclusion fragments grammar encoded in a PLCFRS using Goodman’s (2003) reduction, to realize a (dis- Our results show that optimal binarizations are continuous) Data-Oriented Parsing (DOP; Scha, clearly not the answer to parsing LCFRS efficiently, 1990) model—which goes by the name of Disco- as they do not significantly reduce parsing com- DOP (van Cranenburgh et al., 2011). This provides plexity in our experiments. While they provide an effective yet conceptually simple method to some efficiency gains, they do not help with the weaken the independence assumptions of treebank main problem of longer sentences. grammars. Table 2 gives statistics on the gram- mars, including the parsing complexities. The fine We have presented a new technique for large- grammar has a parsing complexity of 9, which scale parsing with LCFRS, which makes it possible means that parsing with this grammar has com- to parse sentences of any length, with favorable plexity O(|w|9 ). We use the same parameters as accuracies. The availability of this technique may van Cranenburgh et al. (2011), except that unlike lead to a wider acceptance of LCFRS as a syntactic van Cranenburgh et al., we can use v = 1, h = 1 backbone in computational linguistics. Markovization, in order to obtain a higher cover- age. The DOP grammar is added as a third stage in Acknowledgments the coarse-to-fine pipeline. This gave slightly bet- ter results than substituting the the DOP grammar I am grateful to Willem Zuidema, Remko Scha, for the PLCFRS stage. Parsing with NEGRA -40 Rens Bod, and three anonymous reviewers for took about 11 hours and 4 GB of memory. The comments. 468 References Proceedings of NAACL HLT 2010., pages 769– 776. Franc¸ois Barth´elemy, Pierre Boullier, Philippe De- ´ de la Clergerie. 2001. Guided schamp, and Eric Carlos G´omez-Rodr´ıguez, Marco Kuhlmann, and parsing of range concatenation languages. In Giorgio Satta. 2010. Efficient parsing of well- Proc. of ACL, pages 42–49. nested linear context-free rewriting systems. In Proceedings of NAACL HLT 2010., pages 276– Pierre Boullier. 1998. Proposal for a natural lan- 284. guage processing syntactic backbone. Techni- cal Report RR-3342, INRIA-Rocquencourt, Le Carlos G´omez-Rodr´ıguez, Marco Kuhlmann, Gior- Chesnay, France. URL http://www.inria. gio Satta, and David Weir. 2009. Optimal reduc- fr/RRRT/RR-3342.html. tion of rule length in linear context-free rewrit- ing systems. In Proceedings of NAACL HLT Adriane Boyd. 2007. Discontinuity revisited: An 2009, pages 539–547. improved conversion to context-free representa- Joshua Goodman. 2003. Efficient parsing of tions. In Proceedings of the Linguistic Annota- DOP with PCFG-reductions. In Rens Bod, tion Workshop, pages 41–44. Remko Scha, and Khalil Sima’an, editors, Data- Sabine Brants, Stefanie Dipper, Silvia Hansen, Oriented Parsing. The University of Chicago Wolfgang Lezius, and George Smith. 2002. The Press. Tiger treebank. In Proceedings of the workshop Laura Kallmeyer. 2010. Parsing Beyond Context- on treebanks and linguistic theories, pages 24– Free Grammars. Cognitive Technologies. 41. Springer Berlin Heidelberg. Eugene Charniak, Mark Johnson, M. Elsner, Laura Kallmeyer, Timm Lichte, Wolfgang Maier, J. Austerweil, D. Ellis, I. Haxton, C. Hill, Yannick Parmentier, Johannes Dellert, and Kil- R. Shrivaths, J. Moore, M. Pozar, et al. 2006. ian Evang. 2008. Tulipa: Towards a multi- Multilevel coarse-to-fine PCFG parsing. In Pro- formalism parsing environment for grammar ceedings of NAACL-HLT, pages 168–175. engineering. In Proceedings of the Workshop Joan Chen-Main and Aravind K. Joshi. 2010. Un- on Grammar Engineering Across Frameworks, avoidable ill-nestedness in natural language and pages 1–8. the adequacy of tree local-mctag induced depen- Laura Kallmeyer and Wolfgang Maier. 2010. Data- dency structures. In Proceedings of TAG+. URL driven parsing with probabilistic linear context- http://www.research.att.com/∼srini/ free rewriting systems. In Proceedings of the TAG+10/papers/chenmainjoshi.pdf. 23rd International Conference on Computa- Michael Collins. 1999. Head-driven statistical tional Linguistics, pages 537–545. models for natural language parsing. Ph.D. the- Dan Klein and Christopher D. Manning. 2003. Ac- sis, University of Pennsylvania. curate unlexicalized parsing. In Proc. of ACL, Pierluigi Crescenzi, Daniel Gildea, Aandrea volume 1, pages 423–430. Marino, Gianluca Rossi, and Giorgio Satta. Marco Kuhlmann and Giorgio Satta. 2009. Tree- 2011. Optimal head-driven parsing complex- bank grammar techniques for non-projective de- ity for linear context-free rewriting systems. In pendency parsing. In Proceedings of EACL, Proc. of ACL. pages 478–486. Amit Dubey and Frank Keller. 2003. Parsing ger- Roger Levy. 2005. Probabilistic models of word man with sister-head dependencies. In Proc. of order and syntactic discontinuity. Ph.D. thesis, ACL, pages 96–103. Stanford University. Kilian Evang and Laura Kallmeyer. 2011. Wolfgang Maier. 2010. Direct parsing of discon- PLCFRS parsing of English discontinuous con- tinuous constituents in German. In Proceedings stituents. In Proceedings of IWPT, pages 104– of the SPMRL workshop at NAACL HLT 2010, 116. pages 58–66. Daniel Gildea. 2010. Optimal parsing strategies Wolfgang Maier and Timm Lichte. 2009. Charac- for linear context-free rewriting systems. In terizing discontinuity in constituent treebanks. 469 In Proceedings of Formal Grammar 2009, pages Ph.D. thesis, University of Pennsylvania. 167–182. Springer. URL http://repository.upenn.edu/ Wolfgang Maier and Anders Søgaard. 2008. Tree- dissertations/AAI8908403/. banks and mild context-sensitivity. In Proceed- ings of Formal Grammar 2008, page 61. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large an- notated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330. James D. McCawley. 1982. Parentheticals and discontinuous constituent structure. Linguistic Inquiry, 13(1):91–106. Oliver Plaehn. 2004. Computing the most prob- able parse for a discontinuous phrase structure grammar. In Harry Bunt, John Carroll, and Gior- gio Satta, editors, New developments in parsing technology, pages 91–106. Kluwer Academic Publishers, Norwell, MA, USA. Remko Scha. 1990. Language theory and language technology; competence and performance. In Q.A.M. de Kort and G.L.J. Leerdam, editors, Computertoepassingen in de Neerlandistiek, pages 7–22. LVVN, Almere, the Netherlands. Original title: Taaltheorie en taaltechnologie; competence en performance. Translation avail- able at http://iaaa.nl/rs/LeerdamE.html. Stuart M. Shieber. 1985. Evidence against the context-freeness of natural language. Linguis- tics and Philosophy, 8:333–343. Wojciech Skut, Brigitte Krenn, Thorten Brants, and Hans Uszkoreit. 1997. An annotation scheme for free word order languages. In Pro- ceedings of ANLP, pages 88–95. Andreas van Cranenburgh, Remko Scha, and Federico Sangati. 2011. Discontinuous data- oriented parsing: A mildly context-sensitive all- fragments grammar. In Proceedings of SPMRL, pages 34–44. K. Vijay-Shanker and David J. Weir. 1994. The equivalence of four extensions of context-free grammars. Theory of Computing Systems, 27(6):511–546. K. Vijay-Shanker, David J. Weir, and Aravind K. Joshi. 1987. Characterizing structural descrip- tions produced by various grammatical for- malisms. In Proc. of ACL, pages 104–111. David J. Weir. 1988. Characterizing mildly context-sensitive grammar formalisms. 470 Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system Myroslava O. Dzikovska and Peter Bell and Amy Isard and Johanna D. Moore Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh, United Kingdom {m.dzikovska,peter.bell,amy.isard,j.moore}@ed.ac.uk Abstract task-based evaluation can be used to complement intrinsic evaluations. For example, NLP com- It is not always clear how the differences ponents such as parsers and co-reference resolu- in intrinsic evaluation metrics for a parser or classifier will affect the performance of tion algorithms could be compared in terms of the system that uses it. We investigate the how much they contribute to the performance of relationship between the intrinsic evalua- a textual entailment (RTE) system (Sammons et tion scores of an interpretation component al., 2010; Yuret et al., 2010); parser performance in a tutorial dialogue system and the learn- could be evaluated by how well it contributes to ing outcomes in an experiment with human an information retrieval task (Miyao et al., 2008). users. Following the PARADISE method- However, task-based evaluation can be difficult ology, we use multiple linear regression to build predictive models of learning gain, and expensive for interactive applications. Specif- an important objective outcome metric in ically, task-based evaluation for dialogue systems tutorial dialogue. We show that standard typically involves collecting data from a number intrinsic metrics such as F-score alone do of people interacting with the system, which is not predict the outcomes well. However, time-consuming and labor-intensive. Thus, it is we can build predictive performance func- desirable to develop an off-line evaluation pro- tions that account for up to 50% of the vari- cedure that relates intrinsic evaluation metrics to ance in learning gain by combining fea- predicted interaction outcomes, reducing the need tures based on standard evaluation scores and on the confusion matrix entries. We to conduct experiments with human participants. argue that building such predictive mod- This problem can be addressed via the use of els can help us better evaluate performance the PARADISE evaluation methodology for spo- of NLP components that cannot be distin- ken dialogue systems (Walker et al., 2000). In a guished based on F-score alone, and illus- PARADISE study, after an initial data collection trate our approach by comparing the cur- with users, a performance function is created to rent interpretation component in the system to a new classifier trained on the evaluation predict an outcome metric (e.g., user satisfaction) data. which can normally only be measured through user surveys. Typically, a multiple linear regres- sion is used to fit a predictive model of the desired 1 Introduction metric based on the values of interaction param- Much of the work in natural language processing eters that can be derived from system logs with- relies on intrinsic evaluation: computing standard out additional user studies (e.g., dialogue length, evaluation metrics such as precision, recall and F- word error rate, number of misunderstandings). score on the same data set to compare the perfor- PARADISE models have been used extensively mance of different approaches to the same NLP in task-oriented spoken dialogue systems to estab- problem. However, once a component, such as lish which components of the system most need a parser, is included in a larger system, it is not improvement, with user satisfaction as the out- always clear that improvements in intrinsic eval- come metric (M¨oller et al., 2007; M¨oller et al., uation scores will translate into improved over- 2008; Walker et al., 2000; Larsen, 2003). In tu- all system performance. Therefore, extrinsic or torial dialogue, PARADISE studies investigated 471 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 471–481, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics which manually annotated features predict learn- in series or in parallel”. Explanation and defi- ing outcomes, to justify new features needed in nition questions require longer answers that con- the system (Forbes-Riley et al., 2007; Rotaru and sist of 1-2 sentences, e.g., “Why was bulb A on Litman, 2006; Forbes-Riley and Litman, 2006). when switch Z was open?” (expected answer “Be- We adapt the PARADISE methodology to eval- cause it was still in a closed path with the bat- uating individual NLP components, linking com- tery”) or “What is voltage?” (expected answer monly used intrinsic evaluation scores with ex- “Voltage is the difference in states between two trinsic outcome metrics. We describe an evalua- terminals”). We focus on the performance of the tion of an interpretation component of a tutorial system on these long-answer questions, since re- dialogue system, with student learning gain as the acting to them appropriately requires processing target outcome measure. We first describe the more complex input than factual questions. evaluation setup, which uses standard classifica- We collected a corpus of 35 dialogues from tion accuracy metrics for system evaluation (Sec- paid undergraduate volunteers interacting with the tion 2). We discuss the results of the intrinsic sys- system as part of a formative system evaluation. tem evaluation in Section 3. We then show that Each student completed a multiple-choice test as- standard evaluation metrics do not serve as good sessing their knowledge of the material before and predictors of system performance for the system after the session. In addition, system logs con- we evaluated. However, adding confusion matrix tained information about how each student’s utter- features improves the predictive model (Section ance was interpreted. The resulting data set con- 4). We argue that in practical applications such tains 3426 student answers grouped into 35 sub- predictive metrics should be used alongside stan- sets, paired with test results. The answers were dard metrics for component evaluations, to bet- then manually annotated to create a gold standard ter predict how different components will perform evaluation corpus. in the context of a specific task. We demonstrate how this technique can help differentiate the out- 2.2 B EETLE II Interpretation Output put quality between a majority class baseline, the The interpretation component of B EETLE II uses system’s output, and the output of a new classifier a syntactic parser and a set of hand-authored rules we trained on our data (Section 5). Finally, we to extract the domain-specific semantic represen- discuss some limitations and possible extensions tations of student utterances from the text. The to this approach (Section 6). student answer is first classified with respect to its domain-specific speech act, as follows: 2 Evaluation Procedure • Answer: a contentful expression to which 2.1 Data Collection the system responds with a tutoring action, We collected transcripts of students interacting either accepting it as correct or remediating with B EETLE II (Dzikovska et al., 2010b), a tu- the problems as discussed in (Dzikovska et torial dialogue system for teaching conceptual al., 2010a). knowledge in the basic electricity and electron- • Help request: any expression indicating that ics domain. The system is a learning environment the student does not know the answer and with a self-contained curriculum targeted at stu- without domain content. dents with no knowledge of high school physics. When interacting with the system, students spend • Social: any expression such as “sorry” which 3-5 hours going through pre-prepared reading ma- appears to relate to social interaction and has terial, building and observing circuits in a simula- no recognizable domain content. tor, and talking with a dialogue-based computer tutor via a text-based chat interface. • Uninterpretable: the system could not arrive During the interaction, students can be asked at any interpretation of the utterance. It will two types of questions. Factual questions require respond by identifying the likely source of them to name a set of objects or a simple prop- error, if possible (e.g., a word it does not un- erty, e.g., “Which components in circuit 1 are in derstand) and asking the student to rephrase a closed path?” or “Are bulbs A and B wired their utterance (Dzikovska et al., 2009). 472 If the student utterance was determined to be an the tutoring strategy based on the general answer answer, it is further diagnosed for correctness as class (correct, incomplete, or contradictory). In discussed in (Dzikovska et al., 2010b), using a do- addition, this allows us to cast the problem in main reasoner together with semantic representa- terms of classifier evaluation, and to use standard tions of expected correct answers supplied by hu- classifier evaluation metrics. If more detailed an- man tutors. The resulting diagnosis contains the notations were available, this approach could eas- following information: ily be extended, as discussed in Section 6. We employed a hierarchical annotation scheme • Consistency: whether the student statement shown in Figure 1, which is a simplification of correctly describes the facts mentioned in the DeMAND coding scheme (Campbell et al., the question and the simulation environment: 2009). Student utterances were first annotated e.g., student saying “Switch X is closed” is as either related to domain content, or not con- labeled inconsistent if the question stipulated taining any domain content, but expressing the that this switch is open. student’s metacognitive state or attitudes. Utter- • Diagnosis: an analysis of how well the stu- ances expressing domain content were then coded dent’s explanation matches the expected an- with respect to their correctness, as being fully swer. It consists of 4 parts correct, partially correct but incomplete, contain- ing some errors (rather than just omissions) or – Matched: parts of the student utterance irrelevant1 . The “irrelevant” category was used that matched the expected answer for utterances which were correct in general but – Contradictory: parts of the student ut- which did not directly answer the question. Inter- terance that contradict the expected an- annotator agreement for this annotation scheme swer on the corpus was κ = 0.69. – Extra: parts of the student utterance that The speech acts and diagnoses logged by the do not appear in the expected answer system can be automatically mapped into our an- – Not-mentioned: parts of the expected notation labels. Help requests and social acts answer missing from the student utter- are assigned the “non-content” label; answers ance. are assigned a label based on which diagnosis The speech act and the diagnosis are passed to fields were filled: “contradictory” for those an- the tutorial planner which makes decisions about swers labeled as either inconsistent, or contain- feedback. They constitute the output of the inter- ing something in the contradictory field; “incom- pretation component, and its quality is likely to plete” if there is something not mentioned, but affect the learning outcomes, therefore we need something matched as well, and “irrelevant” if an effective way to evaluate it. In future work, nothing matched (i.e., the entire expected answer performance of individual pipeline components is in not-mentioned). Finally, uninterpretable ut- could also be evaluated in a similar fashion. terances are treated as unclassified, analogous to a situation where a statistical classifier does not out- 2.3 Data Annotation put a label for an input because the classification The general idea of breaking down the student an- probability is below its confidence threshold. swer into correct, incorrect and missing parts is This mapping was then compared against the common in tutorial dialogue systems (Nielsen et manually annotated labels to compute the intrin- al., 2008; Dzikovska et al., 2010b; Jordan et al., sic evaluation scores for the B EETLE II interpreter 2006). However, representation details are highly described in Section 3. system specific, and difficult and time-consuming to annotate. Therefore we implemented a simpli- 3 Intrinsic Evaluation Results fied annotation scheme which classifies whole an- The interpretation component of B EETLE II was swers as correct, partially correct but incomplete, developed based on the transcripts of 8 sessions or contradictory, without explicitly identifying the 1 Several different subcategories of non-content utter- correct and incorrect parts. This makes it easier to ances, and of contradictory utterances, were recorded. How- create the gold standard and still retains useful in- ever, they resulting classes were too small and so were col- formation, because tutoring systems often choose lapsed into a single category for purposes of this study. 473 Category Subcategory Description Non-content Metacognitive and social expressions without domain content, e.g., “I don’t know”, “I need help”, “you are stupid” Content The utterance includes domain content. correct The student answer is fully correct pc incomplete The student said something correct, but incomplete, with some parts of the expected answer missing contradictory The student’s answer contains something incorrect or contradicting the expected answer, rather than just an omission irrelevant The student’s statement is correct in general, but it does not answer the question. Figure 1: Annotation scheme used in creating the gold standard Label Count Frequency 43%, the same as B EETLE II. However, this is correct 1438 0.43 obviously not a good choice for a tutoring sys- pc incomplete 796 0.24 tem, since students who make mistakes will never contradictory 808 0.24 get tutoring feedback. This is reflected in a much irrelevant 105 0.03 lower value of the F score (0.12 macroaverage F non content 232 0.07 score for baseline vs. 0.44 for B EETLE II). Note also that there is a large difference in the micro- Table 1: Distribution of annotated labels in the evalu- and macro- averaged scores. It is not immediately ation corpus clear which of these metrics is the most important, and how they relate to actual system performance. of students interacting with earlier versions of the We discuss machine learning models to help an- system. These sessions were completed prior to swer this question in the next section. the beginning of the experiment during which our evaluation corpus was collected, and are not in- 4 Linking Evaluation Measures to cluded in the corpus. Thus, the corpus constitutes Outcome Measures unseen testing data for the B EETLE II interpreter. Table 1 shows the distribution of codes in Although the intrinsic evaluation shows that the the annotated data. The distribution is unbal- B EETLE II interpreter performs better than the anced, and therefore in our evaluation results we baseline on the F score, ultimately system devel- use two different ways to average over per-class opers are not interested in improving interpreta- evaluation scores. Macro-average combines per- tion for its own sake: they want to know whether class scores disregarding the class sizes; micro- the time spent on improvements, and the compli- average weighs the per-class scores by class size. cations in system design which may accompany The overall classification accuracy (defined as the them, are worth the effort. Specifically, do such number of correctly classified instances out of all changes translate into improvement in overall sys- instances) is mathematically equivalent to micro- tem performance? averaged recall; however, macro-averaging better To answer this question without running expen- reflects performance on small classes, and is com- sive user studies we can build a model which pre- monly used for unbalanced classification prob- dicts likely outcomes based on the data observed lems (see, e.g., (Lewis, 1991)). so far, and then use the model’s predictions as an The detailed evaluation results are presented additional evaluation metric. We chose a multiple in Table 2. We will focus on two metrics: the linear regression model for this task, linking the overall classification accuracy (listed as “micro- classification scores with learning gain as mea- averaged recall” as discussed above), and the sured during the data collection. This approach macro-averaged F score. follows the general PARADISE approach (Walker The majority class baseline is to assign “cor- et al., 2000), but while PARADISE is typically rect” to every instance. Its overall accuracy is used to determine which system components need 474 Label baseline B EETLE II prec. recall F1 prec. recall F1 correct 0.43 1.00 0.60 0.93 0.52 0.67 pc incomplete 0.00 0.00 0.00 0.42 0.53 0.47 contradictory 0.00 0.00 0.00 0.57 0.22 0.31 irrelevant 0.00 0.00 0.00 0.17 0.15 0.16 non-content 0.00 0.00 0.00 0.91 0.41 0.57 macroaverage 0.09 0.20 0.12 0.60 0.37 0.44 microaverage 0.18 0.43 0.25 0.70 0.43 0.51 Table 2: Intrinsic Evaluation Results for the B EETLE II and a majority class baseline the most improvement, we focus on finding a bet- rate confusion matrices for each student. We nor- ter performance metric for a single component malized each confusion matrix cell by the total (interpretation), using standard evaluation scores number of incorrect classifications for that stu- as features. dent. We then added features based on confusion Recall from Section 2.1 that each participant frequencies to our feature set.2 in our data collection was given a pre-test and Ideally, we should add 20 different features to a post-test, measuring their knowledge of course our model, corresponding to every possible con- material. The test score was equal to the propor- fusion. However, we are facing a sparse data tion of correctly answered questions. The normal- problem, illustrated by the overall confusion ma- ized learning gain, post−pre 1−pre is a metric typically trix for the corpus in Table 3. For example, used to assess system quality in intelligent tutor- we only observed 25 instances where a contra- ing, and this is the metric we are trying to model. dictory utterance was miscategorized as correct Thus, the training data for our model consists of (compared to 200 “contradictory–pc incomplete” 35 instances, each corresponding to a single dia- confusions), and so for many students this mis- logue and the learning gain associated with it. We classification was never observed, and predictions can compute intrinsic evaluation scores for each based on this feature are not likely to be reliable. dialogue, in order to build a model that predicts Therefore, we limited our features to those mis- that student’s learning gain based on these scores. classifications that occurred at least twice for each If the model’s predictions are sufficiently reliable, student (i.e., at least 70 times in the entire cor- we can also use them for predicting the learning pus). The list of resulting features is shown in the gain that a student could achieve when interacting “conf” row of Table 4. Since only a small num- with a new version of the interpretation compo- ber of features was included, this limits the appli- nent for the system, not yet tested with users. We cability of the model we derived from this data can then use the predicted score to compare dif- set to the systems which make similar types of ferent implementations and choose the one with confusions. However, it is still interesting to in- the highest predicted learning gain. vestigate whether confusion probabilities provide additional information compared to standard eval- 4.1 Features uation metrics. We discuss how better coverage Table 4 lists the feature sets we used. We tried two could be obtained in Section 6. basic types of features. First, we used the eval- uation scores reported in the previous section as 4.2 Regression Models features. Second, we hypothesized that some er- Table 5 shows the regression models we obtained rors that the system makes are likely to be worse using different feature sets. All models were ob- than others from a tutoring perspective. For ex- tained using stepwise linear regression, using the ample, if the student gives a contradictory answer, Akaike information criterion (AIC) for variable accepting it as correct may lead to student miscon- 2 We also experimented with using % unclassified as an ceptions; on the other hand, calling an irrelevant additional feature, since % of rejections is known to be a answer “partially correct but incomplete” may be problem for spoken dialogue systems. However, it did not less of a problem. Therefore, we computed sepa- improve the models, and we do not report it here for brevity. 475 Actual Predicted contradictory correct irrelevant non-content pc incomplete contradictory 175 86 3 0 43 correct 25 752 1 4 26 irrelevant 31 12 16 4 29 non-content 1 3 2 95 3 pc incomplete 200 317 40 28 419 Table 3: Confusion matrix for B EETLE II. System predicted values are in rows; actual values in columns. selection implemented in the R stepwise regres- full set of intrinsic evaluation scores with confu- sion library. As measures of model quality, we re- sion frequencies. Note that if the full set of met- port R2 , the percentage of variance accounted for rics (precision, recall, F score) is used, the model by the models (a typical measure of fit in regres- derives a more complex formula which covers sion modeling), and mean squared error (MSE). about 33% of the variance. Our best models, These were estimated using leave-one-out cross- however, combine the averaged scores with con- validation, since our data set is small. fusion frequencies, resulting in a higher R2 and We used feature ablation to evaluate the contri- a lower MSE (22% relative decrease between the bution of different features. First, we investigated “scores.f” and “conf+scores.f” models in the ta- models using precision, recall or F-score alone. ble). This shows that these features have comple- As can be seen from the table, precision is not pre- mentary information, and that combining them in dictive of learning gain, while F-score and recall an application-specific way may help to predict perform similarly to one another, with R2 = 0.12. how the components will behave in practice. In comparison, the model using only confusion frequencies has substantially higher estimated R2 5 Using prediction models in evaluation and a lower MSE.3 In addition, out of the 3 con- The models from Table 5 can be used to compare fusion features, only one is selected as predictive. different possible implementations of the inter- This supports our hypothesis that different types pretation component, under the assumption that of errors may have different importance within a the component with a higher predicted learning practical system. gain score is more appropriate to use in an ITS. The confusion frequency feature chosen by To show how our predictive models can be used the stepwise model (“predicted-pc incomplete- in making implementation decisions, we compare actual-contradictory”) has a reasonable theoret- three possible choices for an interpretation com- ical justification. Previous research shows that ponent: the original B EETLE II interpreter, the students who give more correct or partially cor- baseline classifier described earlier, and a new de- rect answers, either in human-human or human- cision tree classifier trained on our data. computer dialogue, exhibit higher learning gains, We built a decision tree classifier using the and this has been established for different sys- Weka implementation of C4.5 pruned decision tems and tutoring domains (Litman et al., 2009). trees, with default parameters. As features, we Consequently, % of contradictory answers is neg- used lexical similarity scores computed by the atively predictive of learning gain. It is reasonable Text::Similarity package4 . We computed to suppose, as predicted by our model, that sys- 8 features: the similarity between student answer tems that do not identify such answers well, and and either the expected answer text or the question therefore do not remediate them correctly, will do text, using 4 different scores: raw number of over- worse in terms of learning outcomes. lapping words, F1 score, lesk score and cosine Based on this initial finding, we investigated score. Its intrinsic evaluation scores are shown in the models that combined either F scores or the Table 6, estimated using 10-fold cross-validation. 3 The decrease in MSE is not statistically significant, pos- We can compare B EETLE II and baseline clas- sibly because of the small data set. However, since we ob- sifier using the “scores.all” model. The predicted serve the same pattern of results across our models, it is still 4 useful to examine. http://search.cpan.org/dist/Text-Similarity/ 476 Name Variables scores.fm fmeasure.microaverage, fmeasure.macroaverage, fmeasure.correct, fmeasure.contradictory, fmeasure.pc incomplete,fmeasure.non-content, fmeasure.irrelevant scores.precision precision.microaverage, precision.macroaverage, precision.correct, precision.contradictory, precision.pc incomplete,precision.non-content, precision.irrelevant scores.recall recall.microaverage, recall.macroaverage, recall.correct, recall.contradictory, recall.pc incomplete,recall.non-content, recall.irrelevant scores.all scores.fm + scores.precision + scores.recall conf Freq.predicted.contradictory.actual.correct, Freq.predicted.pc incomplete.actual.correct, Freq.predicted.pc incomplete.actual.contradictory Table 4: Feature sets for regression models Variables Cross- Cross- Formula validation validation R2 MSE scores.f 0.12 0.0232 0.32 (0.02) (0.0302) + 0.56 ∗ f measure.microaverage scores.precision 0.00 0.0242 0.61 (0.00) (0.0370) scores.recall 0.12 0.0232 0.37 (0.02) (0.0310) + 0.56 ∗ recall.microaverage conf 0.25 0.0197 0.74 (0.03) (0.0262) − 0.56 ∗ F req.predicted.pc incomplete.actual.contradictory scores.all 0.33 0.0218 0.63 (0.03) (0.0264) + 4.20 ∗ f measure.microaverage − 1.30 ∗ precision.microaverage − 2.79 ∗ recall.microaverage − 0.07 ∗ recall.non − content conf+scores.f 0.36 0.0179 0.52 (0.03) (0.0281) − 0.66 ∗ F req.predicted.pc incomplete.actual.contradictory + 0.42 ∗ f measure.correct − 0.07 ∗ f measure.non − content full 0.49 0.0189 0.88 (conf+scores.all) (0.02) (0.0248) − 0.68 ∗ F req.predicted.pc incomplete.actual.contradictory − 0.06 ∗ precision.non domain + 0.28 ∗ recall.correct − 0.79 ∗ precision.microaverage + 0.65 ∗ f measure.microaverage Table 5: Regression models for learning gain. R2 and MSE estimated with leave-one-out cross-validation. Standard deviation in parentheses. 477 score for B EETLE II is 0.66. The predicted Label prec. recall F1 score for the baseline is 0.28. We cannot use correct 0.66 0.76 0.71 the models based on confusion scores (“conf”, pc incomplete 0.38 0.34 0.36 “conf+scores.f” or “full”) for evaluating the base- contradictory 0.40 0.35 0.37 line, because the confusions it makes are always irrelevant 0.07 0.04 0.05 to predict that the answer is correct when the non-content 0.62 0.76 0.68 actual label is “incomplete” or “contradictory”. macroaverage 0.43 0.45 0.43 Such situations were too rare in our training data, microaverage 0.51 0.53 0.52 and therefore were not included in the models (as Table 6: Intrinsic evaluation scores for our newly built discussed in Section 4.1). Additional data will classifier. need to be collected before this model can rea- sonably predict baseline behavior. Compared to our new classifier, B EETLE II has swer. However, we could still use a classifier to lower overall accuracy (0.43 vs. 0.53), but per- “double-check” the interpreter’s output. If the forms micro- and macro- averaged scores. B EE - predictions made by the original interpreter and TLE II precision is higher than that of the classi- the classifier differ, and in particular when the fier. This is not unexpected given how the system classifier assigns the “contradictory” label to an was designed: since misunderstandings caused answer, B EETLE II may choose to use a generic dialogue breakdown in pilot tests, the interpreter strategy for contradictory utterances, e.g. telling was built to prefer rejecting utterances as uninter- the student that their answer is incorrect without pretable rather than assigning them to an incorrect specifying the exact problem, or asking them to class, leading to high precision but lower recall. re-read portions of the material. However, we can use all our predictive models 6 Discussion and Future Work to evaluate the classifier. We checked the the con- fusion matrix (not shown here due to space lim- In this paper, we proposed an approach for cost- itations), and saw that the classifier made some sensitive evaluation of language interpretation of the same types of confusions that B EETLE II within practical applications. Our approach is interpreter made. On the “scores.all” model, the based on the PARADISE methodology for dia- predicted learning gain score for the classifier is logue system evaluation (Walker et al., 2000). 0.63, also very close to B EETLE II. But with the We followed the typical pattern of a PARADISE “conf+scores.all” model, the predicted score is study, but instead of relying on a variety of fea- 0.89, compared to 0.59 for B EETLE II, indicating tures that characterize the interaction, we used that we should prefer the newly built classifier. scores that reflect only the performance of the Looking at individual class performance, the interpretation component. For B EETLE II we classifier performs better than the B EETLE II in- could build regression models that account for terpreter on identifying “correct” and “contradic- nearly 50% variance in the desired outcomes, on tory” answers, but does not do as well for par- par with models reported in earlier PARADISE tially correct but incomplete, and for irrelevant an- studies (M¨oller et al., 2007; M¨oller et al., 2008; swers. Using our predictive performance metric Walker et al., 2000; Larsen, 2003). More impor- highlights the differences between the classifiers tantly, we demonstrated that combining averaged and effectively helps determine which confusion scores with features based on confusion frequen- types are the most important. cies improves prediction quality and allows us to One limitation of this prediction, however, is see differences between systems which are not ob- that the original system’s output is considerably vious from the scores alone. more complex: the B EETLE II interpreter explic- Previous work on task-based evaluation of NLP itly identifies correct, incorrect and missing parts components used RTE or information extraction of the student answer which are then used by the as target tasks (Sammons et al., 2010; Yuret et al., system to formulate adaptive feedback. This is 2010; Miyao et al., 2008), based on standard cor- an important feature of the system because it al- pora. We specifically targeted applications which lows for implementation of strategies such as ac- involve human-computer interaction, where run- knowledging and restating correct parts of the an- ning task-based evaluations is particularly expen- 478 sive, and building a predictive model of system tation variants during the system development, performance can simplify system development. without re-running user evaluations, can provide Our evaluation data limited the set of features important information, as we illustrated with an that we could use in our models. For most con- example of evaluating a new classifier we built for fusion features, there were not enough instances our interpretation task. Moreover, the confusion in the data to build a model that would reliably frequency feature that our models picked is con- predict learning gain for those cases. One way sistent with earlier results from a different tutor- to solve this problem would be to conduct a user ing domain (see Section 4.2). Thus, these models study in which the system simulates random er- could provide a starting point when making sys- rors appearing some of the time. This could pro- tem development choices, which can then be con- vide the data needed for more accurate models. firmed by user evaluations in new domains. The general pattern we observed in our data The models we built do not fully account for is that a model based on F-scores alone predicts the variance in the training data. This is expected, only a small proportion of the variance. If a full since interpretation performance is not the only set of metrics (including F-score, precision and factor influencing the objective outcome: other recall) is used, linear regression derives a more factors, such choosing the the appropriate tutor- complex equation, with different weights for pre- ing strategy, are also important. Similar models cision and recall. Instead of the linear model, we could be built for other system components to ac- may consider using a model based on Fβ score, count for their contribution to the variance. Fi- Fβ = (1 + β 2 ) β 2PPR+R , and fitting it to the data to nally, we could consider using different learning derive the β weight rather than using the standard algorithms. M¨oller et al. (2008) examined deci- F1 score. We plan to investigate this in the future. sion trees and neural networks in addition to mul- Our method would apply to a wide range of tiple linear regression for predicting user satisfac- systems. It can be used straightforwardly with tion in spoken dialogue. They found that neural many current spoken dialogue systems which rely networks had the best prediction performance for on classifiers to support language understanding their task. We plan to explore other learning algo- in domains such as call routing and technical sup- rithms for this task as part of our future work. port (Gupta et al., 2006; Acomb et al., 2007). 7 Conclusion We applied it to a system that outputs more com- plex logical forms, but we showed that we could In this paper, we described an evaluation of an simplify its output to a set of labels which still interpretation component of a tutorial dialogue allowed us to make informed decisions. Simi- system using predictive models that link intrin- lar simplifications could be derived for other sys- sic evaluation scores with learning outcomes. We tems based on domain-specific dialogue acts typ- showed that adding features based on confusion ically used in dialogue management. For slot- frequencies for individual classes significantly based systems, it may be useful to consider con- improves the prediction. This approach can be cept accuracy for recognizing individual slot val- used to compare different implementations of lan- ues. Finally, for tutoring systems it is possible guage interpretation components, and to decide to annotate the answers on a more fine-grained which option to use, based on the predicted im- level. Nielsen et al. (2008) proposed an annota- provement in a task-specific target outcome met- tion scheme based on the output of a dependency ric trained on previous evaluation data. parser, and trained a classifier to identify individ- ual dependencies as “expressed”, “contradicted” Acknowledgments or “unaddressed”. Their system could be evalu- We thank Natalie Steinhauser, Gwendolyn Camp- ated using the same approach. bell, Charlie Scott, Simon Caine, Leanne Taylor, The specific formulas we derived are not likely Katherine Harrison and Jonathan Kilgour for help to be highly generalizable. It is a well-known with data collection and preparation; and Christo- limitation of PARADISE evaluations that models pher Brew for helpful comments and discussion. built based on one system often do not perform This work has been supported in part by the US well when applied to different systems (M¨oller et ONR award N000141010085. al., 2008). But using them to compare implemen- 479 References Gilbert. 2006. The AT&T spoken language un- derstanding system. IEEE Transactions on Audio, Kate Acomb, Jonathan Bloom, Krishna Dayanidhi, Speech & Language Processing, 14(1):213–222. Phillip Hunter, Peter Krogh, Esther Levin, and Pamela W. Jordan, Maxim Makatchev, and Umarani Roberto Pieraccini. 2007. Technical support dia- Pappuswamy. 2006. Understanding complex nat- log systems: Issues, problems, and solutions. In ural language explanations in tutorial applications. Proceedings of the Workshop on Bridging the Gap: In Proceedings of the Third Workshop on Scalable Academic and Industrial Research in Dialog Tech- Natural Language Understanding, ScaNaLU ’06, nologies, pages 25–31, Rochester, NY, April. pages 17–24. Gwendolyn C. Campbell, Natalie B. Steinhauser, Lars Bo Larsen. 2003. Issues in the evaluation of spo- Myroslava O. Dzikovska, Johanna D. Moore, ken dialogue systems using objective and subjective Charles B. Callaway, and Elaine Farrow. 2009. The measures. In Proceedings of the 2003 IEEE Work- DeMAND coding scheme: A “common language” shop on Automatic Speech Recognition and Under- for representing and analyzing student discourse. In standing, pages 209–214. Proceedings of 14th International Conference on David D. Lewis. 1991. Evaluating text categorization. Artificial Intelligence in Education (AIED), poster In Proceedings of the workshop on Speech and Nat- session, Brighton, UK, July. ural Language, HLT ’91, pages 312–318, Strouds- Myroslava O. Dzikovska, Charles B. Callaway, Elaine burg, PA, USA. Farrow, Johanna D. Moore, Natalie B. Steinhauser, Diane Litman, Johanna Moore, Myroslava Dzikovska, and Gwendolyn E. Campbell. 2009. Dealing with and Elaine Farrow. 2009. Using natural lan- interpretation errors in tutorial dialogue. In Pro- guage processing to analyze tutorial dialogue cor- ceedings of the SIGDIAL 2009 Conference, pages pora across domains and modalities. In Proceed- 38–45, London, UK, September. ings of 14th International Conference on Artificial Myroslava Dzikovska, Diana Bental, Johanna D. Intelligence in Education (AIED), Brighton, UK, Moore, Natalie B. Steinhauser, Gwendolyn E. July. Campbell, Elaine Farrow, and Charles B. Callaway. Yusuke Miyao, Rune Sætre, Kenji Sagae, Takuya Mat- 2010a. Intelligent tutoring with natural language suzaki, and Jun’ichi Tsujii. 2008. Task-oriented support in the Beetle II system. In Sustaining TEL: evaluation of syntactic parsers and their representa- From Innovation to Learning and Practice - 5th Eu- tions. In Proceedings of ACL-08: HLT, pages 46– ropean Conference on Technology Enhanced Learn- 54, Columbus, Ohio, June. ing, (EC-TEL 2010), Barcelona, Spain, October. Sebastian M¨oller, Paula Smeele, Heleen Boland, and Jan Krebber. 2007. Evaluating spoken dialogue Myroslava O. Dzikovska, Johanna D. Moore, Natalie systems according to de-facto standards: A case Steinhauser, Gwendolyn Campbell, Elaine Farrow, study. Computer Speech & Language, 21(1):26 – and Charles B. Callaway. 2010b. Beetle II: a sys- 53. tem for tutoring and computational linguistics ex- Sebastian M¨oller, Klaus-Peter Engelbrecht, and perimentation. In Proceedings of the 48th Annual Robert Schleicher. 2008. Predicting the quality and Meeting of the Association for Computational Lin- usability of spoken dialogue services. Speech Com- guistics (ACL-2010) demo session, Uppsala, Swe- munication, pages 730–744. den, July. Rodney D. Nielsen, Wayne Ward, and James H. Mar- Kate Forbes-Riley and Diane J. Litman. 2006. Mod- tin. 2008. Learning to assess low-level conceptual elling user satisfaction and student learning in a understanding. In Proceedings 21st International spoken dialogue tutoring system with generic, tu- FLAIRS Conference, Coconut Grove, Florida, May. toring, and user affect parameters. In Proceed- Mihai Rotaru and Diane J. Litman. 2006. Exploit- ings of the Human Language Technology Confer- ing discourse structure for spoken dialogue perfor- ence of the North American Chapter of the Asso- mance analysis. In Proceedings of the 2006 Con- ciation of Computational Linguistics (HLT-NAACL ference on Empirical Methods in Natural Language ’06), pages 264–271, Stroudsburg, PA, USA. Processing, EMNLP ’06, pages 85–93, Strouds- Kate Forbes-Riley, Diane Litman, Amruta Purandare, burg, PA, USA. Mihai Rotaru, and Joel Tetreault. 2007. Compar- Mark Sammons, V.G.Vinod Vydiswaran, and Dan ing linguistic features for modeling learning in com- Roth. 2010. “Ask not what textual entailment can puter tutoring. In Proceedings of the 2007 confer- do for you...”. In Proceedings of the 48th Annual ence on Artificial Intelligence in Education: Build- Meeting of the Association for Computational Lin- ing Technology Rich Learning Contexts That Work, guistics, pages 1199–1208, Uppsala, Sweden, July. pages 270–277, Amsterdam, The Netherlands. IOS Marilyn A. Walker, Candace A. Kamm, and Diane J. Press. Litman. 2000. Towards Developing General Mod- Narendra K. Gupta, G¨okhan T¨ur, Dilek Hakkani-T¨ur, els of Usability with PARADISE. Natural Lan- Srinivas Bangalore, Giuseppe Riccardi, and Mazin guage Engineering, 6(3). 480 Deniz Yuret, Aydin Han, and Zehra Turgut. 2010. SemEval-2010 task 12: Parser evaluation using tex- tual entailments. In Proceedings of the 5th Inter- national Workshop on Semantic Evaluation, pages 51–56, Uppsala, Sweden, July. 481 Experimenting with Distant Supervision for Emotion Classification Matthew Purver∗ and Stuart Battersby† ∗ † Interaction Media and Communication Group Chatterbox Analytics School of Electronic Engineering and Computer Science Queen Mary University of London Mile End Road, London E1 4NS, UK

[email protected] [email protected]

Abstract in unconventional style and without accompany- ing metadata, audio/video signals or access to the We describe a set of experiments using au- author for disambiguation, how can we easily pro- tomatically labelled data to train supervised duce a gold-standard labelling for training and/or classifiers for multi-class emotion detection for evaluation and test? One possible solution in Twitter messages with no manual inter- vention. By cross-validating between mod- that is becoming popular is crowd-sourcing the la- els trained on different labellings for the belling task, as the easy access to very large num- same six basic emotion classes, and testing bers of annotators provided by tools such as Ama- on manually labelled data, we conclude that zon’s Mechanical Turk can help with the problem the method is suitable for some emotions of dataset size; however, this has its own attendant (happiness, sadness and anger) but less able problems of annotator reliability (see e.g. (Hsueh to distinguish others; and that different la- et al., 2009)), and cannot directly help with the in- belling conventions are more suitable for herent problem of ambiguity – using many anno- some emotions than others. tators does not guarantee that they can understand or correctly assign the author’s intended interpre- 1 Introduction tation or emotional state. We present a set of experiments into classify- In this paper, we investigate a different ap- ing Twitter messages into the six basic emotion proach via distant supervision (see e.g. (Mintz classes of (Ekman, 1972). The motivation behind et al., 2009)). By using conventional markers of this work is twofold: firstly, to investigate the pos- emotional content within the texts themselves as sibility of detecting emotions of multiple classes a surrogate for explicit labels, we can quickly re- (rather than purely positive or negative sentiment) trieve large subsets of (noisily) labelled data. This in such short texts; and secondly, to investigate approach has the advantage of giving us direct the use of distant supervision to quickly bootstrap access to the authors’ own intended interpreta- large datasets and classifiers without the need for tion or emotional state, without relying on third- manual annotation. party annotators. Of course, the labels themselves Text classification according to emotion and may be noisy: ambiguous, vague or not having sentiment is a well-established research area. In a direct correspondence with the desired classi- this and other areas of text analysis and classifica- fication. We therefore experiment with multiple tion, recent years have seen a rise in use of data such conventions with apparently similar mean- from online sources and social media, as these ings – here, emoticons (following (Read, 2005)) provide very large, often freely available datasets and Twitter hashtags – allowing us to examine the (see e.g. (Eisenstein et al., 2010; Go et al., 2009; similarity of classifiers trained on independent la- Pak and Paroubek, 2010) amongst many others). bels but intended to detect the same underlying However, one of the challenges this poses is that class. We also investigate the precision and cor- of data annotation: given very large amounts of respondence of particular labels with the desired data, often consisting of very short texts, written emotion classes by testing on a small set of man- 482 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 482–491, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics ually labelled data. 74% for anger. However, they achieved signifi- We show that the success of this approach de- cant improvements using acoustic features avail- pends on both the conventional markers chosen able in their speech data, improving accuracies up and the emotion classes themselves. Some emo- to a maximum of 81.5%. tions are both reliably marked by different con- 2.2 Conventions ventions and distinguishable from other emotions; this seems particularly true for happiness, sadness As we are using text data, such intonational and and anger, indicating that this approach can pro- prosodic cues are unavailable, as are the other vide not only the basic distinction required for rich sources of emotional cues we obtain from sentiment analysis but some more finer-grained gesture, posture and facial expression in face-to- information. Others are either less distinguishable face communication. However, the prevalence of from short text messages, or less reliably marked. online text-based communication has led to the emergence of textual conventions understood by 2 Related Work the users to perform some of the same functions as these acoustic and non-verbal cues. The most 2.1 Emotion and Sentiment Classification familiar of these is the use of emoticons, either Much research in this area has concentrated on the Western-style (e.g. :), :-( etc.) or Eastern-style related tasks of subjectivity classification (distin- (e.g. (ˆ_ˆ), (>_<) etc.). Other conventions guishing objective from subjective texts – see e.g. have emerged more recently for particular inter- (Wiebe and Riloff, 2005)); and sentiment classifi- faces or domains; in Twitter data, one common cation (classifying subjective texts into those that convention is the use of hashtags to add or em- convey positive, negative and neutral sentiment – phasise emotional content – see (1). see e.g. (Pang and Lee, 2008)). We are interested (1) a. Best day in ages! #Happy :) in emotion detection: classifying subjective texts according to a finer-grained classification of the b. Gets so #angry when tutors don’t email emotions they convey, and thus providing richer back... Do you job idiots! and more informative data for social media anal- Linguistic and social research into the use of ysis than simple positive/negative sentiment. In such conventions suggests that their function is this study we confine ourselves to the six basic generally to emphasise or strengthen the emo- emotions identified by Ekman (1972) as being tion or sentiment conveyed by a message, rather common across cultures; other finer-grained clas- than to add emotional content which would not sifications are of course available. otherwise be present. Walther and D’Addario (2001) found that the contribution of emoticons 2.1.1 Emotion Classification towards the sentiment of a message was out- The task of emotion classification is by nature weighed by the verbal content, although nega- a multi-class problem, and classification experi- tive ones tended to shift interpretation towards the ments have therefore achieved lower accuracies negative. Ip (2002) experimented with emoticons than seen in the binary problems of sentiment and in instant messaging, with the results suggesting subjectivity classification. Danisman and Alpko- that emoticons do not add positivity or negativ- cak (2008) used vector space models for the same ity but rather increase valence (making positive six-way emotion classification we examine here, messages more positive and vice versa). Similarly and achieved F-measures around 32%; Seol et al. Derks et al. (2008a; 2008b) found that emoticons (2008) used neural networks for an 8-way clas- are used in strengthening the intensity of a ver- sification (hope, love, thank, neutral, happy, sad, bal message (although they serve other functions fear, anger) and achieved per-class accuracies of such as expressing humour), and hypothesized 45% to 65%. Chuang and Wu (2004) used su- that they serve similar functions to actual non- pervised classifiers (SVMs) and manually defined verbal behavior; Provine et al. (2007) also found keyword features over a seven-way classification that emoticons are used to “punctuate” messages consisting of the same six-class taxonomy plus a rather than replace lexical content, appearing in neutral category, and achieved an average accu- similar grammatical locations to verbal laughter racy of 65.5%, varying from 56% for disgust to and preserving phrase structure. 483 2.3 Distant Supervision b. Leftover ToeJams with Kettle Salt and These findings suggest, of course, that emoticons Vinegar chips. #stress #sadness #comfort and related conventional markers are likely to be #letsturnthisfrownupsidedown useful features for sentiment and emotion classifi- 3 Methodology cation. They also suggest, though, that they might be used as surrogates for manual emotion class la- We used a collection of Twitter messages, all bels: if their function is often to complement the marked with emoticons or hashtags correspond- verbal content available in messages, they should ing to one of Ekman (1972)’s six emotion classes. give us a way to automatically label messages ac- For emoticons, we used Ansari (2010)’s taxon- cording to emotional class, while leaving us with omy, taken from the Yahoo messenger classifica- messages with enough verbal content to achieve tion. For hashtags, we used emotion names them- reasonable classification. selves together with the main related adjective – This approach has been exploited in several both are used commonly on Twitter in slightly ways in recent work; Tanaka et al. (2005) used different ways as shown in (3); note that emo- Japanese-style emoticons as classification labels, tion names are often used as marked verbs as well and Go et al. (2009) and Pak and Paroubek (2010) as nouns. Details of the classes and markers are used Western-style emoticons to label and classify given in Table 1. Twitter messages according to positive and nega- tive sentiment, using traditional supervised clas- (3) a. Gets so #angry when tutors don’t email sification methods. The highest accuracies ap- back... Do you job idiots! pear to have been achieved by Go et al. (2009), b. I’m going to say it, Paranormal Activity who used various combinations of features (un- 2 scared me and I didn’t sleep well last igrams, bigrams, part-of-speech tags) and clas- night because of it. #fear #demons sifiers (Na¨ıve Bayes, maximum entropy, and SVMs), achieving their best accuracy of 83.0% c. Girls that sleep w guys without even fully with unigram and bigram features and a maxi- getting to know them #disgust me mum entropy; using only unigrams with a SVM classifier achieved only slightly lower accuracy at Messages with multiple conventions (see (4)) 82.2%. Ansari (2010) then provides an initial in- were collected and used in the experiments, ensur- vestigation into applying the same methods to six- ing that the marker being used as a label in a par- way emotion classification, treating each emotion ticular experiment was not available as a feature in independently as a binary classification problem that experiment. Messages with no markers were and showing that accuracy varied with emotion not collected. While this prevents us from exper- class as well as with dataset size. The highest ac- imenting with the classification of neutral or ob- curacies achieved were up to 81%, but these were jective messages, it would require manual anno- on very small datasets (e.g. 81.0% accuracy on tation to distinguish these from emotion-carrying fear, but with only around 200 positive and nega- messages which are not marked. We assume that tive data instances). any implementation of the techniques we investi- We view this approach as having several ad- gate here would be able to use a preliminary stage vantages; apart from the ease of data collection of subjectivity and/or sentiment detection to iden- it allows by avoiding manual annotation, it gives tify these messages, and leave this aside here. us access to the author’s own intended interpeta- tions, as the markers are of course added by the (4) a. just because people are celebs they dont authors themselves at time of writing. In some reply to your tweets! NOT FAIR #Angry cases such as the examples of (1) above, the emo- :( I wish They would reply! #Please tion conveyed may be clear to a third-party anno- Data was collected from Twitter’s Streaming tator; but in others it may not be clear at all with- API service.1 This provides a 1-2% random sam- out the marker – see (2): ple of all tweets with no constraints on language (2) a. Still trying to recover from seeing the 1 See http://dev.twitter.com/docs/ #bluewaffle on my TL #disgusted #sick streaming-api. 484 absolute performance for future work. Table 1: Conventional markers used for emotion classes. 4 Experiments happy :-) :) ;-) :D :P 8) 8-| <@o Throughout, the markers (emoticons and/or hash- sad :-( :( ;-( :-< :’( tags) used as labels in any experiment were re- anger :-@ :@ moved before feature extraction in that experi- fear :| :-o :-O ment – labels were not used as features. surprise :s :S 4.1 Experiment 1: Emotion detection disgust :$ +o( happy #happy #happiness To simulate the task of detecting emotion classes sad #sad #sadness from a general stream of messages, we first built anger #angry #anger for each convention type C and each emotion class E a dataset DE C of size N containing (a) fear #scared #fear surprise #surprised #surprise as positive instances, N/2 messages containing disgust #disgusted #disgust markers of the emotion class E and no other markers of type C, and (b) as negative instances, N/2 messages containing markers of type C of or location. These are collected in near real time any other emotion class. For example, the posi- and stored in a local database. An English lan- tive instance set for emoticon-marked anger was guage selection filter was applied; scripts collect- based on those tweets which contained :-@ or ing each conventional marker set were alternated :@, but none of the emoticons from the happy, throughout different times of day and days of the sad, surprise, disgust or fear classes; week to avoid any bias associated with e.g. week- any hashtags were allowed, including those as- ends or mornings. The numbers of messages col- sociated with emotion classes. The negative in- lected varied with the popularity of the markers stance set contained a representative sample of themselves: for emoticons, we obtained a max- the same number of instances, with each having imum of 837,849 (for happy) and a minimum at least one of the happy, sad, surprise, of 10,539 for anger; for hashtags, a maximum disgust or fear emoticons but not containing of 10,219 for happy and a minimum of 536 for :-@ or :@. disgust. 2 This of course excludes messages with no emo- Classification in all experiments was using sup- tional markers; for this to act as an approximation port vector machines (SVMs) (Vapnik, 1995) via of the general task therefore requires a assump- the LIBSVM implementation of Chang and Lin tion that unmarked messages reflect the same dis- (2001) with a linear kernel and unigram features. tribution over emotion classes as marked mes- Unigram features included all words and hashtags sages. For emotion-carrying but unmarked mes- (other than those used as labels in relevant exper- sages, this does seem intuitively likely, but re- iments) after removal of URLs and Twitter user- quires investigation. For neutral objective mes- names. Some improvement in performance might sages it is clearly false, but as stated above we as- be available using more advanced features (e.g. sume a preliminary stage of subjectivity detection n-grams), other classification methods (e.g. maxi- in any practical application. mum entropy, as lexical features are unlikely to be Performance was evaluated using 10-fold independent) and/or feature weightings (e.g. the cross-validation. Results are shown as the bold variant of TFIDF used for sentiment classification figures in Table 2; despite the small dataset by Martineau (2009)). Here, our interest is more sizes in some cases, a χ2 test shows all to be in the difference between the emotion and con- significantly different from chance. The best- vention marker classes - we leave investigation of performing classes show accuracies very similar to those achieved by Go et al. (2009) for their bi- 2 One possible way to increase dataset sizes for the rarer nary positive/negative classification, as might be markers might be to include synonyms in the hashtag names used; however, people’s use and understanding of hashtags is expected; for emoticon markers, the best classes not straightforwardly predictable from lexical form. Instead, are happy, sad and anger; interestingly the we intend to run a longer-term data gathering exercise. best classes for hashtag markers are not the same 485 but the highest figures (between 63% and 68%) Table 2: Experiment 1: Within-class results. Same- convention (bold) figures are accuracies over 10-fold are achieved for happy, sad and anger; here cross-validation; cross-convention (italic) figures are perhaps we can have some confidence that not accuracies over full sets. only are the markers acting as predictable labels Train themselves, but also seem to be labelling the same Convention Test emoticon hashtag thing (and therefore perhaps are actually labelling emoticon happy 79.8% 63.5% the emotion we are hoping to label). emoticon sad 79.9% 65.5% emoticon anger 80.1% 62.9% 4.2 Experiment 2: Emotion discrimination emoticon fear 76.2% 58.5% To investigate whether these independent clas- emoticon surprise 77.4% 48.2% sifiers can be used in multi-class classification emoticon disgust 75.2% 54.6% (distinguishing emotion classes from each other hashtag happy 67.7% 82.5% rather than just distinguishing one class from a hashtag sad 67.1% 74.6% general “other” set), we next cross-tested the clas- hashtag anger 62.8% 74.7% sifiers between emotion classes: training models hashtag fear 60.6% 77.2% on one emotion and testing on the others – for hashtag surprise 51.9% 67.4% each convention type C and each emotion class hashtag disgust 64.6% 78.3% E1, train a classifier on dataset DE1 C and test on DE2C , D C etc. The datasets in Experiment 1 had E3 – happy performs best, but disgust and fear an uneven balance of emotion classes (including a outperform sad and anger, and surprise high proportion of happy instances) which could performs particularly badly. For sad, one reason bias results; for this experiment, therefore, we cre- may be a dual meaning of the tag #sad (one emo- ated datasets with an even balance of emotions tional and one expressing ridicule); for anger among the negative instances. For each conven- one possibility is the popularity on Twitter of the tion type C and each emotion class E1, we built a dataset DE1 C of size N containing (a) as pos- game “Angry Birds”; for surprise, the data seems split between two rather distinct usages, itive instances, N/2 messages containing mark- ones expressing the author’s emotion, but one ex- ers of the emotion class E1 and no other mark- pressing an intended effect on the audience (see ers of type C, and (b) as negative instances, N/2 (5)). However, deeper analysis is needed to estab- messages consisting of N/10 messages contain- lish the exact causes. ing only markers of class E2, N/10 messages containing only markers of class E3 etc. Results (5) a. broke 100 followers. #surprised im glad were then generated as in Experiment 1. that the HOFF is one of them. Within-class results are shown in Table 3 and are similar to those obtained in Experiment 1; b. Who’s excited for the Big Game? We again, differences between bold/italic results are know we are AND we have a #surprise statistically significant. Cross-class results are for you! shown in Table 4. The happy class was well To investigate whether the different conven- distinguished from other emotion classes for both tion types actually convey similar properties (and convention types (i.e. cross-class classification hence are used to mark similar messages) we then accuracy is low compared to the within-class fig- compared these accuracies to those obtained by ures in italics and parentheses). The sad class training classifiers on the dataset for a different also seems well distinguished when using hash- convention: in other words, for each emotion tags as labels, although less so when using emoti- C1 and test class E, train a classifier on dataset DE cons. However, other emotion classes show a sur- C2 on DE . As the training and testing sets are dif- prisingly high cross-class performance in many ferent, we now test on the entire dataset rather cases – in other words, they are producing dis- than using cross-validation. Results are shown as appointingly similar classifiers. the italic figures in Table 2; a χ2 test shows all This poor discrimination for negative emotion to be significantly different from the bold same- classes may be due to ambiguity or vagueness in convention results. Accuracies are lower overall, the label, similarity of the verbal content associ- 486 Table 4: Experiment 2: Cross-class results. Same-class figures from 10-fold cross-validation are shown in (italics) for comparison; all other figures are accuracies over full sets. Train Convention Test happy sad anger fear surprise disgust emoticon happy (78.1%) 17.3% 39.6% 26.7% 28.3% 42.8% emoticon sad 16.5% (78.9%) 59.1% 71.9% 69.9% 55.5% emoticon anger 29.8% 67.0% (79.7%) 74.2% 76.4% 67.5% emoticon fear 27.0% 69.9% 64.4% (75.3%) 74.0% 61.2% emoticon surprise 25.4% 69.9% 67.7% 76.3% (78.1%) 66.4% emoticon disgust 42.2% 54.4% 61.1% 64.2% 64.1% (73.9%) hashtag happy (81.1%) 10.7% 45.3% 47.8% 52.7% 43.4% hashtag sad 13.8% (77.9%) 47.7% 49.7% 46.5% 54.2% hashtag anger 44.6% 45.2% (74.3%) 72.0% 63.0% 62.9% hashtag fear 45.0% 50.4% 68.6% (74.7%) 63.9% 60.7% hashtag surprise 51.5% 45.7% 67.4% 70.7% (70.2%) 64.2% hashtag disgust 40.4% 53.5% 74.7% 71.8% 70.8% (74.2%) we would expect emoticon-trained models to fail Table 3: Experiment 2: Within-class results. Same- convention (bold) figures are accuracies over 10-fold to discriminate hashtag-labelled test sets, but cross-validation; cross-convention (italic) figures are hashtag-trained models to discriminate emoticon- accuracies over full sets. labelled test sets well; if on the other hand the Train cause lies in the overlap of verbal content or the Convention Test emoticon hashtag emotions themselves, the effect should be simi- emoticon happy 78.1% 61.2% lar in either direction. This experiment also helps emoticon sad 78.9% 60.2% determine in more detail whether the labels used emoticon anger 79.7% 63.7% label similar underlying properties. emoticon fear 75.3% 55.9% Table 5 shows the results. For the three classes emoticon surprise 78.1% 53.1% happy, sad and perhaps anger, models trained emoticon disgust 73.9% 51.5% using emoticon labels do a reasonable job of dis- hashtag happy 68.7% 81.1% tinguishing classes in hashtag-labelled data, and hashtag sad 65.4% 77.9% vice versa. However, for the other classes, dis- hashtag anger 63.9% 74.3% crimination is worse. Emoticon-trained mod- hashtag fear 58.9% 74.7% els appear to give (undesirably) higher perfor- hashtag surprise 51.8% 70.2% mance across emotion classes in hashtag-labelled hashtag disgust 55.4% 74.2% data (for the problematic non-happy classes). Hashtag-trained models perform around the ran- dom 50% level on emoticon-labelled data for ated with the emotions, or of genuine frequent co- those classes, even when tested on nominally presence of the emotions. Given the close lex- the same emotion as they are trained on. For ical specification of emotions in hashtag labels, both label types, then, the lower within-class and the latter reasons seem more likely; however, with higher cross-class performance with these nega- emoticon labels, we suspect that the emoticons tive classes (fear, surprise, disgust) sug- themselves are often used in ambiguous or vague gests that these emotion classes are genuinely ways. hard to tell apart (they are all negative emotions, As one way of investigating this directly, we and may use similar words), or are simply of- tested classifiers across labelling conventions as ten expressed simultaneously. The higher perfor- well as across emotion classes, to determine mance of emoticon-trained classifiers compared whether the (lack of) cross-class discrimination to hashtag-trained classifiers, though, also sug- holds across convention marker types. In the gests vagueness or ambiguity in emoticons: data case of ambiguity or vagueness of emoticons, labelled with emoticons nominally thought to be 487 Table 5: Experiment 2: Cross-class, cross-convention results (train on hashtags, test on emoticons and vice versa). All figures are accuracies over full sets. Accuracies over 60% are shown in bold. Train Convention Test happy sad anger fear surprise disgust emoticon happy 61.2% 40.4% 44.1% 47.4% 52.0% 45.9% emoticon sad 38.3% 60.2% 55.1% 51.5% 47.1% 53.9% emoticon anger 47.0% 48.0% 63.7% 56.2% 50.9% 56.6% emoticon fear 39.8% 57.7% 57.1% 55.9% 50.8% 56.1% emoticon surprise 43.7% 55.2% 59.2% 58.4% 53.1% 54.0% emoticon disgust 51.5% 48.0% 53.5% 55.1% 53.1% 51.5% hashtag happy 68.7% 32.5% 43.6% 32.1% 35.4% 50.4% hashtag sad 33.8% 65.4% 53.2% 65.0% 61.8% 48.8% hashtag anger 43.9% 55.5% 63.9% 59.6% 60.4% 53.0% hashtag fear 44.3% 54.6% 56.1% 58.9% 61.5% 54.3% hashtag surprise 54.2% 45.3% 49.8% 49.9% 51.8% 52.3% hashtag disgust 41.5% 57.6% 61.6% 62.2% 59.3% 55.4% associated with surprise produces classifiers for the three classes already seen to be prob- which perform well on data labelled with many lematic: surprise, fear and disgust. To other hashtag classes, suggesting that those emo- create our dataset for this experiment, we there- tions were present in the training data. Con- fore took only instances which were given the versely, the more specific hashtag labels produce same primary label by all labellers – i.e. only classifiers which perform poorly on data labelled those examples which we could take as reliably with emoticons and which thus contains a range and unambiguously labelled. This gave an un- of actual emotions. balanced dataset, with numbers varying from 266 instances for happy to only 12 instances for 4.3 Experiment 3: Manual labelling each of surprise and fear. Classifiers were To confirm whether either (or both) set of auto- trained using the datasets from Experiment 2. Per- matic (distant) labels do in fact label the under- formance is shown in Table 6; given the imbal- lying emotion class intended, we used human an- ance between class numbers in the test dataset, notators via Amazon’s Mechanical Turk to label evaluation is given as recall, precision and F-score a set of 1,000 instances. These instances were all for the class in question rather than a simple accu- labelled with emoticons (we did not use hashtag- racy figure (which is biased by the high proportion labelled data: as hashtags are so lexically close to of happy examples). the names of the emotion classes being labelled, their presence may influence labellers unduly)3 and were evenly distributed across the 6 classes, Table 6: Experiment 3: Results on manual labels. in so far as indicated by the emoticons. Labellers Train Class Precision Recall F-score were asked to choose the primary emotion class emoticon happy 79.4% 75.6% 77.5% (from the fixed set of six) associated with the mes- emoticon sad 43.5% 73.2% 54.5% sage; they were also allowed to specify if any emoticon anger 62.2% 37.3% 46.7% other classes were also present. Each data in- emoticon fear 6.8% 63.6% 12.3% stance was labelled by three different annotators. emoticon surprise 15.0% 90.0% 25.7% Agreement between labellers was poor over- emoticon disgust 8.3% 25.0% 12.5% all. The three annotators unanimously agreed in hashtag happy 78.9% 51.9% 62.6% only 47% of cases overall; although two of three hashtag sad 47.9% 81.7% 60.4% agreed in 83% of cases. Agreement was worst hashtag anger 58.2% 76.0% 65.9% 3 hashtag fear 10.1% 81.8% 18.0% Although, of course, one may argue that they do the hashtag surprise 5.9% 60.0% 10.7% same for their intended audience of readers – in which case, such an effect is legitimate. hashtag disgust 6.7% 66.7% 11.8% 488 Again, results for happy are good, and cor- To avoid any effect of ordering, the order of the respond fairly closely to the levels of accuracy emoticon list and each drop-down menu was ran- reported by Go et al. (2009) and others for the domised every time the survey page was loaded. binary positive/negative sentiment detection task. The survey was distributed via Twitter, Facebook Emoticons give significantly better performance and academic mailing lists. Respondents were not than hashtags here. Results for sad and anger given the opportunity to give their own definitions are reasonable, and provide a baseline for fur- or to provide finer-grained classifications, as we ther experiments with more advanced features and wanted to establish purely whether they would re- classification methods once more manually anno- liably associate labels with the six emotions in our tated data is available for these classes. In con- taxonomy. trast, hashtags give much better performance with these classes than the (perhaps vague or ambigu- 5.2 Results ous) emoticons. The survey was completed by 492 individuals; The remaining emotion classes, however, show full results are shown in Table 7. It demonstrated poor performance for both labelling conventions. agreement with the predefined emoticons for sad The observed low precision and high recall can be and most of the emoticons for happy (people adjusted using classifier parameters, but F-scores were unsure what 8-| and <@o meant). For all are not improved. Note that Experiment 1 shows the emoticons listed as anger, surprise and that both emoticon and hashtag labels are to some disgust, the survey showed that people are reli- extent predictable, even for these classes; how- ably unsure as to what these mean. For the emoti- ever, Experiment 2 shows that they may not be con :-o there was a direct contrast between the reliably different to each other, and Experiment 3 defined meaning and the survey meaning; the def- tells us that they do not appear to coincide well inition of this emoticon following Ansari (2010) with human annotator judgements of emotions. was fear, but the survey reliably assigned this to More reliable labels may therefore be required; surprise. although we do note that the low reliability of Given the small scale of the survey, we hesi- the human annotations for these classes, and the tate to draw strong conclusions about the emoti- correspondingly small amount of annotated data con meanings themselves (in fact, recent conver- used in this evaluation, means we hesitate to draw sations with schoolchildren – see below – have in- strong conclusions about fear, surprise and dicated very different interpretations from these disgust. An approach which considers multi- adult survey respondents). However, we do con- ple classes to be associated with individual mes- clude that for most emotions outside happy and sages may also be beneficial: using majority- sad, emoticons may indeed be an unreliable la- decision labels rather than unanimous labels im- bel; as hashtags also appear more reliable in the proves F-scores for surprise to 23-35% by in- classification experiments, we expect these to be cluding many examples also labelled as happy a more promising approach for fine-grained emo- (although this gives no improvements for other tion discrimination in future. classes). 6 Conclusions 5 Survey The approach shows reasonable performance at To further detemine whether emoticons used individual emotion label prediction, for both as emotion class labels are ambiguous or vague emoticons and hashtags. For some emotions (hap- in meaning, we set up a web survey to exam- piness, sadness and anger), performance across ine whether people could reliably classify these label conventions (training on one, and testing on emoticons. the other) is encouraging; for these classes, per- formance on those manually labelled examples 5.1 Method where annotators agree is also reasonable. This Our survey asked people to match up which of gives us confidence not only that the approach the six emotion classes (selected from a drop- produces reliable classifiers which can predict the down menu) best matched each emoticon. Each labels, but that these classifiers are actually de- drop-down menu included a ‘Not Sure’ option. tecting the desired underlying emotional classes, 489 Table 7: Survey results showing the defined emotion, the most popular emotion from the survey, the percentage of votes this emotion received, and the χ2 significance test for the distribution of votes. These are indexed by emoticon. Emoticon Defined Emotion Survey Emotion % of votes Significance of votes distribution :-) Happy Happy 94.9 χ2 = 3051.7 (p < 0.001) :) Happy Happy 95.5 χ2 = 3098.2 (p < 0.001) ;-) Happy Happy 87.4 χ2 = 2541 (p < 0.001) :D Happy Happy 85.7 χ2 = 2427.2 (p < 0.001) :P Happy Happy 59.1 χ2 = 1225.4 (p < 0.001) 8) Happy Happy 61.9 χ2 = 1297.4 (p < 0.001) 8-| Happy Not Sure 52.2 χ2 = 748.6 (p < 0.001) <@o Happy Not Sure 84.6 χ2 = 2335.1 (p < 0.001) :-( Sad Sad 91.3 χ2 = 2784.2 (p < 0.001) :( Sad Sad 89.0 χ2 = 2632.1 (p < 0.001) ;-( Sad Sad 67.9 χ2 = 1504.9 (p < 0.001) :-< Sad Sad 56.1 χ2 = 972.59 (p < 0.001) :’( Sad Sad 80.7 χ2 = 2116 (p < 0.001) :-@ Anger Not Sure 47.8 χ2 = 642.47 (p < 0.001) :@ Anger Not Sure 50.4 χ2 = 691.6 (p < 0.001) :s Surprise Not Sure 52.2 χ2 = 757.7 (p < 0.001) :$ Disgust Not Sure 62.8 χ2 = 1136 (p < 0.001) +o( Disgust Not Sure 64.2 χ2 = 1298.1 (p < 0.001) :| Fear Not Sure 55.1 χ2 = 803.41 (p < 0.001) :-o Fear Surprise 89.2 χ2 = 2647.8 (p < 0.001) without requiring manual annotation. We there- Acknowledgements fore plan to pursue this approach with a view to improving performance by investigating training The authors are supported in part by the Engi- with combined mixed-convention datasets, and neering and Physical Sciences Research Council cross-training between classifiers trained on sepa- (grants EP/J010383/1 and EP/J501360/1) and the rate conventions. Technology Strategy Board (R&D grant 700081). We thank the reviewers for their comments. However, this cross-convention performance is much better for some emotions (happiness, sad- ness and anger) than others (fear, surprise and dis- References gust). Indications are that the poor performance Saad Ansari. 2010. Automatic emotion tone detection on these latter emotion classes is to a large de- in twitter. Master’s thesis, Queen Mary University gree an effect of ambiguity or vagueness of the of London. emoticon and hashtag conventions we have used Chih-Chung Chang and Chih-Jen Lin, 2001. LIB- as labels here; we therefore intend to investi- SVM: a library for Support Vector Machines. gate other conventions with more specific and/or Software available at http://www.csie.ntu. less ambiguous meanings, and the combination edu.tw/˜cjlin/libsvm. of multiple conventions to provide more accu- Ze-Jing Chuang and Chung-Hsien Wu. 2004. Multi- rately/specifically labelled data. Another possi- modal emotion recognition from speech and text. Computational Linguistics and Chinese Language bility might be to investigate approaches to anal- Processing, 9(2):45–62, August. yse emoticons semantically on the basis of their Taner Danisman and Adil Alpkocak. 2008. Feeler: shape, or use features of such an analysis – see Emotion classification of text using vector space (Ptaszynski et al., 2010; Radulovic and Milikic, model. In AISB 2008 Convention, Communication, 2009) for some interesting recent work in this di- Interaction and Social Intelligence, volume 2, pages rection. 53–59, Aberdeen. 490 Daantje Derks, Arjan Bos, and Jasper von Grumbkow. F. Radulovic and N. Milikic. 2009. Smiley ontology. 2008a. Emoticons and online message interpreta- In Proceedings of The 1st International Workshop tion. Social Science Computer Review, 26(3):379– On Social Networks Interoperability. 388. Jonathon Read. 2005. Using emoticons to reduce de- Daantje Derks, Arjan Bos, and Jasper von Grumbkow. pendency in machine learning techniques for sen- 2008b. Emoticons in computer-mediated commu- timent classification. In Proceedings of the 43rd nication: Social motives and social context. Cy- Meeting of the Association for Computational Lin- berPsychology & Behavior, 11(1):99–101, Febru- guistics. Association for Computational Linguis- ary. tics. Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, Young-Soo Seol, Dong-Joo Kim, and Han-Woo Kim. and Eric P. Xing. 2010. A latent variable model 2008. Emotion recognition from text using knowl- for geographic lexical variation. In Proceedings edge based ANN. In Proceedings of ITC-CSCC. of the 2010 Conference on Empirical Methods in Y. Tanaka, H. Takamura, and M. Okumura. 2005. Ex- Natural Language Processing, pages 1277–1287, traction and classification of facemarks with kernel Cambridge, MA, October. Association for Compu- methods. In Proceedings of IUI. tational Linguistics. Vladimir N. Vapnik. 1995. The Nature of Statistical Paul Ekman. 1972. Universals and cultural differ- Learning Theory. Springer. ences in facial expressions of emotion. In J. Cole, Joseph Walther and Kyle D’Addario. 2001. The editor, Nebraska Symposium on Motivation 1971, impacts of emoticons on message interpretation in volume 19. University of Nebraska Press. computer-mediated communication. Social Science Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit- Computer Review, 19(3):324–347. ter sentiment classification using distant supervi- J. Wiebe and E. Riloff. 2005. Creating subjective sion. Master’s thesis, Stanford University. and objective sentence classifiers from unannotated Pei-Yun Hsueh, Prem Melville, and Vikas Sindhwani. texts . In Proceedings of the 6th International Con- 2009. Data quality from crowdsourcing: A study of ference on Computational Linguistics and Intelli- annotation selection criteria. In Proceedings of the gent Text Processing (CICLing-05), volume 3406 of NAACL HLT 2009 Workshop on Active Learning for Springer LNCS. Springer-Verlag. Natural Language Processing, pages 27–35, Boul- der, Colorado, June. Association for Computational Linguistics. Amy Ip. 2002. The impact of emoticons on affect in- terpretation in instant messaging. Carnegie Mellon University. Justin Martineau. 2009. Delta TFIDF: An improved feature space for sentiment analysis. Artificial In- telligence, 29:258–261. Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- sky. 2009. Distant supervision for relation extrac- tion without labeled data. In Proceedings of ACL- IJCNLP 2009. Alexander Pak and Patrick Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion min- ing. In Proceedings of the 7th conference on Inter- national Language Resources and Evaluation. Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in In- formation Retrieval, 2(1–2):1–135. Robert Provine, Robert Spencer, and Darcy Mandell. 2007. Emotional expression online: Emoticons punctuate website text messages. Journal of Lan- guage and Social Psychology, 26(3):299–307. M. Ptaszynski, J. Maciejewski, P. Dybala, R. Rzepka, and K Araki. 2010. CAO: A fully automatic emoti- con analysis system based on theory of kinesics. In Proceedings of The 24th AAAI Conference on Arti- ficial Intelligence (AAAI-10), pages 1026–1032, At- lanta, GA. 491 Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgarian Georgi Georgiev and Valentin Zhikov Petya Osenova and Kiril Simov Ontotext AD IICT, Bulgarian Academy of Sciences 135 Tsarigradsko Sh., Sofia, Bulgaria 25A Acad. G. Bonchev, Sofia, Bulgaria {georgi.georgiev,valentin.zhikov}@ontotext.com {petya,kivs}@bultreebank.org Preslav Nakov Qatar Computing Research Institute, Qatar Foundation Tornado Tower, floor 10, P.O. Box 5825, Doha, Qatar

[email protected]

Abstract For example, there are six tags for verbs in the Penn Treebank: VB (verb, base form; e.g., sing), We present experiments with part-of- speech tagging for Bulgarian, a Slavic lan- VBD (verb, past tense; e.g., sang), VBG (verb, guage with rich inflectional and deriva- gerund or present participle; e.g., singing), VBN tional morphology. Unlike most previous (verb, past participle; e.g., sung) VBP (verb, non- work, which has used a small number of 3rd person singular present; e.g., sing), and VBZ grammatical categories, we work with 680 (verb, 3rd person singular present; e.g., sings); morpho-syntactic tags. We combine a large these tags are morpho-syntactic in nature. Other morphological lexicon with prior linguis- corpora have used even larger tagsets, e.g., the tic knowledge and guided learning from a POS-annotated corpus, achieving accuracy Brown corpus (Kuˇcera and Francis, 1967) and the of 97.98%, which is a significant improve- Lancaster-Oslo/Bergen (LOB) corpus (Johansson ment over the state-of-the-art for Bulgarian. et al., 1986) use 87 and 135 tags, respectively. POS tagging poses major challenges for mor- 1 Introduction phologically complex languages, whose tagsets encode a lot of additional morpho-syntactic fea- Part-of-speech (POS) tagging is the task of as- tures (for most of the basic POS categories), e.g., signing each of the words in a given piece of text a gender, number, person, etc. For example, the contextually suitable grammatical category. This BulTreeBank (Simov et al., 2004) for Bulgarian is not trivial since words can play different syn- uses 680 tags, while the Prague Dependency Tree- tactic roles in different contexts, e.g., can is a bank (Hajiˇc, 1998) for Czech has over 1,400 tags. noun in “I opened a can of coke.” but a verb in “I can write.” Traditionally, linguists have classi- Below we present experiments with POS tag- fied English words into the following eight basic ging for Bulgarian, which is an inflectional lan- POS categories: noun, pronoun, adjective, verb, guage with rich morphology. Unlike most previ- adverb, preposition, conjunction, and interjection; ous work, which has used a reduced set of POS this list is often extended a bit, e.g., with deter- tags, we use all 680 tags in the BulTreeBank. We miners, particles, participles, etc., but the number combine prior linguistic knowledge and statistical of categories considered is rarely more than 15. learning, achieving accuracy comparable to that Computational linguistics works with a larger reported for state-of-the-art systems for English. inventory of POS tags, e.g., the Penn Treebank The remainder of the paper is organized as fol- (Marcus et al., 1993) uses 48 tags: 36 for part- lows: Section 2 provides an overview of related of-speech, and 12 for punctuation and currency work, Section 3 describes Bulgarian morphology, symbols. This increase in the number of tags Section 4 introduces our approach, Section 5 de- is partially due to finer granularity, e.g., there scribes the datasets, Section 6 presents our exper- are special tags for determiners, particles, modal iments in detail, Section 7 discusses the results, verbs, cardinal numbers, foreign words, existen- Section 8 offers application-specific error analy- tial there, etc., but also to the desire to encode sis, and Section 9 concludes and points to some morphological information as part of the tags. promising directions for future work. 492 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 492–502, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics 2 Related Work First, a coarse POS class is assigned (e.g., noun, verb, adjective), then, additional fine-grained Most research on part-of-speech tagging has fo- morphological features like case, number and cused on English, and has relied on the Penn Tree- gender are added, and finally, the proposed tags bank (Marcus et al., 1993) and its tagset for train- are further reconsidered using non-local features. ing and evaluation. The task is typically addressed Similarly, Smith et al. (2005) decomposed the as a sequential tagging problem; one notable ex- complex tags into factors, where models for pre- ception is the work of Brill (1995), who proposed dicting part-of-speech, gender, number, case, and non-sequential transformation-based learning. lemma are estimated separately, and then com- A number of different sequential learning posed into a single CRF model; this yielded com- frameworks have been tried, yielding 96-97% petitive results for Arabic, Korean, and Czech. accuracy: Lafferty et al. (2001) experimented Most previous work on Bulgarian POS tagging with conditional random fields (CRFs) (95.7% has started with large tagsets, which were then accuracy), Ratnaparkhi (1996) used a maximum reduced. For example, Dojchinova and Mihov entropy sequence classifier (96.6% accuracy), (2004) mapped their initial tagset of 946 tags to Brants (2000) employed a hidden Markov model just 40, which allowed them to achieve 95.5% (96.6% accuracy), Collins (2002) adopted an av- accuracy using the transformation-based learning eraged perception discriminative sequence model of Brill (1995), and 98.4% accuracy using manu- (97.1% accuracy). All these models fix the order ally crafted linguistic rules. Similarly, Georgiev of inference from left to right. et al. (2009), who used maximum entropy and Toutanova et al. (2003) introduced a cyclic de- the BulTreeBank (Simov et al., 2004), grouped pendency network (97.2% accuracy), where the its 680 fine-grained POS tags into 95 coarse- search is bi-directional. Shen et al. (2007) have grained ones, and thus improved their accuracy further shown that better results (97.3% accu- from 90.34% to 94.4%. Simov and Osenova racy) can be obtained using guided learning, a (2001) used a recurrent neural network to predict framework for bidirectional sequence classifica- (a) 160 morpho-syntactic tags (92.9% accuracy) tion, which integrates token classification and in- and (b) 15 POS tags (95.2% accuracy). ference order selection into a single learning task Some researchers did not reduce the tagset: and uses a perceptron-like (Collins and Roark, Savkov et al. (2011) used 680 tags (94.7% ac- 2004) passive-aggressive classifier to make the curacy), and Tanev and Mitkov (2002) used 303 easiest decisions first. Recently, Tsuruoka et al. tags and the BULMORPH morphological ana- (2011), proposed a simple perceptron-based clas- lyzer (Krushkov, 1997), achieving P=R=95%. sifier applied from left to right but augmented with a lookahead mechanism that searches the 3 Bulgarian Morphology space of future actions, yielding 97.3% accuracy. Bulgarian is an Indo-European language from the For morphologically complex languages, the Slavic language group, written with the Cyrillic problem of POS tagging typically includes mor- alphabet and spoken by about 9-12 million peo- phological disambiguation, which yields a much ple. It is also a member of the Balkan Sprachbund larger number of tags. For example, for Arabic, and thus differs from most other Slavic languages: Habash and Rambow (2005) used support vector it has no case declensions, uses a suffixed definite machines (SVM), achieving 97.6% accuracy with article (which has a short and a long form for sin- 139 tags from the Arabic Treebank (Maamouri et gular masculine), and lacks verb infinitive forms. al., 2003). For Czech, Hajiˇc et al. (2001) com- It further uses special evidential verb forms to ex- bined a hidden Markov model (HMM) with lin- press unwitnessed, retold, and doubtful activities. guistic rules, which yielded 95.2% accuracy using Bulgarian is an inflective language with very an inventory of over 1,400 tags from the Prague rich morphology. For example, Bulgarian verbs Dependency Treebank (Hajiˇc, 1998). For Ice- have 52 synthetic wordforms on average, while landic, Dredze and Wallenberg (2008) reported pronouns have altogether more than ten grammat- 92.1% accuracy with 639 tags developed for the ical features (not necessarily shared by all pro- Icelandic frequency lexicon (Pind et al., 1991), nouns), including case, gender, person, number, they used guided learning and tag decomposition: definiteness, etc. 493 This rich morphology inevitably leads to ambi- In many cases, strong domain preferences exist guity proliferation; our analysis of BulTreeBank about how various systematic ambiguities should shows four major types of ambiguity: be resolved. We made a study for the newswire domain, analyzing a corpus of 546,029 words, 1. Between the wordforms of the same lexeme, and we found that ambiguity type 2 (lexeme- i.e., in the paradigm. For example, divana, lexeme) prevailed for functional parts-of-speech, an inflected form of divan (‘sofa’, mascu- while the other types were more frequent for in- line), can mean (a) ‘the sofa’ (definite, singu- flecting parts-of-speech. Below we show the most lar, short definite article) or (b) a count form, frequent types of morpho-syntactic ambiguities e.g., as in dva divana (‘two sofas’). and their frequency in our corpus: 2. Between two or more lexemes, i.e., conver- • na: preposition (‘of’) vs. emphatic particle, sion. For example, kato can be (a) a subor- with a ratio of 28,554 to 38; dinator meaning ‘as, when’, or (b) a preposi- • da: auxiliary particle (‘to’) vs. affirmative tion meaning ‘like, such as’. particle, with a ratio of 12,035 to 543; 3. Between a lexeme and an inflected wordform • e: 3rd person present auxiliary verb (‘to be’) of another lexeme, i.e., across-paradigms. vs. particle (‘well’) vs. interjection (‘wow’), For example, politika can mean (a) ‘the with a ratio of 9,136 to 21 to 5; politician’ (masculine, singular, definite, • singular masculine noun with a short definite short definite article) or (b) ‘politics’ (fem- article vs. count form of a masculine noun, inine, singular, indefinite). with a ratio of 6,437 to 1,592; • adverb vs. neuter singular adjective, with a 4. Between the wordforms of two or more ratio of 3,858 to 1,753. lexemes, i.e., across-paradigms and quasi- Overall, the following factors should be taken conversion. For example, vrvi can mean into account when modeling Bulgarian morpho- (a) ‘walks’ (verb, 2nd or 3rd person, present syntax: (1) locality vs. non-locality of grammat- tense) or (b) ‘strings, laces’ (feminine, plu- ical features, (2) interdependence of grammatical ral, indefinite). features, and (3) domain-specific preferences. Some morpho-syntactic ambiguities in Bulgar- ian are occasional, but many are systematic, e.g., 4 Method neuter singular adjectives have the same forms We used the guided learning framework described as adverbs. Overall, most ambiguities are local, in (Shen et al., 2007), which has yielded state-of- and thus arguably resolvable using n-grams, e.g., the-art results for English and has been success- compare hubavo dete (‘beautiful child’), where fully applied to other morphologically complex hubavo is a neuter adjective, and “Pe hubavo.” languages such as Icelandic (Dredze and Wallen- (‘I sing beautifully.’), where it is an adverb of berg, 2008); we found it quite suitable for Bul- manner. Other ambiguities, however, are non- garian as well. We used the feature set defined in local and may require discourse-level analysis, (Shen et al., 2007), which includes the following: e.g., “Vidh go.” can mean ‘I saw him.’, where go is a masculine pronoun, or ’I saw it.’, where 1. The feature set of Ratnaparkhi (1996), in- it is a neuter pronoun. Finally, there are ambi- cluding prefix, suffix and lexical, as well as guities that are very hard or even impossible1 to some bigram and trigram context features; resolve, e.g., “Deteto vleze veselo.” can mean 2. Feature templates as in (Ratnaparkhi, 1996), both ‘The child came in happy.’ (veselo is an ad- which have been shown helpful in bidirec- jective) and ‘The child came in happily.’ (it is an tional search; adverb); however, the latter is much more likely. 1 The problem also exists for English, e.g., the annotators 3. More bigram and trigram features and bi- of the Penn Treebank were allowed to use tag combinations lexical features as in (Shen et al., 2007). for inherently ambiguous cases: JJ|NN (adjective or noun as prenominal modifier), JJ|VBG (adjective or gerund/present Note that we allowed prefixes and suffixes of participle), JJ|VBN (adjective or past participle), NN|VBG length up to 9, as in (Toutanova et al., 2003) and (noun or gerund), and RB|RP (adverb or particle). (Tsuruoka and Tsujii, 2005). 494 We further extended the set of features with The rules are quite efficient at reducing the POS the tags proposed for the current word token by a ambiguity. On the test dataset, before the rule ap- morphological lexicon, which maps words to pos- plication, 34.2% of the tokens (excluding punctu- sible tags; it is exhaustive, i.e., the correct tag is ation) had more than one tag in our morphological always among the suggested ones for each token. lexicon. This number is reduced to 18.5% after We also used 70 linguistically-motivated, high- the cascaded application of the 70 linguistic rules. precision rules in order to further reduce the num- Table 1 illustrates the effect of the rules on a small ber of possible tags suggested by the lexicon. sentence fragment. In this example, the rules have The rules are similar to those proposed by Hin- left only one tag (the correct one) for three of the richs and Trushkina (2004) for German; we im- ambiguous words. Since the rules in essence de- plemented them as constraints in the CLaRK sys- crease the average number of tags per token, we tem (Simov et al., 2003). calculated that the lexicon suggests 1.6 tags per Here is an example of a rule: If a wordform token on average, and after the application of the is ambiguous between a masculine count noun rules this number decreases to 1.44 per token. (Ncmt) and a singular short definite masculine 5 Datasets noun (Ncmsh), the Ncmt tag should be chosen if the previous token is a numeral or a number. 5.1 BulTreeBank The 70 rules were developed by linguists based We used the latest version of the BulTree- on observations over the training dataset only. Bank (Simov and Osenova, 2004), which contains They target primarily the most frequent cases of 20,556 sentences and 321,542 word tokens (four ambiguity, and to a lesser extent some infrequent times less than the English Penn Treebank), anno- but very problematic cases. Some rules operate tated using a total of 680 unique morpho-syntactic over classes of words, while other refer to partic- tags. See (Simov et al., 2004) for a detailed de- ular wordforms. The rules were designed to be scription of the BulTreeBank tagset. 100% accurate on our training dataset; our exper- We split the data into training/development/test iments show that they are also 100% accurate on as shown in Table 2. Note that only 552 of all 680 the test and on the development dataset. tag types were used in the training dataset, and Note that some of the rules are dependent on the development and the test datasets combined others, and thus the order of their cascaded appli- contain a total of 128 new tag types that were not cation is important. For example, the wordform seen in the training dataset. Moreover, 32% of the is ambiguous between an accusative feminine sin- word types in the development dataset and 31% gular short form of a personal pronoun (‘her’) and of those in the testing dataset do not occur in the an interjection (‘wow’). To handle this properly, training dataset. Thus, data sparseness is an issue the rule for interjection, which targets sentence at two levels: word-level and tag-level. initial positions, followed by a comma, needs to Dataset Sentences Tokens Types Tags be executed first. The rule for personal pronouns Train 16,532 253,526 38,659 552 is only applied afterwards. Dev 2,007 32,995 9,635 425 Test 2,017 35,021 9,627 435 Word Tags To$i Ppe-os3m Table 2: Statistics about our datasets. obaqe Cc; Dd nma Afsi; Vnitf-o3s; Vnitf-r3s; 5.2 Morphological Lexicon Vpitf-o2s; Vpitf-o3s; Vpitf-r3s vzmonost Ncfsi In order to alleviate the data sparseness issues, da Ta;Tx we further used a large morphological lexicon for sledi Ncfpi; Vpitf-o2s; Vpitf-o3s; Vpitf-r3s; Bulgarian, which is an extended version of the Vpitz–2s dictionary described in (Popov et al., 1998) and ... ... (Popov et al., 2003). It contains over 1.5M in- Table 1: Sample fragment showing the possible tags flected wordforms (for 110K lemmata and 40K suggested by the lexicon. The tags that are further proper names), each mapped to a set of possible filtered by the rules are in italic; the correct tag is bold. morpho-syntactic tags. 495 6 Experiments and Evaluation 6.1 Baselines State-of-the-art POS taggers for English typically First, we experimented with the most-frequent- build a lexicon containing all tags a word type has tag baseline, which is standard for POS tagging. taken in the training dataset; this lexicon is then This baseline ignores context altogether and as- used to limit the set of possible tags that an input signs each word type the POS tag it was most token can be assigned, i.e., it imposes a hard con- frequently seen with in the training dataset; ties straint on the possibilities explored by the POS are broken randomly. We coped with word types tagger. For example, if can has only been tagged not seen in the training dataset using three sim- as a verb and as a noun in the training dataset, ple strategies: (a) we considered them all wrong, it will be only assigned those two tags at test (b) we assigned them Ncmsi, which is the most time; other tags such as adjective, adverb and pro- frequent open-class tag in the training dataset, or noun will not be considered. Out-of-vocabulary (c) we used a very simple guesser, which assigned words, i.e., those that were not seen in the train- Ncfsi, Ncnsi, Ncfsi, and Ncmsf, if the target word ing dataset, are constrained as well, e.g., to a small ended by -a, -o, -i, and -t, respectively, other- set of frequent open-class tags. wise, it assigned Ncmsi. The results are shown In our experiments, we used a morphological in lines 1-3 of Table 3: we can see that the token- lexicon that is much larger than what could be level accuracy ranges in 78-80% for (a)-(c), which built from the training corpus only: building a is relatively high, given that we use a large inven- lexicon from the training corpus only is of lim- tory of 680 morpho-syntactic tags. ited utility since one can hardly expect to see in We further tried a baseline that uses the above- the training corpus all 52 synthetic forms a verb described morphological lexicon, in addition to can possibly have. Moreover, we did not use the the training dataset. We first built two frequency tags listed in the lexicon as hard constraints (ex- lists, containing respectively (1) the most frequent cept in one of our baselines); instead, we experi- tag in the training dataset for each word type, as mented with a different, non-restrictive approach: before, and (2) the most frequent tag in the train- we used the lexicon’s predictions as features or ing dataset for each class of tags that can be as- soft constraints, i.e., as suggestions only, thus al- signed to some word type, according to the lexi- lowing each token to take any possible tag. Note con. For example, the most frequent tag for poli- that for both known and out-of-vocabulary words tika is Ncfsi, and the most frequent tag for the we used all 680 tags rather than the 552 tags ob- tag-class {Ncmt;Ncmsi} is Ncmt. served in the training dataset; we could afford to Given a target word type, this new baseline first explore this huge search space thanks to the effi- tries to assign it the most frequent tag from the ciency of the guided learning framework. Allow- first list. If this is not possible, which happens ing all 680 tags on training helped the model by (i) in case of ties or (ii) when the word type was exposing it to a larger set of negative examples. not seen on training, it extracts the tag-class from We combined these lexicon features with stan- the lexicon and consults the second list. If there dard features extracted from the training corpus. is a single most frequent tag in the corpus for this We further experimented with the 70 contextual tag-class, it is assigned; otherwise a random tag linguistic rules, using them (a) as soft and (b) as from this tag-class is selected. hard constraints. Finally, we set four baselines: Line 4 of Table 3 shows that this latter baseline three that do not use the lexicon and one that does. achieves a very high accuracy of 94.40%. Note, however, that this is over-optimistic: the lexicon Accuracy (%) contains a tag-class for each word type in our test- # Baselines (token-level) ing dataset, i.e., while there can be word types 1 MFT + unknowns are wrong 78.10 not seen in the training dataset, there are no word 2 MFT + unknowns are Ncmsi 78.52 3 MFT + guesser for unknowns 79.49 types that are not listed in the lexicon. Thus, this 4 MFT + lexicon tag-classes 94.40 high accuracy is probably due to a large extent to the scale and quality of our morphological lexi- Table 3: Most-frequent-tag (MFT) baselines. con, and it might not be as strong with smaller lex- icons; we plan to investigate this in future work. 496 6.2 Lexicon Tags as Soft Constraints 6.3 Linguistic Rules as Hard Constraints We experimented with three types of features: Next, we experimented with using the suggestions of the linguistic rules as hard constraints. Table 4 1. Word-related features only; shows that this is a very good idea. Comparing line 1 to line 2, which do not use the morpholog- 2. Word-related features + the tags suggested ical lexicon, we can see very significant improve- by the lexicon; ments: from 95.72% to 97.20% at the token-level and from 52.95% to 64.50% at the sentence-level. 3. Word-related features + the tags suggested The improvements are smaller but still consistent by the lexicon but then further filtered using when the morphological lexicon is used: compar- the 70 contextual linguistic rules. ing lines 3 and 4 to lines 6 and 7, respectively, we see an improvement from 97.83% to 97.91% and Table 4 shows the sentence-level and the token- from 97.80% to 97.93% at the token-level, and level accuracy on the test dataset for the three about 1% absolute at the sentence-level. kinds of features: shown on lines 1, 3 and 4, re- spectively. We can see that using the tags pro- 6.4 Increasing the Beam Size posed by the lexicon as features (lines 3 and 4) Finally, we increased the beam size of guided has a major positive impact, yielding up to 49% learning from 1 to 3 as in (Shen et al., 2007). error reduction at the token-level and up to 37% Comparing line 7 to line 8 in Table 4, we can see at the sentence-level, as compared to using word- that this yields further token-level improvement: related features alone (line 1). from 97.93% to 97.98%. Interestingly, filtering the tags proposed by the lexicon using the 70 contextual linguistic rules 7 Discussion yields a minor decrease in accuracy both at the Table 5 compares our results to previously re- word token-level and at the sentence-level (com- ported evaluation results for Bulgarian. The pare line 4 to line 2). This is surprising since first four lines show the token-level accuracy for the linguistic rules are extremely reliable: they standard POS tagging tools trained and evalu- were designed to be 100% accurate on the train- ated on the BulTreeBank:2 TreeTagger (Schmid, ing dataset, and we found them experimentally to 1994), which uses decision trees, TnT (Brants, be 100% correct on the development and on the 2000), which uses a hidden Markov model, testing dataset as well. SVMtool (Gim´enez and M`arquez, 2004), which One possible explanation is that by limiting the is based on support vector machines, and set of available tags for a given token at training ACOPOST (Schr¨oder, 2002), implementing the time, we prevent the model from observing some memory-based model of Daelemans et al. (1996). potentially useful negative examples. We tested The following lines report the token-level accu- this hypothesis by using the unfiltered lexicon racy reported in previous work, as compared to predictions at training time but then making use our own experiments using guided learning. of the filtered ones at testing time; the results are We can see that we outperform by a very large shown on line 5. We can observe a small increase margin (92.53% vs. 97.98%, which represents in accuracy compared to line 4: from 97.80% to 73% error reduction) the systems from the first 97.84% at the token-level, and from 70.30% to four lines, which are directly comparable to our 70.40% at the sentence-level. Although these dif- experiments: they are trained and evaluated on the ferences are tiny, they suggest that having more BulTreeBank using the full inventory of 680 tags. negative examples at training is helpful. We further achieved statistically significant im- We can conclude that using the lexicon as a provement (p < 0.0001; Pearson’s chi-squared source of soft constraints has a major positive im- test (Plackett, 1983)) over the best pervious result pact, e.g., because it provides access to impor- on 680 tags: from 94.65% to 97.98%, which rep- tant external knowledge that is complementary resents 62.24% error reduction at the token-level. to what can be learned from the training corpus 2 We used the pre-trained TreeTagger; for the rest, we re- alone; the improvements when using linguistic port the accuracy given on the Webpage of the BulTreeBank: rules as soft constraints are more limited. www.bultreebank.org/taggers/taggers.html 497 Lexicon Linguistic Rules (applied to filter): Beam Accuracy (%) # (source of) (a) the lexicon features (b) the output tags size Sentence-level Token-level 1 – – – 1 52.95 95.72 2 – – yes 1 64.50 97.20 3 features – – 1 70.40 97.83 4 features yes – 1 70.30 97.80 5 features yes, for test only – 1 70.40 97.84 6 features – yes 1 71.34 97.91 7 features yes yes 1 71.69 97.93 8 features yes yes 3 71.94 97.98 Table 4: Evaluation results on the test dataset. Line 1 shows the evaluation results when using features derived from the text corpus only; these features are used by all systems in the table. Line 2 further uses the contextual linguistic rules to limit the set of possible POS tags that can be predicted. Note that these rules (1) consult the lexicon, and (2) always predict a single POS tag. Line 3 uses the POS tags listed in the lexicon as features, i.e., as soft suggestions only. Line 4 is like line 3, but the list of feature-tags proposed by the lexicon is filtered by the contextual linguistic rules. Line 5 is like line 4, but the linguistic rules filtering is only applied at test time; it is not done on training. Lines 6 and 7 are similar to lines 3 and 4, respectively, but here the linguistic rules are further applied to limit the set of possible POS tags that can be predicted, i.e., the rules are used as hard constraints. Finally, line 8 is like line 7, but here the beam size is increased to 3. Overall, we improved over almost all previ- Still, our performance is impressive because ously published results. Our accuracy is sec- (1) our model is trained on 253,526 tokens only ond only to the manual rules approach of Do- while the standard training sections 0-18 of the jchinova and Mihov (2004). Note, however, that Penn Treebank contain a total of 912,344 tokens, they used 40 tags only, i.e., their inventory is 17 i.e., almost four times more, and (2) we predict times smaller than ours. Moreover, they have op- 680 rather than just 48 tags as for the Penn Tree- timized their tagset specifically to achieve very bank, which is 14 times more. high POS tagging accuracy by choosing not to at- Note, however, that (1) we used a large exter- tempt to resolve some inherently hard systematic nal morphological lexicon for Bulgarian, which ambiguities, e.g., they do not try to choose be- yielded about 50% error reduction (without it, tween second and third person past singular verbs, our accuracy was 95.72% only), and (2) our whose inflected forms are identical in Bulgarian train/dev/test sentences are generally shorter, and and hard to distinguish when the subject is not thus arguably simpler for a POS tagger to analyze: present (Bulgarian is a pro-drop language). we have 17.4 words per test sentence in the Bul- In order to compare our results more closely TreeBank vs. 23.7 in the Penn Treebank. to the smaller tagsets in Table 5, we evaluated Our results also compare favorably to the state- our best model with respect to (a) the first letter of-the-art results for other morphologically com- of the tag only (which is part-of-speech only, no plex languages that use large tagsets, e.g., 95.2% morphological information; 13 tags), e.g., Ncmsf for Czech with 1,400+ tags (Hajiˇc et al., 2001), becomes N, and (b) the first two letters of the 92.1% for Icelandic with 639 tags (Dredze and tag (POS + limited morphological information; Wallenberg, 2008), 97.6% for Arabic with 139 49 tags), e.g., Ncmsf becomes Nc. This yielded tags (Habash and Rambow, 2005). 99.30% accuracy for (a) and 98.85% for (b). The latter improves over (Dojchinova and Mihov, 8 Error Analysis 2004), while using a bit larger number of tags. In this section, we present error analysis with re- Our best token-level accuracy of 97.98% is spect to the impact of the POS tagger’s perfor- comparable and even slightly better than the state- mance on other processing steps in a natural lan- of-the-art results for English: 97.33% when using guage processing pipeline, such as lemmatization Penn Treebank data only (Shen et al., 2007), and and syntactic dependency parsing. 97.50% for Penn Treebank plus some additional First, we explore the most frequently confused unlabeled data (Søgaard, 2011). Of course, our pairs of tags for our best-performing POS tagging results are only indirectly comparable to English. system; these are shown in Table 6. 498 Accuracy Tool/Authors Method # Tags (token-level, %) *TreeTagger Decision Trees 680 89.21 *ACOPOST Memory-based Learning 680 89.91 *SVMtool Support Vector Machines 680 92.22 *TnT Hidden Markov Model 680 92.53 (Georgiev et al., 2009) Maximum Entropy 680 90.34 (Simov and Osenova, 2001) Recurrent Neural Network 160 92.87 (Georgiev et al., 2009) Maximum Entropy 95 94.43 (Savkov et al., 2011) SVM + Lexicon + Rules 680 94.65 (Tanev and Mitkov, 2002) Manual Rules 303 95.00(=P=R) (Simov and Osenova, 2001) Recurrent Neural Network 15 95.17 (Dojchinova and Mihov, 2004) Transformation-based Learning 40 95.50 (Dojchinova and Mihov, 2004) Manual Rules + Lexicon 40 98.40 Guided Learning 680 95.72 Guided Learning + Lexicon 680 97.83 This work Guided Learning + Lexicon + Rules 680 97.98 Guided Learning + Lexicon + Rules 49 98.85 Guided Learning + Lexicon + Rules 13 99.30 Table 5: Comparison to previous work for Bulgarian. The first four lines report evaluation results for various standard POS tagging tools, which were retrained and evaluated on the BulTreeBank. The following lines report token-level accuracy for previously published work, as compared to our own experiments using guided learning. We can see that most of the wrong tags share Here is an example of such a rule: the same part-of-speech (indicated by the initial if tag = Vpitf-o1s then uppercase letter), such as V for verb, N for noun, {remove oh; concatenate a} etc. This means that most errors refer to the mor- The application of the above rule to the past phosyntactic features. For example, personal or simple verb form qetoh (‘I read’) would remove impersonal verb; definite or indefinite feminine oh, and then concatenate a. The result would be noun; singular or plural masculine adjective, etc. the correct lemma qeta (‘to read’). At the same time, there are also cases, where the Such rules are generated for each wordform in error has to do with the part-of-speech label itself. the morphological lexicon; the above functional For example, between an adjective and an adverb, representation allows for compact representation or between a numeral and an indefinite pronoun. in a finite state automaton. Similar rules are ap- We want to use the above tagger to develop plied to the unknown words, where the lemma- (1) a rule-based lemmatizer, using the morpholog- tizer tries to guess the correct lemma. ical lexicon, e.g., as in (Plisson et al., 2004), and Obviously, the applicability of each rule cru- (2) a dependency parser like MaltParser (Nivre et cially depends on the output of the POS tagger. al., 2007), trained on the dependency part of the If the tagger suggests the correct tag, then the BulTreeBank. We thus study the potential impact wordform would be lemmatized correctly. Note of wrong tags on the performance of these tools. that, in some cases of wrongly assigned POS tags The lemmatizer relies on the lexicon and uses in a given context, we might still get the correct string transformation functions defined via two lemma. This is possible in the majority of the operations – remove and concatenate: erroneous cases in which the part-of-speech has if tag = Tag then been assigned correctly, but the wrong grammat- ical alternative has been selected. In such cases, {remove OldEnd; concatenate NewEnd} the error does not influence lemmatization. where Tag is the tag of the wordform, OldEnd is In order to calculate the proportion of such the string that has to be removed from the end of cases, we divided each tag into two parts: the wordform, and NewEnd is the string that has (a) grammatical features that are common for all to be concatenated to the beginning of the word- wordforms of a given lemma, and (b) features that form in order to produce the lemma. are specific to the wordform. 499 Freq. Gold Tag Proposed Tag Finally, we should note that there are two spe- 43 Ansi Dm cial classes of tokens for which it is generally 23 Vpitf-r3s Vnitf-r3s hard to predict some of the grammatical features: 16 Npmsh Npmsi 14 Vpiif-r3s Vniif-r3s (1) abbreviations and (2) numerals written with 13 Npfsd Npfsi digits. In sentences, they participate in agreement 12 Dm Ansi relations only if they are pronounced as whole 12 Vpitcam-smi Vpitcao-smi phrases; unfortunately, it is very hard for the tag- 12 Vpptf-r3p Vpitf-r3p ger to guess such relations since it does not have 11 Vpptf-r3s Vpptf-o3s at its disposal enough features, such as the inflec- 10 Mcmsi Pfe-os-mi tion of the numeral form, that might help detect 10 Ppetas3n Ppetas3m and use the agreement pattern. 10 Ppetds3f Psot–3–f 9 Npnsi Npnsd 9 Vpptf-o3s Vpptf-r3s 9 Conclusion and Future Work 8 Dm A-pi We have presented experiments with part-of- 8 Ppxts Ppxtd 7 Mcfsi Pfe-os-fi speech tagging for Bulgarian, a Slavic language 7 Npfsi Npfsd with rich inflectional and derivational morphol- 7 Ppetas3m Ppetas3n ogy. Unlike most previous work for this language, 7 Vnitf-r3s Vpitf-r3s which has limited the number of possible tags, we 7 Vpitcam-p-i Vpitcao-p-i used a very rich tagset of 680 morpho-syntactic tags as defined in the BulTreeBank. By com- Table 6: Most frequently confused pairs of tags. bining a large morphological lexicon with prior linguistic knowledge and guided learning from a The part-of-speech features are always deter- POS-annotated corpus, we achieved accuracy of mined by the lemma. For example, Bulgarian 97.98%, which is a significant improvement over verbs have the lemma features aspect and tran- the state-of-the-art for Bulgarian. Our token-level sitivity. If they are correct, then the lemma is pre- accuracy is also comparable to the best results re- dicted also correctly, regardless of whether cor- ported for English. rect or wrong on the grammatical features. For In future work, we want to experiment with a example, if the verb participle form (aorist or richer set of features, e.g., derived from unlabeled imperfect) has its correct aspect and transitivity, data (Søgaard, 2011) or from the Web (Umansky- then it is lemmatized also correctly, regardless Pesin et al., 2010; Bansal and Klein, 2011). We of whether the imperfect or aorist features were further plan to explore ways to decompose the guessed correctly; similarly, for other error types. complex Bulgarian morpho-syntactic tags, e.g., as We evaluated these cases for the 711 errors in our proposed in (Simov and Osenova, 2001) and experiment, and we found that 206 of them (about (Smith et al., 2005). Modeling long-distance 29%) were non-problematic for lemmatization. syntactic dependencies (Dredze and Wallenberg, 2008) is another promising direction; we believe For the MaltParser, we encode most of the this can be implemented efficiently using poste- grammatical features of the wordforms as spe- rior regularization (Graca et al., 2009) or expecta- cific features for the parser. Hence, it is much tion constraints (Bellare et al., 2009). harder to evaluate the problematic cases due to the tagger. Still, we were able to make an es- Acknowledgments timation of some cases. Our strategy was to ig- nore the grammatical features that do not always We would like to thank the anonymous reviewers contribute to the syntactic behavior of the word- for their useful comments, which have helped us forms. Such grammatical features for the verbs improve the paper. are aspect and tense. Thus, proposing perfective The research presented above has been par- instead of imperfective for a verb or present in- tially supported by the EU FP7 project 231720 stead of past tense would not cause problems for EuroMatrixPlus, and by the SmartBook project, the MaltParser. Among our 711 errors, 190 cases funded by the Bulgarian National Science Fund (or about 27%) were not problematic for parsing. under grant D002-111/15.12.2008. 500 References vector machines. In Proceedings of the 4th Inter- national Conference on Language Resources and Mohit Bansal and Dan Klein. 2011. Web-scale fea- Evaluation, LREC ’04, Lisbon, Portugal. tures for full-scale parsing. In Proceedings of the Joao Graca, Kuzman Ganchev, Ben Taskar, and Fer- 49th Annual Meeting of the Association for Com- nando Pereira. 2009. Posterior vs parameter spar- putational Linguistics: Human Language Technolo- sity in latent variable models. In Yoshua Bengio, gies, ACL-HLT ’10, pages 693–702, Portland, Ore- Dale Schuurmans, John D. Lafferty, Christopher gon, USA. K. I. Williams, and Aron Culotta, editors, Advances Kedar Bellare, Gregory Druck, and Andrew McCal- in Neural Information Processing Systems 22, NIPS lum. 2009. Alternating projections for learning ’09, pages 664–672. Curran Associates, Inc., Van- with expectation constraints. In Proceedings of the couver, British Columbia, Canada. 25th Conference on Uncertainty in Artificial Intel- Nizar Habash and Owen Rambow. 2005. Arabic to- ligence, UAI ’09, pages 43–50, Montreal, Quebec, kenization, part-of-speech tagging and morpholog- Canada. ical disambiguation in one fell swoop. In Proceed- Thorsten Brants. 2000. TnT – a statistical part-of- ings of the 43rd Annual Meeting of the Associa- speech tagger. In Proceedings of the Sixth Applied tion for Computational Linguistics, ACL ’05, pages Natural Language Processing, ANLP ’00, pages 573–580, Ann Arbor, Michigan. 224–231, Seattle, Washington, USA. Jan Hajiˇc, Pavel Krbec, Pavel Kvˇetoˇn, Karel Oliva, Eric Brill. 1995. Transformation-based error-driven and Vladim´ır Petkeviˇc. 2001. Serial combination learning and natural language processing: a case of rules and statistics: A case study in Czech tag- study in part-of-speech tagging. Comput. Linguist., ging. In Proceedings of the 39th Annual Meeting 21:543–565. of the Association for Computational Linguistics, Michael Collins and Brian Roark. 2004. Incremen- ACL ’01, pages 268–275, Toulouse, France. tal parsing with the perceptron algorithm. In Pro- Jan Hajiˇc. 1998. Building a Syntactically Annotated ceedings of the 42nd Meeting of the Association for Corpus: The Prague Dependency Treebank. In Eva Computational Linguistics, Main Volume, ACL ’04, Hajiˇcov´a, editor, Issues of Valency and Meaning. pages 111–118, Barcelona, Spain. Studies in Honor of Jarmila Panevov´a, pages 12– Michael Collins. 2002. Discriminative training meth- 19. Prague Karolinum, Charles University Press. ods for hidden Markov models: theory and experi- Erhard W. Hinrichs and Julia S. Trushkina. 2004. ments with perceptron algorithms. In Proceedings Forging agreement: Morphological disambiguation of the Conference on Empirical Methods in Natu- of noun phrases. Research on Language & Compu- ral Language Processing, EMNLP ’02, pages 1–8, tation, 2:621–648. Philadelphia, PA, USA. Stig Johansson, Eric Atwell, Roger Garside, and Geof- Walter Daelemans, Jakub Zavrel, Peter Berck, and frey Leech, 1986. The Tagged LOB Corpus: Users’ Steven Gillis. 1996. MBT: A memory-based part manual. ICAME, The Norwegian Computing Cen- of speech tagger generator. In Eva Ejerhed and tre for the Humanities, Bergen University, Norway. Ido Dagan, editors, Fourth Workshop on Very Large Hristo Krushkov. 1997. Modelling and building ma- Corpora, pages 14–27, Copenhagen, Denmark. chine dictionaries and morphological processors Veselka Dojchinova and Stoyan Mihov. 2004. High (in Bulgarian). Ph.D. thesis, University of Plov- performance part-of-speech tagging of Bulgarian. div, Faculty of Mathematics and Informatics, Plov- In Christoph Bussler and Dieter Fensel, editors, div, Bulgaria. AIMSA, volume 3192 of Lecture Notes in Computer Henry Kuˇcera and Winthrop Nelson Francis. 1967. Science, pages 246–255. Springer. Computational analysis of present-day American Mark Dredze and Joel Wallenberg. 2008. Icelandic English. Brown University Press, Providence, RI. data driven part of speech tagging. In Proceedings John D. Lafferty, Andrew McCallum, and Fernando of the 44th Annual Meeting of the Association of C. N. Pereira. 2001. Conditional random fields: Computational Linguistics: Short Papers, ACL ’08, Probabilistic models for segmenting and labeling pages 33–36, Columbus, Ohio, USA. sequence data. In Proceedings of the 18th Inter- Georgi Georgiev, Preslav Nakov, Petya Osenova, and national Conference on Machine Learning, ICML Kiril Simov. 2009. Cross-lingual adaptation as ’01, pages 282–289, San Francisco, CA, USA. a baseline: adapting maximum entropy models to Mohamed Maamouri, Ann Bies, Hubert Jin, and Tim Bulgarian. In Proceedings of the RANLP’09 Work- Buckwalter. 2003. Arabic Treebank: Part 1 v 2.0. shop on Adaptation of Language Resources and LDC2003T06. Technology to New Domains, AdaptLRTtoND ’09, Mitchell P. Marcus, Mary Ann Marcinkiewicz, and pages 35–38, Borovets, Bulgaria. Beatrice Santorini. 1993. Building a large anno- Jes´us Gim´enez and Llu´ıs M`arquez. 2004. SVMTool: tated corpus of English: the Penn Treebank. Com- A general POS tagger generator based on support put. Linguist., 19:313–330. 501 Joakim Nivre, Johan Hall, Jens Nilsson, Atanas garian texts. Technical Report BTB-TR04, Bulgar- Chanev, G¨ulsen Eryigit, Sandra K¨ubler, Svetoslav ian Academy of Sciences. Marinov, and Erwin Marsi. 2007. MaltParser: Kiril Ivanov Simov, Alexander Simov, Milen A language-independent system for data-driven de- Kouylekov, Krasimira Ivanova, Ilko Grigorov, and pendency parsing. Natural Language Engineering, Hristo Ganev. 2003. Development of corpora 13(2):95–135. within the CLaRK system: The BulTreeBank J¨orgen Pind, Fridrik Magn´usson, and Stef´an Briem. project experience. In Proceedings of the 10th con- 1991. The Icelandic frequency dictionary. Techni- ference of the European chapter of the Association cal report, The Institute of Lexicography, University for Computational Linguistics, EACL ’03, pages of Iceland, Reykjavik, Iceland. 243–246, Budapest, Hungary. Robin L. Plackett. 1983. Karl Pearson and the Chi- Kiril Simov, Petya Osenova, and Milena Slavcheva. Squared Test. International Statistical Review / Re- 2004. BTB-TR03: BulTreeBank morphosyntac- vue Internationale de Statistique, 51(1):59–72. tic tagset. Technical Report BTB-TR03, Bulgarian Jo¨el Plisson, Nada Lavraˇc, and Dunja Mladeni´c. 2004. Academy of Sciences. A rule based approach to word lemmatization. In Noah A. Smith, David A. Smith, and Roy W. Tromble. Proceedings of the 7th International Multiconfer- 2005. Context-based morphological disambigua- ence: Information Society, IS ’2004, pages 83–86, tion with random fields. In Proceedings of Hu- Ljubljana, Slovenia. man Language Technology Conference and Confer- Dimitar Popov, Kiril Simov, and Svetlomira Vidinska. ence on Empirical Methods in Natural Language 1998. Dictionary of Writing, Pronunciation and Processing, pages 475–482, Vancouver, British Punctuation of Bulgarian Language (in Bulgarian). Columbia, Canada. Atlantis KL, Sofia, Bulgaria. Anders Søgaard. 2011. Semi-supervised condensed nearest neighbor for part-of-speech tagging. In Pro- Dimityr Popov, Kiril Simov, Svetlomira Vidinska, and ceedings of the 49th Annual Meeting of the Associa- Petya Osenova. 2003. Spelling Dictionary of Bul- tion for Computational Linguistics, ACL-HLT ’10, garian. Nauka i izkustvo, Sofia, Bulgaria. pages 48–52, Portland, Oregon, USA. Adwait Ratnaparkhi. 1996. A maximum entropy Hristo Tanev and Ruslan Mitkov. 2002. Shallow model for part-of-speech tagging. In Eva Ejerhed language processing architecture for Bulgarian. In and Ido Dagan, editors, Fourth Workshop on Very Proceedings of the 19th International Conference Large Corpora, pages 133–142, Copenhagen, Den- on Computational Linguistics, COLING ’02, pages mark. 1–7, Taipei, Taiwan. Aleksandar Savkov, Laska Laskova, Petya Osenova, Kristina Toutanova, Dan Klein, Christopher D. Man- Kiril Simov, and Stanislava Kancheva. 2011. ning, and Yoram Singer. 2003. Feature-rich A web-based morphological tagger for Bulgarian. part-of-speech tagging with a cyclic dependency In Daniela Majchr´akov´a and Radovan Garab´ık, network. In Proceedings of the Conference of editors, Slovko 2011. Sixth International Confer- the North American Chapter of the Association ence. Natural Language Processing, Multilingual- for Computational Linguistics, NAACL ’03, pages ity, pages 126–137, Modra/Bratislava, Slovakia. 173–180, Edmonton, Canada. Helmut Schmid. 1994. Probabilistic part-of-speech Yoshimasa Tsuruoka and Jun’ichi Tsujii. 2005. Bidi- tagging using decision trees. In International Con- rectional inference with the easiest-first strategy ference on New Methods in Language Processing, for tagging sequence data. In Proceedings of the pages 44–49, Manchester, UK. Conference on Human Language Technology and Ingo Schr¨oder. 2002. A case study in part-of-speech- Empirical Methods in Natural Language Process- tagging using the ICOPOST toolkit. Technical Re- ing, HLT-EMNLP ’05, pages 467–474, Vancouver, port FBI-HH-M-314/02, Department of Computer British Columbia, Canada. Science, University of Hamburg. Yoshimasa Tsuruoka, Yusuke Miyao, and Jun’ichi Libin Shen, Giorgio Satta, and Aravind Joshi. 2007. Kazama. 2011. Learning with lookahead: Can Guided learning for bidirectional sequence classi- history-based models rival globally optimized mod- fication. In Proceedings of the 45th Annual Meet- els? In Proceedings of the 49th Annual Meeting of ing of the Association of Computational Linguistics, the Association for Computational Linguistics: Hu- ACL ’07, pages 760–767, Prague, Czech Republic. man Language Technologies, ACL-HLT ’10, pages Kiril Simov and Petya Osenova. 2001. A hybrid 238–246, Portland, Oregon, USA. system for morphosyntactic disambiguation in Bul- Shulamit Umansky-Pesin, Roi Reichart, and Ari Rap- garian. In Proceedings of the EuroConference on poport. 2010. A multi-domain web-based algo- Recent Advances in Natural Language Processing, rithm for POS tagging of unknown words. In Pro- RANLP ’01, pages 5–7, Tzigov chark, Bulgaria. ceedings of the 23rd International Conference on Kiril Simov and Petya Osenova. 2004. BTB-TR04: Computational Linguistics: Posters, COLING ’10, BulTreeBank morphosyntactic annotation of Bul- pages 1274–1282, Beijing, China. 502 Instance-Driven Attachment of Semantic Annotations over Conceptual Hierarchies Janara Christensen∗ Marius Pas¸ca University of Washington Google Inc. Seattle, Washington 98195 Mountain View, California 94043

[email protected] [email protected]

Abstract or manually created within encyclopedic re- sources (Remy, 2002). Such facts could state, for Whether automatically extracted or human instance, that rhapsody in blue was composed- generated, open-domain factual knowledge by george gershwin, or that tristan und isolde is often available in the form of semantic was composed-by richard wagner. In compar- annotations (e.g., composed-by) that take ison, concept-level annotations more concisely one or more specific instances (e.g., rhap- sody in blue, george gershwin) as their ar- and effectively capture the underlying semantics guments. This paper introduces a method of the annotations by identifying the concepts cor- for converting flat sets of instance-level responding to the arguments, e.g., ‘Musical Com- annotations into hierarchically organized, positions’ are composed-by ‘Composers’. concept-level annotations, which capture not only the broad semantics of the desired The frequent occurrence of instances, relative arguments (e.g., ‘People’ rather than ‘Loca- to more abstract concepts, in Web documents and tions’), but also the correct level of general- popular Web search queries (Barr et al., 2008; ity (e.g., ‘Composers’ rather than ‘People’, Li, 2010), is both an asset and a liability from or ‘Jazz Composers’). The method refrains the point of view of knowledge acquisition. On from encoding features specific to a partic- one hand, it makes instance-level annotations rel- ular domain or annotation, to ensure imme- atively easy to find, either from manually created diate applicability to new, previously un- resources (Remy, 2002; Bollacker et al., 2008), seen annotations. Over a gold standard of semantic annotations and concepts that best or extracted automatically from text (Banko et capture their arguments, the method sub- al., 2007). On the other hand, it makes concept- stantially outperforms three baselines, on level annotations more difficult to acquire di- average, computing concepts that are less rectly. While “Rhapsody in Blue was composed than one step in the hierarchy away from by George Gershwin [..]” may occur in some the corresponding gold standard concepts. form within Web documents, the more abstract “Musical compositions are composed by musi- 1 Introduction cians [..]” is unlikely to occur. A more practical approach to collecting concept-level annotations Background: Knowledge about the world can is to indirectly derive them from already plenti- be thought of as semantic assertions or anno- ful instance-level annotations, effectively distill- tations, at two levels of granularity: instance ing factual knowledge into more abstract, concise level (e.g., rhapsody in blue, tristan und isolde, and generalizable knowledge. george gershwin, richard wagner) and concept Contributions: This paper introduces a method level (e.g., ‘Musical Compositions’, ‘Works of for converting flat sets of specific, instance- Art’, ‘Composers’). Instance-level annotations level annotations into hierarchically organized, correspond to factual knowledge that can be concept-level annotations. As illustrated in Fig- found in repositories extracted automatically from ure 1, the resulting annotations must capture not text (Banko et al., 2007; Wu and Weld, 2010) just the broad semantics of the desired arguments ∗ Contributions made during an internship at Google. (e.g., ‘People’ rather than ‘Locations’ or ‘Prod- 503 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 503–513, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Annotations composed−by lives−in instrument−played sung−by • mappings {cs →cg } from more specific con- Conceptual hierarchy cepts to more general concepts, as encoded in a People hierarchy H, e.g., ‘American Actors’→‘Actors’, ‘People from Kiev’→‘People from Ukraine’, Composers Musicians ‘Actors’→‘Entertainers’. Composers by genre Cellists Singers Thus, the main inputs are the conceptual hi- Baroque Composers Jazz Composers erarchy H, and the instance-level annotations I. The hierarchy contains instance-to-concept map- Figure 1: Hierarchical Semantic Annotations: The pings, as well as specific-to-general concept map- attachment of semantic annotations (e.g., composed- pings. Via transitivity, instances (milla jovovich) by) into a conceptual hierarchy, a portion of which is and concepts (‘American Actors’) may be im- shown in the diagram, requires the identification of the mediate children of more general concepts (‘Ac- correct concept at the correct level of generality (e.g., ‘Composers’ rather than ‘Jazz Composers’ or ‘Peo- tors’), or transitive descendants of more general ple’, for the right argument of composed-by). concepts (‘Entertainers’). The hierarchy is not re- quired to be a tree; in particular, a concept may have multiple parent concepts. The instance-level ucts’, as the right argument of the annotation annotations may be created collaboratively by hu- composed-by), but actually identify the concepts man contributors, or extracted automatically from at the correct level of generality/specificity (e.g., Web documents or some other data source. ‘Composers’ rather than ‘Artists’ or ‘Jazz Com- Goal: Given the data sources, the goal is to de- posers’) in the underlying conceptual hierarchy. termine to which concept c in the hierarchy H the To ensure portability to new, previously unseen arguments of the target concept-level annotation annotations, the proposed method avoids encod- r should be attached. While the left argument of ing features specific to a particular domain or an- acted-in could attach to ‘American Actors’, ‘Peo- notation. In particular, the use of annotations’ la- ple from Kiev’, ‘Entertainers’ or ‘People’, it is bels (composed-by) as lexical features might be best attached to the concept ‘Actors’. The goal tempting, but would anchor the annotation model is to select the concept c that most appropriately to that particular annotation. Instead, the method generalizes across the instances. Over the set I relies only on features that generalize across an- of instance-level annotations, selecting a method notations. Over a gold standard of semantic anno- for this goal can be thought of as a minimization tations and concepts that best capture their argu- problem. The metric to be minimized is the sum ments, the method substantially outperforms three of the distances between each predicted concept c baseline methods. On average, the method com- and the correct concept cgold , where the distance putes concepts that are less than one step in the is the number of edges between c and cgold in H. hierarchy away from the corresponding gold stan- dard concepts of the various annotations. Intuitions and Challenges: Given instances such as milla jovovich that instantiate an argument of 2 Hierarchical Semantic Annotations an annotation like acted-in, the conceptual hierar- chy can be used to propagate the annotation up- 2.1 Task Description wards, from instances to their concepts, then in Data Sources: The computation of hierarchical turn further upwards to more general concepts. semantic annotations relies on the following data The best concept would be one of the many can- sources: didate concepts reached during propagation. In- • a target annotation r (e.g., acted-in) that takes tuitively, when compared to other candidate con- M arguments; cepts, a higher proportion of the descendant in- • N annotations I={<i1j , . . . , iM j >}N j=1 of stances of the best concept should instantiate (or r at instance level, e.g., {<leonardo dicaprio, match) the annotation. At the same time, rela- inception>, <milla jovovich, fifth element>} (in tive to other candidate concepts, the best concept this example, M =2); should have more descendant instances. • mappings {i→c} from instances to con- While the intuitions seem clear, their inclu- cepts to which they belong, e.g., milla jovovich sion in a working method faces a series of prac- → ‘American Actors’, milla jovovich → ‘People tical challenges. First, the data sources may be from Kiev’, milla jovovich → ‘Models’; noisy. One form of noise is missing or erroneous 504 Conceptual hierarchy Candidate concepts Training/testing data Entities Entities People-Actors, 3/2, 0.1/0.7 . . . People Actors-People, 2/3, 0.7/0.1 . . . Actors Actors-American Actors, 2/1, 0.7/0.9 . . . American Actors American Actors-Actors, 1/2, 0.9/0.7 . . . Locations People English Actors . . . Singers Actors Raw statistics Classified data Entities, 4, 0.01 . . . 0, People-Actors, 3/2, 0.1/0.7 . . . People, 3, 0.1 . . . 1, Actors-People, 2/3, 0.7/0.1 . . . Actors, 2, 0.7 . . . 1, Actors-American Actors, 2/1, 0.7/0.9 . . . American Actors English Actors 0, American Actors-Actors, 1/2, 0.9/0.7 . . . American Actors, 1, 0.9 . . . English Actors, 1, 0.8 . . . . . . Instance-level annotations Features Depth, Instance Percent . . . acted-in(leonardo dicaprio, inception) acted-in(milla jovovich, fifth element) Ranked data (for Concept-level annotations) acted-in(judy dench, casino royale) 4, Actors acted-in(colin firth, the king’s speech) 3, People 2, American Actors 1, English Actors Query logs 0, Entities Instance to concept mappings fifth element actors leonardo dicaprio: American Actors fifth element costumes milla jovovich: American Actors inception quotes judy dench: English Actors out of africa actors Concept-level annotations colin firth: English Actors the king’s speech oscars acted-in(Actors, ?) Figure 2: Method Overview: Inferring concept-level annotations from instance-level annotations. instance-level annotations, which may artificially Second, to apply evidence collected from some skew the distribution of matching instances to- annotations to a new annotation, the evidence wards a less than optimal region in the hierarchy. must generalize across annotations. However, If the input annotations for acted-in are available collected evidence or statistics may vary widely almost exhaustively for all descendant instances across annotations. Observing that 90% of all de- of ‘American Actors’, and are available for only a scendant instances of the concept ‘Actors’ match few of the descendant instances of ‘Belgian Ac- an annotation acted-in constitutes strong evidence tors’, ‘Italian Actors’ etc., then the distribution that ‘Actors’ is a good concept for acted-in. In over the hierarchy may incorrectly suggest that contrast, observing that only 0.09% of all descen- the left argument of acted-in is ‘American Actors’ dant instances of the concept ‘Football Teams’ rather than the more general ‘Actors’. In another match won-super-bowl should not be as strong example, if virtually all instances that instantiate negative evidence as the percentage suggests. the left argument of the annotation won-award are mapped to the concept ‘Award Winning Actors’, 2.2 Inferring Concept-Level Annotations then it would be difficult to distinguish ‘Award Winning Actors’ from the more general ‘Actors’ Determining Candidate Concepts: As illus- or ‘People’, as best concept to be computed for trated in the left part of Figure 2, the first step to- the annotation. Another type of noise is missing wards inferring concept-level from instance-level or erroneous edges in the hierarchy, which could annotations is to propagate the instances that in- artificially direct propagation towards irrelevant stantiate a particular argument of the annota- regions of the hierarchy, or prevent propagation tion, upwards in the hierarchy. Starting from the from even reaching relevant regions of the hier- left arguments of the annotation acted-in, namely archy. For example, if the hierarchy incorrectly leonardo dicaprio, milla jovovich etc., the prop- maps ‘Actors’ to ‘Entertainment’, then ‘Entertain- agation reaches their parent concepts ‘American ment’ and its ancestor concepts incorrectly be- Actors’, ‘English Actors’, then their parent and come candidate concepts during propagation for ancestor concepts ‘Actors’, ‘People’, ‘Entities’ the left argument of acted-in. Conversely, if miss- etc. The concepts reached during upward prop- ing edges caused ‘Actors’ to not have any children agation become candidate concepts. In subse- in the hierarchy, then ‘Actors’ would not even be quent steps, the candidates are modeled, scored reached and considered as a candidate concept and ranked such that ideally the best concept is during propagation. ranked at the top. Ranking Candidate Concepts: The identifica- 505 tion of a ranking function is cast as a semi- ing descendant instances might be noise. supervised learning problem. Given the cor- Also in this category are features that relay in- rect (gold) concept of an annotation, it would be formation about the candidate concept’s children tempting to employ binary classification directly, concepts. These features include (1) M ATCHED by marking the correct concept as a positive ex- C HILDREN the number of child concepts con- ample, and all other candidate concepts as nega- taining at least one matching instance, (2) C HIL - tive examples. Unfortunately, this would produce DREN P ERCENT the percentage of child concepts a highly imbalanced training set, with thousands with at least one matching instance, (3) AVG I N - of negative examples and, more importantly, with STANCE P ERCENT C HILDREN the average per- only one positive example. Another disadvan- centage of matching descendant instances of the tage of using binary classification directly is that child concepts, and (4) I NSTANCE P ERCENT TO it is difficult to capture the preference for concepts I NSTANCE P ERCENT C HILDREN the ratio be- closer in the hierarchy to the correct concept, over tween I NSTANCE P ERCENT and AVERAGE I N - concepts many edges away. Finally, the absolute STANCE P ERCENT OF C HILDREN . The last fea- values of the features that might be employed may ture is meant to capture dramatic changes in per- be comparable within an annotation, but incompa- centages when moving in the hierarchy from child rable across annotations, which reduces the porta- concepts to the candidate concept in question. bility of the resulting model to new annotations. (B) Concept Features: Concept features ap- To address the above issues, the ranking func- proximate the generality of the concepts: (1) tion proposed does not construct training exam- N UM I NSTANCES the number of descendant in- ples from raw features collected for each indi- stances of the concept, (2) N UM C HILDREN the vidual candidate concept. Instead, it constructs number of child concepts, and (3) D EPTH the dis- training examples from pairwise comparisons of tance to the concept’s farthest descendant. a candidate concept with another candidate con- (C) Argument Co-occurrence Features: The ar- cept. Concretely, a pairwise comparison is la- gument co-occurrence features model the likeli- beled as a positive example if the first concept is hood that an annotation applies to a concept by closer to the correct concept than the second, or as looking at co-occurrences with another argument negative otherwise. The pairwise formulation has of the same annotation. Intuitively, if a con- three immediate advantages. First, it accomodates cept representing one argument has a high co- the preference for concepts closer to the gold con- occurrence with an instance that is some other ar- cept. Second, the pairwise formulation produces gument, a relationship more likely exists between a larger, more balanced training set. Third, deci- members of the concept and the instance. For ex- sions of whether the first concept being compared ample, given acted-in, ‘Actors’ is likely to have a is more relevant than the second are more likely to higher co-occurrence with casablanca than ‘Peo- generalize across annotations, than absolute deci- ple’ is. These features are generated from a set of sions of whether (and how much) a particular con- Web queries. Therefore, the collected values are cept is relevant for a given annotation. likely to be affected by different noise than that Compiling Ranking Features: The features are present in the original dataset. For every concept grouped into four categories: (A) annotation co- and instance pair from the arguments of a given occurrence features, (B) concept features, (C) ar- annotation, they feature the number of times each gument co-occurrence features, and (D) combina- of the tokens in the concept appears in the same tion features, as described below. query with each of the tokens in the instance, (A) Annotation Co-occurrence Features: The normalizing to the respective number of tokens. annotation co-occurrence features emphasize how The procedure generates, for each candidate con- well an annotation applies to a concept. These cept, an average co-occurrence score (AVG C O - features include (1) M ATCHED I NSTANCES the OCCURRENCE ) and a total co-occurrence score number of descendant instances of the concept (T OTAL C O - OCCURRENCE) over all instances the that appear with the annotation, (2) I NSTANCE concept is paired with. P ERCENT the percentage of matched instances in (D) Combination Features: The last group the concept, (3) M ORE THAN T HREE M ATCHING of features are combinations of the above fea- I NSTANCES and (4) M ORE THAN T EN M ATCH - tures: (1) D EPTH , I NSTANCE P ERCENT which is ING I NSTANCES , which indicate when the match- D EPTH multiplied by I NSTANCE P ERCENT, and 506 Concept Distance Match Total Match Total AvgInst Depth Avg Total ToCorrect Inst Inst Child Child PercOfChild Cooccur Cooccur People 4 36512 879423 22 29 4% 14 0.67 33506 Actors 0 29101 54420 6 10 32% 6 2.08 99971 English Actors 2 3091 5922 3 4 37% 3 2.75 28378 Labeled Concept Pair Annotation Co-occurrence Concept Arg Co-occurrence Combination Features Features Features Features Concept Label Match Inst Match Child AvgInst Num Num Depth Avg Total Depth DepthInst Pair Inst Perc Child Perc PercChild Inst Child Cooccur Cooccur InstPerc PercChild People-Actors 0 1.25 0.08 3.67 1.26 0.13 1.25 3.67 2.33 0.32 0.34 0.18 0.66 Actors-People 1 0.8 12.88 0.27 0.79 7.65 0.8 0.27 0.43 3.11 2.98 5.52 1.51 Actors-English Actors 1 9.41 1.02 2.0 0.8 0.87 9.41 2.0 2.0 0.76 3.52 2.05 4.1 English Actors-Actors 0 0.11 0.98 0.5 1.25 1.15 0.11 0.5 0.5 1.32 0.28 0.49 0.24 English Actors-People 1 0.08 12.57 0.14 0.99 8.82 0.08 0.14 0.21 4.12 0.85 2.69 0.37 People-English Actors 0 11.81 0.08 7.33 1.01 0.11 11.81 7.33 4.67 0.24 1.18 0.37 2.72 Table 1: Training/Testing Examples: The top table shows examples of raw statistics gathered for three candidate concepts for the left argument of the annotation acted-in. The second table shows the training/testing examples generated from these concepts and statistics. Each example represents a pair of concepts which is labeled positive if the first concept is closer to the correct concept than the second concept. Features shown here are the ratio between a statistic for the first concept and a statistic for the second (e.g. D EPTH for Actors-English Actors is 2 as ‘Actors’ has depth of 6 and ‘English Actors’ has depth of 3). Some features omitted due to space constraints. (2) D EPTH , I NSTANCE P ERCENT, C HILDREN, ceptual hierarchy derived automatically from the which is the D EPTH multipled by the I NSTANCE Wikipedia (Remy, 2002) category network, as de- P ERCENT multiplied by M ATCHED C HILDREN. scribed in (Ponzetto and Navigli, 2009). The hi- Both these features seek to balance the perceived erarchy filters out edges (e.g., from ‘British Film relevance of an annotation to a candidate concept, Actors’ to ‘Cinema of the United Kingdom’) from with the generality of the candidate concept. the Wikipedia category network that do not corre- Generating Learning Examples: For a given spond to IsA relations. A concept in the hierarchy annotation, the ranking features described so far is a Wikipedia category (e.g., ‘English Film Ac- are computed for each candidate concept (e.g., tors’) that has zero or more Wikipedia categories ‘Movie Actors’, ‘Models’, ‘Actors’). However, as child concepts, and zero or more Wikipedia the actual training and testing examples are gener- categories (e.g., ‘English People by Occupation’, ated for pairs of candidate concepts (e.g., <‘Film ‘British Film Actors’) as parent concepts. Each Actors’, ‘Models’>, <‘Film Actors’, ‘Actors’>, concept in the hierarchy has zero or more in- <‘Models, ‘Actors’>). A training example rep- stances, which are the Wikipedia articles listed (in resents a comparison between two candidate con- Wikipedia) under the respective categories (e.g., cepts, and specifies which of the two is more rele- colin firth is an instance of ‘English Actors’). vant. To create training and testing examples, the values of the features of the first concept in the Instance-Level Annotations: The experiments pair are respectively combined with the values of exploit a set of binary instance-level annotations the features of the second concept in the pair to (e.g., acted-in, composed) among Wikipedia in- produce values corresponding to the entire pair. stances, as available in Freebase (Bollacker et Following classification of testing examples, al., 2008). The annotation is a Freebase prop- concepts are ranked according to the number of erty (e.g., /music/composition/composer). Inter- other concepts which they are classified as more nally, the left and right arguments are Freebase relevant than. Table 1 shows examples of train- topic identifiers mapped to their corresponding ing/testing data. Wikipedia articles (e.g., /m/03f4k mapped to the Wikipedia article on george gershwin). In this pa- 3 Experimental Setting per, the derived annotations and instances are dis- played in a shorter, more readable form for con- 3.1 Data Sources ciseness and clarity. As features do not use the Conceptual Hierarchy: The experiments com- label of the annotation, labels are never used in pute concept-level annotations relative to a con- the experiments and evaluation. 507 Web Search Queries: The argument co- mantics. The manual annotation is carried out occurrence features described above are com- independently by two human judges, who then puted over a set of around 100 million verify each other’s work and discard inconsisten- anonymized Web search queries from 2010. cies. For example, the gold concept of the left argument of composed-by is annotated to be the 3.2 Experimental Runs Wikipedia category ‘Musical Compositions’. In The experimental runs exploit ranking features the process, some annotation labels are discarded, described in the previous section, employing: when (a) it is not clear what concept captures an • one of three learning algorithms: naive Bayes argument (e.g., for the right argument of function- (NAIVE BAYES), maximum entropy (M AXENT), of-building), or (b) more than 5000 candidate con- or perceptron (P ERCEPTRON) (Mitchell, 1997), cepts are available via propagation for one of the chosen for their scalability to larger datasets via arguments, which would cause too many train- distributed implementations. ing or testing examples to be generated via con- • one of three ways of combining the values cept pairs, and slow down the experiments. The of features collected for individual candidate con- retained 139 annotation labels, whose arguments cepts into values of features for pairs of candidate have been labeled with their respective gold con- concepts: the raw ratio of the values of the re- cepts, form the gold standard for the experiments. spective features of the two concepts (0 when the More precisely, an entry in the resulting gold stan- denominator is 0); the ratio scaled to the interval dard consists of an annotation label, one of its [0, 1]; or a binary value indicating which of the arguments being considered (left or right), and values is larger. a gold concept that best captures that argument. For completeness, the experiments include The set of annotation labels from the gold stan- three additional, baseline runs. Each baseline dard is quite diverse and covers many domains of computes scores for all candidate concepts based potential interest, e.g., has-company(‘Industries’, on the respective metric; then candidate concepts ‘Companies’), written-by(‘Films’, ‘Screenwrit- are ranked in decreasing order of their scores. The ers’), member-of (‘Politicians’,‘Political Parties’), baselines metrics are: or part-of-movement(‘Artists’, ‘Art Movements’). • I NST P ERCENT ranks candidate concepts by Evaluation Metric: Following previous work the percentage of matched instances that are de- on selectional preferences (Kozareva and Hovy, scendants of the concept. It emphasizes concepts 2010; Ritter et al., 2010), each entry in the gold which are “proven” to belong to the annotation; standard, (i.e., each argument for a given annota- tion) is evaluated separately. Experimental runs • E NTROPY ranks candidate concepts by the compute a ranked list of candidate concepts for entropy (Shannon, 1948) of the proportion of each entry in the gold standard. In theory, a com- matched descendant instances of the concept; puted candidate concept is better if it is closer • AVG D EPTH ranks candidate concepts by semantically to the gold concept. In practice, their distances to half of the maximum hierarchy the accuracy of a ranked list of candidate con- height, emphasizing a balance of generality and cepts, relative to the gold concept of the anno- specificity. tation label, is measured by two scoring metrics 3.3 Evaluation Procedure that correspond to the mean reciprocal rank score (MRR) (Voorhees and Tice, 2000) and a modifi- Gold Standard of Concept-Level Annotations: cation of it (DRR) (Pas¸ca and Alfonseca, 2009): A random, weighted sample of 200 annotation la- 1 X N 1 M RR = max bels (e.g., corresponding to composed-by, play- N i=1 rank ranki instrument) is selected, out of the set of labels N is the number of annotations and ranki is the of all instance-level annotations collected from rank of the gold concept in the returned list for Freebase. During sampling, the weights are the MRR. An annotation ai receives no credit for counts of distinct instance-level annotations (e.g., MRR if the gold concept does not appear in the <rhapsody in blue, george gershwin>) avail- corresponding ranked list. N able for the label. The arguments of the anno- 1 X 1 DRR = max tation labels are then manually annotated with N i=1 rank ranki × (1 + Len) a gold concept, which is the category from the For DRR, ranki is the rank of a candidate con- Wikipedia hierarchy that best captures their se- cept in the returned list and Len is the length of 508 Annotation (Number of Candidate Concepts) Examples of Instances Top Ranked Concepts Composers compose Musical Compositions (3038) aaron copland; black sabbath Music by Nationality; Composers; Classical Composers Musical Compositions composed-by Composers (1734) we are the champions; yor- Musical Compositions; Compositions by ckscher marsch Composer; Classical Music Foods contain Nutrients (1112) acca sellowiana; lasagna Foods; Edible Plants; Food Ingredients Organizations has-boardmember People (3401) conocophillips; spence school Companies by Stock Exchange; Companies Listed on the NYSE; Companies Educational Organizations has-graduate Alumni (4072) air force institute of technology; Education by Country; Schools by Country; deering high school Universities and Colleges by Country Television Actors guest-role Fictional Characters (4823) melanie griffith; patti laBelle Television Actors by Nationality; Actors; American Actors Musical Groups has-member Musicians (2287) steroid maximus; u2 Musical Groups; Musical Groups by Genre; Musical Groups by Nationality Record Labels represent Musician (920) columbia records; vandit Record Labels; Record Labels by Country; Record Labels by Genre Awards awarded-to People (458) academy award for best original Film Awards; Awards; Grammy Awards song; erasmus prize Foods contain Nutrients (177) lycopene; glutamic acid Carboxylic Acids ; Acids; Essential Nutrients Architects design Buildings and Structures (4811) 20 times square; berkeley build- Buildings and Structures; Buildings and Struc- ing tures by Architect; Houses by Country People died-from Causes of Death (577) malaria; skiing Diseases; Infectious Diseases; Causes of Death Art Directors direct Films (1265) batman begins; the lion king Films; Films by Director; Film Episodes guest-star Television Actors (1067) amy poehler; david caruso Television Actors by Nationality; Actors; American Actors Television Network has-tv-show Television Series (2492) george of the jungle; great expec- Television Series by Network; Television Se- tations ries; Television Series by Genre Musicians play Musical Instruments (423) accordion; tubular bell Musical Instruments; Musical Instruments by Nationality; Percussion Instruments Politicians member-of Political Parties (938) independent moralizing front; Political Parties; Political Parties by Country; national coalition party Political Parties by Ideology Table 2: Concepts Computed for Gold-Standard Annotations: Examples of entries from the gold standard and counts of candidate concepts (Wikipedia categories) reached from upward propagation of instances (Wikipedia instances). The target gold concept is shown in bold. Also shown are examples of Wikipedia instances, and the top concepts computed by the best-performing learning algorithm for the respective gold concepts. the minimum path in the hierarchy between the the annotation labels in testing appears in train- concept and the gold concept. Len is minimum ing. This restriction makes the evaluation more (0) if the candidate concept is the same as the gold rigurous and conservative as it actually assesses standard concept. A given annotation ai receives the extent the models learned are applicable to no credit for DRR if no path is found between the new, previously unseen annotation labels. If returned concepts and the gold concept. this restriction were relaxed, the baselines would As an illustration, for a single annotation, the preform equivalently as they do not depend on right argument of composed-by, the ranked list the training data, but the learned methods would of concepts returned by an experimental may likely do better. be [‘Symphonies by Anton Bruckner’, ‘Sym- 4 Evaluation Results phonies by Joseph Haydn’, ‘Symphonies by Gus- tav Mahler’, ‘Musical Compositions’, ..], with the 4.1 Quantitative Results gold concept being ‘Musical Compositions’. The Conceptual Hierarchy: The conceptual hierar- length of the path between ‘Symphonies by An- chy contains 108,810 Wikipedia categories, and ton Bruckner’ etc. and ‘Musical Compositions’ is its maximum depth, measured as the distance 2 (via ‘Symphonies’). Therefore, the MRR score from a concept to its farthest descendant, is 16. would be 0.25 (given by the fourth element of Candidate Concepts: On average, for the gold the ranked list), whereas the DRR score would be standard, the method propagates a given annota- 0.33 (given by the first element of the ranked list). tion from instances to 1,525 candidate concepts, MRR and DRR are computed in five-fold cross from which the single best concept must be deter- validation. Concretely, the gold standard is split mined. The left part of Table 2 illustrates the num- into five folds such that the sets of annotation la- ber of candidate concepts reached during propa- bels in each fold are disjoint. Thus, none of gation for a sample of annotations. 509 Experimental Run Accuracy 0.513 (DRR) over the top 20 computed concepts, N=1 N=20 MRR DRR MRR DRR and 0.245 (MRR) and 0.456 (DRR) when consid- → With raw-ratio features: ering only the first concept. These scores corre- NAIVE BAYES 0.021 0.180 0.054 0.222 spond to the ranked list being less than one step M AXENT 0.029 0.168 0.045 0.208 away in the hierarchy. The very first computed P ERCEPTRON 0.029 0.176 0.045 0.216 concept exactly matches the gold concept in about → With scaled-ratio features: one in four cases, and is slightly more than one NAIVE BAYES 0.050 0.170 0.112 0.243 step away from it. In comparison, the very first M AXENT 0.245 0.456 0.430 0.513 concept computed by the best baseline matches P ERCEPTRON 0.245 0.391 0.367 0.461 the gold concept in about one in 35 cases (0.029 → With binary features: NAIVE BAYES 0.115 0.297 0.224 0.361 MRR), and is about 6 steps away (0.173 DRR). M AXENT 0.165 0.390 0.293 0.441 The accuracies of the various learning algorithms P ERCEPTRON 0.180 0.332 0.330 0.429 (not shown) were also measured and correlated → For baselines: roughly with the MRR and DRR scores. I NST P ERCENT 0.029 0.173 0.045 0.224 Discussion: The baseline runs I NST P ERCENT E NTROPY 0.000 0.110 0.007 0.136 and E NTROPY produce categories that are far AVG D EPTH 0.007 0.018 0.028 0.045 too specific. For the gold annotation composed- Table 3: Precision Results: Accuracy of ranked lists by(‘Composers’, ‘Musical Compositions’), I NST- of concepts (Wikipedia categories) computed by var- P ERCENT produces ‘Scottish Flautists’ for the left ious runs, as an average over the gold standard of argument and ‘Operas by Ernest Reyer’ for the concept-level annotations, considering the top N can- right. AVG D EPTH does not suffer from over- didate concepts computed for each gold standard entry. specification, but often produces concepts that 4.2 Qualitative Results have been reached via propagation, yet are not close to the gold concept. For composed-by, Precision: Table 3 compares the precision of the AVG D EPTH produces ‘Film’ for the left argument ranked lists of candidate concepts produced by the and ‘History by Region’ for the right. experimental runs. The MRR and DRR scores in the table consider either at most 20 of the concepts 4.3 Error Analysis in the ranked list computed by a given experimen- tal run, or only the first, top ranked computed con- The right part of Table 2 provides a more de- cept. Note that, in the latter case, the MRR and tailed view into the best performing experimental DRR scores are equivalent to precision@1 scores. run, showing actual ranked lists of concepts pro- Several conclusions can be drawn from the re- duced for a sample of the gold standard entries sults. First, as expected by definition of the by M AXENT with scaled-ratio. A separate analy- scoring metrics, DRR scores are higher than the sis of the results indicates that the most common stricter MRR scores, as they give partial credit cause of errors is noise in the conceptual hier- to concepts that, while not identical to the gold archy, in the form of unbalanced instance-level concepts, are still close approximations. This is annotations and missing hierarchy edges. Un- particularly noticeable for the runs M AXENT and balanced annotations are annotations where cer- P ERCEPTRON with raw-ratio features (4.6 and tain subtrees of the hierarchy are artificially more 4.8 times higher respectively). Second, among populated than other subtrees. For the left argu- the baselines, I NST P ERCENT is the most accu- ment of the annotation has-profession, 0.05% of rate, with the computed concepts identifying the ‘New York Politicians’ are matched but 70% of gold concept strictly at rank 22 on average (for ‘Bushrangers’ are matched. Such imbalances may an MRR score 0.045), and loosely at an aver- be inherent to how annotations are added to Free- age of 4 steps away from the gold concept (for base: different human contributors may add new a DRR score of 0.224). Third, the accuracy of annotations to particular portions of Freebase, but the learning algorithms varies with how the pair- miss other relevant portions. wise feature values are combined. Overall, raw- The results are also affected by missing edges ratio feature values perform the worst, and scaled- in the hierarchy. Of the more than 100K con- ratio the best, with binary in-between. Fourth, cepts in the hierarchy, 3479 are roots of subhier- the scores of the best experimental run, M AXENT archies that are mutually disconnected. Exam- with scaled-ratio features, are 0.430 (MRR) and ples are ‘People by Region’, ‘Shades of Red’, and 510 ‘Members of the Parliament of Northern Ireland’, and use this semantic information to construct a all of which should have parents in the hierarchy. taxonomy. The resulting taxonomy is the concep- If a few edges are missing in a particular region tual hierarchy used in the evaluation. of the hierarchy, the method can recover, but if so Another related area of work is the discovery of many edges are missing that a gold concept has relations between concepts. Nastase and Strube very few descendants, then propagation can be (2008) use Wikipedia category names and cate- substantially affected. In the worst case, the gold gory structure to generate a set of relations be- concept becomes disconnected, and thus will be tween concepts. Yan et al. (2009) discover re- missing from the set of candidate concepts com- lations between Wikipedia concepts via deep lin- piled during propagation. For example, for the guistic information and Web frequency informa- annotation team-color(‘Sports Clubs’, ‘Colors’), tion. Mohamed et al. (2011) generate candi- the only descendant concept of ‘Colors’ in the hi- date relations by coclustering text contexts for ev- erarchy is ‘Horse Coat Colors’, meaning that the ery pair of concepts in a hierarchy. In a sense, gold concept ‘Colors’ is not reached during prop- this area of research is complementary to that dis- agation from instances upwards in the hierarchy. cussed in this paper. These methods induce new relations, and the proposed method can be used 5 Related Work to find appropriate levels of generalization for the arguments of any given relation. Similar to the task of attaching a semantic anno- tation to the concept in a hierarchy that has the best level of generality is the task of finding se- 6 Conclusions lectional preferences for relations. Most relevant This paper introduces a method to convert flat sets to this paper is work that seeks to find the appro- of instance-level annotations to hierarchically or- priate concept in a hierarchy for an argument of ganized, concept-level annotations. The method a specific relation (Ribas, 1995; McCarthy, 1997; determines the appropriate concept for a given se- Li and Abe, 1998). Li and Abe (1998) address mantic annotation in three stages. First, it propa- this problem by attempting to identify the best tree gates annotations upwards in the hierarchy, form- cut in a hierarchy for an argument of a given verb. ing a set of candidate concepts. Second, it classi- They use the minimum description length princi- fies each candidate concept as more or less appro- ple to select a set of concepts from a hierarchy to priate than each other candidate concept within an represent the selectional preferences. This work annotation. Third, it ranks candidate concepts by makes several limiting assumptions including that the number of other concepts relative to which it the hierarchy is a tree, and every instance belongs is classified as more appropriate. Because the fea- to just one concept. Clark and Weir (2002) inves- tures are comparisons between concepts within a tigate the task of generalizing a single relation- single semantic annotation, rather than consider- concept pair. A relation is propagated up a hier- ations of individual concepts, the method is able archy until a chi-square test determines the differ- to generalize across annotations, and can thus be ence between the probability of the child and par- applied to new, previously unseen annotations. ent concepts to be significant where the probabili- Experiments demonstrate that, on average, the ties are relation-concept frequencies. This method method is able to identify the concept of a given has no direct translation to the task discussed here; annotation’s argument within one hierarchy edge it is unclear how to choose the correct concept if of the gold concept. instances generalize to different concepts. The proposed method can take advantage of In other research on selectional preferences, existing work on open-domain information ex- Pantel et al. (2007), Kozareva and Hovy (2010) traction. The output of such work is usually and Ritter et al. (2010) focus on generating ad- instance-level annotations, although often at sur- missible arguments for relations, and Erk (2007) face level (non-disambiguated arguments) rather and Bergsma et al. (2008) investigate classifying than semantic level (disambiguated arguments). a relation-instance pair as plausible or not. After argument disambiguation (e.g., (Dredze et al., 2010)), the annotations can be used as input Important to this paper is the Wikipedia cate- to determining concept-level annotations. Thus, gory network (Remy, 2002) and work on refin- the method has the potential to generalize any ing it. Ponzetto and Navigli (2009) disambiguate existing database of instance-level annotations to Wikipedia categories by using WordNet synsets concept-level annotations. 511 References Diana McCarthy. 1997. Word sense disambiguation for acquisition of selectional preferences. In Pro- Michele Banko, Michael Cafarella, Stephen Soder- ceedings of the ACL/EACL Workshop on Automatic land, Matt Broadhead, and Oren Etzioni. 2007. Information Extraction and Building of Lexical Se- Open information extraction from the Web. In Pro- mantic Resources for NLP Applications, pages 52– ceedings of the 20th International Joint Conference 60, Madrid, Spain. on Artificial Intelligence (IJCAI-07), pages 2670– Tom Mitchell. 1997. Machine Learing. McGraw Hill. 2676, Hyderabad, India. Thahir Mohamed, Estevam Hruschka, and Tom Cory Barr, Rosie Jones, and Moira Regelson. 2008. Mitchell. 2011. Discovering relations between The linguistic structure of English Web-search noun categories. In Proceedings of the 2011 Con- queries. In Proceedings of the 2008 Conference ference on Empirical Methods in Natural Language on Empirical Methods in Natural Language Pro- Processing (EMNLP-11), pages 1447–1455, Edin- cessing (EMNLP-08), pages 1021–1030, Honolulu, burgh, United Kingdom. Hawaii. Vivi Nastase and Michael Strube. 2008. Decoding Shane Bergsma, Dekang Lin, and Randy Goebel. Wikipedia categories for knowledge acquisition. In 2008. Discriminative learning of selectional pref- Proceedings of the 23rd National Conference on erence from unlabeled text. In Proceedings of the Artificial Intelligence (AAAI-08), pages 1219–1224, 2008 Conference on Empirical Methods in Natural Chicago, Illinois. Language Processing (EMNLP-08), pages 59–68, M. Pas¸ca and E. Alfonseca. 2009. Web-derived Honolulu, Hawaii. resources for Web Information Retrieval: From Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim conceptual hierarchies to attribute hierarchies. In Sturge, and Jamie Taylor. 2008. Freebase: A Proceedings of the 32nd International Conference collaboratively created graph database for struc- on Research and Development in Information Re- turing human knowledge. In Proceedings of the trieval (SIGIR-09), pages 596–603, Boston, Mas- 2008 International Conference on Management of sachusetts. Data (SIGMOD-08), pages 1247–1250, Vancouver, Patrick Pantel, Rahul Bhagat, Timothy Chklovski, and Canada. Eduard Hovy. 2007. ISP: Learning inferential se- Stephen Clark and David Weir. 2002. Class-based lectional preferences. In Proceedings of the Annual probability estimation using a semantic hierarchy. Meeting of the North American Chapter of the Asso- Computational Linguistics, 28(2):187–206. ciation for Computational Linguistics (NAACL-07), pages 564–571, Rochester, New York. Mark Dredze, Paul McNamee, Delip Rao, Adam Ger- Simone Paolo Ponzetto and Roberto Navigli. 2009. ber, and Tim Finin. 2010. Entity disambiguation Large-scale taxonomy mapping for restructuring for knowledge base population. In Proceedings and integrating Wikipedia. In Proceedings of of the 23rd International Conference on Compu- the 21st International Joint Conference on Ar- tational Linguistics (COLING-10), pages 277–285, tifical Intelligence (IJCAI-09), pages 2083–2088, Beijing, China. Barcelona, Spain. Katrin Erk. 2007. A simple, similarity-based model Melanie Remy. 2002. Wikipedia: The free encyclope- for selectional preferences. In Proceedings of the dia. Online Information Review, 26(6):434. 45th Annual Meeting of the Association for Com- Francesc Ribas. 1995. On learning more appropriate putational Linguistics (ACL-07), pages 216–223, selectional restrictions. In Proceedings of the 7th Prague, Czech Republic. Conference of the European Chapter of the Asso- Zornitsa Kozareva and Eduard Hovy. 2010. Learning ciation for Computational Linguistics (EACL-97), arguments and supertypes of semantic relations us- pages 112–118, Madrid, Spain. ing recursive patterns. In Proceedings of the 48th Alan Ritter, Mausam, and Oren Etzioni. 2010. A la- Annual Meeting of the Association for Computa- tent dirichlet allocation method for selectional pref- tional Linguistics (ACL-10), pages 1482–1491, Up- erences. In Proceedings of the 48th Annual Meet- psala, Sweden. ing of the Association for Computational Linguis- Hang Li and Naoki Abe. 1998. Generalizing case tics (ACL-10), pages 424–434, Uppsala, Sweden. frames using a thesaurus and the mdl principle. In Claude Shannon. 1948. A mathematical theory of Proceedings of the ECAI-2000 Workshop on Ontol- communication. Bell System Technical Journal, ogy Learning, pages 217–244, Berlin, Germany. 27:379–423,623–656. Xiao Li. 2010. Understanding the semantic struc- Ellen Voorhees and Dawn Tice. 2000. Building a ture of noun phrase queries. In Proceedings of the question-answering test collection. In Proceedings 48th Annual Meeting of the Association for Com- of the 23rd International Conference on Research putational Linguistics (ACL-10), pages 1337–1345, and Development in Information Retrieval (SIGIR- Uppsala, Sweden. 00), pages 200–207, Athens, Greece. 512 Fei Wu and Daniel S. Weld. 2010. Open information extraction using wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Compu- tational Linguistics (ACL-10), pages 118–127, Up- psala, Sweden. Yulan Yan, Naoaki Okazaki, Yutaka Matsuo, Zhenglu Yang, and Mitsuru Ishizuka. 2009. Unsupervised relation extraction by mining Wikipedia texts using information from the Web. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL- IJCNLP-09), pages 1021–1029, Suntec, Singapore. 513 Joint Satisfaction of Syntactic and Pragmatic Constraints Improves Incremental Spoken Language Understanding Andreas Peldszus Okko Buß University of Potsdam University of Potsdam Department for Linguistics Department for Linguistics

[email protected] [email protected]

Timo Baumann David Schlangen University of Hamburg University of Bielefeld Department for Informatics Department for Linguistics

[email protected] [email protected]

Abstract In this paper, we investigate whether the other potential advantage of incremental processing— We present a model of semantic processing providing “higher-level”-feedback to lower-level of spoken language that (a) is robust against modules, in order to improve subsequent process- ill-formed input, such as can be expected ing of the lower-level module—can be realised as from automatic speech recognisers, (b) re- spects both syntactic and pragmatic con- well. Specifically, we experimented with giving a straints in the computation of most likely syntactic parser feedback about whether semantic interpretations, (c) uses a principled, ex- readings of nominal phrases it is in the process of pressive semantic representation formalism constructing have a denotation in the given con- (RMRS) with a well-defined model the- text or not. Based on the assumption that speak- ory, and (d) works continuously (produc- ers do plan their referring expressions so that they ing meaning representations on a word- can successfully refer, we use this information to by-word basis, rather than only for full re-rank derivations; this in turn has an influence utterances) and incrementally (computing only the additional contribution by the new on how the derivations are expanded, given con- word, rather than re-computing for the tinued input. As we show in our experiments, for whole utterance-so-far). a corpus of realistic dialogue utterances collected We show that the joint satisfaction of syn- in a Wizard-of-Oz setting, this strategy led to an tactic and pragmatic constraints improves absolute improvement in computing the intended the performance of the NLU component denotation of around 10 % over a baseline (even (around 10 % absolute, over a syntax-only more using a more permissive metric), both for baseline). manually transcribed test data as well as for the output of automatic speech recognition. 1 Introduction The remainder of this paper is structured as fol- lows: We discuss related work in the next section, Incremental processing for spoken dialogue sys- and then describe in general terms our model and tems (i. e., the processing of user input even while its components. In Section 4 we then describe the it still may be extended) has received renewed at- data resources we used for the experiments and tention recently (Aist et al., 2007; Baumann et the actual implementation of the model, the base- al., 2009; Buß and Schlangen, 2010; Skantze and lines for comparison, and the results of our exper- Hjalmarsson, 2010; DeVault et al., 2011; Purver iments. We close with a discussion and an outlook et al., 2011). Most of the practical work, how- on future work. ever, has so far focussed on realising the poten- tial for generating more responsive system be- 2 Related Work haviour through making available processing re- sults earlier (e. g. (Skantze and Schlangen, 2009)), The idea of using real-world reference to inform but has otherwise followed a typical pipeline ar- syntactic structure building has been previously chitecture where processing results are passed explored by a number of authors. Stoness et al. only in one direction towards the next module. (2004, 2005) describe a proof-of-concept imple- 514 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 514–523, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics mentation of a “continuous understanding” mod- The components of our model are described in ule that uses reference information in guiding a the following sections: first the parser which com- bottom-up chart-parser, which is evaluated on a putes the syntactic probability in an incremental, single dialogue transcript. In contrast, our model top-down manner; the semantic construction al- uses a probabilistic top-down parser with beam gorithm which associates (underspecified) logi- search (following Roark (2001)) and is evalu- cal forms to derivations; the reference resolution ated on a large number of real-world utterances component that computes the pragmatic plausi- as processed by an automatic speech recogniser. bility; and the combination that incorporates the Similarly, DeVault and Stone (2003) describe a feedback from this pragmatic signal. system that implements interaction between a parser and higher-level modules (in this case, even 3.2 Parser more principled, trying to prove presuppositions), Roark (2001) introduces a strategy for incremen- which however is also only tested on a small, con- tal probabilistic top-down parsing and shows that structed data-set. it can compete with high-coverage bottom-up Schuler (2003) and Schuler et al. (2009) present parsers. One of the reasons he gives for choosing a model where information about reference is a top-down approach is that it enables fully left- used directly within the speech recogniser, and connected derivations, where at every process- hence informs not only syntactic processing but ing step new increments directly find their place also word recognition. To this end, the processing in the existing structure. This monotonically en- is folded into the decoding step of the ASR, and riched structure can then serve as a context for in- is realised as a hierarchical HMM. While techni- cremental language understanding, as the author cally interesting, this approach is by design non- claims, although this part is not further developed modular and restricted in its syntactic expressiv- by Roark (2001). He discusses a battery of dif- ity. ferent techniques for refining his results, mostly The work presented here also has connections based on grammar transformations and on con- to work in psycholinguistics. Pad´o et al. (2009) ditioning functions that manipulate a derivation present a model that combines syntactic and se- probability on the basis of local linguistic and lex- mantic models into one plausibility judgement ical information. that is computed incrementally. However, that We implemented a basic version of his parser work is evaluated for its ability to predict reading without considering additional conditioning or time data and not for its accuracy in computing lexicalizations. However, we applied left-facto- meaning. rization to parts of the grammar to delay cer- tain structural decisions as long as possible. The 3 The Model search-space is reduced by using beam search. To match the next token, the parser tries to expand 3.1 Overview the existing derivations. These derivations are Described abstractly, the model computes the stored in a priorized queue, which means that the probability of a syntactic derivation (and its ac- most probable derivation will always be served companying logical form) as a combination of a first. Derivations resulting from rule expansions syntactic probability (as in a typical PCFG) and are kept in the current queue, derivations result- a semantic or pragmatic plausibility.1 The prag- ing from a successful lexical match are pushed in matic plausibility here comes from the presuppo- a new queue. The parser proceeds with the next sition that the speaker intended her utterance to most probable derivation until the current queue successfully refer, i. e. to have a denotation in the is empty or until a threshhold is reached at which current situation (a unique one, in the case of def- remaining analyses are pruned. This threshhold inite reference). Hence, readings that do have a is determined dynamically: If the probability of denotation are preferred over those that do not. the current derivation is lower than the product of 1 the best derivation’s probability on the new queue, Note that, as described below, in the actual implemen- tation the weights given to particular derivations are not real the number of derivations in the new queue, and a probabilities anymore, as derivations fall out of the beam and base beam factor (an initial parameter for the size normalisation is not performed after re-weighting. of the search beam), then all further old deriva- 515 TextualWordIU TextualWordIU TextualWordIU TextualWordIU TextualWordIU TextualWordIU $TopOfWords nimm den winkel in TagIU TagIU TagIU TagIU TagIU TagIU $TopOfTags vvimp det nn appr CandidateAnalysisIU CandidateAnalysisIU CandidateAnalysisIU CandidateAnalysisIU CandidateAnalysisIU LD=[s*/s,kon,s*, s/vp, vp/vvimp-v1, m(vvimp)] LD=[v1/np-vz, np/pper, i(det)] LD=[n1/adjp-n1, adjp/adja, i(nn)] LD=[nz/eps, vz/advp-vz, advp/adv, i(appr)] P=0.14 P=0.00441 P=0.002646 P=0.00007938 CandidateAnalysisIU S=[V1, kon, S*, S!] S=[pper, VZ, S!] S=[adja, N1, VZ, S!] S=[adv, VZ, S!] LD=[] P=1.00 S=[S*,S!] CandidateAnalysisIU CandidateAnalysisIU CandidateAnalysisIU CandidateAnalysisIU LD=[s*/s, s/vp, vp/vvimp-v1, m(vvimp)] LD=[v1/np-vz, np/det-n1, m(det)] LD=[n1/nadj-nz, nadj/adja, i(nn)] LD=[nz/advp-nz, advp/adv, i(appr)] P=0.49 P=0.2205 P=0.000441 P=0.0003969 S=[V1, S!] S=[N1, VZ, S!] S=[adja, NZ, VZ, S!] S=[adv, NZ, VZ, S!] CandidateAnalysisIU CandidateAnalysisIU LD=[n1/nn-nz, m(nn)] LD=[nz/pp-nz, pp/appr-np, m(appr)] P=0.06615 P=0.0178605 S=[NZ, VZ, S!] S=[NP, NZ, VZ, S!] FormulaIU FormulaIU FormulaIU ... ... FormulaIU ... FormulaIU FormulaIU FormulaIU FormulaIU ... ... [ [l0:a1:i2] [ [l0:a1:e2] FormulaIU { [l0:a1:i2] } ] { [l18:a19:x14] [l0:a1:e2] } [ [l0:a1:e2] ARG1(a1,x8), FormulaIU { [l0:a1:e2] } FormulaIU l6:a7:addressee(x8), ... ARG1(a1,x8), [ [l0:a1:e2] l0:a1:_nehmen(e2), l6:a7:addressee(x8), { [l42:a43:x44] [l29:a30:x14] [l0:a1:e2] } ARG2(a1,x14), l0:a1:_nehmen(e2)] FormulaIU ARG1(a1,x8), BV(a13,x14), [ [l0:a1:e2] l6:a7:addressee(x8), RSTR(a13,h21), { [l29:a30:x14] [l0:a1:e2] } l0:a1:_nehmen(e2), BODY(a13,h22), ARG1(a1,x8), ARG2(a1,x14), l12:a13:_def(), l6:a7:addressee(x8), BV(a13,x14), qeq(h21,l18)] l0:a1:_nehmen(e2), RSTR(a13,h21), ARG2(a1,x14), BODY(a13,h22), BV(a13,x14), l12:a13:_def(), RSTR(a13,h21), l18:a19:_winkel(x14), BODY(a13,h22), ARG1(a40,x14), l12:a13:_def(), ARG2(a40,x44), l18:a19:_winkel(x14), l39:a40:_in(e41), qeq(h21,l18)] qeq(h21,l18)] Figure 1: An example network of incremental units, including the levels of words, POS-tags, syntactic derivations and logical forms. See section 3 for a more detailed description. tions are pruned. Due to probabilistic weighing derivations (“CandidateAnalysisIUs”) are repre- and the left factorization of the rules, left recur- sented by three features: a list of the last parser ac- sion poses no direct threat in such an approach. tions of the derivation (LD), with rule expansions Additionally, we implemented three robust lex- or (robust) lexical matches; the derivation proba- ical operations: insertions consume the current bility (P); and the remaining stack (S), where S* token without matching it to the top stack item; is the grammar’s start symbol and S! an explicit deletions can “consume” a requested but actu- end-of-input marker. (To keep the Figure small, ally non-existent token; repairs adjust unknown we artificially reduced the beam size and cut off tokens to the requested token. These robust op- alternatives paths, shown in grey.) erations have strong penalties on the probability to make sure they will survive in the derivation 3.3 Semantic Construction Using RMRS only in critical situations. Additionally, only a As a novel feature, we use for the representation single one of them is allowed to occur between of meaning increments (that is, the contributions the recognition of two adjacent input tokens. of new words and syntactic constructions) as well Figure 1 illustrates this process for the first few as for the resulting logical forms the formalism words of the example sentence “nimm den winkel Robust Minimal Recursion Semantics (Copestake, in der dritten reihe” (take the bracket in the third 2006). This is a representation formalism that was row), using the incremental unit (IU) model to originally constructed for semantic underspecifi- represent increments and how they are linked; see cation (of scope and other phenomena) and then (Schlangen and Skantze, 2009).2 Here, syntactic adapted to serve the purposes of semantics repre- 2 Very briefly: rounded boxes in the Figures represent the same predecessor can be regarded as alternatives. Solid IUs, and dashed arrows link an IU to its predecessor on the arrows indicate which information from a previous level an same level, where the levels correspond to processing stages. IU is grounded in (based on); here, every semantic IU is The Figure shows the levels of input words, POS-tags, syn- grounded in a syntactic IU, every syntactic IU in a POS-tag- tactic derivations and logical forms. Multiple IUs sharing IU, and so on. 516 sentations in heterogeneous situations where in- semantic combination in synchronisation with the formation from deep and shallow parsers must be syntactic expansion of the tree, i.e. in a top-down combined. In RMRS, meaning representations of left-to-right fashion. This way, no underspecifica- a first order logic are underspecified in two ways: tion of projected nodes and no re-interpretation of First, the scope relationships can be underspeci- already existing parts of the tree is required. This, fied by splitting the formula into a list of elemen- however, requires adjustments to the slot structure tary predications (EP) which receive a label ` and of RMRS. Left-recursive rules can introduce mul- are explicitly related by stating scope constraints tiple slots of the same sort before they are filled, to hold between them (e.g. qeq-constraints). This which is not allowed in the classic (R)MRS se- way, all scope readings can be compactly repre- mantic algebra, where only one named slot of sented. Second, RMRS allows underspecification each sort can be open at a time. We thus organize of the predicate-argument-structure of EPs. Ar- the slots as a stack of unnamed slots, where mul- guments are bound to a predicate by anchor vari- tiple slots of the same sort can be stored, but only ables a, expressed in the form of an argument re- the one on top can be accessed. We then define lation ARGREL(a,x). This way, predicates can a basic combination operation equivalent to for- be introduced without fixed arity and arguments ward function composition (as in standard lambda can be introduced without knowing which predi- calculus, or in CCG (Steedman, 2000)) and com- cates they are arguments of. We will make use of bine substructures in a principled way across mul- this second form of underspecification and enrich tiple syntactic rules without the need to represent lexical predicates with arguments incrementally. slot names. Combining two RMRS structures involves at Each lexical items receives a generic represen- least joining their list of EPs and ARGRELs and tation derived from its lemma and the basic se- of scope constraints. Additionally, equations be- mantic type (individual, event, or underspecified tween the variables can connect two structures, denotations), determined by its POS tag. This which is an essential requirement for semantic makes the grammar independent of knowledge construction. A semantic algebra for the combi- about what later (semantic) components will ac- nation of RMRSs in a non-lexicalist setting is de- tually be able to process (“understand”).3 Parallel fined in (Copestake, 2007). Unsaturated semantic to the production of syntactic derivations, as the increments have open slots that need to be filled tree is expanded top-down left-to-right, seman- by what is called the hook of another structure. tic macros are activated for each syntactic rule, Hook and slot are triples [`:a:x] consisting of a composing the contribution of the new increment. label, an anchor and an index variable. Every vari- This allows for a monotonic semantics construc- able of the hook is equated with the corresponding tion process that proceeds in lockstep with the one in the slot. This way the semantic representa- syntactic analysis. tion can grow monotonically at each combinatory Figure 1 (in the ”FormulaIU” box) illustrates step by simply adding predicates, constraints and the results of this process for our example deriva- equations. tion. Again, alternatives paths have been cut to Our approach differs from (Copestake, 2007) keep the size of the illustration small. Notice that, only in the organisation of the slots: In an incre- apart from the end-of-input marker, the stack of mental setting, a proper semantic representation semantic slots (in curly brackets) is always syn- is desired for every single state of growth of the chronized with the parser’s stack. syntactic tree. Typically, RMRS composition as- 3.4 Computing Noun Phrase Denotations sumes that the order of semantic combination is parallel to a bottom-up traversal of the syntactic Formally, the task of this module is, given a model tree. Yet, this would require for every incremental M of the current context, to compute the set of step first to calculate an adequate underspecified all variable assignments such that M satisfies φ: semantic representation for the projected nodes G = {g | M |=g φ}. If |G| > 1, we say that φ on the lower right border of the tree and then to refers ambiguously; if |G| = 1, it refers uniquely; proceed with the combination not only of the new 3 This feature is not used in the work presented here, but semantic increments but of the complete tree. For it could be used for enabling the system to learn the meaning our purposes, it is more elegant to proceed with of unknown words. 517 and if |G| = 0, it fails to refer. This process does not work directly on RMRS formulae, but on ex- tracted and unscoped first-order representations of their nominal content. 3.5 Parse Pruning Using Reference Information After all possible syntactic hypotheses at an in- crement have been derived by the parser and the corresponding semantic representations have been constructed, reference resolution informa- tion can be used to re-rank the derivations. If pragmatic feedback is enabled, the probability of every reprentation that does not resolve in the cur- rent context is degraded by a constant factor (we Figure 2: The game board used in the study, as pre- used 0.001 in our experiments described below, sented to the player: (a) the current state of the game determined by experimentation). The degradation on the left, (b) the goal state to be reached on the right. thus changes the derivation order in the parsing queue for the next input item and increases the chances of degraded derivations to be pruned in our study does not focus on these, we have dis- the following parsing step. regarded another 661 utterances in which pieces are referred to by pronouns, leaving us with 1026 4 Experiments and Results utterances for evaluation. These utterances con- tained on average 5.2 words (median 5 words; 4.1 Data std dev 2 words). We use data from the Pentomino puzzle piece do- In order to test the robustness of our method, main (which has been used before for example we generated speech recognition output using an by (Fern´andez and Schlangen, 2007; Schlangen et acoustic model trained for spontaneous (German) al., 2009)), collected in a Wizard-of-Oz study. In speech. We used leave-one-out language model this specific setting, users gave instructions to the training, i. e. we trained a language model for ev- system (the wizard) in order to manipulate (select, ery utterance to be recognized which was based rotate, mirror, delete) puzzle pieces on an upper on all the other utterances in the corpus. Unfor- board and to put them onto a lower board, reach- tunately, the audio recordings of the first record- ing a pre-specified goal state. Figure 2 shows an ing day were too quiet for successful recognition example configuration. Each participant took part (with a deletion rate of 14 %). We thus decided in several rounds in which the distinguishing char- to limit the analysis for speech recognition out- acteristics for puzzle pieces (color, shape, pro- put to the remaining 633 utterances from the other posed name, position on the board) varied widely. recording days. On this part of the corpus word In total, 20 participants played 284 games. error rate (WER) was at 18 %. We extracted the semantics of an utterance The subset of the full corpus that we used for from the wizard’s response action. In some cases, evaluation, with the utterances selected according such a mapping was not possible to do (e. g. be- to the criteria described above, nevertheless still cause the wizard did not perform a next action, only consists of natural, spontaneous utterances mimicking a non-understanding by the system), (with all the syntactic complexity that brings) that or potentially unreliable (if the wizard performed are representative for interactions in this type of several actions at or around the end of the utter- domain. ance). We discarded utterances without a clear se- mantics alignment, leaving 1687 semantically an- 4.2 Grammar and Resolution Model notated user utterances. The wizard of course was The grammar used in our experiments was hand- able to use her model of the previous discourse for constructed, inspired by a cursory inspection of resolving references, including anaphoric ones; as the corpus and aiming to reach good coverage 518 Words Predicates Status nimm nimm(e) -1 nimm den nimm(e,x) def(x) 0 nimm den Winkel nimm(e,x) def(x) winkel(x) 0 nimm den Winkel in nimm(e,x) def(x) winkel(x) in(x,y) 0 nimm den Winkel in der nimm(e,x) def(x) winkel(x) in(x,y) def(y) 0 nimm den Winkel in der dritten nimm(e,x) def(x) winkel(x) in(x,y) def(y) third(y) 1 nimm den Winkel in der dritten Reihe nimm(e,x) def(x) winkel(x) in(x,y) def(y) third(y) row(y) 1 Table 1: Example of logical forms (flattened into first-order base-language formulae) and reference resolution results for incrementally parsing and resolving ‘nimm den winkel in der dritten reihe’ for a core fragment. We created 30 rules, whose winkel in der dritten reihe” (take the bracket in the weights were also set by hand (as discussed be- third row) is shown in Table 1. The first column low, this is an obvious area for future improve- shows the incremental word hypothesis string, the ment), sparingly and according to standard intu- second the set of predicates derived from the most itions. When parsing, the first step is the assign- recent RMRS representation and the third the res- ment of a POS tag to each word. This is done by olution status (-1 for no resolution, 0 for some res- a simple lookup tagger that stores the most fre- olution and 1 for a unique resolution). quent tag for each word (as determined on a small subset of our corpus).4 4.3 Baselines and Evaluation Metric The situation model used in reference resolu- 4.3.1 Variants / Baselines tion is automatically derived from the internal To be able to accurately quantify and assess the representation of the current game state. (This effect of our reference-feedback strategy, we im- was recorded in an XML-format for each utter- plemented different variants / baselines. These all ance in our corpus.) Variable assignments were differ in how, at each step, the reading is deter- then derived from the relevant nominal predicate mined that is evaluated against the gold standard, structures,5 consisting of extracted simple pred- and are described in the following: ications, e. g. red(x) and cross(x) for the NP in In the Just Syntax (JS) variant, we simply take a phrase such as “take the red cross”. For each single-best derivation, as determined by syntax unique predicate argument X in these EP struc- alone and evaluate this. tures (such as as x above), the set of domain ob- The External Filtering (EF) variant adds in- jects that satisfied all predicates of which X was formation from reference resolution, but keeps an argument were determined. For example for it separate from the parsing process. Here, we the phrase above, X mapped to all elements that look at the 5 highest ranking derivations (as de- were red and crosses. termined by syntax alone), and go through them Finally, the size of these sets was determined: beginning at the highest ranked, picking the first no elements, one element, or multiple elements, derivation where reference resolution can be per- as described above. Emptiness of at least one set formed uniquely; this reading is then put up for denoted that no resolution was possible (for in- evaluation. If there is no such reading, the highest stance, if no red crosses were available, x’s set ranking one will be put forward for evaluation (as was empty), uniqueness of all sets denoted that in JS). an exact resolution was possible while multiple Syntax/Pragmatics Interaction (SPI) is the elements in at least some sets denoted ambiguity. variant described in the previous section. Here, This status was then leveraged for parse pruning, all active derivations are sent to the reference res- as per Section 3.5. olution module, and are re-weighted as described A more complex example using the scene de- above; after this has been done, the highest- picted in Figure 2 and the sentence “nimm den ranking reading is evaluated. 4 Finally, the Combined Interaction and Fil- A more sophisticated approach has recently been pro- posed by Beuck et al. (2011); this could be used in our setup. tering (CIF) variant combines the previous two 5 The domain model did not allow making a plausibility strategies, by using reference-feedback in com- judgement based on verbal resolution. puting the ranking for the derivations, and then 519 again using reference-information to identify the be used. But as we are building this module for an most promising reading within the set of 5 highest interactive system, ultimately, accuracy in recov- ranking ones. ering meaning is what we are interested in, and so we see this not just as a proxy, but actually as a 4.3.2 Metric more valuable metric. Moreover, this metric can When a reading has been identified according be applied at each incremental step, which is not to one of these methods, a score s is computed as clear how to do with more traditional metrics. follows: s = 1, if the correct referent (according to the gold standard) is computed as the denota- 4.4 Experiments tion for this reading; s = 0 if no unique referent Our parser, semantic construction and reference can be computed, but the correct one is part of the resolution modules are implemented within the set of possible referents; s = −1 if no referent InproTK toolkit for incremental spoken dialogue can be computed at all, or the correct one is not systems development (Schlangen et al., 2010). In part of the set of those that are computed. this toolkit, incremental hypotheses are modified As this is done incrementally for each word as more information becomes available over time. (adding the new word to the parser chart), for an Our modules support all such modifications (i. e. utterance of length m we get a sequence of m also allow to revert their states and output if word such numbers. (In our experiments we treat the input is revoked). “end of utterance” signal as a pseudo-word, since As explained in Section 4.1, we used offline knowing that an utterance has concluded allows recognition results in our evaluation. However, the parser to close off derivations and remove the results would be identical if we were to use those that are still requiring elements. Hence, we the incremental speech recognition output of In- in fact have sequences of m+1 numbers.) A com- proTK directly. bined score for the whole utterance is computed The system performs several times faster than according to the following formula: real-time on a standard workstation computer. We m X thus consider it ready to improve practical end-to- su = (sn ∗ n/m) end incremental systems which perform within- n=1 turn actions such as those outlined in (Buß and (where sn is the score at position n). The fac- Schlangen, 2010). tor n/m causes “later” decisions to count more The parser was run with a base-beam factor of towards the final score, reflecting the idea that 0.01; this parameter may need to be adjusted if a it is more to be expected (and less harmful) to larger grammar was used. be wrong early on in the utterance, whereas the 4.5 Results longer the utterance goes on, the more pressing it becomes to get a correct result (and the more Table 2 shows an overview of the experiment re- damaging if mistakes are made).6 sults. The table lists, separately for the manual Note that this score is not normalised by utter- transcriptions and the ASR transcripts, first the ance length m; the maximally achievable score number of times that the final reading did not re- being (m + 1)/2. This has the additional ef- solve at all, or to a wrong entitiy; did not uniquely fect of increasing the weight of long utterances resolve, but included the correct entity in its de- when averaging over the score of all utterances; notiation; or did uniquely resolve to the correct we see this as desirable, as the analysis task be- entity (-1, 0, and 1, respectively). The next lines comes harder the longer the utterance is. show “strict accuracy” (proportion of “1” among We use success in resolving reference to eval- all results) at the end of utterance, and “relaxed uate the performance of our parsing and semantic accuracy” (which allows ambiguity, i.e., is the set construction component, where more tradition- {0, 1}). incr.scr is the incremental score as de- ally, metrics like parse bracketing accuracy might scribed above, which includes in the evaluation 6 the development of references and not just the fi- This metric compresses into a single number some of the concerns of the incremental metrics developed in (Bau- nal state. (And in that sense, is the most appro- mann et al., 2011), which can express more fine-grainedly priate metric here, as it captures the incremental the temporal development of hypotheses. behaviour.) This score is shown both as absolute 520 JS EF SPI CIF −1 563 518 364 363 0 197 198 267 268 transcript 1 264 308 392 392 str.acc. 25.7 % 30.0 % 38.2 % 38.2 % rel.acc. 44.9 % 49.3 % 64.2 % 64.3 % incr.scr −1568 −1248 −536 −504 avg.incr.scr −1.52 −1.22 −0.52 −0.49 −1 362 348 254 255 0 122 121 173 173 recogntion 1 143 158 196 195 str.acc. 22.6 % 25.0 % 31.0 % 30.8 % rel.acc. 41.2 % 44.1 % 58.3 % 58.1 % incr.scr −1906 −1730 −1105 −1076 avg.incr.scr −1.86 −1.69 −1.01 −1.05 Table 2: Results of the Experiments. See text for explanation of metrics. number as well as averaged for each utterance. ber of non-standard constructions in our sponta- As these results show, the strategy of provid- neous material (e.g., utterances like “l¨oschen, un- ing the parser with feedback about the real-world ten” (delete, bottom) which we did not try to cover utility of constructed phrases (in the form of refer- with syntactic rules, and which may not even con- ence decisions) improves the parser, in the sense tain NPs. The SPI condition can promote deriva- that it helps the parser to successfully retrieve the tions resulting from robust rules (here, deletion) intended meaning more often compared to an ap- which then can refer. In general though state-of- proach that only uses syntactic information (JS) the art grammar engineering may narrow the gap or that uses pragmatic information only outside between JS and SPI – this remains to be tested – of the main programme: 38.2 % strict or 64.2 % but we see as an advantage of our approach that relaxed for SPI over 25.7 % / 44.9 % for JS, an it can improve over the (easy-to-engineer) set of absolute improvement of 12.5 % for strict or even core grammar rules. more, 19.3 %, for the relaxed metric; the incre- mental metric shows that this advantage holds not 5 Conclusions only at the final word, but also consistently within We have described a model of semantic process- the utterance, the average incremental score for ing of natural, spontaneous speech that strives an utterance being −0.49 for SPI and −1.52 to jointly satisfy syntactic and pragmatic con- for JS. The improvement is somewhat smaller straints (the latter being approximated by the as- against the variant that uses some reference infor- sumption that referring expressions are intended mation, but does not integrate this into the parsing to indeed successfully refer in the given context). process (EF), but it is still consistently present. The model is robust, accepting also input of the Adding such n-best-list processing to the output kind that can be expected from automatic speech of the parser+reference-combination (as variant recognisers, and incremental, that is, can be fed CIF does) finally does not further improve the input on a word-by-word basis, computing at each performance noticeably. When processing par- increment only exactly the contribution of the new tially defective material (the output of the speech word. Lastly, as another novel contribution, the recogniser), the difference between the variants model makes use of a principled formalism for se- is maintained, showing a clear advantage of SPI, mantic representation, RMRS (Copestake, 2006). although performance of all variants is degraded While the results show that our approach of somewhat. combining syntactic and pragmatic information Clearly, accuracy is rather low for the base- can work in a real-world setting on realistic line condition (JS); this is due to the large num- data—previous work in this direction has so far 521 only been at the proof-of-concept stage—there is Ann Copestake. 2006. Robust minimal recursion se- much room for improvement. First, we are now mantics. Technical report, Cambridge Computer exploring ways of bootstrapping a grammar and Lab. Unpublished draft. derivation weights from hand-corrected parses. Ann Copestake. 2007. Semantic composition with Secondly, we are looking at making the variable (robust) minimal recursion semantics. In Proceed- ings of the Workshop on Deep Linguistic Process- assignment / model checking function probabilis- ing, DeepLP ’07, pages 73–80, Stroudsburg, PA, tic, assigning probabilities (degree of strength of USA. Association for Computational Linguistics. belief) to candidate resolutions (as for example David DeVault and Matthew Stone. 2003. Domain the model of Schlangen et al. (2009) does). An- inference in incremental interpretation. In Proceed- other next step—which will be very easy to take, ings of ICOS 4: Workshop on Inference in Compu- given the modular nature of the implementation tational Semantics, Nancy, France, September. IN- framework that we have used—will be to integrate RIA Lorraine. this component into an interactive end-to-end sys- David DeVault, Kenji Sagae, and David Traum. 2011. tem, and testing other domains in the process. Incremental Interpretation and Prediction of Utter- ance Meaning for Interactive Dialogue. Dialogue and Discourse, 2(1):143–170. Acknowledgements We thank the anonymous Raquel Fern´andez and David Schlangen. 2007. Re- reviewers for their helpful comments. The work ferring under restricted interactivity conditions. In reported here was supported by a DFG grant in Simon Keizer, Harry Bunt, and Tim Paek, editors, the Emmy Noether programme to the last author Proceedings of the 8th SIGdial Workshop on Dis- course and Dialogue, pages 136–139, Antwerp, and a stipend from DFG-CRC (SFB) 632 to the Belgium, September. first author. Ulrike Pad´o, Matthew W Crocker, and Frank Keller. 2009. A probabilistic model of semantic plausi- References bility in sentence processing. Cognitive Science, 33(5):794–838. Gregory Aist, James Allen, Ellen Campana, Car- Matthew Purver, Arash Eshghi, and Julian Hough. los Gomez Gallo, Scott Stoness, Mary Swift, and 2011. Incremental semantic construction in a di- Michael K. Tanenhaus. 2007. Incremental under- alogue system. In J. Bos and S. Pulman, editors, standing in human-computer dialogue and experi- Proceedings of the 9th International Conference on mental evidence for advantages over nonincremen- Computational Semantics (IWCS), pages 365–369, tal methods. In Proceedings of Decalog 2007, the Oxford, UK, January. 11th International Workshop on the Semantics and Brian Roark. 2001. Robust Probabilistic Predictive Pragmatics of Dialogue, Trento, Italy. Syntactic Processing: Motivations, Models, and Timo Baumann, Michaela Atterer, and David Applications. Ph.D. thesis, Department of Cogni- Schlangen. 2009. Assessing and improving the per- tive and Linguistic Sciences, Brown University. formance of speech recognition for incremental sys- tems. In Proceedings of the North American Chap- David Schlangen and Gabriel Skantze. 2009. A gen- ter of the Association for Computational Linguis- eral, abstract model of incremental dialogue pro- tics - Human Language Technologies (NAACL HLT) cessing. In EACL ’09: Proceedings of the 12th 2009 Conference, Boulder, Colorado, USA, May. Conference of the European Chapter of the Associa- Timo Baumann, Okko Buß, and David Schlangen. tion for Computational Linguistics, pages 710–718. 2011. Evaluation and optimization of incremen- Association for Computational Linguistics, mar. tal processors. Dialogue and Discourse, 2(1):113– David Schlangen, Timo Baumann, and Michaela At- 141. terer. 2009. Incremental reference resolution: The Niels Beuck, Arne K¨ohn, and Wolfgang Menzel. task, metrics for evaluation, and a bayesian filtering 2011. Decision strategies for incremental pos tag- model that is sensitive to disfluencies. In Proceed- ging. In Proceedings of the 18th Nordic Con- ings of SIGdial 2009, the 10th Annual SIGDIAL ference of Computational Linguistics, NODALIDA- Meeting on Discourse and Dialogue, London, UK, 2011, Riga, Latvia. September. Okko Buß and David Schlangen. 2010. Modelling David Schlangen, Timo Baumann, Hendrik sub-utterance phenomena in spoken dialogue sys- Buschmeier, Okko Buß, Stefan Kopp, Gabriel tems. In Proceedings of the 14th International Skantze, and Ramin Yaghoubzadeh. 2010. Middle- Workshop on the Semantics and Pragmatics of Dia- ware for Incremental Processing in Conversational logue (Pozdial 2010), pages 33–41, Poznan, Poland, Agents. In Proceedings of SigDial 2010, Tokyo, June. Japan, September. 522 William Schuler, Stephen Wu, and Lane Schwartz. 2009. A framework for fast incremental interpre- tation during speech decoding. Computational Lin- guistics, 35(3). William Schuler. 2003. Using model-theoretic se- mantic interpretation to guide statistical parsing and word recognition in a spoken language interface. In Proceedings of the 41st Meeting of the Association for Computational Linguistics (ACL 2003), Sap- poro, Japan. Association for Computational Lin- guistics. Gabriel Skantze and Anna Hjalmarsson. 2010. To- wards incremental speech generation in dialogue systems. In Proceedings of the SIGdial 2010 Con- ference, pages 1–8, Tokyo, Japan, September. Gabriel Skantze and David Schlangen. 2009. Incre- mental dialogue processing in a micro-domain. In Proceedings of the 12th Conference of the Euro- pean Chapter of the Association for Computational Linguistics (EACL 2009), pages 745–753, Athens, Greece, March. Mark Steedman. 2000. The Syntactic Process. MIT Press, Cambridge, Massachusetts. Scott C. Stoness, Joel Tetreault, and James Allen. 2004. Incremental parsing with reference inter- action. In Proceedings of the Workshop on In- cremental Parsing at the ACL 2004, pages 18–25, Barcelona, Spain, July. Scott C. Stoness, James Allen, Greg Aist, and Mary Swift. 2005. Using real-world reference to improve spoken language understanding. In AAAI Workshop on Spoken Language Understanding, pages 38–45. 523 Learning How to Conjugate the Romanian Verb. Rules for Regular and Partially Irregular Verbs Liviu P. Dinu Vlad Niculae Octavia-Maria S, ulea Faculty of Mathematics Faculty of Mathematics Faculty of Foreign Languages and Computer Science and Computer Science and Literatures University of Bucharest University of Bucharest Faculty of Mathematics

[email protected] [email protected]

and Computer Science University of Bucharest

[email protected]

Abstract to give other conjugational classifications based on the way the verb actually conjugates. Lom- In this paper we extend our work described bard (1955), looking at a corpus of 667 verbs, in (Dinu et al., 2011) by adding more con- combined the traditional 4 classes with the way in jugational rules to the labelling system in- which the biggest two subgroups conjugate (one troduced there, in an attempt to capture using the suffix ”ez”, the other ”esc”) and ar- the entire dataset of Romanian verbs ex- tracted from (Barbu, 2007), and we em- rived at 6 classes. Ciompec (Ciompec et. al., ploy machine learning techniques to predict 1985 in Costanzo, 2011) proposed 10 conjuga- a verb’s correct label (which says what con- tional classes, while Felix (1964) proposed 12, jugational pattern it follows) when only the both of them looking at the inflection of the verbs infinitive form is given. and number of allomorphs of the stem. Romalo (1968, p. 5-203) produced a list of 38 verb types, which she eventually reduced to 10. 1 Introduction For the purpose of machine translation, Moisil Using only a restricted group of verbs, in (Dinu (1960) proposed 5 regrouped classes of verbs, et al., 2011) we validated the hypothesis that pat- with numerous subgroups, and introduced the terns can be identified in the conjugation of the method of letters with variable values, while Pa- Romanian (partially irregular) verb and that these pastergiou et al. (2007) have recently developed patterns can be learnt automatically so that, given a classification from a (second) language acquisi- the infinitive of a verb, its correct conjugation tion point of view, dividing the 1st and 4th tradi- for the indicative present tense can be produced. tional classes into 3 and respectively 5 subclasses, In this paper, we extend our investigation to the each with a different conjugational pattern, and whole dataset described in (Barbu, 2008) and at- offering rules for alternations in the stem. tempt to capture, beside the general ending pat- Of the more extensive classifications, Barbu terns during conjugation, as much of the phono- (2007) distinguished 41 conjugational classes for logical alternations occuring in the stem of verbs all tenses and 30 for the indicative present alone, (apophony) from the dataset as we can. covering a whole corpus of more that 7000 con- Traditionally, Romanian has received a Latin- temporary Romanian verbs, a corpus which was inspired classification of verbs into 4 (or some- also used in the present paper. However, her times 5) conjugational classes based on the ending classes were developed on the basis of the suf- of their infinitival form alone (Costanzo, 2011). fixes each verb receives during conjugation, and However, this infinitive-based classification has the classification system did not take into account proved itself inadequate due to its inability to ac- the alternations occuring in the stem of irregular count for the behavior of partially irregular verbs and partially irregular verbs. The system of rules (whose stems have a smaller number of allo- presented below took into account both the end- morphs than the completely irregular) during their ings pattern and the type of stem alternation for conjugation. each verb. There have been, thus, numerous attempts In what follows we describe our method for la- throughout the history of Romanian Linguistics beling the dataset and finding a model able to pre- 524 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 524–528, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics dict the labels. Person Regexp Example 1st singular ˆ(.+)a(.+)t$ tresalt 2 Approach 2nd singular ˆ(.+)a(.+)t¸i$ tresalt¸i 3rd singular ˆ(.+)a(.+)t˘a$ tresalt˘a The problem which we are aiming to solve is to 1st plural ˆ(.+)˘a(.+)t˘am$ tres˘alt˘am determine how to conjugate a verb, given its in- 2nd plural ˆ(.+)˘a(.+)tat¸i$ tres˘altat¸i finitive form. The traditional infinitive-based clas- 3rd plural ˆ(.+)a(.+)t˘a$ tresalt˘a sification taught in school does not take one all the way to solving this problem. Many conjugational Table 1: Rule 14 modelling ”a tres˘alta” patterns exist within each of these four classes. 2.1 Labeling the dataset that they model. Note that, when we say (no) al- ternation, we mean (no) alternation in the stem. Following our own observations, the alternations So the difference between rules 1, 20, 22, and the identified in (Papastergiou et al., 2007) and the sort lies in the suffix that is added to the stem classes of suffix patterns given in (Barbu, 2007), for each verb form. They may share some suf- we developed a number of conjugational rules fixes, but not all and/or not for the same person which were narrowed down to the 30 most pro- and number. ductive in relation to the dataset. Each of these 30 rules (or patterns) contains 6 regular expres- 1. no alternation; ”a spera” (to hope); sions through which the rule models how a (dif- 2. alternation: a˘ →e for the 2nd person singular; ferent) type of Romanian verb conjugates in the ”a num˘ara” (to count); indicative present. They each consist of 6 reg- ular expressions because there are three persons 3. no alternation; ”a intra” (to enter), stem ends (first, second, and third) times two numbers (sin- in ”tr”, ”pl”, ”bl” or ”fl” which determines gular and plural). the addition of ”u” at the end of the 1st per- Rule 10, for example, models, as stated in son singular form; the list that follows, how verbs of the type ”a cˆanta” (to sing) conjugate in the indicative 4. alternation: it lacks t→t¸ for the 2nd person present, by having the first regular expression singular, which otherwise normally occurs; model the first person singular form ”(eu) cˆant” ”a mis¸ca” (to move), stem ends in ”s¸ca”; (in regular expression format: ˆ(.+)$), the sec- 5. no alternation; ”a t˘aia” (to cut), ends in ”ia” ond, model the second person singular form ”(tu) and has a vowel before; cˆant¸i” (ˆ(.+)t¸i$), the third, model the third per- son singular form ”(ei) cˆant˘a” (ˆ(.+)˘a$), and so 6. no alternation; ”a speria” (to scare), ends in forth. Thus, rule 10 catches the alternation t→t¸ ”ia” and has a consonant before; for the 2nd person singular, while modelling a 7. no alternation; ”a dansa” (to dance), conju- particular type of verb class with a particular set gated with the suffix ”ez”; of suffixes. Note that the dot accepts any letter in the Romanian alphabet and that, for each of 8. no alternation; ”a copia” (to copy), conju- the six forms, the value of the capturing groups gated with a modified ”ez” due to the stem (those between brackets) remains constant, in this ending in ”ia”; case cˆan. These groups correspond to all parts of the stem that remain unchanged and ensure that, 9. altenation c→ch(e) or g→gh(e); ”a parca” given the infinitive and the regular expressions, (to park), conjugated with ”ez”, ending in one can work backwards and produce the correct ”ca” or ”ga”; conjugation. 10. alternation: t→t¸ for the 2nd person singular; For a clearer understanding of one such rule, ”a cˆanta” (to sing); Table 1 shows an example of how the verb ”a tres˘alta” is modeled by rule 14. 11. alternation: s→s¸ which replaces the usual Below, we list all the rules used, with the stem t→t¸ for the 2nd person singular; ”a exista” alternations they capture and an example of a verb (to exist); 525 12. alternation: a→ea for the 3rd person singular 27. no alternation; ”a citi” (to read), conjugates and plural, t→t¸ for the 2nd person singular; with the suffix ”esc” ; ”a des¸tepta” (to awake/arouse); 28. this type preserves the ”i” from the infinitive; 13. alternation: e→ea for the 3rd person singular ”a locui” (to reside), ends in ”˘ai”, ”oi”, or ui” and plural, t→t¸ for the 2nd person singular; and conjugates with ”esc”; ”a des¸erta” (to empty); 29. alternation: o→oa in the 3rd person singular 14. alternation: a˘ →a for all the forms except the and plural; end in ”ˆı”, ”a omorˆı” (to kill); 1st and 2nd person plural; ”a tres˘alta” (to start, to take fright); 30. no alternation; ”a hot˘arˆı” (to decide), ends in ”ˆı” and conjugates with ”˘asc”, a variant of 15. alternation: a˘ →a in the 3rd person singular ”esc” and plural, a˘ →e in the 2nd person singular; ”a desf˘ata” (to delight); 2.2 Classifiers and features Each infinitive in the dataset received a label cor- 16. alternation: a˘ →a for all the forms except for responding to the first rule that correctly produces the 1st and 2nd person plural; ”a p˘area” (to a conjugation for it. This was implemented in seem); order to reduce the ambiguity of the data, which 17. alternation: d→z for the 2nd person singu- was due to some verbs having alternate conjuga- lar due to palatalization, along with a˘ →e; ”a tion patterns. The unlabeled verbs were thrown vedea” (to see), stem ends in ”d”; out, while the labeled ones were used to train and evaluate a classifier. 18. alternation: a˘ →a for all forms except the 1st The context sensitive nature of the alternations and 2nd person plural, d→z for the 2nd per- leads to the idea that n-gram character windows son singular due to palatalization; ”a c˘adea” are useful. In the preprocessing step, the list of in- (to fall); finitives is transformed to a sparse matrix whose lines correspond to samples, and whose features 19. no alternation; ”a veghea” (to watch over), are the occurence or the frequency of a specific n- conjugates with another type of ”ez” ending gram. This feature extraction step has three free pattern; parameters: the maximum n-gram length, the op- 20. no alternations; ”a merge” (to walk), receives tional binarization of the features (taking only bi- the typical ending pattern for the third conju- nary occurences instead of counts), and the op- gational class; tional appending of a terminator character. The terminator character allows the classifier to iden- 21. alternation: t→t¸ for the 2nd person singular; tify and assign a different weight to the n-grams ”a promite” (to promise); that overlap with the suffix of the string. 22. no alternation; ”a scrie” (to write); For example, consider the English infinitive to walk. We will assume the following illustrative 23. alternations: s¸t→sc for the 1st person singu- values for the parameters: n-gram size of 3 and lar and 3rd person plural; ”a nas¸te” (to give appending the terminator character. Firstly, a ter- birth), ends in ”s¸te”; minator is appended to the end, yielding the string walk$. Subsequently, the string is broken into 1, 2 24. alternation: ”n” is deleted from the stem in and 3-grams: w, a, l, k, $, wa, al, lk, k$, wal, alk, the 2nd person singular; ”a pune” (to put), lk$. Next, this list is turned into a vector using a ends in ”ne”; standard process. We have first built a dictionary 25. alternation: d→z in the 2nd person singular of all the n-grams from the whole dataset. These, due to palatalization; ”a crede” (to believe), in order, encode the features. The verb (to) walk stem ends in ”d”; is therefore encoded as a row vector with ones in the columns corresponding to the features w, a, 26. no alternation; ”a sui” (to climb), ends in etc. and zeros in the rest. In this particular case, ”ui”, ”˘ai”, or ”ˆai”; there is no difference between binary and count 526 rule no. verbs rule no. verbs terminator and with non-binarized (count) fea- 1 547 16 13 tures. The estimated correct classification rate is 2 8 17 6 90.64%, with a weighted averaged precision of 3 18 18 4 80.90%, recall of 90.64% and F1 score of 89.89%. 4 5 19 14 Appending the artificial terminator character ’$’ 5 8 20 124 consistently improves accuracy by around 0.7%. 6 16 21 25 Because each word was represented as a bag of 7 3330 22 15 character n-grams instead of a continuous string, 8 273 23 7 and because, by its nature, a SVM yields sparse 9 89 24 41 solutions, combined with the evaluation using 10 4 25 51 cross-validation, we can safely say that the model 11 5 26 185 does not overfit and indeed learns useful decision 12 4 27 1554 boundaries. 13 106 28 486 14 13 29 5 4 Conclusions and Future Works 15 5 30 27 Our results show that the labelling system based Table 2: Number of verbs captured by each of our rules on the verb conjugation model we developed can be learned with reasonable accuracy. In the future, we plan to develop a multiple tiered labelling sys- features because all of the n-grams of this short tem that will allow for general alternations, such verb occur only once. But for a verb such as (to) as the ones occuring as a result of palatalization, tantalize, the feature corresponding to the 2-gram to be defined only once for all verbs that have ta would get a value of 2 in a count reprezentation, them, taking cues from the idea of letters with but only a value of 1 in a binary one. multiple values. This, we feel, will highly im- The system was put together using the scikit- prove the acuracy of the classifier. learn machine learning library for Python (Pe- dregosa et al., 2011), which provides a fast, scal- 5 Acknowledgements able implementation of linear support vector ma- chines based on liblinear (Fan et al., 2008), along The authors would like to thank the anonymous with n-gram extraction and grid search function- reviewers for their helpful comments. All authors ality. contributed equally to this work. The research of Liviu P. Dinu was supported by the CNCS, IDEI 3 Results - PCE project 311/2011, ”The Structure and In- Tabel 2 shows how well the rules fitted the dataset. terpretation of the Romanian Nominal Phrase in Out of 7,295 verbs in the dataset, 349 were uncap- Discourse Representation Theory: the Determin- tured by our rules. As expected, the rule capturing ers.” the most verbs (3,330) is the one modelling those from the 1st conjugational class (whose infinitives References end in ”a”) which conjugate with the ”ez” suffix Ana-Maria Barbu. Conjugarea verbelor romˆa- and are regular, namely rule 7, created for verbs nes¸ti. Dict¸ionar: 7500 de verbe romˆanes¸ti gru- like ”a dansa”. The second largest class, also as pate pe clase de conjugare. Bucharest: Coresi, expected, is the one belonging to verbs from the 2007. 4th edition, revised. (In Romanian.) (263 4th conjugational group (whose infinitives end in pp.). ”i”), which are regular, meaning no alternation in the stem, and conjugate with the ”esc” suffix. This Ana-Maria Barbu. Romanian lexical databases: class is modeled by rule number 27. Inflected and syllabic forms dictionaries. In The support vector classifier was evaluated Sixth International Language Resources and using a 10-fold cross-validation. The multi- Evaluation (LREC’08), 2008. class problem is treated using the one-versus-all Angelo Roth Costanzo. Romance Conjugational scheme. The parameters chosen by grid search are Classes: Learning from the Peripheries. PhD a maximum n-gram length of 5, with appended thesis, Ohio State University, 2011. 527 Figure 1: 10-fold cross validation scores for various combination of parameters. Only the values corresponding to the best C regularization parameters are shown. Liviu P. Dinu, Emil Ionescu, Vlad Niculae, and J. Vanderplas, A. Passos, D. Cournapeau, Octavia-Maria S¸ulea. Can alternations be M. Brucher, M. Perrot, and E. Duchesnay. learned? a machine learning approach to verb Scikit-learn: Machine learning in Python. Jour- alternations. In Recent Advances in Natural nal of Machine Learning Research, 12:2825– Language Processing 2011, September 2011. 2830, Oct 2011. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Valeria Gut¸u Romalo. Morfologie Structural˘a a Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: limbii romˆane. Editura Academiei Republicii A library for large linear classification. Journal Socialiste Romˆania, 1968. of Machine Learning Research, 9:1871–1874, June 2008. ISSN 1532-4435. Jiˇri Felix. Classification des verbes roumains, vol- ume VII. Philosophica Pragensia, 1964. Alf Lombard. Le verbe roumain. Etude mor- phologique, volume 1. Lund, C. W. K. Gleerup, 1955. Grigore C. Moisil. Probleme puse de traduc- erea automat˘a. conjugarea verbelor ˆın limba romˆan˘a. Studii si cercet˘ari lingvistice, XI(1): 7–29, 1960. I. Papastergiou, N. Papastergiou, and L. Man- deki. Verbul romˆanesc - reguli pentru ˆınlesnirea ˆınsus¸irii indicativului prezent. In Romanian National Symposium ”Directions in Roma- nian Philological Research”, 7th Edition, May 2007. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, 528 Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History Torsten Zesch Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational Information, Frankfurt Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universit¨at Darmstadt http://www.ukp.tu-darmstadt.de Abstract the optical character recognition module (Walker et al., 2010). We evaluate measures of contextual fitness A malapropism or real-word spelling error oc- on the task of detecting real-word spelling curs when a word is replaced with another cor- errors. For that purpose, we extract nat- rectly spelled word which does not suit the con- urally occurring errors and their contexts text, e.g. “People with lots of honey usually from the Wikipedia revision history. We show that such natural errors are better live in big houses.”, where ‘money’ was replaced suited for evaluation than the previously with ‘honey’. Besides typing mistakes, a major used artificially created errors. In partic- source of such errors is the failed attempt of au- ular, the precision of statistical methods tomatic spelling correctors to correct a misspelled has been largely over-estimated, while the word (Hirst and Budanitsky, 2005). A real-word precision of knowledge-based approaches spelling error is hard to detect, as the erroneous has been under-estimated. Additionally, we word is not misspelled and fits syntactically into show that knowledge-based approaches can be improved by using semantic relatedness the sentence. Thus, measures of contextual fitness measures that make use of knowledge be- are required to detect words that do not fit their yond classical taxonomic relations. Finally, contexts. we show that statistical and knowledge- Existing measures of contextual fitness can be based methods can be combined for in- categorized into knowledge-based (Hirst and Bu- creased performance. danitsky, 2005) and statistical methods (Mays et al., 1991; Wilcox-OHearn et al., 2008). Both test the lexical cohesion of a word with its con- 1 Introduction text. For that purpose, knowledge-based ap- Measuring the contextual fitness of a term in its proaches employ the structural knowledge en- context is a key component in different NLP ap- coded in lexical-semantic networks like WordNet plications like speech recognition (Inkpen and (Fellbaum, 1998), while statistical approaches D´esilets, 2005), optical character recognition rely on co-occurrence counts collected from large (Wick et al., 2007), co-reference resolution (Bean corpora, e.g. the Google Web1T corpus (Brants and Riloff, 2004), or malapropism detection (Bol- and Franz, 2006). shakov and Gelbukh, 2003). The main idea is al- So far, evaluation of contextual fitness mea- ways to test what fits better into the current con- sures relied on artificial datasets (Mays et al., text: the actual term or a possible replacement that 1991; Hirst and Budanitsky, 2005) which are cre- is phonetically, structurally, or semantically simi- ated by taking a sentence that is known to be cor- lar. We are going to focus on malapropism detec- rect, and replacing a word with a similar word tion as it allows evaluating measures of contex- from the vocabulary. This has a couple of dis- tual fitness in a more direct way than evaluating advantages: (i) the replacement might be a syn- in a complex application which always entails in- onym of the original word and perfectly valid in fluence from other components, e.g. the quality of the given context, (ii) the generated error might 529 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 529–538, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics be very unlikely to be made by a human, and real-word spelling errors at some point, which (iii) inserting artificial errors often leads to un- are then corrected in subsequent revisions of the natural sentences that are quite easy to correct, same article. The challenge lies in discriminating e.g. if the word class has changed. However, real-word spelling errors from all sorts of other even if the word class is unchanged, the origi- changes, including non-word spelling errors, re- nal word and its replacement might still be vari- formulations, or the correction of wrong facts. ants of the same lemma, e.g. a noun in singu- For that purpose, we apply a set of precision- lar and plural, or a verb in present and past form. oriented heuristics narrowing down the number This usually leads to a sentence where the error of possible error candidates. Such an approach can be easily detected using syntactical or statis- is feasible, as the high number of revisions in tical methods, but is almost impossible to detect Wikipedia allows to be extremely selective. for knowledge-based measures of contextual fit- ness, as the meaning of the word stays more or 2.1 Accessing the Revision Data less unchanged. To estimate the impact of this is- We access the Wikipedia revision data using sue, we randomly sampled 1,000 artificially cre- the freely available Wikipedia Revision Toolkit ated real-word spelling errors1 and found 387 sin- (Ferschke et al., 2011) together with the JWPL gular/plural pairs and 57 pairs which were in an- Wikipedia API (Zesch et al., 2008a).3 The API other direct relation (e.g. adjective/adverb). This outputs plain text converted from Wiki-Markup, means that almost half of the artificially created but the text still contains a small portion of left- errors are not suited for an evaluation targeted at over markup and other artifacts. Thus, we per- finding optimal measures of contextual fitness, as form additional cleaning steps removing (i) to- they over-estimate the performance of statistical kens with more than 30 characters (often URLs), measures while underestimating the potential of (ii) sentences with less than 5 or more than 200 semantic measures. In order to investigate this tokens, and (iii) sentences containing a high frac- issue, we present a framework for mining natu- tion of special characters like ‘:’ usually indicat- rally occurring errors and their contexts from the ing Wikipedia-specific artifacts like lists of lan- Wikipedia revision history. We use the resulting guage links. The remaining sentences are part-of- English and German datasets to evaluate statisti- speech tagged and lemmatized using TreeTagger cal and knowledge-based measures. (Schmid, 2004). Using these cleaned and anno- We make the full experimental framework pub- tated articles, we form pairs of adjacent article re- licly available2 which will allow reproducing our visions (ri and ri+1 ). experiments as well as conducting follow-up ex- 2.2 Sentence Alignment periments. The framework contains (i) methods to extract natural errors from Wikipedia, (ii) ref- Fully aligning all sentences of the adjacent revi- erence implementations of the knowledge-based sions is a quite costly operation, as sentences can and the statistical methods, and (iii) the evalua- be split, joined, replaced, or moved in the arti- tion datasets described in this paper. cle. However, we are only looking for sentence pairs which are almost identical except for the 2 Mining Errors from Wikipedia real-word spelling error and its correction. Thus, we form all sentence pairs and then apply an ag- Measures of contextual fitness have previously gressive but cheap filter that rules out all sentences been evaluated using artificially created datasets, which (i) are equal, or (ii) whose lengths differ as there are very few sources of sentences with more than a small number of characters. For the naturally occurring errors and their corrections. resulting much smaller subset of sentence pairs, Recently, the revision history of Wikipedia has we compute the Jaro distance (Jaro, 1995) be- been introduced as a valuable knowledge source tween each pair. If the distance exceeds a cer- for NLP (Nelken and Yamangil, 2008; Yatskar et tain threshold tsim (0.05 in this case), we do not al., 2010). It is also a possible source of natural further consider the pair. The small amount of re- errors, as it is likely that Wikipedia editors make maining sentence pairs is passed to the sentence 1 pair filter for in-depth inspection. The same artificial data as described in Section 3.2. 2 3 http://code.google.com/p/dkpro-spelling-asl/ http://code.google.com/p/jwpl/ 530 2.3 Sentence Pair Filtering tions, the change is likely to be semantically mo- The sentence pair filter further reduces the num- tivated, e.g. if “house” was replaced with “hut”. ber of remaining sentence pairs by applying a set Thus, we do not consider cases, where we detect of heuristics including surface level and semantic a direct semantic relation between the original and level filters. Surface level filters include: the replaced term. For this purpose, we use Word- Replaced Token Sentences need to consist of Net (Fellbaum, 1998) for English and GermaNet identical tokens, except for one replaced token. (Lemnitzer and Kunze, 2002) for German. No Numbers The replaced token may not be a 3 Resulting Datasets number. UPPER CASE The replaced token may not be 3.1 Natural Error Datasets in upper case. Using our framework for mining real-word Case Change The change should not only in- spelling errors in context, we extracted an En- volve case changes, e.g. changing ‘english’ into glish dataset5 , and a German dataset6 . Although ‘English’. the output generally was of high quality, man- Edit Distance The edit distance between the ual post-processing was necessary7 , as (i) for replaced token and its correction need to be be- some pairs the available context did not provide low a certain threshold. enough information to decide which form was After applying the surface level filters, the re- correct, and (ii) a problem that might be spe- maining sentence pairs are well-formed and con- cific to Wikipedia – vandalism. The revisions are tain exactly one changed token at the same posi- full of cases where words are replaced with simi- tion in the sentence. However, the change does lar sounding but greasy alternatives. A relatively not need to characterize a real-word spelling er- mild example is “In romantic comedies, there is ror, but could also be a normal spelling error or a a love story about a man and a woman who fall semantically motivated change. Thus, we apply a in love, along with silly or funny comedy farts.”, set of semantic filters: where ‘parts’ was replaced with ‘farts’ only to be Vocabulary The replaced token needs to occur changed back shortly afterwards by a Wikipedia in the vocabulary. We found that even quite com- vandalism hunter. We removed all cases that re- prehensive word lists discarded too many valid sulted from obvious vandalism. For further ex- errors as Wikipedia contains articles from a very periments, a small list of offensive terms could be wide range of domains. Thus, we use a frequency added to the stopword list to facilitate this pro- filter based on the Google Web1T n-gram counts cess. (Brants and Franz, 2006). We filter all sentences A connected problem is correct words that get where the replaced token has a very low unigram falsely corrected by Wikipedia editors (without count. We experimented with different values and the malicious intend from the previous examples, found 25,000 for English and 10,000 for German but with similar consequences). For example, the to yield good results. initially correct sentence “Dung beetles roll it into Same Lemma The original token and the re- a ball, sometimes being up to 50 times their own placed token may not have the same lemma, e.g. weight.” was ‘corrected’ by exchanging weight ‘car’ and ‘cars’ would not pass this filter. with wait. We manually removed such obvious Stopwords The replaced token should not be in mistakes, but are still left with some borderline a short list of stopwords (mostly function words). cases. In the sentence “By the 1780s the goals Named Entity The replaced token should not of England were so full that convicts were often be part of a named entity. For this purpose, we chained up in rotting old ships.” the obvious error applied the Stanford NER (Finkel et al., 2005). 5 Normal Spelling Error We apply the Jazzy Using a revision dump from April 5, 2011. 6 spelling detector4 and rule out all cases in which Using a revision dump from August 13, 2010. 7 The most efficient and precise way of finding real-word it is able to detect the error. spelling errors would of course be to apply measures of con- Semantic Relation If the original token and the textual fitness. However, the resulting dataset would then replaced token are in a close lexical-semantic rela- only contain errors that are detectable by the measures we want to evaluate – a clearly unacceptable bias. Thus, a cer- 4 http://jazzy.sourceforge.net/ tain amount of manual validation is inevitable. 531 ‘goal’ was changed by some Wikipedia editor to corpus that is known to be free of spelling errors, ‘jail’. However, actually it should have been the sentences are randomly sampled. For each sen- old English form for jail ‘gaol’ which can be de- tence, a random word is selected and all strings duced when looking at the full context and later with edit distance smaller than a given threshold versions of the article. We decided to not remove (2 in our case) are generated. If one of those gen- these rare cases, because ‘jail’ is a valid correction erated strings is a known word from the vocabu- in this context. lary, it is picked as the artificial error. After manual inspection, we are left with 466 Previous work on evaluating real-word spelling English and 200 German errors. Given that we correction (Hirst and Budanitsky, 2005; Wilcox- restricted our experiment to 5 million English and OHearn et al., 2008; Islam and Inkpen, 2009) German revisions, much larger datasets can be ex- used a dataset sampled from the Wall Street Jour- tracted if the whole revision history is taken into nal corpus which is not freely available. Thus, we account. Our snapshot of the English Wikipedia created a comparable English dataset of 1,000 ar- contains 305·106 revisions. Even if not all of them tificial errors based on the easily available Brown correspond to article revisions, it is safe to assume corpus (Francis W. Nelson and Kuc¸era, 1964).8 that more than 10,000 real-word spelling errors Additionally, we created a German dataset with can be extracted from this version of Wikipedia. 1,000 artificial errors based on the TIGER cor- Using the same amount of source revisions, we pus.9 found significantly more English than German er- rors. This might be due to (i) English having more 4 Measuring Contextual Fitness short nouns or verbs than German that are more There are two main approaches for measuring the likely to be confused with each other, and (ii) the contextual fitness of a word in its context: the English Wikipedia being known to attract a larger statistical (Mays et al., 1991) and the knowledge- amount of non-native editors which might lead to based approach (Hirst and Budanitsky, 2005). higher rates of real-word spelling errors. How- ever, this issue needs to be further investigated 4.1 Statistical Approach e.g. based on comparable corpora build on the ba- Mays et al. (1991) introduced an approach based sis of different language editions of Wikipedia. on the noisy-channel model. The model assumes Further refining the identification of real-word er- that the correct sentence s is transmitted through rors in Wikipedia would allow evaluating how fre- a noisy channel adding ‘noise’ which results in a quent such errors actually occur, and how long word w being replaced by an error e leading the it takes the Wikipedia editors to detect them. If wrong sentence s0 which we observe. The prob- errors persist over a long time, using measures ability of the correct word w given that we ob- of contextual fitness for detection would be even serve the error e can be computed as P (w|e) = more important. P (w) · P (e|w). The channel model P (e|w) de- Another interesting observation is that the av- scribes how likely the typist is to make an error. erage edit distance is around 1.4 for both datasets. This is modeled by the parameter α.10 The re- This means that a substantial proportion of errors maining probability mass (1 − α) is distributed involve more than one edit operation. Given that equally among all words in the vocabulary within many measures of contextual fitness allow at most an edit distance of 1 (edits(w)): one edit, many naturally occurring errors will not ( be detected. However, allowing a larger edit dis- α if e = w tance enormously increases the search space re- P (e|w) = (1 − α)/|edits(w)| if e 6= w sulting in increased run-time and possibly de- creased detection precision due to more false pos- The source model P (w) is estimated using a itives. trigram language model, i.e. the probability of the 3.2 Artificial Error Datasets 8 http://www.archive.org/details/BrownCorpus (CC-by-na). 9 http://www.ims.uni-stuttgart.de/projekte/TIGER/ In contrast to the quite challenging process of The corpus contains 50,000 sentences of German newspaper mining naturally occurring errors, creating artifi- text, and is freely available under a non-commercial license. 10 cial errors is relatively straightforward. From a We optimize α on a held-out development set of errors. 532 intended word wi is computed as the conditional Dataset P R F probability P (wi |wi−1 wi−2 ). Hence, the proba- Artificial-English .77 .50 .60 bility of the correct sentence s = w1 . . . wn can Natural-English .54 .26 .35 be estimated as Artificial-German .90 .49 .63 Natural-German .77 .20 .32 n+2 Y P (s) = P (wi |wi−1 wi−2 ) Table 1: Performance of the statistical approach using i=1 a trigram model based on Google Web1T. The set of candidate sentences Sc contains all ver- sions of the observed sentence s0 derived by re- It is unclear which list was used. We could use placing one word with a word from edits(w), multi-words from WordNet, but coverage would while all other words in the sentence remain be rather limited. We decided not to use both fil- unchanged. The correct sentence s is those ters in order to better assess the influence of the sentence from Sc that maximizes P (s|s0 ) = underlying semantic relatedness measure on the arg maxs∈Sc P (s) · P (s0 |s). overall performance. The knowledge based approach uses semantic 4.2 Knowledge Based Approach relatedness measures to determine the cohesion Hirst and Budanitsky (2005) introduced a between a candidate and its context. In the exper- knowledge-based approach that detects real-word iments by Budanitsky and Hirst (2006), the mea- spelling errors by checking the semantic relations sure by (Jiang and Conrath, 1997) yields the best of a target word with its context. For this pur- results. However, a wide range of other measures pose, they apply WordNet as the source of lexical- have been proposed, cf. (Zesch and Gurevych, semantic knowledge. 2010). Some measures using a wider defini- The algorithm flags all words as error can- tion of semantic relatedness (Gabrilovich and didates and then applies filters to remove those Markovitch, 2007; Zesch et al., 2008b) instead words from further consideration that are unlikely of only using taxonomic relations in a knowledge to be errors. First, the algorithm removes all source. closed-class word candidates as well as candi- As semantic relatedness measures usually re- dates which cannot be found in the vocabulary. turn a numeric value, we need to determine a Candidates are then tested for having lexical co- threshold θ in order to come up with a binary hesion with their context, by (i) checking whether related/unrelated decision. Budanitsky and Hirst the same surface form or lemma appears again in (2006) used a characteristic gap in the stan- the context, or (ii) a semantically related concept dard evaluation dataset by Rubenstein and Good- is found in the context. In both cases, the candi- enough (1965) that separates unrelated from re- date is removed from the list of candidates. For lated word pairs. We do not follow this approach, each remaining possible real-word spelling error, but optimize the threshold on a held-out develop- edits are generated by inserting, deleting, or re- ment set of real-word spelling errors. placing characters up to a certain edit distance (usually 1). Each edit is then tested for lexical 5 Results & Discussion cohesion with the context. If at least one of it fits into the context, the candidate is selected as a real- In this section, we report on the results obtained word error. in our evaluation of contextual fitness measures Hirst and Budanitsky (2005) use two additional using artificial and natural errors in English and filters: First, they remove candidates that are German. “common non-topical words”. It is unclear how the list of such words was compiled. Their list 5.1 Statistical Approach of examples contains words like ‘find’ or ‘world’ Table 1 summarizes the results obtained by the which we consider to be perfectly valid candi- statistical approach using a trigram model based dates. Second, they also applied a filter using a on the Google Web1T data (Brants and Franz, list of known multi-words, as the probability for 2006). On the English artificial errors, we ob- words to accidentally form multi-words is low. serve a quite high F-measure of .60 that drops to 533 Dataset N-gram model Size P R F Dataset P R F 7 · 1011 .77 .50 .60 Artificial-English .26 .15 .19 Google Web 7 · 1010 .78 .48 .59 Natural-English .29 .18 .23 Art-En 7 · 109 .76 .42 .54 Artificial-German .47 .16 .24 Wikipedia 2 · 109 .72 .37 .49 Natural-German .40 .13 .19 7 · 1011 .54 .26 .35 Google Web 7 · 1010 .51 .23 .31 Table 3: Performance of the knowledge-based ap- Nat-En proach using the JiangConrath semantic relatedness 7 · 109 .46 .19 .27 measure. Wikipedia 2 · 109 .49 .19 .27 10 8 · 10 .90 .49 .63 Google Web 8 · 109 .90 .47 .61 not targeted towards the Wikipedia articles from Art-De 8 · 108 .88 .36 .51 which we sampled the natural errors. Thus, we Wikipedia 7 · 108 .90 .37 .52 also tested a trigram model based on Wikipedia. 8 · 10 10 .77 .20 .32 However, it is much smaller than the Web model, Google Web 8 · 109 .68 .14 .23 which leads us to additionally testing smaller Web Nat-De 8 · 108 .65 .10 .17 models. Table 2 summarizes the results. Wikipedia 7 · 108 .70 .13 .22 We observe that “more data is better data” still holds, as the largest Web model always outper- Table 2: Influence of the n-gram model on the perfor- forms the Wikipedia model in terms of recall. If mance of the statistical approach. we reduce the size of the Web model to the same order of magnitude as the Wikipedia model, the .35 when switching to the naturally occurring er- performance of the two models is comparable. rors which we extracted from Wikipedia. On the We would have expected to see better results for German dataset, we observe almost the same per- the Wikipedia model in this setting, but its higher formance drop (from .63 to .32). quality does not lead to a significant difference. These observations correspond to our earlier Even if statistical approaches quite reliably de- analysis where we showed that the artificial data tect real-word spelling errors, the size of the re- contains many cases that are quite easy to correct quired n-gram models remains a serious obstacle using a statistical model, e.g. where a plural form for use in real-world applications. The English of a noun is replaced with its singular form (or Web1T trigram model is about 25GB, which cur- vice versa) as in “I bought a car.” vs. “I bought rently is not suited for being applied in settings a cars.”. The naturally occurring errors often con- with limited storage capacities e.g. for intelligent tain much harder contexts, as shown in the fol- input assistance in mobile devices. As we have lowing example: “Through the open window they seen above, using smaller models will decrease heard sounds below in the street: cartwheels, a recall to a point where hardly any error will be de- tired horse’s plodding step, vices.” where ‘vices’ tected anymore. Thus, we will now have a look on should be corrected to ‘voices’. While the lemma knowledge-based approaches which are less de- ‘voice’ is clearly semantically related to other manding in terms of the required resources. words in the context like ‘hear’ or ‘sound’, the 5.2 Knowledge-based Approach position at the end of the sentence is especially difficult for the trigram-based statistical approach. Table 3 shows the results for the knowledge-based The only trigram that connects the error to the measure. In contrast to the statistical approach, context is (‘step’, ‘,’, vices/voices) which will the results on the artificial errors are not higher probably yield a low frequency count even for than on the natural errors, but almost equal for very large trigram models. Higher order n-gram German and even lower for English; another piece models would help, but suffer from the usual data- of evidence supporting our view that the proper- sparseness problems. ties of artificial datasets over-estimate the perfor- mance of statistical measures. Influence of the N-gram Model For building the trigram model, we used the Google Web1T Influence of the Relatedness Measure As was data, which has some known quality issues and is pointed out before, Budanitsky and Hirst (2006) 534 Dataset Measure θ P R F Dataset Comb.-Strategy P R F JiangConrath 0.5 .26 .15 .19 Best-Single .77 .50 .60 Lin 0.5 .22 .17 .19 Artificial-English Union .52 .55 .54 Lesk 0.5 .19 .16 .17 Intersection .91 .15 .25 Art-En ESA-Wikipedia 0.05 .43 .13 .20 Best-Single .54 .26 .35 ESA-Wiktionary 0.05 .35 .20 .25 Natural-English Union .40 .36 .38 ESA-Wordnet 0.05 .33 .15 .21 Intersection .82 .11 .19 JiangConrath 0.5 .29 .18 .23 Lin 0.5 .26 .21 .23 Table 5: Results obtained by a combination of the best Lesk 0.5 .19 .19 .19 statistical and knowledge-based configuration. ‘Best- Nat-En Single’ is the best precision or recall obtained by a sin- ESA-Wikipedia 0.05 .48 .14 .22 gle measure. ‘Union’ merges the detections of both ESA-Wiktionary 0.05 .39 .21 .27 ESA-Wordnet 0.05 .36 .15 .21 approaches. ‘Intersection’ only detects an error if both methods agree on a detection. Table 4: Performance of knowledge-based approach using different relatedness measures. count an error as detected if both methods agree on a detection (‘Intersection’). When compar- show that the measure by Jiang and Conrath ing the combined results in Table 5 with the best (1997) yields the best results in their experi- precision or recall obtained by a single measure ments on malapropism detection. In addition, we (‘Best-Single’), we observe that precision can be test another path-based measure by Lin (1998), significantly improved using the ‘Union’ strategy, the gloss-based measure by Lesk (1986), and while recall is only moderately improved using the ESA measure (Gabrilovich and Markovitch, the ‘Intersect’ strategy. This means that (i) a large 2007) based on concept vectors from Wikipedia, subset of errors is detected by both approaches Wiktionary, and WordNet. Table 4 summarizes that due to their different sources of knowledge the results. In contrast to the findings of Budanit- mutually reinforce the detection leading to in- sky and Hirst (2006), JiangConrath is not the best creased precision, and (ii) a small but otherwise path-based measure, as Lin provides equal or bet- undetectable subset of errors requires considering ter performance. Even more importantly, other detections made by one approach only. (non path-based) measures yield better perfor- mance than both path-based measures. Especially 6 Related Work ESA based on Wiktionary provides a good over- To our knowledge, we are the first to create a all performance, while ESA based on Wikipedia dataset of naturally occurring errors based on the provides excellent precision. The advantage of revision history of Wikipedia. Max and Wis- ESA over the other measure types can be ex- niewski (2010) used similar techniques to create plained with its ability to incorporate semantic re- a dataset of errors from the French Wikipedia. lationships beyond classical taxonomic relations However, they target a wider class of errors in- (as used by path-based measures). cluding non-word spelling errors, and their class of real-word errors conflates malapropisms as 5.3 Combining the Approaches well as other types of changes like reformulations. The statistical and the knowledge-based approach Thus, their dataset cannot be easily used for our use quite different methods to assess the con- purposes and is only available in French, while textual fitness of a word in its context. This our framework allows creating datasets for all ma- makes it worthwhile trying to combine both ap- jor languages with minimal manual effort. proaches. We ran the statistical method (using the Another possible source of real-word spelling full Wikipedia trigram model) and the knowledge- errors are learner corpora (Granger, 2002), e.g. based method (using the ESA-Wiktionary related- the Cambridge Learner Corpus (Nicholls, 1999). ness measure) in parallel and then combined the However, annotation of errors is difficult and resulting detections using two strategies: (i) we costly (Rozovskaya and Roth, 2010), only a small merge the detections of both approaches in order fraction of observed errors will be real-word to obtain higher recall (‘Union’), and (ii) we only spelling errors, and learners are likely to make dif- 535 ferent mistakes than proficient language users. word spelling errors. For that purpose, we ex- Islam and Inkpen (2009) presented another sta- tracted a dataset with naturally occurring errors tistical approach using the Google Web1T data and their contexts from the Wikipedia revision (Brants and Franz, 2006) to create the n-gram history. We show that evaluating measures of con- model. It slightly outperformed the approach by textual fitness on this dataset provides a more re- Mays et al. (1991) when evaluated on a corpus of alistic picture of task performance. In particular, artificial errors based on the WSJ corpus. How- using artificial datasets over-estimates the perfor- ever, the results are not directly comparable, as mance of the statistical approach, while it under- Mays et al. (1991) used a much smaller n-gram estimates the performance of the knowledge- model and our results in Section 5.1 show that based approach. the size of the n-gram model has a large influence We show that n-gram models targeted towards on the results. Eventually, we decided to use the the domain from which the errors are sampled Mays et al. (1991) approach in our study, as it is do not improve the performance of the statisti- easier to adapt and augment. cal approach if larger n-gram models are avail- In a re-evaluation of the statistical model by able. We further show that the performance of Mays et al. (1991), Wilcox-OHearn et al. (2008) the knowledge-based approach can be improved found that it outperformed the knowledge-based by using semantic relatedness measures that in- method by Hirst and Budanitsky (2005) when corporate knowledge beyond the taxonomic rela- evaluated on a corpus of artificial errors based on tions in a classical lexical-semantic resource like the WSJ corpus. This is consistent with our find- WordNet. Finally, by combining both approaches, ings on the artificial errors based on the Brown significant increases in precision or recall can be corpus, but - as we have seen in the previous sec- achieved. tion - evaluation on the naturally occurring errors shows a different picture. They also tried to im- In future work, we want to evaluate a wider prove the model by permitting multiple correc- range of contextual fitness measures, and learn tions and using fixed-length context windows in- how to combine them using more sophisticated stead of sentences, but obtained discouraging re- combination strategies. Both - the statistical as sults. well as the knowledge-based approach - will ben- All previously discussed methods are unsuper- efit from a better model of the typist, as not all vised in a way that they do not rely on any training edit operations are equally likely (Kernighan et data with annotated errors. However, real-word al., 1990). On the side of the error extraction, we spelling correction has also been tackled by su- are going to further improve the extraction pro- pervised approaches (Golding and Schabes, 1996; cess by incorporating more knowledge about the Jones and Martin, 1997; Carlson et al., 2001). revisions. For example, vandalism is often re- Those methods rely on predefined confusion-sets, verted very quickly, which can be detected when i.e. sets of words that are often confounded e.g. looking at the full set of revisions of an article. {peace, piece} or {weather, whether}. For each We hope that making the experimental frame- set, the methods learn a model of the context in work publicly available will foster future research which one or the other alternative is more proba- in this field, as our results on the natural errors ble. This yields very high precision, but only for show that the problem is still quite challenging. the limited number of previously defined confu- sion sets. Our framework for extracting natural errors could be used to increase the number of Acknowledgments known confusion sets. This work has been supported by the Volk- 7 Conclusions and Future Work swagen Foundation as part of the Lichtenberg- In this paper, we evaluated two main approaches Professorship Program under grant No. I/82806. for measuring the contextual fitness of terms: the We Andreas Kellner and Tristan Miller for check- statistical approach by Mays et al. (1991) and ing the datasets, and the anonymous reviewers for the knowledge-based approach by Hirst and Bu- their helpful feedback. danitsky (2005) on the task of detecting real- 536 References ical cohesion. Natural Language Engineering, 11(1):87–111, March. David Bean and Ellen Riloff. 2004. Unsupervised Diana Inkpen and Alain D´esilets. 2005. Semantic learning of contextual role knowledge for corefer- similarity for detecting recognition errors in auto- ence resolution. In Proc. of HLT/NAACL, pages matic speech transcripts. In Proceedings of the con- 297–304. ference on Human Language Technology and Em- Igor A. Bolshakov and Alexander Gelbukh. 2003. On pirical Methods in Natural Language Processing - Detection of Malapropisms by Multistage Colloca- HLT ’05, number October, pages 49–56, Morris- tion Testing. In Proceedings of NLDB-2003, 8th town, NJ, USA. Association for Computational Lin- International Workshop on Applications of Natural guistics. Language to Information Systems, number Cic. Aminul Islam and Diana Inkpen. 2009. Real-word Thorsten Brants and Alex Franz. 2006. Web 1T 5- spelling correction using Google Web IT 3-grams. gram Version 1. In Proceedings of the 2009 Conference on Empiri- Alexander Budanitsky and Graeme Hirst. 2006. Eval- cal Methods in Natural Language Processing Vol- uating wordnet-based measures of lexical semantic ume 3 - EMNLP ’09, Morristown, NJ, USA. Asso- relatedness. Computational Linguistics, 32(1):13– ciation for Computational Linguistics. 47. M A Jaro. 1995. Probabilistic linkage of large public Andrew J Carlson, Jeffrey Rosen, and Dan Roth. health data file. Statistics in Medicine, 14:491–498. 2001. Scaling Up Context-Sensitive Text Correc- Jay J Jiang and David W Conrath. 1997. Seman- tion. In Proceedings of IAAI. tic Similarity Based on Corpus Statistics and Lex- C Fellbaum. 1998. WordNet An Electronic Lexical ical Taxonomy. In Proceedings of the 10th Inter- Database. MIT Press, Cambridge, MA. national Conference on Research in Computational Oliver Ferschke, Torsten Zesch, and Iryna Gurevych. Linguistics, Taipei, Taiwan. 2011. Wikipedia Revision Toolkit: Efficiently Michael P Jones and James H Martin. 1997. Contex- Accessing Wikipedia’s Edit History. In Proceed- tual spelling correction using latent semantic analy- ings of the 49th Annual Meeting of the Associa- sis. In Proceedings of the fifth conference on Ap- tion for Computational Linguistics: Human Lan- plied natural language processing -, pages 166– guage Technologies. System Demonstrations, pages 173, Morristown, NJ, USA. Association for Com- 97–102, Portland, OR, USA. putational Linguistics. Jenny Rose Finkel, Trond Grenager, and Christopher Mark D Kernighan, Kenneth W Church, and Manning. 2005. Incorporating non-local informa- William A Gale. 1990. A Spelling Correc- tion into information extraction systems by Gibbs tion Program Based on a Noisy Channel Model. sampling. In Proceedings of the 43rd Annual Meet- In Proceedings of the 13th International Confer- ing on Association for Computational Linguistics - ence on Computational Linguistics, pages 205–210, ACL ’05, pages 363–370, Morristown, NJ, USA. Helsinki, Finland. Association for Computational Linguistics. Lothar Lemnitzer and Claudia Kunze. 2002. Ger- Francis W. Nelson and Henry Kuc¸era. 1964. Manual maNet - Representation, Visualization, Application. of information to accompany a standard corpus of In Proceedings of the 3rd International Conference present-day edited American English, for use with on Language Resources and Evaluation (LREC), digital computers. pages 1485–1491. Evgeniy Gabrilovich and Shaul Markovitch. 2007. M Lesk. 1986. Automatic sense disambiguation using Computing Semantic Relatedness using Wikipedia- machine readable dictionaries: how to tell a pine based Explicit Semantic Analysis. In Proceedings cone from an ice cream cone. Proceedings of the of the 20th International Joint Conference on Arti- 5th annual international conference, pages 24–26. ficial Intelligence, pages 1606–1611. Dekang Lin. 1998. An Information-Theoretic Defini- Andrew R. Golding and Yves Schabes. 1996. Com- tion of Similarity. In Proceedings of International bining Trigram-based and feature-based methods Conference on Machine Learning, pages 296–304, for context-sensitive spelling correction. In Pro- Madison, Wisconsin. ceedings of the 34th annual meeting on Association Aurelien Max and Guillaume Wisniewski. 2010. for Computational Linguistics -, pages 71–78, Mor- Mining Naturally-occurring Corrections and Para- ristown, NJ, USA. Association for Computational phrases from Wikipedias Revision History. In Pro- Linguistics. ceedings of the Seventh conference on International Sylviane Granger, 2002. A birds-eye view of learner Language Resources and Evaluation (LREC’10), corpus research, pages 3–33. John Benjamins Pub- pages 3143–3148. lishing Company. Eric Mays, Fred. J Damerau, and Robert L Mercer. Graeme Hirst and Alexander Budanitsky. 2005. Cor- 1991. Context based spelling correction. Informa- recting real-word spelling errors by restoring lex- tion Processing & Management, 27(5):517–522. 537 Rani Nelken and Elif Yamangil. 2008. Mining Torsten Zesch, Christof M¨uller, and Iryna Gurevych. Wikipedia’s Article Revision History for Train- 2008b. Using wiktionary for computing semantic ing Computational Linguistics Algorithms. In relatedness. In Proceedings of the 23rd AAAI Con- Proceedings of the AAAI Workshop on Wikipedia ference on Artificial Intelligence, pages 861–867, and Artificial Intelligence: An Evolving Synergy Chicago, IL, USA, Jul. (WikiAI), WikiAI08. Diane Nicholls. 1999. The Cambridge Learner Cor- pus - Error Coding and Analysis for Lexicography and ELT. In Summer Workshop on Learner Cor- pora, Tokyo, Japan. Alla Rozovskaya and Dan Roth. 2010. Annotating ESL Errors: Challenges and Rewards. In The 5th Workshop on Innovative Use of NLP for Building Educational Applications (NAACL-HLT). H Rubenstein and J B Goodenough. 1965. Contextual Correlates of Synonymy. Communications of the ACM, 8(10):627–633. Helmut Schmid. 2004. Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vec- tors. In Proceedings of the 20th International Conference on Computational Linguistics (COL- ING 2004), Geneva, Switzerland. Daniel D. Walker, William B. Lund, and Eric K. Ring- ger. 2010. Evaluating Models of Latent Document Semantics in the Presence of OCR Errors. Proceed- ings of the 2010 Conference on Empirical Methods in Natural Language Processing, (October):240– 250. M. Wick, M. Ross, and E. Learned-Miller. 2007. Context-sensitive error correction: Using topic models to improve OCR. In Ninth International Conference on Document Analysis and Recogni- tion (ICDAR 2007) Vol 2, pages 1168–1172. Ieee, September. Amber Wilcox-OHearn, Graeme Hirst, and Alexander Budanitsky. 2008. Real-word spelling correction with trigrams: A reconsideration of the Mays, Dam- erau, and Mercer model. In Proceedings of the 9th international conference on Computational linguis- tics and intelligent text processing (CICLing). Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu- Mizil, and Lillian Lee. 2010. For the sake of sim- plicity: unsupervised extraction of lexical simplifi- cations from Wikipedia. In Human Language Tech- nologies: The 2010 Annual Conference of the North American Chapter of the Association for Computa- tional Linguistics, HLT ’10, pages 365–368. Torsten Zesch and Iryna Gurevych. 2010. Wisdom of Crowds versus Wisdom of Linguists - Measur- ing the Semantic Relatedness of Words. Journal of Natural Language Engineering, 16(1):25–59. Torsten Zesch, Christof M¨uller, and Iryna Gurevych. 2008a. Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary. In Proceedings of the Conference on Language Resources and Evalu- ation (LREC). 538 Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation Rico Sennrich Institute of Computational Linguistics University of Zurich Binzmühlestr. 14 CH-8050 Zürich

[email protected]

Abstract We move the focus away from a binary com- bination of in-domain and out-of-domain data. If We investigate the problem of domain we can scale up the number of models whose con- adaptation for parallel data in Statistical tributions we weight, this reduces the need for a Machine Translation (SMT). While tech- priori knowledge about the fitness1 of each poten- niques for domain adaptation of monolin- tial training text, and opens new research oppor- gual data can be borrowed for parallel data, we explore conceptual differences between tunities, for instance experiments with clustered translation model and language model do- training data. main adaptation and their effect on per- formance, such as the fact that translation 2 Domain Adaptation for Translation models typically consist of several features Models that have different characteristics and can be optimized separately. We also explore To motivate efforts in domain adaptation, let us adapting multiple (4–10) data sets with no review why additional training data can improve, a priori distinction between in-domain and but also decrease translation quality. out-of-domain data except for an in-domain Adding more training data to a translation sys- development set. tem is easy to motivate through the data sparse- ness problem. Koehn and Knight (2001) show 1 Introduction that translation quality correlates strongly with how often a word occurs in the training corpus. The increasing availability of parallel corpora Rare words or phrases pose a problem in sev- from various sources, welcome as it may be, eral stages of MT modelling, from word align- leads to new challenges when building a statis- ment to the computation of translation probabil- tical machine translation system for a specific ities through Maximum Likelihood Estimation. domain. The task of determining which par- Unknown words are typically copied verbatim to allel texts should be included for training, and the target text, which may be a good strategy for which ones hurt translation performance, is te- named entities, but is often wrong otherwise. In dious when performed through trial-and-error. general, more data allows for a better word align- Alternatively, methods for a weighted combina- ment, a better estimation of translation probabili- tion exist, but there is conflicting evidence as to ties, and for the consideration of more context (in which approach works best, and the issue of de- phrase-based or syntactic SMT). termining weights is not adequately resolved. A second effect of additional data is not nec- The picture looks better in language mod- essarily positive. Translations are inherently am- elling, where model interpolation through per- biguous, and a strong source of ambiguity is the plexity minimization has become a widespread 1 We borrow this term from early evolutionary biology to method of domain adaptation. We investigate the emphasize that the question in domain adaptation is not how applicability of this method for translation mod- “good” or “bad” the data is, but how well-adapted it is to the els, and discuss possible applications. task at hand. 539 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 539–549, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics domain of a text. The German word “Wort” (engl. dividual model probabilities. It is defined as fol- word) is typically translated as floor in Europarl, lows: a corpus of Parliamentary Proceedings (Koehn, n 2005), owing to the high frequency of phrases X such as you have the floor, which is translated into p(x|y; λ) = λi pi (x|y) (1) i=1 German as Sie haben das Wort. This translation is highly idiomatic and unlikely to occur in other with λi being the P interpolation weight of each contexts. Still, adding Europarl as out-of-domain model i, and with ( i λi ) = 1. training data shifts the probability distribution of For SMT, linear interpolation of translation p(t|“Wort”) in favour of p(“floor”|“Wort”), and models has been used in numerous systems. The may thus lead to improper translations. approaches diverge in how they set the inter- We will refer to the two problems as the data polation weights. Some authors use uniform sparseness problem and the ambiguity problem. weights (Cohn and Lapata, 2007), others em- Adding out-of-domain data typically mitigates the pirically test different interpolation coefficients data sparseness problem, but exacerbates the am- (Finch and Sumita, 2008; Yasuda et al., 2008; biguity problem. The net gain (or loss) of adding Nakov and Ng, 2009; Axelrod et al., 2011), others more data changes from case to case. Because apply monolingual metrics to set the weights for there are (to our knowledge) no tools that predict TM interpolation (Foster and Kuhn, 2007; Koehn this net effect, it is a matter of empirical investi- et al., 2010). gation (or, in less suave terms, trial-and-error), to There are reasons against all these approaches. determine which corpora to use.2 Uniform weights are easy to implement, but give From this understanding of the reasons for and little control. Empirically, it has been shown that against out-of-domain data, we formulate the fol- they often do not perform optimally (Finch and lowing hypotheses: Sumita, 2008; Yasuda et al., 2008). An opti- mization of B LEU scores on a development set is 1. A weighted combination can control the con- promising, but slow and impractical. There is no tribution of the out-of-domain corpus on the easy way to integrate linear interpolation into log- probability distribution, and thus limit the linear SMT frameworks and perform optimization ambiguity problem. through MERT. Monolingual optimization objec- tives such as language model perplexity have the 2. A weighted combination eliminates the need advantage of being well-known and readily avail- for data selection, offering a robust baseline able, but their relation to the ambiguity problem for domain-specific machine translation. is indirect at best. Linear interpolation is seemingly well-defined We will discuss three mixture modelling tech- in equation 1. Still, there are a few implemen- niques for translation models. Our aim is to adapt tation details worth pointing out. If we directly all four features of the standard Moses SMT trans- interpolate each feature in the translation model, lation model: the phrase translation probabilities and define the feature values of non-occurring p(t|s) and p(s|t), and the lexical weights lex(t|s) phrase pairs as 0, this disregards the meaning of and lex(s|t).3 each feature. If we estimate p(x|y) via MLE as in 2.1 Linear Interpolation equation 2, and c(y) = 0, then p(x|y) is strictly speaking undefined. Alternatively to a naive al- A well-established approach in language mod- gorithm, which treats unknown phrase pairs as elling is the linear interpolation of several mod- having a probability of 0, which results in a defi- els, i.e. computing the weighted average of the in- cient probability distribution, we propose and im- 2 A frustrating side-effect is that these findings rarely gen- plement the following algorithm. For each value eralize. For instance, we were unable to reproduce the find- pair (x, y) for which we compute p(x|y), we re- ing by Ceau¸su et al. (2011) that patent translation systems place λi with 0 for all models i with p(y) = are highly domain-sensitive and suffer from the inclusion of parallel training data from other patent subdomains. 0, then renormalize the weight vector λ to 1. 3 We can ignore the fifth feature, the phrase penalty, We do this for p(t|s) and lex(t|s), but not for which is a constant. p(s|t) and lex(s|t), the reasoning being the con- 540 sequences for perplexity minimization (see sec- 2.3 Alternative Paths tion 2.4). Namely, we do not want to penalize A third method is using multiple translation mod- a small in-domain model for having a high out- els as alternative decoding paths (Birch et al., of-vocabulary rate on the source side, but we do 2007), an idea which Koehn and Schroeder (2007) want to penalize models that know the source first used for domain adaptation. This approach phrase, but not its correct translation. A sec- has the attractive theoretical property that adding ond modification pertains to the lexical weights new models is guaranteed to lead to equal or bet- lex(s|t) and lex(t|s), which form no true proba- ter performance, given the right weights. At best, bility distribution, but are derived from the indi- a model is beneficial with appropriate weights. At vidual word translation probabilities of a phrase worst, we can set the feature weights so that the pair (see (Koehn et al., 2003)). We propose to decoding paths of one model are never picked for not interpolate the features directly, but the word the final translation. In practice, each translation translation probabilities which are the basis of the model adds 5 features and thus 5 more dimensions lexical weight computation. The reason for this is to the weight space, which leads to longer search, that word pairs are less sparse than phrase pairs, search errors, and/or overfitting. The expectation so that we can even compute lexical weights for is that, at least with MERT, using alternative de- phrase pairs which are unknown in a model.4 coding paths does not scale well to a high number 2.2 Weighted Counts of models. Weighting of different corpora can also be imple- A suboptimal choice of weights is not the only mented through a modified Maximum Likelihood weakness of alternative paths, however. Let us Estimation. The traditional equation for MLE is: assume that all models have the same weights. Note that, if a phrase pair occurs in several mod- c(x, y) c(x, y) els, combining models through alternative paths p(x|y) = =P 0 (2) c(y) x0 c(x , y) means that the decoder selects the path with the where c denotes the count of an observation, and highest probability, whereas with linear interpo- p the model probability. If we generalize the for- lation, the probability of the phrase pair would mula to compute a probability from n corpora, be the (weighted) average of all models. Select- and assign a weight λi to each, we get5 : ing the highest-scoring phrase pair favours statis- Pn tical outliers and hence is the less robust decision, λ c (x, y) p(x|y; λ) = Pn i=1 P i i 0 (3) prone to data noise and data sparseness. i=1 x0 λi ci (x , y) The main difference to linear interpolation is 2.4 Perplexity Minimization that this equation takes into account how well- In language modelling, perplexity is frequently evidenced a phrase pair is. This includes the dis- used as a quality measure for language models tinction between lack of evidence and negative ev- (Chen and Goodman, 1998). Among other appli- idence, which is missing in a naive implementa- cations, language model perplexity has been used tion of linear interpolation. for domain adaptation (Foster and Kuhn, 2007). Translation models trained with weighted For translation models, perplexity is most closely counts have been discussed before, and have associated with EM word alignment (Brown et been shown to outperform uniform ones in some al., 1993) and has been used to evaluate different settings. However, researchers who demon- alignment algorithms (Al-Onaizan et al., 1999). strated this fact did so with arbitrary weights (e.g. We investigate translation model perplexity (Koehn, 2002)), or by empirically testing differ- minimization as a method to set model weights ent weights (e.g. (Nakov and Ng, 2009)). We do in mixture modelling. For the purpose of opti- not know of any research on automatically deter- mization, the cross-entropy H(p), the perplexity mining weights for this method, or which is not 2H(p) , and other derived measures are equivalent. limited to two corpora. The cross-entropy H(p) is defined as:6 4 For instance if the word pairs (the,der) and (man,Mann) 6 are known, but the phrase pair (the man, der Mann) is not. See (Chen and Goodman, 1998) for a short discussion 5 P Unlike equation 1, equation 3 does not require that of the equation. In short, a lower cross-entropy indicates that ( i λi ) = 1. the model is better able to predict the development set. 541 X Our main technical contributions are as fol- H(p) = − p˜(x, y) log2 p(x|y) (4) lows: Additionally to perplexity optimization for x,y linear interpolation, which was first applied by The phrase pairs (x, y) whose probability we Foster et al. (2010), we propose perplexity opti- measure, and their empirical probability p˜ need mization for weighted counts (equation 3), and a to be extracted from a development set, whereas modified implementation of linear interpolation. p is the model probability. To obtain the phrase Also, we independently perform perplexity mini- pairs, we process the development set with the mization for all four features of the standard SMT same word alignment and phrase extraction tools translation model: the phrase translation proba- that we use for training, i.e. GIZA++ and heuris- bilities p(t|s) and p(s|t), and the lexical weights tics for phrase extraction (Och and Ney, 2003). lex(t|s) and lex(s|t). The objective function is the minimization of the 3 Other Domain Adaptation Techniques cross-entropy, with the weight vector λ as argu- ment: So far, we discussed mixture modelling for trans- X lation models, which is only a subset of domain ˆ = arg min − λ p˜(x, y) log2 p(x|y; λ) (5) adaptation techniques in SMT. λ x,y Mixture-modelling for language models is well We can fill in equations 1 or 3 for p(x|y; λ). The established (Foster and Kuhn, 2007). Language optimization itself is convex and can be done with model adaptation serves the same purpose as off-the-shelf software.7 We use L-BFGS with translation model adaptation, i.e. skewing the numerically approximated gradients (Byrd et al., probability distribution in favour of in-domain 1995). translations. This means that LM adaptation may Perplexity minimization has the advantage that have similar effects as TM adaptation, and that it is well-defined for both weighted counts and lin- the two are to some extent redundant. Foster and ear interpolation, and can be quickly computed. Kuhn (2007) find that “both TM and LM adap- Other than in language modelling, where p(x|y) tation are effective”, but that “combined LM and is the probability of a word given a n-gram his- TM adaptation is not better than LM adaptation tory, conditional probabilities in translation mod- on its own”. els express the probability of a target phrase given A second strand of research in domain adap- a source phrase (or vice versa), which connects tation is data selection, i.e. choosing a subset of the perplexity to the ambiguity problem. The the training data that is considered more relevant higher the probability of “correct” phrase pairs, for the task at hand. This has been done for lan- the lower the perplexity, and the more likely guage models using techniques from information the model is to successfully resolve the ambigu- retrieval (Zhao et al., 2004), or perplexity (Lin et ity. The question is in how far perplexity min- al., 1997; Moore and Lewis, 2010). Data selec- imization coincides with empirically good mix- tion has also been proposed for translation mod- ture weights.8 This depends, among others, on els (Axelrod et al., 2011). Note that for transla- the other model components in the SMT frame- tion models, data selection offers an unattractive work, for instance the language model. We will trade-off between the data sparseness and the am- not evaluate perplexity minimization against em- biguity problem, and that the optimal amount of pirically optimized mixture weights, but apply it data to select is hard to determine. in situations where the latter is infeasible, e.g. be- Our discussion of mixture-modelling is rela- cause of the number of models. tively coarse-grained, with 2-10 models being 7 combined. Matsoukas et al. (2009) propose an ap- A quick demonstration of convexity: equation 1 is affine; equation 3 linear-fractional. Both are convex in the proach where each sentence is weighted accord- domain R>0 . Consequently, equation 4 is also convex be- ing to a classifier, and Foster et al. (2010) ex- cause it is the weighted sum of convex functions. tend this approach by weighting individual phrase 8 There are tasks for which perplexity is known to be un- pairs. These more fine-grained methods need not reliable, e.g. for comparing models with different vocabular- ies. However, such confounding factors do not affect the op- be seen as alternatives to coarse-grained ones. timization algorithm, which works with a fixed set of phrase Foster et al. (2010) combine the two, apply- pairs, and merely varies λ. ing linear interpolation to combine the instance- 542 weighted out-of-domain model with an in-domain Data set sentences words (fr) model. Alpine (in-domain) 220k 4 700k Europarl 1 500k 44 000k 4 Evaluation JRC Acquis 1 100k 24 000k OpenSubtitles v2 2 300k 18 000k Apart from measuring the performance of the ap- Total train 5 200k 91 000k proaches introduced in section 2, we want to in- Dev 1424 33 000 vestigate the following open research questions. Test 991 21 000 1. Does an implementation of linear interpola- Table 1: Parallel data sets for German – French trans- tion that is more closely tailored to trans- lation task. lation modelling outperform a naive imple- mentation? Data set sentences words Alpine (in-domain) 650k 13 000k 2. How do the approaches perform outside a News-commentary 150k 4 000k binary setting, i.e. when we do not work Europarl 2 000k 60 000k with one in-domain and one out-of-domain News 25 000k 610 000k model, but with a higher number of models? Total 28 000k 690 000k Table 2: Monolingual French data sets for German – 3. Can we apply perplexity minimization to French translation task. other translation model features such as the lexical weights, and if yes, does a separate all translation model features, and a modified one optimization of each translation model fea- that normalizes λ for each phrase pair (s, t) for ture improve performance? p(t|s) and recomputes the lexical weights based on interpolated word translation probabilites. The 4.1 Data and Methods fourth weighted combination is using alternative In terms of tools and techniques used, we mostly decoding paths with weights set through MERT. adhere to the work flow described for the WMT The four weighted combinations are evaluated 2011 baseline system9 . The main tools are Moses twice: once applied to the original four or ten par- (Koehn et al., 2007), SRILM (Stolcke, 2002), and allel data sets, once in a binary setting in which GIZA++ (Och and Ney, 2003), with settings as all out-of-domain data sets are first concatenated. described in the WMT 2011 guide. We report Since we want to concentrate on translation two translation measures: B LEU (Papineni et al., model domain adaptation, we keep other model 2002) and METEOR 1.3 (Denkowski and Lavie, components, namely word alignment and the lex- 2011). All results are lowercased and tokenized, ical reordering model, constant throughout the ex- measured with five independent runs of MERT periments. We contrast two language models. An (Och and Ney, 2003) and MultEval (Clark et al., unadapted, out-of-domain language model trained 2011) for resampling and significance testing. on data sets provided for the WMT 2011 transla- We compare three baselines and four transla- tion task, and an adapted language model which is tion model mixture techniques. The three base- the linear interpolation of all data sets, optimized lines are a purely in-domain model, a purely out- for minimal perplexity on the in-domain develop- of-domain model, and a model trained on the con- ment set. catenation of the two, which corresponds to equa- While unadapted language models are becom- tion 3 with uniform weights. Additionally, we ing more rare in domain adaptation research, they evaluate perplexity optimization with weighted allow us to contrast different TM mixtures with- counts and the two implementations of linear in- out the effect on performance being (partially) terpolation contrasted in section 2.1. The two lin- hidden by language model adaptation with the ear interpolations that are contrasted are a naive same effect. one, i.e. a direct, unnormalized interpolation of The first data set is a DE–FR translation sce- 9 http://www.statmt.org/wmt11/baseline. nario in the domain of mountaineering. The in- html domain corpus is a collection of Alpine Club pub- 543 lications (Volk et al., 2010). As parallel out-of- Data set units words (en) domain dataset, we use Europarl, a collection of SMS (in-domain) 16 500 380 000 parliamentary proceedings (Koehn, 2005), JRC- Medical 1 600 10 000 Acquis, a collection of legislative texts (Stein- Newswire 13 500 330 000 berger et al., 2006), and OpenSubtitles v2, a par- Glossary 35 700 90 000 allel corpus extracted from film subtitles10 (Tiede- Wikipedia 8 500 110 000 mann, 2009). For language modelling, we use in- Wikipedia NE 10 500 34 000 domain data and data from the 2011 Workshop Bible 30 000 920 000 on Statistical Machine Translation. The respec- Haitisurf dict 3 700 4000 tive sizes of the data sets are listed in tables 1 and Krengle dict 1 600 2 600 2. Krengle 650 4 200 As the second data set, we use the Haitian Cre- Total train 120 000 1 900 000 ole – English data from the WMT 2011 featured Dev 900 22 000 translation task. It consists of emergency SMS Test 1274 25 000 sent in the wake of the 2010 Haiti earthquake. Table 3: Parallel data sets for Haiti Creole – English Originally, Microsoft Research and CMU oper- translation task. ated under severe time constraints to build a trans- lation system for this language pair. This limits the ability to empirically verify how much each Data set sentences words data set contributes to translation quality, and in- SMS (in-domain) 16k 380k creases the importance of automated and quick News 113 000k 2 650 000k domain adaptation methods. Table 4: Monolingual English data sets for Haiti Cre- Note that both data sets have a relatively high ole – English translation task. ratio of in-domain to out-of-domain parallel train- ing data (1:20 for DE–EN and 1:5 for HT–EN) Previous research has been performed with ratios of 1:100 (Foster et al., 2010) or 1:400 (Axelrod LM performs better than an out-of-domain one, et al., 2011). Since domain adaptation becomes and using all available in-domain parallel data is more important when the ratio of IN to OUT is better than using only part of it. The same is not low, and since such low ratios are also realistic11 , true for out-of-domain data, which highlights the we also include results for which the amount of problem discussed in the introduction. For the in-domain parallel data has been restricted to 10% DE–FR task, adding 86 million words of out-of- of the available data set. domain parallel data to the 5 million in-domain We used the same development set for lan- data set does not lead to consistent performance guage/translation model adaptation and setting gains. We observe a decrease of 0.3 B LEU points the global model weights with MERT. While it with an out-of-domain LM, and an increase of 0.4 is theoretically possible that MERT will give too B LEU points with an adapted LM. The out-of- high weights to models that are optimized on the domain training data has a larger positive effect same development set, we found no empirical evi- if less in-domain data is available, with a gain of dence for this in experiments with separate devel- 1.4 B LEU points. The results in the HT–EN trans- opment sets. lation task (table 6) paint a similar picture. An interesting side note is that even tiny amounts of 4.2 Results in-domain parallel data can have strong effects on performance. A training set of 1600 emergency The results are shown in tables 5 and 6. In the SMS (38 000 tokens) yields a comparable perfor- DE–FR translation task, results vary between 13.5 mance to an out-of-domain data set of 1.5 million and 18.9 B LEU points; in the HT–EN task, be- tokens. tween 24.3 and 33.8. Unsurprisingly, an adapted 10 As to the domain adaptation experiments, http://www.opensubtitles.org 11 We predict that the availability of parallel data will weights optimized through perplexity minimiza- steadily increase, most data being out-of-domain for any tion are significantly better in the majority of given task. cases, and never significantly worse, than uniform 544 out-of-domain LM adapted LM System full IN TM full IN TM small IN TM B LEU METEOR B LEU METEOR B LEU METEOR in-domain 16.8 35.9 17.9 37.0 15.7 33.5 out-of-domain 13.5 31.3 14.8 32.3 14.8 32.3 counts (concatenation) 16.5 35.7 18.3 37.3 17.1 35.4 binary in/out weighted counts 17.4 36.6 18.7 37.9 17.6 36.2 linear interpolation (naive) 17.4 36.7 18.8 37.9 17.6 36.1 linear interpolation (modified) 17.2 36.5 18.9 38.0 17.6 36.2 alternative paths 17.2 36.5 18.6 37.8 17.4 36.0 4 models weighted counts 17.3 36.6 18.8 37.8 17.4 36.0 linear interpolation (naive) 17.1 36.5 18.5 37.7 17.3 35.9 linear interpolation (modified) 17.2 36.5 18.7 37.9 17.3 36.0 alternative paths 17.0 36.2 18.3 37.4 16.3 35.1 Table 5: Domain adaptation results DE–FR. Domain: Alpine texts. Full IN TM: Using the full in-domain parallel corpus; small IN TM: using 10% of available in-domain parallel data. weights.12 However, the difference is smaller for mization methods does not change significantly. the experiments with an adapted language model This is positive for perplexity optimization be- than for those with an out-of-domain one, which cause it demonstrates that it requires less a priori confirms that the benefit of language model adap- information, and opens up new research possibil- tation and translation model adaptation are not ities, i.e. experiments with different clusterings of fully cumulative. Performance-wise, there seems parallel data. The performance degradation for to be no clear winner between weighted counts alternative paths is partially due to optimization and the two alternative implementations of lin- problems in MERT, but also due to a higher sus- ear interpolation. We can still argue for weighted ceptibility to statistical outliers, as discussed in counts on theoretical grounds. A weighted MLE section 2.3.14 (equation 3) returns a true probability distribution, A pessimistic interpretation of the results whereas a naive implementation of linear interpo- would point out that performance gains compared lation results in a deficient model. Consequently, to the best baseline system are modest or even probabilities are typically lower in the naively in- inexistent in some settings. However, we want terpolated model, which results in higher (worse) to stress two important points. First, we often perplexities. While the deficiency did not affect do not know a priori whether adding an out-of- MERT or decoding negatively, it might become domain data set boosts or weakens translation per- problematic in other applications, for instance if formance. An automatic weighting of data sets re- we want to use an interpolated model as a compo- duces the need for trial-and-error experimentation nent in a second perplexity-based combination of and is worthwhile even if a performance increase models.13 is not guaranteed. Second, the potential impact When moving from a binary setting with of a weighted combination depends on the trans- one in-domain and one out-of-domain transla- lation scenario and the available data sets. Gen- tion model (trained on all available out-of-domain erally, we expect non-uniform weighting to have data) to 4–10 translation models, we observe a a bigger impact when the models that are com- serious performance degradation for alternative bined are more dissimilar (in terms of fitness for paths, while performance of the perplexity opti- the task), and if the ratio of in-domain to out-of- domain data is low. Conversely, there are situa- 12 This also applies to linear interpolation with uniform 14 weights, which is not shown in the tables. We empirically verified this weakness in a synthetic ex- 13 Specifically, a deficient model would be dispreferred by periment with a randomly split training corpus and identical the perplexity minimization algorithm. weights for each path. 545 out-of-domain LM adapted LM System full IN TM full IN TM small IN TM B LEU METEOR B LEU METEOR B LEU METEOR in-domain 30.4 30.7 33.4 31.7 29.7 28.6 out-of-domain 24.3 28.0 28.9 30.2 28.9 30.2 counts (concatenation) 30.3 31.2 33.6 32.4 31.3 31.3 binary in/out weighted counts 31.0 31.6 33.8 32.4 31.5 31.3 linear interpolation (naive) 30.8 31.4 33.7 32.4 31.9 31.3 linear interpolation (modified) 30.8 31.5 33.7 32.4 31.7 31.2 alternative paths 30.8 31.3 33.2 32.4 29.8 30.7 10 models weighted counts 31.0 31.5 33.5 32.3 31.8 31.5 linear interpolation (naive) 30.9 31.4 33.8 32.4 31.9 31.3 linear interpolation (modified) 31.0 31.6 33.8 32.5 32.1 31.5 alternative paths 25.9 29.2 24.3 29.1 29.8 30.9 Table 6: Domain adaptation results HT–EN. Domain: emergency SMS. Full IN TM: Using the full in-domain parallel corpus; small IN TM: using 10% of available in-domain parallel data. tions where we actually expect a simple concate- nation to be optimal, e.g. when the data sets have very similar probability distributions. 4.2.1 Individually Optimizing Each TM perplexity Feature weights B LEU 1 2 3 4 It is hard to empirically show how translation weighted counts model perplexity optimization compares to using uniform 5.12 7.68 4.84 13.67 30.3 monolingual perplexity measures for the purpose separate 4.68 6.62 4.24 8.57 31.0 of weighting translation models, as e.g. done by 1 4.68 6.84 4.50 10.86 30.3 (Foster and Kuhn, 2007; Koehn et al., 2010). One 2 4.78 6.62 4.48 10.54 30.3 problem is that there are many different possible 3 4.86 7.31 4.24 9.15 30.8 configurations for the latter. We can use source 4 5.33 7.87 4.52 8.57 30.9 side or target side language models, operate with average 4.72 6.71 4.38 9.95 30.4 different vocabularies, smoothing techniques, and linear interpolation (modified) n-gram orders. uniform 19.89 82.78 4.80 10.78 30.6 One of the theoretical considerations that separate 5.45 8.56 4.28 8.85 31.0 favour measuring perplexity on the translation 1 5.45 8.79 4.40 8.89 30.8 model rather than using monolingual measures 2 5.71 8.56 4.54 8.91 30.9 is that we can optimize each translation model 3 6.46 11.88 4.28 9.07 31.0 feature separately. In the default Moses transla- 4 6.12 10.86 4.47 8.85 30.9 tion model, the four features are p(s|t), lex(s|t), average 5.73 9.72 4.34 8.89 30.9 p(t|s) and lex(t|s). LM 6.01 9.83 4.56 8.96 30.8 We empirically test different optimization Table 7: Contrast between a separate optimization of schemes as follows. We optimize perplexity on each feature and applying the weight vector optimized each feature independently, obtaining 4 weight on one feature to the whole model. HT–EN with out- vectors. We then compute one model with one of-domain LM. weight vector per feature (namely the feature that the vector was optimized on), and four models that use one of the weight vectors for all features. A further model uses a weight vector that is the 546 average of the other four. For linear interpolation, elling. We envision that a weighted combination we also include a model whose weights have been could be useful to deal with noisy datasets, or ap- optimized through language model perplexity op- plied after a clustering of training data. timization, with a 3-gram language model (modi- fied Knesey-Ney smoothing) trained on the target Acknowledgements side of each parallel data set. This research was funded by the Swiss National Table 7 shows the results. In terms of B LEU Science Foundation under grant 105215_126999. score, a separate optimization of each feature is a winner in our experiment in that no other scheme is better, with 8 of the 11 alternative weighting References schemes (excluding uniform weights) being sig- Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin nificantly worse than a separate optimization. The Knight, John Lafferty, Dan Melamed, Franz-Josef differences in B LEU score are small, however, Och, David Purdy, Noah A. Smith, and David since the alternative weighting schemes are gen- Yarowsky. 1999. Statistical machine translation. erally felicitious in that they yield both a lower Technical report, Final Report, JHU Summer Work- shop. perplexity and better B LEU scores than uniform Amittai Axelrod, Xiaodong He, and Jianfeng Gao. weighting. While our general expectation is that 2011. Domain adaptation via pseudo in-domain lower perplexities correlate with higher transla- data selection. In Proceedings of the EMNLP 2011 tion performance, this relation is complicated by Workshop on Statistical Machine Translation. several facts. Since the interpolated models are Alexandra Birch, Miles Osborne, and Philipp Koehn. deficient (i.e. their probabilities do not sum to 1), 2007. CCG supertags in factored statistical ma- perplexities for weighted counts and our imple- chine translation. In Proceedings of the Second mentation of linear interpolation cannot be com- Workshop on Statistical Machine Translation, pages 9–16, Prague, Czech Republic, June. Association pard. Also, note that not all features are equally for Computational Linguistics. important for decoding. Their weights in the log- Peter F. Brown, Vincent J. Della Pietra, Stephen A. linear model are set through MERT and vary be- Della Pietra, and Robert L. Mercer. 1993. The tween optimization runs. Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 5 Conclusion 19(2):263–311. Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and This paper contributes to SMT domain adaptation Ciyou Zhu. 1995. A limited memory algorithm research in several ways. We expand on work for bound constrained optimization. SIAM J. Sci. by (Foster et al., 2010) in establishing transla- Comput., 16:1190–1208, September. tion model perplexity minimization as a robust Alexandru Ceau¸su, John Tinsley, Jian Zhang, and Andy Way. 2011. Experiments on domain adap- baseline for a weighted combination of translation tation for patent machine translation in the PLuTO models.15 We demonstrate perplexity optimiza- project. In Proceedings of the 15th conference of tion for weighted counts, which are a natural ex- the European Association for Machine Translation, tension of unadapted MLE training, but are of lit- Leuven, Belgium. tle prominence in domain adaptation research. We Stanley F. Chen and Joshua Goodman. 1998. An em- also show that we can separately optimize the four pirical study of smoothing techniques for language variable features in the Moses translation model modeling. Computer Speech & Language, 13:359– through perplexity optimization. 393. Jonathan H. Clark, Chris Dyer, Alon Lavie, and We break with prior domain adaptation re- Noah A. Smith. 2011. Better hypothesis testing for search in that we do not rely on a binary clustering statistical machine translation: Controlling for op- of in-domain and out-of-domain training data. We timizer instability. In Proceedings of the 49th An- demonstrate that perplexity minimization scales nual Meeting of the Association for Computational well to a higher number of translation models. Linguistics: Human Language Technologies, pages This is not only useful for domain adaptation, but 176–181, Portland, Oregon, USA, June. Associa- for various tasks that profit from mixture mod- tion for Computational Linguistics. Trevor Cohn and Mirella Lapata. 2007. Machine 15 Translation by Triangulation: Making Effective Use The source code is available in the Moses repository http://github.com/moses-smt/mosesdecoder of Multi-Parallel Corpora. In Proceedings of the 547 45th Annual Meeting of the Association of Compu- Philipp Koehn, Barry Haddow, Philip Williams, and tational Linguistics, pages 728–735, Prague, Czech Hieu Hoang. 2010. More linguistic annotation Republic, June. Association for Computational Lin- for statistical machine translation. In Proceedings guistics. of the Joint Fifth Workshop on Statistical Machine Michael Denkowski and Alon Lavie. 2011. Meteor Translation and MetricsMATR, pages 115–120, Up- 1.3: Automatic Metric for Reliable Optimization psala, Sweden, July. Association for Computational and Evaluation of Machine Translation Systems. In Linguistics. Proceedings of the EMNLP 2011 Workshop on Sta- Philipp Koehn. 2002. Europarl: A Multilingual Cor- tistical Machine Translation. pus for Evaluation of Machine Translation. Andrew Finch and Eiichiro Sumita. 2008. Dynamic Philipp Koehn. 2005. Europarl: A parallel corpus for model interpolation for statistical machine transla- statistical machine translation. In Machine Transla- tion. In Proceedings of the Third Workshop on tion Summit X, pages 79–86, Phuket, Thailand. Statistical Machine Translation, StatMT ’08, pages Sung-Chien Lin, Chi-Lung Tsai, Lee-Feng Chien, 208–215, Stroudsburg, PA, USA. Association for Keh-Jiann Chen, and Lin-Shan Lee. 1997. Chinese Computational Linguistics. language model adaptation based on document clas- George Foster and Roland Kuhn. 2007. Mixture- sification and multiple domain-specific language model adaptation for smt. In Proceedings of the models. In George Kokkinakis, Nikos Fakotakis, Second Workshop on Statistical Machine Transla- and Evangelos Dermatas, editors, EUROSPEECH. tion, StatMT ’07, pages 128–135, Stroudsburg, PA, ISCA. USA. Association for Computational Linguistics. Spyros Matsoukas, Antti-Veikko I. Rosti, and Bing George Foster, Cyril Goutte, and Roland Kuhn. 2010. Zhang. 2009. Discriminative corpus weight esti- Discriminative instance weighting for domain adap- mation for machine translation. In Proceedings of tation in statistical machine translation. In Proceed- the 2009 Conference on Empirical Methods in Nat- ings of the 2010 Conference on Empirical Methods ural Language Processing: Volume 2 - Volume 2, in Natural Language Processing, pages 451–459, pages 708–717, Stroudsburg, PA, USA. Association Stroudsburg, PA, USA. Association for Computa- for Computational Linguistics. tional Linguistics. Robert C. Moore and William Lewis. 2010. Intelli- Philipp Koehn and Kevin Knight. 2001. Knowledge gent selection of language model training data. In sources for word-level translation models. In Lil- Proceedings of the ACL 2010 Conference Short Pa- lian Lee and Donna Harman, editors, Proceedings pers, ACLShort ’10, pages 220–224, Stroudsburg, of the 2001 Conference on Empirical Methods in PA, USA. Association for Computational Linguis- Natural Language Processing, pages 27–35. tics. Philipp Koehn and Josh Schroeder. 2007. Experi- ments in domain adaptation for statistical machine Preslav Nakov and Hwee Tou Ng. 2009. Improved translation. In Proceedings of the Second Work- statistical machine translation for resource-poor shop on Statistical Machine Translation, StatMT languages using related resource-rich languages. In ’07, pages 224–227, Stroudsburg, PA, USA. Asso- Proceedings of the 2009 Conference on Empiri- ciation for Computational Linguistics. cal Methods in Natural Language Processing: Vol- ume 3 - Volume 3, EMNLP ’09, pages 1358–1367, Philipp Koehn, Franz Josef Och, and Daniel Marcu. Stroudsburg, PA, USA. Association for Computa- 2003. Statistical phrase-based translation. In tional Linguistics. NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association Franz Josef Och and Hermann Ney. 2003. A sys- for Computational Linguistics on Human Language tematic comparison of various statistical alignment Technology, pages 48–54, Morristown, NJ, USA. models. Computational Linguistics, 29(1):19–51. Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and Philipp Koehn, Hieu Hoang, Alexandra Birch, Wei-Jing Zhu. 2002. Bleu: A method for automatic Chris Callison-Burch, Marcello Federico, Nicola evaluation of machine translation. In ACL ’02: Pro- Bertoldi, Brooke Cowan, Wade Shen, Christine ceedings of the 40th Annual Meeting on Associa- Moran, Richard Zens, Chris Dyer, Ondˇrej Bojar, tion for Computational Linguistics, pages 311–318, Alexandra Constantin, and Evan Herbst. 2007. Morristown, NJ, USA. Association for Computa- Moses: Open Source Toolkit for Statistical Ma- tional Linguistics. chine Translation. In ACL 2007, Proceedings of the Ralf Steinberger, Bruno Pouliquen, Anna Widiger, 45th Annual Meeting of the Association for Com- Camelia Ignat, Tomaz Erjavec, Dan Tufis, and putational Linguistics Companion Volume Proceed- Daniel Varga. 2006. The JRC-Acquis: A multilin- ings of the Demo and Poster Sessions, pages 177– gual aligned parallel corpus with 20+ languages. In 180, Prague, Czech Republic, June. Association for Proceedings of the 5th International Conference on Computational Linguistics. Language Resources and Evaluation (LREC’2006). 548 A. Stolcke. 2002. SRILM – An Extensible Language Modeling Toolkit. In Seventh International Confer- ence on Spoken Language Processing, pages 901– 904, Denver, CO, USA. Jörg Tiedemann. 2009. News from opus - a col- lection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, vol- ume V, pages 237–248. John Benjamins, Amster- dam/Philadelphia, Borovets, Bulgaria. Martin Volk, Noah Bubenhofer, Adrian Althaus, Maya Bangerter, Lenz Furrer, and Beni Ruef. 2010. Chal- lenges in building a multilingual alpine heritage corpus. In Proceedings of the Seventh conference on International Language Resources and Evalu- ation (LREC’10), Valletta, Malta. European Lan- guage Resources Association (ELRA). Keiji Yasuda, Ruiqiang Zhang, Hirofumi Yamamoto, and Eiichiro Sumita. 2008. Method of selecting training data to build a compact and efficient trans- lation model. In Proceedings of the 3rd Interna- tional Joint Conference on Natural Language Pro- cessing (IJCNLP). Bing Zhao, Matthias Eck, and Stephan Vogel. 2004. Language model adaptation for statistical machine translation with structured query models. In Pro- ceedings of the 20th international conference on Computational Linguistics, COLING ’04, Strouds- burg, PA, USA. Association for Computational Lin- guistics. 549 Subcat-LMF: Fleshing out a standardized format for subcategorization frame interoperability Judith Eckle-Kohler‡ and Iryna Gurevych†‡ † Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational Information ‡ Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science Technische Universit¨at Darmstadt http://www.ukp.tu-darmstadt.de Abstract takes two arguments that can be realized, for in- stance, as noun phrase and that-clause as in He This paper describes Subcat-LMF, an ISO- says that the window is open. LMF compliant lexicon representation for- mat featuring a uniform representation Although a number of freely available, large- of subcategorization frames (SCFs) for scale and accurate SCF lexicons exist, e.g. COM- the two languages English and German. LEX (Grishman et al., 1994), VerbNet (Kipper Subcat-LMF is able to represent SCFs at a et al., 2008) for English, availability and limita- very fine-grained level. We utilized Subcat- tions in size and coverage remain an inherent is- LMF to standardize lexicons with large- sue. This applies even more to languages other scale SCF information: the English Verb- than English. Net and two German lexicons, i.e., a subset of IMSlex and GermaNet verbs. To evalu- One particular approach to address this issue is ate our LMF-model, we performed a cross- the combination and integration of existing man- lingual comparison of SCF coverage and ually built SCF lexicons. Lexicon integration overlap for the standardized versions of the has widely been adopted for increasing the cover- English and German lexicons. The Subcat- age of lexicons regarding lexical-semantic infor- LMF DTD, the conversion tools and the mation types, such as semantic roles, selectional standardized versions of VerbNet and IMS- restrictions, and word senses (e.g., Shi and Mi- lex subset are publicly available.1 halcea (2005), the Semlink project2 , Navigli and Ponzetto (2010), Niemann and Gurevych (2011), 1 Introduction Meyer and Gurevych (2011)). Computational lexicons providing accurate Currently, SCFs are represented idiosyncrati- lexical-syntactic information, such as subcatego- cally in existing SCF lexicons. However, inte- rization frames (SCFs) are vital for many NLP gration of SCFs requires a common, interopera- applications involving parsing and word sense ble representation format. Monolingual SCF in- disambiguation. In parsing, SCFs have been tegration based on a common representation for- successfully used to improve the output of sta- mat has already been addressed by King and tistical parsers (Klenner (2007), Deoskar (2008), Crouch (2005) and just recently by Necsulescu et Sigogne et al. (2011)) which is particularly al. (2011) and Padr´o et al. (2011). However, nei- significant in high-precision domain-independent ther King and Crouch (2005) nor Necsulescu et parsing. In word sense disambiguation, SCFs al. (2011) or Padr´o et al. (2011) make use of ex- have been identified as important features for isting standards in order to create a uniform SCF verb sense disambiguation (Brown et al., 2011), representation for lexicon merging. The defini- which is due to the correlation of verb senses and tion of an interoperable representation format ac- SCFs (Andrew et al., 2004). cording to an existing standard, such as the ISO SCFs specify syntactic arguments of verbs and standard Lexical Markup Framework (LMF, ISO other predicate-like lexemes, e.g. the verb say 24613:2008, see Francopoulo et al. (2006)), is the 1 2 http://www.ukp.tu-darmstadt.de/data/uby http://verbs.colorado.edu/semlink/ 550 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 550–560, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics prerequisite for re-using this format in different fies a core package and a number of extensions contexts, thus contributing to the standardization for modeling different types of lexicons, includ- and interoperability of language resources. ing subcategorization lexicons. While LMF models exist that cover the rep- The development of an LMF-compliant lexi- resentation of SCFs (see Quochi et al. (2008), con model requires two steps: in the first step, Buitelaar et al. (2009)), their suitability for repre- the structure of the lexicon model has to be de- senting SCFs at a large scale remains unclear: nei- fined by choosing a combination of the LMF core ther of these LMF-models has been used for stan- package and zero to many extensions (i.e. UML dardizing lexicons with a large number of SCFs, packages). While the LMF core package models such as VerbNet. Furthermore, the question of a lexicon in terms of lexical entries, each of which their applicability to different languages has not is defined as the pairing of one to many forms and been investigated yet, a situation that is compli- zero to many senses, the LMF extensions provide cated by the fact that SCFs are highly language- UML classes for different types of lexicon orga- specific. nization, e.g., covering the synset-based organiza- The goal of this paper is to address these gaps tion of WordNet and the class-based organization for the two languages English and German by pre- of VerbNet. The first step results in a set of UML senting a uniform LMF representation of SCFs classes that are associated according to the UML for English and German which is utilized for the diagrams given in ISO LMF. standardization of large-scale English and Ger- In the second step, these UML classes may be man SCF lexicons. The contributions of this enriched by attributes. While neither attributes paper are threefold: (1) We present the LMF nor their values are given by the standard, the model Subcat-LMF, an LMF-compliant lexicon standard states that both are to be linked to Data representation format featuring a uniform and Categories (DCs) defined in a Data Category Reg- very fine-grained representation of SCFs for En- istry (DCR) such as ISOCat.4 DCs that are not glish and German. Subcat-LMF is a subset of available in ISOCat may be defined and submit- Uby-LMF (Eckle-Kohler et al., 2012), the LMF ted for standardization. The second step results in model of the large integrated lexical resource Uby a so-called Data Category Selection (DCS). (Gurevych et al., 2012). (2) We convert lexicons DCs specify the linguistic vocabulary used in with large-scale SCF information to Subcat-LMF: an LMF model. Consider as an example the the English VerbNet and two German lexicons, linguistic term direct object that often occurs in i.e., GermaNet (Kunze and Lemnitzer, 2002) and SCFs of verbs taking an accusative NP as argu- a subset of IMSlex3 (Eckle-Kohler, 1999). (3) We ment. In ISOCat, there are two different specifi- perform a comparison of these three lexicons re- cations of this term, one explicitly referring to the garding SCF coverage and SCF overlap, based on capability of becoming the clause subject in pas- the standardized representation. sivization5 , the other not mentioning passivization The remainder of this paper is structured as fol- at all.6 Consequently, the use of a DCR plays a lows: Section 2 gives a detailed description of major role regarding the semantic interoperability Subcat-LMF and section 3 demonstrates its use- of lexicons (Ide and Pustejovsky, 2010). Different fulness for representing and cross-lingually com- resources that share a common definition of their paring large-scale English and German lexicons. linguistic vocabulary are said to be semantically Section 4 provides a discussion including related interoperable. work and section 5 concludes. 2.2 Fleshing out ISO-LMF 2 Subcat-LMF Approach: We started our development of 2.1 ISO-LMF: a meta-model Subcat-LMF with a thorough inspection of large- scale English and German resources providing LMF defines a meta-model of lexical resources, SCFs for verbs, nouns, and adjectives. For covering NLP lexicons and Machine Readable 4 Dictionaries. This meta-model is based on the http://www.isocat.org/, the implementation of the ISO Unified Modeling Language (UML) and speci- 12620 DCR (Broeder et al., 2010). 5 http://www.isocat.org/datcat/DC-1274 3 6 http://www.ims.uni-stuttgart.de/projekte/IMSLex/ http://www.isocat.org/datcat/DC-2263 551 English, our analysis included VerbNet7 and vs. FrameNet syntactically annotated example sen- Er schlug vor, das Haus zu putzen. (to- tences from Ruppenhofer et al. (2010). For Ger- infinitive) man, we inspected GermaNet, SALSA annota- tion guidelines (Burchardt et al., 2006) and IM- • morphosyntactic marking of verb phrase ar- Slex documentation (Eckle-Kohler, 1999). In ad- guments in the main clause: He managed to dition, the EAGLES synopsis on morphosyntactic win. (no marking) vs. phenomena8 (Calzolari and Monachini, 1996), as Er hat es geschafft zu gewinnen. (obligatory well as the EAGLES recommendations on subcat- es) egorization9 have been used to identify DCs rele- vant for SCFs. • morphosyntactic marking of clausal argu- We specified Subcat-LMF by a DTD yielding ments in the main clause: That depends on an XML serialization of ISO-LMF. Thus, existing who did it. (preposition) vs. lexicons can be standardized, i.e. converted into Das h¨angt davon ab, wer es getan hat. Subcat-LMF format, based on the DTD.10 (pronominal adverb) Lexicon structure: Next, we defined the lexicon structure of Subcat-LMF. In addition Uniform Data Categories for English and Ger- to the core package, Subcat-LMF primarily man: Thus, the main challenge in developing makes use of the LMF Syntax and Seman- Subcat-LMF has been the specification of DCs tics extension. Figure 1 shows the most (attributes and attribute values) in such a way, important classes of Subcat-LMF including that a uniform specification of SCFs in the two SynsemCorrespondence where the linking of languages English and German can be achieved. syntactic and semantic arguments is encoded. It The specification of DCs for Subcat-LMF in- might by worth noting that both synsets from Ger- volved fleshing out ISO-LMF, because it is a maNet and verb classes from VerbNet can be rep- meta-standard in the sense that it provides only resented in Subcat-LMF by using the Synset and few linguistic terms, i.e. DCs, and these DCs SubcategorizationFrameSet class. are not linked to any DCR: in the Syntax Exten- Diverging linguistic properties of SCFs in sion, the standard only provides 7 class names, English and German: For verbs (and also for see Figure 1), complemented by 17 example at- predicate-like nouns and adjectives), SCFs spec- tributes given in an informative, non-binding An- ify the syntactic and morphosyntactic properties nex F. These are by far not sufficient to repre- of their arguments that have to be present in con- sent the fine-grained SCFs available in such large- crete realizations of these arguments within a sen- scale lexicons as VerbNet. tence. While some properties of syntactic argu- In contrast, the Syntax part of Subcat-LMF ments in English and German correspond (both comprises 58 DCs that are properly linked to English and German are Germanic languages and ISOCat DCs; a number of DCs were missing in hence closely related), there are other properties, ISOCat, so we entered them ourselves.11 The mainly morphosyntactic ones that diverge. By majority of the attributes in Subcat-LMF are at- way of examples, we illustrate some of these di- tached to the SyntacticArgument class. The vergences in the following (we contrast English corresponding DCs can be divided into two main examples with their German equivalents): groups: Cross-lingually valid DCs for the spec- • overt case marking in German: ification of grammatical functions (e.g. He helps him. vs. Er hilft ihm. (dative) subject, prepositionalComplement) and syntactic categories (e.g. nounPhrase, • specific verb form in verb phrase arguments: prepositionalPhrase), see Table 1. He suggested cleaning the house. (ing-form) Partly language-specific morphosyntactic 7 SCFs in VerbNet also cover SCFs in VALEX, a lexicon DCs that further specify the syntactic arguments automatically extracted from corpora. (e.g. attribute case, attribute verbForm and 8 http://www.ilc.cnr.it/EAGLES96/morphsyn/ 9 11 http://www.ilc.cnr.it/EAGLES96/synlex/ The Subcat-LMF DCS is publicly available on the ISO- 10 Available at http://www.ukp.tu-darmstadt.de/data/uby Cat website. 552 Figure 1: Selected classes of Subcat-LMF. Values of grammaticalFunction Example subject They arrived in time. subjectComplement He becomes a teacher. directObject He saw a rainbow. objectComplement They elected him governor. complement He told him a story. prepositionalComplement It depends on several factors. adverbialComplement They moved far away. Values of syntacticCategory Example nounPhrase The train stopped. reflexive He drank himself sick. expletive It is raining. prepositionalPhrase It depends on several factors. adverbPhrase They moved far away. adjectivePhrase The light turned red. verbPhrase She tried to exercise. declarativeClause He says he agrees. subordinateClause He believes that it works. Table 1: Cross-lingually valid (English-German) attributes and values of the SyntacticArgument class. values toInfinitive, bareInfinitive, ified by different subsets of morphosyntactic at- ingForm, participle), see Table 2. tributes, see Table 2. The following examples il- In the class LexemeProperty, we introduced lustrate some of these attributes: an attribute syntacticProperty to encode control and raising properties of verbs taking in- • number: the number of a noun phrase argu- finitival verb phrase arguments.12 ment can be lexically governed by the verb In Subcat-LMF, syntactic arguments can be as in These types of fish mix well together. specified by a selection of appropriate attribute- value pairs. While all syntactic arguments are uni- • verbForm: the verb form of a clausal com- formly specified by a grammatical function and a plement can be required to be a bare infini- syntactic category, the use of the morphosyntactic tive as in They demanded that he be there. attributes depends on the particular type of syn- tactic argument. Different phrase types are spec- • tense: not only the verb form, but also the 12 Control or raising specify the co-reference between the tense of a verb phrase complement can be implicit subject of the infinitival argument and syntactic ar- guments in the main clause, either the subject (subject con- lexically governed, e.g., to be a participle in trol or raising) or direct object (object control or raising). the past tense as in They had it removed. 553 Morphosyntactic attributes and values NP PP VP C case: nominative, genitive, dative, accusative x x determiner: possessive, indefinite x x number: singular, plural x verbForm: toInfinitive, bareInfinitive, ingForm(!), Participle x x tense: present, past x complementizer: thatType, whType, yesNoType x prepositionType: external ontological type, e.g. locative x x x preposition: (string) (!) x x x lexeme: (string) (!) x x Table 2: Morphosyntactic attributes of SyntacticArgument and phrase types for which the attributes are appropriate (NP: noun phrase, PP: prepositional phrase, VP: verb phrase, C: clause). Language-specific attributes are marked by (!). 3 Utilizing Subcat-LMF dot-separated sequence of letter pairs. Each letter pair specifies a syntactic argument: the first letter 3.1 Standardizing large-scale lexicons encodes the grammatical function and the second Lexicon Data: We converted VerbNet (VN) and letter the syntactic category.16 For instance, the two German lexicons, i.e., GermaNet (GN) and following shows the GN code for transitive verbs: a subset of IMSlex (ILS) to Subcat-LMF format. NN.AN. ILS has been developed independently from GN and the lexicon data were published in Eckle- ILS is represented in delimiter-separated Kohler (1999). values format and contains 784 verbs in total. VN is organized in verb classes based on Levin- Of these 784 verbs, 740 of them are also present style syntactic alternations (Levin, 1993): verbs in GN, and 44 are listed in ILS only. Although with common SCFs and syntactic alternation be- ILS contains only verbs that take clausal ar- havior that also share common semantic roles are guments and verb phrase arguments, a total grouped into classes. VN (version 3.1) lists 568 number of 220 SCFs is present in ILS, also frames that are encoded as phrase structure rules including SCFs without clausal and verb phrase (XML element SYNTAX), specifying phrase types arguments. ILS lists for each verb lemma a and semantic roles of the arguments, as well as se- number of SCFs, thus specifying coarse-grained lectional, syntactic and morphosyntactic restric- verb senses given by a lemma-SCF pair.17 The tions on the arguments. Additionally, a descrip- SCFs are represented as parenthesized lists. For tive specification of each frame is given (XML instance, the ILS SCF for transitive verbs is: element DESCRIPTION). The verb learn, for in- (subj(NPnom),obj(NPacc)). stance, has the following VN frame: DESCRIPTION (primary): NP V NP Automatic Conversion: We implemented Java SYNTAX: Agent V Topic tools for the conversion of VN, GN and ILS to We extracted both the descriptive specifications Subcat-LMF. These tools convert the source lexi- and the phrase structure rules, using the API cons based on a manual mapping of lexicon units available for VN13 , resulting in 682 unique VN and terms (e.g., VN verb class, GN synset) to frames.14 Subcat-LMF. For the majority of SCFs, this map- GN provides detailed SCFs for verbs, in ping is defined on argument level. Lexical data contrast to the Princeton WordNet: GN version is extracted from the source lexicons by using the 6.0 from April 2011 accessed by the GN API15 native APIs (VN, GN) and additional Perl scripts. lists 202 frames. GN SCFs are represented as a 16 See http://www.sfs.uni-tuebingen.de/GermaNet/- 13 http://verbs.colorado.edu/verb-index/inspector/ verb frames.shtml 14 17 The VN API was used with the view options wrexyzsq In addition, ILS provides a semantic class label for each for verb frame pairs and ctuqw for verb class information. verb; however, these semantic labels are attached at lemma 15 GermaNet Java API 2.0.2 level, i.e. they need to be disambiguated. 554 # LexicalEntry # Sense # Subcat.Frame # SemanticPred. LMF-VN 3962 31891 284 617 orig. VN (3962 verbs) (31891 groups of verb, (568 frames) (572 sem. Pred.) frame, sem.pred.) LMF-GN 8626 12981 147 84 orig. GN (8626 verbs) (12981 verb-synset pairs) (202 GN frames) (no sem. Pred.) LMF-ILS 784 3675 217 10 orig. ILS (784 verbs) (3675 verb-frame pairs) (220 SCFs) (no sem. Pred.) Table 3: Evaluation of the automatic conversion. Numbers of Subcat-LMF instances in the converted lexicons compared to numbers of corresponding units in original lexicons. Evaluation of Automatic Conversion: Table 3 as GN; these few cases were mapped in the same shows the mapping of the major source lexicon way as for GN. Therefore, the LMF version of units (such as verb-synset pairs) to Subcat-LMF ILS, too, specifies less SCFs, but additional se- and lists the corresponding numbers of units. mantic predicates not present in the original. For VN, groups of VN verb, frame and se- Discussion: Grammatical functions of argu- mantic predicate have been mapped to LMF ments are specified distinctly in the three lexicons. senses. VN classes have been mapped to While both GN and ILS specify grammatical SubcategorizationFrameSet. Thus, the functions, they are not explicitly encoded in VN. original VN-sense, a pairing of verb lemma and They have to be inferred on the basis of the phrase class, can be recovered by grouping LMF senses structure rules given in the SYNTAX element. We that share the same verb class. There is a signif- assigned subject to the noun phrase which di- icant difference between the original VN frames rectly precedes the verb and directObject to and their Subcat-LMF representation: the seman- the noun phrase directly following the verb and tic information present in VN frames (seman- having the semantic role Patient. The semantic tic roles and selectional restrictions) is mapped role information has to be considered at this point, to semantic arguments in Subcat-LMF, i.e. the because not all noun phrase arguments are able mapping splits VN frames into a purely syntac- to become the subject in a corresponding passive tic and a purely semantic part. Consequently, sentence. An example is the verb learn which the number of unique SCFs in the Subcat-LMF has the VN frame NP(Agent) V NP(Topic); version of VN is much smaller than the num- here, the Topic-NP is not able to become the sub- ber of frames in the original VN. The conversion ject of a corresponding passive sentence. We as- tool creates for each sense (specifying a unique signed the grammatical function complement to verb, frame, semantic predicate combination) a all other phrase types. SynSemCorrespondence. Argument order constraints in SCFs are repre- On the other hand, the Subcat-LMF version of VN sented in LMF by a list implementation of syntac- contains more semantic predicates than VN. This tic arguments. Most SCFs from VN require the is due to selectional restrictions for semantic ar- subject to be the first argument, reflecting the ba- guments that are specified in Subcat-LMF within sic word order in English sentences. VN lists one semantic predicates, in contrast to VN. exception to this rule for the verb appear, illus- For GN, verb-synset pairs (i.e., GN lexical trated by the example On the horizon appears a units), have been mapped to LMF senses. Few ship. GN frame codes also specify semantic role in- Argument optionality in VN is expressed at the formation, e.g. manner, location. These were semantic level and at the syntactic level in paral- mapped to the semantics part of Subcat-LMF re- lel: it is explicitly specified at the semantic level sulting in 84 semantic predicates that encode the and implicitly specified at the syntactic level. At semantic role information in their semantic argu- the syntactic level, two SCF versions exist in VN, ments. one with the optional argument, the other without ILS specifies similar semantic role information it. In addition, the semantic predicate attached to 555 these SCFs marks optional (semantic) arguments tribute values apart from the attribute optional by a ?-sign. GN, on the other hand, expresses which is specific to GN (resulting in a consid- argument optionality at the level of syntactic ar- erably smaller number of SCFs in GN). Sec- guments, i.e., within the frame code. In Subcat- ond, fine-grained, but cross-lingual string SCFs LMF, optionality is represented at the syntactic were considered; these omit the attributes case, level by an (optional) attribute optional for syn- lexeme, preposition and the attribute value tactic arguments, thus reflecting the explicit repre- ingForm. Finally, coarse-grained cross-lingual sentation used in GN and the implicit representa- string SCFs were compared. These only con- tion present in VN.18 tain the values of the attributes syntactic GN frames specify syntactic alternations of ar- category, complementizer and verbForm gument realizations, e.g. adverbial complements (without the attribute value ingForm). For in- that can alternatively be realized as adverb phrase, stance, a coarse cross-lingual string SCF for tran- prepositional phrase or noun phrase. We encoded sitive verbs is nounPhrasenounPhrase. this generalization in Subcat-LMF by introducing Table 4 lists the results of our quantitative com- attribute values for these aggregated syntactic cat- parison. For each lexicon pair, the number of egories. overlapping SCFs and the numbers of comple- mentary SCFs are given. Regarding VN and the 3.2 Cross-lingual comparison of lexicons German lexicons, the overlap at the language- Lexicons that are standardized according to specific level is (close to) zero, which is due to the Subcat-LMF can be quantitatively compared re- specification of case, e.g. dative, for German ar- garding SCFs. For two lexicons, such a com- guments. However, the numbers for cross-lingual parison gives answers to questions, such as: how SCFs clearly validate our claim: the numbers of many SCFs are present in both lexicons (overlap- overlapping SCFs for the German lexicon pair and ping SCFs), how many SCFs are only listed in one for the two German-English pairs are comparable, of the lexicons (complementary SCFs). Answers ranging from 12 to 18 for the fine-grained SCFs to these questions are important, for instance, for and from 20 to 21 for the coarse SCFs. assessing the potential gain in SCF coverage that Based on the sets of cross-lingually overlap- can be achieved by lexicon merging. ping SCFs, we made an estimation on how many In order to validate our claim that Subcat-LMF high frequent verbs actually have SCFs that are yields a cross-lingually uniform SCF represen- in the cross-lingual SCF overlap of an English- tation, we contrast the monolingual comparison German lexicon pair. For this, we used the lemma of GN and ILS with the cross-lingual compari- frequency lists of the English and German WaCky son of VN, GN and VN and ILS. Assuming that corpora (Baroni et al., 2009) and extracted verbs our claim is valid, the cross-lingual comparisons from VN, GN and ILS that are on 100 top ranked can be expected to yield similar results regard- positions of these lists, starting from rank 100.19 ing overlapping and complementary SCFs as the Table 5 shows the results for the cross-lingual monolingual comparison. SCF overlap between VN – GN and between VN Comparison: The comparison of SCFs from – ILS. While only around 40% of the high fre- two lexicons that are in Subcat-LMF format can quent verbs have an SCF in the fine-grained SCF be performed on the basis of the uniform DCs. overlap, more than 70% are in the coarse overlap As Subcat-LMF is implemented in XML, we between VN – GN, and even more than 80% in compared string representations of SCFs. SCFs the coarse overlap between VN – ILS. from VN, GN and ILS were converted to strings Analysis of results: The small numbers of by concatenating attribute values of syntactic ar- overlapping cross-lingual SCFs (relative to the to- guments and lexemeProperty. We created tal number of SCFs), at both levels of granularity, string representations of different granularities: indicate that the three lexicons each encode sub- First, fine-grained, language-specific string SCFs stantially different lexical-syntactic properties of have been generated by concatenating all at- 19 Since the WaCky frequency lists do not contain POS in- 18 As a consequence, all semantic arguments specified in formation, our lists of extracted verbs contain some noise, the Subcat-LMF version of VN have a corresponding syn- which we tolerated, because we aimed at an approximate es- tactic argument. timate. 556 language-specific cross-lingual cross-lingual (fine-grained) (fine-grained) (coarse) GN vs. ILS 72 GN 21 both, 196 ILS 61 GN, 23 both, 69 ILS 40 GN, 24 both, 23 ILS VN vs. GN 284 VN, 0 both, 93 GN 96 VN, 15 both, 69 GN 29 VN, 24 both, 40 GN VN vs. ILS 283 VN, 1 both, 216 ILS 93 VN, 18 both, 74 ILS 31 VN, 22 both, 25 ILS Table 4: Comparison of lexicon pairs regarding SCF overlap and complementary SCFs. VN-GN overlap VN-GN overlap VN-ILS overlap VN-ILS overlap fine-grained (15 SCFs) coarse (24 SCFs) fine-grained (18 SCFs) coarse (22 SCFs) 43% VN verbs 85% VN verbs 41% VN verbs 84% VN verbs 41% GN verbs 71% GN verbs 43% ILS verbs 87% ILS verbs Table 5: Percentage of 100 high frequent verbs from VN, GN, ILS with a SCF in the cross-lingual SCF overlap (fine-grained vs. coarse) between VN – GN and VN – ILS. verbs. This can at least partly be explained by the ified as language-independent preposition types. historic development of these lexicons in differ- A large number of complementary SCFs in VN ent contexts, e.g., Levin’s work on verb classes vs. GN and GN vs. ILS are due to a diverging lin- (VN), Lexical Functional Grammar (ILS), as well guistic analysis of extraposed subject clauses with as their use for different purposes and applica- an es (it) in the main clause (e.g., It annoys him tions. that the train is late.). In GN, such clauses are not Another reason of the small SCF overlap is specified as subject, whereas in VN and ILS they the comparison of strings derived from the XML are. format. A more sophisticated representation for- Regarding VN and ILS, only VN lists subject mat, notably one that provides semantic typing control for verbs, while both VN and ILS list ob- and type hierarchies, e.g., OWL, could be em- ject control and subject raising. GN, on the other ployed to define hierarchies of grammatical func- hand, does not specify control or raising at all. tions (e.g. direct object would be a sub-type of complement) and other attributes. These would 4 Discussion presumably support the identification of further 4.1 Previous Work overlapping SCFs. Merging SCFs: Previous work on merging SCF During a subsequent qualitative analysis of the lexicons has only been performed in a mono- overlapping and complementary SCFs, we col- lingual setting and lacks the use of standards. lected some enlightening background informa- King and Crouch (2005) describe the process of tion. Overlapping SCFs in the cross-lingual com- unifying several large-scale verb lexicons for En- parison (both fine-grained and coarse) include glish, including VN and WordNet. They perform prominent SCFs corresponding to transitive and a conversion of these lexicons into a uniform, but intransitive verbs, as well as verbs with that- non-standard representation format, resulting in a clause and verbs with to-infinitive. lexicon which is integrated at the level of verb GN and ILS are highly complementary regard- senses, SCFs and lexical-semantics. Thus, the re- ing SCFs: for instance, while many SCFs with ad- sult of their work is not applicable to cross-lingual verbial arguments are unique in GN, only ILS pro- settings. vides a fine-grained specification of prepositional Necsulescu et al. (2011) and Padr´o et al. (2011) complements including the preposition, as well report on approaches to automatic merging of as the case the preposition requires.20 VN, too, two Spanish SCF lexicons. As these lexicons contains a large number of SCFs with a detailed lack sense information apart from the SCFs, their specification of possible prepositions, partly spec- merging approach only works on a very coarse- 20 In German, prepositions govern the case of their noun grained sense level given by lemma-SCF pairs. phrase. The fully automatic merging approach described 557 in (Padr´o et al., 2011) assumes that one of the lex- SCF lexicons to Subcat-LMF, we have demon- icons to be integrated is already represented in the strated its usability for uniformly representing a target representation format, i.e. given two lexi- wide range of SCFs and other lexical-syntactic in- cons, they map one lexicon to the format of the formation types in English and German. other. Moreover, their approach requires a signif- As our cross-lingual comparison of lexicons icant overlap of SCFs and verbs in any two lex- has revealed many complementary SCFs in VN, icons to be merged. The authors state that it is GN and ILS, mono- and cross-lingual alignments presently unclear, how much overlap is required of these lexicons at sense level would lead to a to obtain sufficiently precise merging results. major increase in SCF coverage. Moreover, the Standardizing SCFs: Much previous work on cross-lingually uniform representation of SCFs standardizing NLP lexicons in LMF has focused can be exploited for an additional alignment of on WordNet-like resources. Soria et al. (2009) de- the lexicons at the level of SCF arguments. Such scribe WordNet-LMF, an LMF model for repre- a fine-grained alignment of SCFs can be used, for senting wordnets which has been used in the KY- instance, to project VN semantic roles to GN, thus OTO project.21 Later, WordNet-LMF has been yielding a German resource for semantic role la- adapted by Henrich and Hinrichs (2010) to Ger- beling (see Gildea and Jurafsky (2002), Swier and maNet and by Toral et al. (2010) to the Ital- Stevenson (2005)). ian WordNet. WordNet-LMF does not provide Subcat-LMF could be used for standardizing the possibility to represent subcategorization at further English and German lexicons. The auto- all. The adaption of WordNet-LMF to GN (Hen- matic conversion of lexicons to Subcat-LMF re- rich and Hinrichs, 2010) allows SCFs to be re- quires the manual definition of a mapping, at least spresented as string values. However, this ex- for syntactic arguments. Furthermore, the auto- tension is not sufficient, because it provides no matic merging approach by Padr´o et al. (2011) means to model the syntax-semantics interface, could be tested for English: given our standard- which specifies correspondences between syntac- ized version of VN, other English SCF lexicons tic and semantic arguments of verbs and other could be merged fully automatically with the predicates. Quochi et al. (2008) report on an LMF Subcat-LMF version of VN. model that covers the syntax-semantics mapping just mentioned; it has been used for standardizing 5 Conclusion an Italian domain-specific lexicon. Buitelaar et al. Subcat-LMF contributes to fostering the standard- (2009) describe LexInfo, an LMF-model that is ization of language resources and their interop- used for lexicalizing ontologies. LexInfo is imple- erability at the lexical-syntactic level across En- mented in OWL and specifies a linking of syntac- glish and German. The Subcat-LMF DTD in- tic and semantic arguments. For SCFs and argu- cluding links to ISOCat, all conversion tools, ments, a type hierarchy is defined. In their paper, and the standardized versions of VN and Buitelaar et al. (2009) show only few SCFs and ILS23 are publicly available at http://www.ukp.tu- do not indicate what kinds of SCFs can be repre- darmstadt.de/data/uby. sented with LexInfo in principle. On the LexInfo website22 , the current LexInfo version 2.0 can be Acknowledgments viewed, but no further documentation is given. We inspected LexInfo version 2.0 and found that This work has been supported by the Volks- it specifies a large number of fine-grained SCFs. wagen Foundation as part of the Lichtenberg- However, LexInfo has not been evaluated so far Professorship Program under grant No. I/82806. on large-scale SCF lexicons, such as VerbNet. We thank the anonymous reviewers for their valu- able comments. We also thank Dr. Jungi Kim 4.2 Subcat-LMF and Christian M. Meyer for their contributions to Subcat-LMF enables the uniform representation this paper, and Yevgen Chebotar and Zijad Mak- of fine-grained SCFs across the two languages suti for their contributions to the conversion soft- English and German. By mapping large-scale ware. 21 23 http://www.kyoto-project.eu/ The converted version of GN can not be made available 22 See http://lexinfo.net/ due to licensing. 558 References and Evaluation (LREC 2012), page (to appear), Is- tanbul, Turkey. Galen Andrew, Trond Grenager, and Christopher D. Judith Eckle-Kohler. 1999. Linguistisches Wissen zur Manning. 2004. Verb sense and subcategoriza- automatischen Lexikon-Akquisition aus deutschen tion: using joint inference to improve performance Textcorpora. Logos-Verlag, Berlin, Germany. on complementary tasks. In Proceedings of the PhDThesis. 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 150–157, Gil Francopoulo, Nuria Bel, Monte George, Nico- Barcelona, Spain. letta Calzolari, Monica Monachini, Mandy Pet, and Claudia Soria. 2006. Lexical Markup Framework Marco Baroni, Silvia Bernardini, Adriano Ferraresi, (LMF). In Proceedings of the Fifth International and Eros Zanchetta. 2009. The WaCky wide web: Conference on Language Resources and Evaluation a collection of very large linguistically processed (LREC), pages 233–236, Genoa, Italy. web-crawled corpora. Language Resources and Evaluation, 43(3):209–226. Daniel Gildea and Daniel Jurafsky. 2002. Automatic Daan Broeder, Marc Kemps-Snijders, Dieter Van Uyt- labeling of semantic roles. Computational Linguis- vanck, Menzo Windhouwer, Peter Withers, Peter tics, 28:245–288, September. Wittenburg, and Claus Zinn. 2010. A Data Cat- Ralph Grishman, Catherine Macleod, and Adam Mey- egory Registry- and Component-based Metadata ers. 1994. Comlex Syntax: Building a Computa- Framework. In Proceedings of the Seventh Inter- tional Lexicon. In Proceedings of the 15th Inter- national Conference on Language Resources and national Conference on Computational Linguistics Evaluation (LREC), pages 43–47, Valletta, Malta. (COLING), pages 268–272, Kyoto, Japan. Susan Windisch Brown, Dmitriy Dligach, and Martha Iryna Gurevych, Judith Eckle-Kohler, Silvana Hart- Palmer. 2011. VerbNet Class Assignment as a mann, Michael Matuschek, Christian M. Meyer, WSD Task. In Proceedings of the 9th International and Christian Wirth. 2012. Uby - A Large-Scale Conference on Computational Semantics (IWCS), Unified Lexical-Semantic Resource. In Proceed- pages 85–94, Oxford, UK. ings of the 13th Conference of the European Chap- Paul Buitelaar, Philipp Cimiano, Peter Haase, and ter of the Association for Computational Linguistics Michael Sintek. 2009. Towards Linguistically (EACL 2012), page (to appear), Avignon, France. Grounded Ontologies. In Lora Aroyo, Paolo Verena Henrich and Erhard Hinrichs. 2010. Standard- Traverso, Fabio Ciravegna, Philipp Cimiano, Tom izing wordnets in the ISO standard LMF: Wordnet- Heath, Eero Hyv¨onen, Riichiro Mizoguchi, Eyal LMF for GermaNet. In Proceedings of the 23rd In- Oren, Marta Sabou, and Elena Simperl, editors, The ternational Conference on Computational Linguis- Semantic Web: Research and Applications, pages tics (COLING), pages 456–464, Beijing, China. 111–125, Berlin Heidelberg. Springer-Verlag. Nancy Ide and James Pustejovsky. 2010. What Does Aljoscha Burchardt, Katrin Erk, Anette Frank, Andrea Interoperability Mean, anyway? Toward an Op- Kowalski, Sebastian Pad´o, and Manfred Pinkal. erational Definition of Interoperability. In Pro- 2006. The SALSA Corpus: a German Corpus Re- ceedings of the Second International Conference source for Lexical Semantics. In Proceedings of on Global Interoperability for Language Resources, the Fifth International Conference on Language Re- Hong Kong. sources and Evaluation (LREC), pages 969–974, Tracy Holloway King and Dick Crouch. 2005. Uni- Genoa, Italy. fying lexical resources. In Proceedings of the In- Nicoletta Calzolari and Monica Monachini. 1996. terdisciplinary Workshop on the Identification and EAGLES Proposal for Morphosyntactic Stan- Representation of Verb Features and Verb Classes, dards: in view of a ready-to-use package. In Saarbruecken, Germany. G. Perissinotto, editor, Research in Humanities Karin Kipper, Anna Korhonen, Neville Ryant, and Computing, volume 5, pages 48–64. Oxford Uni- Martha Palmer. 2008. A Large-scale Classification versity Press, Oxford, UK. of English Verbs. Language Resources and Evalu- Tejaswini Deoskar. 2008. Re-estimation of lexi- ation, 42:21–40. cal parameters for treebank PCFGs. In Proceed- Manfred Klenner. 2007. Shallow dependency la- ings of the 22nd International Conference on Com- beling. In Proceedings of the 45th Annual Meet- putational Linguistics (COLING), pages 193–200, ing of the Association for Computational Linguis- Manchester, United Kingdom. tics (ACL), Companion Volume Proceedings of the Judith Eckle-Kohler, Iryna Gurevych, Silvana Hart- Demo and Poster Sessions, pages 201–204, Prague, mann, Michael Matuschek, and Christian M. Czech Republic. Meyer. 2012. UBY-LMF – A Uniform Format Claudia Kunze and Lothar Lemnitzer. 2002. Ger- for Standardizing Heterogeneous Lexical-Semantic maNet — representation, visualization, applica- Resources in ISO-LMF. In Proceedings of the 8th tion. In Proceedings of the Third International International Conference on Language Resources Conference on Language Resources and Evaluation 559 (LREC), pages 1485–1491, Las Palmas, Canary Is- Claudia Soria, Monica Monachini, and Piek Vossen. lands, Spain. 2009. Wordnet-LMF: fleshing out a standardized Beth Levin. 1993. English Verb Classes and Alterna- format for Wordnet interoperability. In Proceedings tions. The University of Chicago Press, Chicago, of the 2009 International Workshop on Intercultural USA. Collaboration, pages 139–146, Palo Alto, Califor- Christian M. Meyer and Iryna Gurevych. 2011. What nia, USA. Psycholinguists Know About Chemistry: Align- Robert S. Swier and Suzanne Stevenson. 2005. Ex- ing Wiktionary and WordNet for Increased Domain ploiting a verb lexicon in automatic semantic role Coverage. In Proceedings of the 5th International labelling. In Proceedings of the conference on Hu- Joint Conference on Natural Language Processing man Language Technology and Empirical Methods (IJCNLP), pages 883–892, Chiang Mai, Thailand. in Natural Language Processing (HLT’05), pages 883–890, Vancouver, British Columbia, Canada. Roberto Navigli and Simone Paolo Ponzetto. 2010. Antonio Toral, Stefania Bracale, Monica Monachini, BabelNet: Building a very large multilingual se- and Claudia Soria. 2010. Rejuvenating the Italian mantic network. In Proceedings of the 48th Annual WordNet: upgrading, standarising, extending. In Meeting of the Association for Computational Lin- Proceedings of the 5th Global WordNet Conference, guistics (ACL), pages 216–225, Uppsala, Sweden. Bombay, India. Silvia Necsulescu, N´uria Bel, Munsta Padr´o, Montser- rat Marimon, and Eva Revilla. 2011. Towards the Automatic Merging of Language Resources. In Proceedings of the 2011 ESSLI Workshop on Lexi- cal Resources (WoLeR 2011), Ljubljana, Slovenia. Elisabeth Niemann and Iryna Gurevych. 2011. The People’s Web meets Linguistic Knowledge: Auto- matic Sense Alignment of Wikipedia and WordNet. In Proceedings of the 9th International Conference on Computational Semantics (IWCS), pages 205– 214, Oxford, UK. Muntsa Padr´o, N´uria Bel, and Silvia Necsulescu. 2011. Towards the Automatic Merging of Lexical Resources: Automatic Mapping. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, pages 296–301, Hissar, Bulgaria. Valeria Quochi, Monica Monachini, Riccardo Del Gratta, and Nicoletta Calzolari. 2008. A lexicon for biology and bioinformatics: the bootstrep expe- rience. In Proceedings of the Sixth International Conference on Language Resources and Evalua- tion (LREC’08), pages 2285–2292, Marrakech, Mo- rocco, may. Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, and Jan Schef- fczyk. 2010. FrameNet II: Extended Theory and Practice, September. Lei Shi and Rada Mihalcea. 2005. Putting pieces to- gether: Combining FrameNet, VerbNet and Word- Net for robust semantic parsing. In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CI- CLing), pages 100–111, Mexico City, Mexico. Anthony Sigogne, Matthieu Constant, and Eric ´ La- porte. 2011. Integration of data from a syntac- tic lexicon into generative and discriminative proba- bilistic parsers. In Proceedings of the International Conference on Recent Advances in Natural Lan- guage Processing, pages 363–370, Hissar, Bulgaria. 560 The effect of domain and text type on text prediction quality Suzan Verberne, Antal van den Bosch, Helmer Strik, Lou Boves Centre for Language Studies Radboud University Nijmegen

[email protected]

Abstract needs software for writers who have difficulties typing (Garay-Vitoria and Abascal, 2006). In most Text prediction is the task of suggesting applications, the scope of the prediction is the text while the user is typing. Its main aim completion of the current word; hence the often- is to reduce the number of keystrokes that used term ‘word completion’. are needed to type a text. In this paper, we The most basic method for word completion is address the influence of text type and do- checking after each typed character whether the main differences on text prediction quality. prefix typed since the last whitespace is unique By training and testing our text predic- according to a lexicon. If it is, the algorithm sug- tion algorithm on four different text types gests to complete the prefix with the lexicon en- (Wikipedia, Twitter, transcriptions of con- try. The algorithm may also suggest to complete a versational speech and FAQ) with equal prefix even before the word’s uniqueness point is corpus sizes, we found that there is a clear reached, using statistical information on the pre- effect of text type on text prediction qual- vious context. Moreover, it has been shown that ity: training and testing on the same text significantly better prediction results can be ob- type gave percentages of saved keystrokes tained if not only the prefix of the current word between 27 and 34%; training on a differ- is included as previous context, but also previ- ent text type caused the scores to drop to ous words (Fazly and Hirst, 2003) or characters percentages between 16 and 28%. (Van den Bosch and Bogers, 2008). In our case study, we compared a num- In the current paper, we follow up on this work ber of training corpora for a specific data by addressing the influence of text type and do- set for which training data is sparse: ques- main differences on text prediction quality. Brief tions about neurological issues. We found messages on mobile devices (such as text mes- that both text type and topic domain play sages, Twitter and Facebook updates) are of a dif- a role in text prediction quality. The ferent style and lexicon than documents typed in best performing training corpus was a set office software (Westman and Freund, 2010). In of medical pages from Wikipedia. The addition, the topic domain of the text also influ- second-best result was obtained by leave- ences its content. These differences may cause an one-out experiments on the test questions, algorithm trained on one text type or domain to even though this training corpus was much perform poorly on another. smaller (2,672 words) than the other cor- The questions that we aim to answer in this pa- pora (1.5 Million words). per are (1) “What is the effect of text type dif- ferences on the quality of a text prediction algo- 1 Introduction rithm?” and (2) “What is the best choice of train- Text prediction is the task of suggesting text while ing data if domain- and text type-specific data is the user is typing. Its main aim is to reduce the sparse?”. To answer these questions, we perform number of keystrokes that are needed to type a three experiments: text, thereby saving time. Text prediction algo- rithms have been implemented for mobile devices, 1. A series of within-text type experiments on office software (Open Office Writer), search en- four different types of Dutch text: Wikipedia gines (Google query completion), and in special- articles, Twitter data, transcriptions of con- 561 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 561–569, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics versational speech and web pages of Fre- All modern methods share the general idea that quently Asked Questions (FAQ). previous context (which we will call the ‘buffer’) 2. A series of across-text type experiments in can be used to predict the next block of charac- which we train and test on different text ters (the ‘predictive unit’). If the user gets correct types; suggestions for continuation of the text then the 3. A case study using texts from a specific do- number of keystrokes needed to type the text is main and text type: questions about neuro- reduced. The unit to be predicted by a text pre- logical issues. Training data for this combi- diction algorithm can be anything ranging from a nation of language (Dutch), text type (FAQ) single character (which actually does not save any and domain (medical/neurological) is sparse. keystrokes) to multiple words. Single words are Therefore, we search for the type of training the most widely used as prediction units because data that gives the best prediction results for they are recognizable at a low cognitive load for this corpus. We compare the following train- the user, and word prediction gives good results ing corpora: in terms of keystroke savings (Garay-Vitoria and • The corpora that we compared in the Abascal, 2006). text type experiments: Wikipedia, Twit- There is some variation among methods in the ter, Speech and FAQ, 1.5 Million words size and type of buffer used. Most methods use per corpus. character n-grams as buffer, because they are pow- • A 1.5 Million words training corpus that erful and can be implemented independently of the is of the same domain as the target data: target language (Carlberger, 1997). In many al- medical pages from Wikipedia; gorithms the buffer is cleared at the start of each • The 359 questions from the neuro-QA new word (making the buffer never larger than data themselves, evaluated in a leave- the length of the current word). In the paper one-out setting (359 times training on by (Van den Bosch and Bogers, 2008), two ex- 358 questions and evaluating on the re- tensions to the basic prefix-model are compared. maining questions). They found that an algorithm that uses the previ- ous n characters as buffer, crossing word borders The prospective application of the third series without clearing the buffer, performs better than of experiments is the development of a text predic- both a prefix character model and an algorithm tion algorithm in an online care platform: an on- that includes the full previous word as feature. In line community for patients seeking information addition to using the previously typed characters about their illness. In this specific case the target and/or words in the buffer, word characteristics group is patients with language disabilities due to such as frequency and recency could also be taken neurological disorders. into account (Garay-Vitoria and Abascal, 2006). The remainder of this paper is organized as fol- Possible evaluation measures for text predic- lows: In Section 2 we give a brief overview of text tion are the proportion of words that are correctly prediction methods discussed in the literature. In predicted, the percentage of keystrokes that could Section 3 we present our approach to text predic- maximally be saved (if the user would always tion. Sections 4 and 5 describe the experiments make the correct decision), and the time saved by that we carried out and the results we obtained. the use of the algorithm (Garay-Vitoria and Abas- We phrase our conclusions in Section 6. cal, 2006). The performance that can be obtained by text prediction algorithms depends on the lan- 2 Text prediction methods guage they are evaluated on. Lower results are ob- Text prediction methods have been developed for tained for higher-inflected languages such as Ger- several different purposes. The older algorithms man than for low-inflected languages such as En- were built as communicative devices for people glish (Matiasek et al., 2002). In their overview of with disabilities, such as motor and speech impair- text prediction systems, (Garay-Vitoria and Abas- ments. More recently, text prediction is developed cal, 2006) report performance scores ranging from for writing with reduced keyboards, specifically 29% to 56% of keystrokes saved. for writing (composing messages) on mobile de- An important factor that is known to influence vices (Garay-Vitoria and Abascal, 2006). the quality of text prediction systems, is training 562 set size (Lesher et al., 1999; Van den Bosch, 3.1 Evaluation 2011). The paper by (Van den Bosch, 2011) shows We evaluate our algorithms on corpus data. This log-linear learning curves for word prediction (a means that we have to make assumptions about constant improvement each time the training cor- user behaviour. We assume that the user confirms pus size is doubled), when the training set size is a suggested word as soon as it is suggested cor- increased incrementally from 102 to 3∗107 words. rectly, not typing any additional characters before confirming. We evaluate our text prediction al- 3 Our approach to text prediction gorithms in terms of the percentage of keystrokes We implement a text prediction algorithm for saved K: Dutch, which is a productive compounding lan- guage like German, but has a somewhat simpler Pn Pn inflectional system. We do not focus on the effect i=0 (Fi ) − i=0 (Wi ) K= Pn ∗ 100 (1) of training set size, but on the effect of text type i=0 (Fi ) and topic domain differences. in which n is the number of words in the test Our approach to text prediction is largely in- set, Wi is the number of keystrokes that have been spired by (Van den Bosch and Bogers, 2008). We typed before the word i is correctly suggested experiment with two different buffer types that are and Fi is the number of keystrokes that would be based on character n-grams: needed to type the complete word i. For example, • ‘Prefix of current word’ contains all char- our algorithm correctly predicts the word niveau acters of only the word currently keyed in, after the context i n g t o t e e n n i where the buffer shifts by one character posi- v in the test set. Assuming that the user confirms tion with every new character. the word niveau at this point, three keystrokes were needed for the prefix niv. So, Wi = 3 and • ‘Buffer15’ buffer also includes any other Fi = 6. The number of keystrokes needed for characters keyed in belonging to previously whitespace and punctuation are unchanged: these keyed-in words. have to be typed anyway, independently of the Modeling character history beyond the current support by a text prediction algorithm. word can naturally be done with a buffer model in 4 Text type experiments which the buffer shifts by one position per charac- ter, while a typical left-aligned prefix model (that In this section, we describe the first and second se- never shifts and fixes letters to their positional fea- ries of experiments. The case study on questions ture) would not be able to do this. from the neurological domain is described in Sec- In the buffer, all characters from the text are tion 5. kept, including whitespace and punctuation. The 4.1 Data predictive unit is one token (word or punctuation symbol). In both the buffer and the prediction la- In the text type experiments, we evaluate our text bel, any capitalization is kept. At each point in the prediction algorithm on four different types of typing process, our algorithm gives one sugges- Dutch text: Wikipedia, Twitter data, transcriptions tion: the word that is the most likely continuation of conversational speech, and web pages of Fre- of the current buffer. quently Asked Questions (FAQ). The Wikipedia We save the training data as a classification data corpus that we use is part of the Lassy cor- set: each character in the buffer fills a feature slot pus (Van Noord, 2009); we obtained a version and the word that is to be predicted is the classi- from the summer of 2010.1 The Twitter data fication label. Figures 1 and 2 give examples of are collected continuously and automatically fil- each of the buffer types Prefix and Buffer15 that tered for language by Erik Tjong Kim Sang (Tjong we created for the text fragment “tot een niveau” Kim Sang, 2011). We used the tweets from all in the context “stelselmatig bij elke verkiezing tot users that posted at least 19 tweets (excluding een niveau van’ ’(structurally with each election retweets) during one day in June 2011. This is to a level of ). We use the implementation of the a set of 1 Million Twitter messages from 30,000 IGTree decision tree algorithm in TiMBL (Daele- 1 http://www.let.rug.nl/vannoord/trees/Treebank/Machine/ mans et al., 1997) to train our models. NLWIKI20100826/COMPACT/ 563 t tot t o tot t o t tot e een e e een e e n een n niveau n i niveau n i v niveau n i v e niveau n i v e a niveau n i v e a u niveau Figure 1: Example of buffer type ‘Prefix’ for the text fragment “(elke verkiezing) tot een niveau”. Un- derscores represent whitespaces. l k e v e r k i e z i n g tot k e v e r k i e z i n g t tot e v e r k i e z i n g t o tot v e r k i e z i n g t o t tot v e r k i e z i n g t o t een e r k i e z i n g t o t e een r k i e z i n g t o t e e een k i e z i n g t o t e e n een i e z i n g t o t e e n niveau e z i n g t o t e e n n niveau z i n g t o t e e n n i niveau i n g t o t e e n n i v niveau n g t o t e e n n i v e niveau g t o t e e n n i v e a niveau t o t e e n n i v e a u niveau Figure 2: Example of buffer type ‘Buffer15’ for the text fragment “(elke verkiezing) tot een niveau”. Underscores represent whitespaces. different users. The transcriptions of conversa- and 100,000 words respectively. The results are in tional speech are from the Spoken Dutch Corpus Table 2. (CGN) (Oostdijk, 2000); for our experiments, we only use the category ‘spontaneous speech’. We 4.4 Discussion of the results obtained the FAQ data by downloading the first Table 1 shows that for all text types, the buffer 1,000 pages that Google returns for the query ‘faq’ of 15 characters that crosses word borders gives with the language restriction Dutch. After clean- better results than the prefix of the current word ing the pages from HTML and other coding, the only. We get a relative improvement of 35% (for resulting corpus contained approximately 1.7 Mil- FAQ) to 62% (for Speech) of Buffer15 compared lion words of questions and answers. to Prefix-only. 4.2 Within-text type experiments Table 2 shows that text type differences have an influence on text prediction quality: all across- For each of the four text types, we compare the text type experiments lead to lower results than buffer types ‘Prefix’ and ‘Buffer15’. In each ex- the within-text type experiments. From the re- periment, we use 1.5 Million words from the cor- sults in Table 2, we can deduce that of the four pus to train the algorithm and 100,000 words to text types, speech and Twitter language resem- test it. The results are in Table 1. ble each other more than they resemble the other 4.3 Across-text type experiments two, and Wikipedia and FAQ resemble each other more. Twitter and Wikipedia data are the least We investigate the importance of text type differ- similar: training on Wikipedia data makes the text ences for text prediction with a series of experi- prediction score for Twitter data drop from 29.2 to ments in which we train and test our algorithm on 16.5%.2 texts of different text types. We keep the size of 2 the train and test sets the same: 1.5 Million words Note that the results are not symmetric. For example, 564 Table 1: Results from the within-text type experiments in terms of percentages of saved keystrokes. Prefix means: ‘use the previous characters of the current word as features’. Buffer 15 means ‘use a buffer of the previous 15 characters as features’. Prefix Buffer15 Wikipedia 22.2% 30.5% Twitter 21.3% 29.2% Speech 20.7% 33.4% FAQ 20.2% 27.2% Table 2: Results from the across-text type experiments in terms of percentages of saved keystrokes, using the best-scoring configuration from the within-text type experiments: a buffer of 15 characters Trained on Tested on Wikipedia Tested on Twitter Tested on Speech Tested on FAQ Wikipedia 30.5% 16.5% 22.3% 24.9% Twitter 17.9% 29.2% 27.9% 20.7% Speech 19.7% 22.5% 33.4% 21.0% FAQ 22.6% 18.2% 22.9% 27.2% 5 Case study: questions about the list. The newly submitted questions are sent to neurological issues an expert who answers them and adds both ques- tion and answer to the chat-by-click database. In Online care platforms aim to bring together pa- typing the question to be submitted, the user will tients and experts. Through this medium, patients be supported by a text prediction application. can find information about their illness, and get in The aim of this section is to find the best train- contact with fellow-sufferers. Patients who suffer ing corpus for newly formulated questions in the from neurological damage may have communica- neurological domain. We realize that questions tive disabilities because their speaking and writ- formulated by users of a web interface are dif- ing skills are impaired. For these patients, existing ferent from questions formulated by experts for online care platforms are often not easily accessi- the purpose of a FAQ-list. Therefore, we plan to ble. Aphasia, for example, hampers the exchange gather real user data once we have a first version of information because the patient has problems of the user interface running online. For develop- with word finding. ing the text prediction algorithm that is behind the In the project ‘Communicatie en revalidatie initial version of the application, we aim to find DigiPoli’ (ComPoli), language and speech tech- the best training corpus using the questions from nologies are implemented in the infrastructure of the chat-by-click data as training set. an existing online care platform in order to fa- cilitate communication for patients suffering from 5.1 Data neurological damage. Part of the online care plat- The chat-by-click data set on neurological issues form is a list of frequently asked questions about consists of 639 questions with corresponding an- neurological diseases with answers. A user can swers. A small sample of the data (translated to browse through the questions using a chat-by-click English) is shown in Table 3. In order to create the interface (Geuze et al., 2008). Besides reading the test data for our experiments, we removed dupli- listed questions and answers, the user has the op- cate questions from the chat-by-click data, leaving tion to submit a question that is not yet included in a set of 359 questions.3 training on Wikipedia, testing on Twitter gives a different re- In the previous sections, we used corpora of sult from training on Twitter, testing on Wikipedia. This is 100,000 words as test collections and we calcu- due to the size and domain of the vocabularies in both data sets and the richness of the contexts (in order for the algo- lated the percentage of saved keystrokes over the rithm to predict a word, it has to have seen it in the train set). 3 If the test set has a larger vocabulary than the train set, a lower Some questions and answers are repeated several times proportion of words can be predicted than when it is the other in the chat-by-click data because they are located at different way around. places in the chat-by-click hierarchy. 565 Table 3: A sample of the neuro-QA data, translated to English. question 0 505 Can (P)LS be cured? answer 0 505 Unfortunately, a real cure is not possible. However, things can be done to combat the effects of the diseases, mainly relieving symptoms such as stiffness and spasticity. The phisical therapist and reha- bilitation specialist can play a major role in symptom relief. Moreover, there are medications that can reduce spasticity. question 0 508 How is (P)LS diagnosed? answer 0 508 The diagnosis PLS is difficult to establish, especially because the symptoms strongly resemble HSP symptoms (Strumpell’s disease). Apart from blood and muscle research, several neurological examina- tions will be carried out. Table 4: Results for the neuro-QA questions only in terms of percentages of saved keystrokes, using different training sets. The text prediction configuration used in all settings is Buffer15. The test samples are 359 questions with an average length of 7.5 words. The percentages of saved keystrokes are means over the 359 questions. Training corpus # words Mean % of saved keystrokes in OOV-rate neuro-QA questions (stdev) Twitter 1.5 Million 13.3% (12.5) 28.5% Speech 1.5 Million 14.1% (13.2) 26.6% Wikipedia 1.5 Million 16.1% (13.1) 19.4% FAQ 1.5 Million 19.4% (15.6) 20.0% Medical Wikipedia 1.5 Million 28.1% (16.5) 7.0% Neuro-QA questions (leave-one-out) 2,672 26.5% (19.9) 17.8% complete test corpus. In the reality of our case evaluating on the remaining questions). study however, users will type only brief frag- ments of text: the length of the question they want In order to create the ‘medical Wikipedia’ cor- to submit. This means that there is potentially a pus, we consulted the category structure of the large deviation in the effectiveness of the text pre- Wikipedia corpus. The Wikipedia category ‘Ge- diction algorithm per user, depending on the con- neeskunde’ (Medicine) contains 69,898 pages and tent of the small text they are typing. Therefore, in the deeper nodes of the hierarchy we see many we decided to evaluate our training corpora sepa- non-medical pages, such as trappist beers (or- rately on each of the 359 unique questions, so that dered under beer, booze, alcohol, Psychoactive we can report both mean and standard deviation drug, drug, and then medicine). If we remove all of the text prediction scores on small (realistically pages that are more than five levels under the ‘Ge- sized) samples. The average number of words per neeskunde’ category root, 21,071 pages are left, question is 7.5; the total size of the neuro-QA cor- which contain fairly over the 1.5 Million words pus is 2,672 words. that we need. We used the first 1.5 Million words of the corpus in our experiments. 5.2 Experiments The text prediction results for the different cor- We aim to find the training set that gives the best pora are in Table 4. For each corpus, the out-of- text prediction result for the neuro-QA questions. vocabulary rate is given: the percentage of words We compare the following training corpora: in the Neuro-QA questions that do not occur in the corpus.4 • The corpora that we compared in the text type experiments: Wikipedia, Twitter, Speech and 5.3 Discussion of the results FAQ, 1.5 Million words per corpus. We measured the statistical significance of the • A 1.5 Million words training corpus that is mean differences between all text prediction of the same topic domain as the target data: scores using a Wilcoxon Signed Rank test on Wikipedia articles from the medical domain; paired results for the 359 questions. We found that • The 359 questions from the neuro-QA data 4 The OOV-rate for the Neuro-QA corpus itself is the av- themselves, evaluated in a leave-one-out set- erage of the OOV-rate of each leave-one-out experiment: the ting (359 times training on 358 questions and proportion of words that only occur in one question. 566 ECDFs for text prediction scores on Neuro−QA questions using six different training corpora 1.0 0.8 Cumulative Percent of test corpus 0.6 0.4 Twitter Speech 0.2 Wikipedia FAQ Neuro−QA (leave−one−out) 0.0 Medical Wikipedia 0 10 20 30 40 50 60 Text prediction scores Figure 3: Empirical CDFs for text prediction scores on Neuro-QA data. Note that the curves that are at the bottom-right side represent the better-performing settings. the difference between the Twitter and Speech cor- Wikipedia corpus. pora on the task is not significant (P = 0.18). Table 4 also shows that the standard devia- The difference between Neuro-QA and Medical tion among the 359 samples is relatively large. Wikipedia is significant with P = 0.02; all other For some questions, we 0% of the keystrokes are differences are significant with P < 0.01. saved, while for other, scores of over 80% are ob- The Medical Wikipedia corpus and the leave- tained (by the Neuro-QA and Medical Wikipedia one-out experiments on the Neuro-QA data give training corpora). We further analyzed the differ- better text prediction scores than the other corpora. ences between the training sets by plotting the Em- The Medical Wikipedia even scores slightly better pirical Cumulative Distribution Function (ECDF) than the Neuro-QA data itself. Twitter and Speech for each experiment. An ECDF shows the devel- are the least-suited training corpora for the Neuro- opment of text prediction scores (shown on the X- QA questions, and FAQ data gives a bit better re- axis) by walking through the test set in 359 steps sults than a general Wikipedia corpus. (shown on the Y-axis). These results suggest that both text type and The ECDFs for our training corpora are in Fig- topic domain play a role in text prediction qual- ure 3. Note that the curves that are at the bottom- ity, but the high scores for the Medical Wikipedia right side represent the better-performing settings corpus shows that topic domain is even more im- (they get to a higher maximum after having seen portant than text type.5 The column ‘OOV-rate’ a smaller portion of the samples). From Figure 3, shows that this is probably due to the high cover- it is again clear that the Neuro-QA and Medical age of terms in the Neuro-QA data by the Medical Wikipedia corpora outperform the other training corpora, and that of the other four, FAQ is the best- 5 We should note here that we did not control for domain performing corpus. Figure 3 also shows a large differences between the four different text types. They are intended to be ‘general domain’ but Wikipedia articles will difference in the sizes of the starting percentiles: naturally be of different topics than conversational speech. The proportion of samples with a text prediction 567 Histogram of text prediction scores for the Neuro−QA Histogram of text prediction scores for leave−one−out questions trained on Medical Wikipedia experiments on Neuro−QA questions 80 80 60 60 Frequency Frequency 40 40 20 20 0 0 0 20 40 60 80 0 20 40 60 80 percentage of keystrokes saved percentage of keystrokes saved Figure 4: Histogram of text prediction scores Figure 5: Histogram of text prediction scores for the Neuro-QA questions trained on Medical for leave-one-out experiments on Neuro-QA ques- Wikipedia. Each bin represents 36 questions. tions. Each bin represents 36 questions. score of 0% is less than 10% for the Medical around the mean, while the leave-one-out exper- Wikipedia up to more than 30% for Speech. iments lead to a larger number of samples with We inspected the questions that get a text pre- low prediction scores and a larger number of sam- diction score of 0%. We see many medical terms ples with high prediction scores. This is also re- in these questions, and many of the utterances are flected by the higher standard deviation for Neuro- not even questions, but multi-word terms repre- QA than for Medical Wikipedia. senting topical headers in the chat-by-click data. Since both the leave-one-out training on the Seven samples get a zero-score in the output of all Neuro-QA questions and the Medical Wikipedia six training corpora, e.g.: led to good results but behave differently for dif- ferent portions of the test data, we also evaluated a • glycogenose III. combination of both corpora on our test set: We • potassium-aggrevated myotonias. created training corpora consisting of the Medi- cal Wikipedia corpus, complemented by 90% of 26 samples get a zero-score in the output of all the Neuro-QA questions, testing on the remaining training corpora except for Medical Wikipedia and 10% of the Neuro-QA questions. This led to mean Neuro-QA itself. These are mainly short headings percentage of saved keystrokes of 28.6%, not sig- with domain-specific terms such as: nificantly higher than just the Medical Wikipedia corpus. • idiopatische neuralgische amyotrofie. • Markesbery-Griggs distale myopathie. 6 Conclusions • oculopharyngeale spierdystrofie. In Section 1, we asked two questions: (1) “What Interestingly, the ECDFs show that the Med- is the effect of text type differences on the quality ical Wikipedia and Neuro-QA corpora cross at of a text prediction algorithm?” and (2) “What is around percentile 70 (around the point of 40% the best choice of training data if domain- and text saved keystrokes). This indicates that although the type-specific data is sparse?” means of the two result samples are close to each By training and testing our text prediction al- other, the distribution the scores for the individ- gorithm on four different text types (Wikipedia, ual questions is different. The histograms of both Twitter, transcriptions of conversational speech distributions (Figures 4 and 5) confirm this: the and FAQ) with equal corpus sizes, we found that algorithm trained on the Medical Wikipedia cor- there is a clear effect of text type on text prediction pus leads a larger number of samples with scores quality: training and testing on the same text type 568 gave percentages of saved keystrokes between 27 N. Garay-Vitoria and J. Abascal. 2006. Text prediction and 34%; training on a different text type caused systems: a survey. Universal Access in the Informa- tion Society, 4(3):188–203. the scores to drop to percentages between 16 and 28%. J. Geuze, P. Desain, and J. Ringelberg. 2008. Re- In our case study, we compared a number of phrase: chat-by-click: a fundamental new mode of training corpora for a specific data set for which human communication over the internet. In CHI’08 extended abstracts on Human factors in computing training data is sparse: questions about neuro- systems, pages 3345–3350. ACM. logical issues. We found significant differences between the text prediction scores obtained with G.W. Lesher, B.J. Moulton, D.J. Higginbotham, et al. 1999. Effects of ngram order and training text size the six training corpora: the Twitter and Speech on word prediction. In Proceedings of the RESNA corpora were the least suited, followed by the ’99 Annual Conference, pages 52–54. Wikipedia and FAQ corpus. The highest scores were obtained by training the algorithm on the Johannes Matiasek, Marco Baroni, and Harald Trost. 2002. FASTY - A Multi-lingual Approach to Text medical pages from Wikipedia, immediately fol- Prediction. In Klaus Miesenberger, Joachim Klaus, lowed by leave-one-out experiments on the 359 and Wolfgang Zagler, editors, Computers Helping neurological questions. The large differences be- People with Special Needs, volume 2398 of Lec- tween the lexical coverage of the medical domain ture Notes in Computer Science, pages 165–176. Springer Berlin / Heidelberg. played a central role in the scores for the different training corpora. N. Oostdijk. 2000. The spoken Dutch corpus: Because we obtained good results by both overview and first evaluation. In Proceedings of LREC-2000, Athens, volume 2, pages 887–894. the Medical Wikipedia corpus and the neuro-QA questions themselves, we opted for a combination Erik Tjong Kim Sang. 2011. Het gebruik van Twit- of both data types as training corpus in the initial ter voor Taalkundig Onderzoek. In TABU: Bulletin version of the online text prediction application. voor Taalwetenschap, volume 39, pages 62–72. In Dutch. Currently, a demonstration version of the appli- cation is running for ComPoli-users. We hope to A. Van den Bosch and T. Bogers. 2008. Efficient collect questions from these users to re-train our context-sensitive word completion for mobile de- vices. In Proceedings of the 10th international con- algorithm with more representative examples. ference on Human computer interaction with mobile devices and services, pages 465–470. ACM. Acknowledgments A. Van den Bosch. 2011. Effects of context and re- This work is part of the research programme cency in scaled word completion. Computational ‘Communicatie en revalidatie digiPoli’ (Com- Linguistics in the Netherlands Journal, 1:79–94, 12/2011. Poli6 ), which is funded by ZonMW, the Nether- lands organisation for health research and devel- G. Van Noord. 2009. Huge parsed corpora in LASSY. opment. In Proceedings of The 7th International Workshop on Treebanks and Linguistic Theories (TLT7). S. Westman and L. Freund. 2010. Information Interac- References tion in 140 Characters or Less: Genres on Twitter. In Proceedings of the third symposium on Information J. Carlberger. 1997. Design and Implementation of a Interaction in Context (IIiX), pages 323–328. ACM. Probabilistic Word Prediciton Program. Master the- sis, Royal Institute of Technology (KTH), Sweden. W. Daelemans, A. Van Den Bosch, and T. Weijters. 1997. IGTree: Using trees for compression and clas- sification in lazy learning algorithms. Artificial In- telligence Review, 11(1):407–423. A. Fazly and G. Hirst. 2003. Testing the efficacy of part-of-speech information in word completion. In Proceedings of the 2003 EACL Workshop on Lan- guage Modeling for Text Entry Methods, pages 9– 16. 6 http://lands.let.ru.nl/˜strik/research/ComPoli/ 569 The Impact of Spelling Errors on Patent Search Benno Stein and Dennis Hoppe and Tim Gollub Bauhaus-Universität Weimar 99421 Weimar, Germany <first name>.<last name>@uni-weimar.de Abstract which distinguishes patent search from general web search. This retrieval constraint has produced The search in patent databases is a risky a variety of sophisticated approaches tailored to business compared to the search in other the patent domain: citation analysis (Magdy and domains. A single document that is relevant but overlooked during a patent search can Jones, 2010), the learning of section-specific re- turn into an expensive proposition. While trieval models (Lopez and Romary, 2010), and au- recent research engages in specialized mod- tomated query generation (Xue and Croft, 2009). els and algorithms to improve the effective- Each approach improves retrieval performance, ness of patent retrieval, we bring another but what keeps them from attaining maximum ef- aspect into focus: the detection and ex- fectiveness in terms of recall are the inconsisten- ploitation of patent inconsistencies. In par- cies found in patents: incomplete citation sets, in- ticular, we analyze spelling errors in the as- signee field of patents granted by the United correctly assigned classification codes, and, not States Patent & Trademark Office. We in- least, spelling errors. troduce technology in order to improve re- Our paper deals with spelling errors in an oblig- trieval effectiveness despite the presence of atory and important field of each patent, namely, typographical ambiguities. In this regard, the patent assignee name. Bibliographic fields are we (1) quantify spelling errors in terms of edit distance and phonological dissimilarity widely used among professional patent searchers and (2) render error detection as a learn- in order to constrain keyword-based search ses- ing problem that combines word dissimi- sions (Joho et al., 2010). The assignee name is larities with patent meta-features. For the particularly helpful for patentability searches and task of finding all patents of a company, portfolio analyses since it determines the com- our approach improves recall from 96.7% pany holding the patent. Patent experts address (when using a state-of-the-art patent search these search tasks by formulating queries contain- engine) to 99.5%, while precision is com- ing the company name in question, in the hope of promised by only 3.7%. finding all patents owned by that company. A for- mal and more precise description of this relevant 1 Introduction search task is as follows: Given a query q which Patent search forms the heart of most retrieval specifies a company, and a set D of patents, de- tasks in the intellectual property domain—cf. Ta- termine the set Dq ⊂ D comprised of all patents ble 1, which provides an overview of various user held by the respective company. groups along with their typical (•) and related (◦) For this purpose, all assignee names in the tasks. The due diligence task, for example, is patents in D should be analyzed. Let A denote concerned with legal issues that arise while inves- the set of all assignee names in D, and let a ∼ q tigating another company. Part of an investiga- denote the fact that an assignee name a ∈ A refers tion is a patent portfolio comparison between one to company q. Then in the portfolio search task, or more competitors (Lupu et al., 2011). Within all patents filed under a are relevant. The retrieval all tasks recall is preferred over precision, a fact of Dq can thus be rendered as a query expansion 570 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 570–579, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Table 1: User groups and patent-search-related retrieval tasks in the patent domain (Hunt et al., 2007). User group Analyst Attorney Manager Inventor Investor Researcher Patentability • ◦ • ◦ State of the art ◦ • Patent search task Infringement • Opposition • • Due diligence • • Portfolio • ◦ • • task, where q is expanded by the disjunction of proach more sophisticated than the standard re- assignee names Aq with Aq = {a ∈ A | a ∼ q}. trieval approach, which is the expansion of q by While the trivial expansion of q by the entire the empty set, is needed. Such an approach must set A ensures maximum recall but entails an un- strive for an expansion of q by a subset of Aq , acceptable precision, the expansion of q by the whereby this subset should be as large as possible. empty set yields a reasonable baseline. The latter approach is implemented in patent search engines 1.1 Contributions such as PatBase1 or FreePatentsOnline,2 which return all patents where the company name q oc- The paper provides a new solution to the problem curs as a substring of the assignee name a. This outlined. This solution employs machine learn- baseline is simple but reasonable; due to trade- ing on orthographic features, as well as on patent mark law, a company name q must be a unique meta features, to reliably detect spelling errors. It identifier (i.e. a key), and an assignee name a that consists of two steps: (1) the computation of A+ q , contains q can be considered as relevant. It should the set of assignee names that are in a certain edit be noted in this regard that |q| < |a| holds for distance neighborhood to q; and (2) the filtering of most elements in Aq , since the assignee names A+ ∗ q , yielding the set Aq , which contains those as- often contain company suffixes such as “Ltd” + signee names from Aq that are classified as mis- or “Inc”. spellings of q. The power of our approach can be Our hypothesis is that due to misspelled as- seen from Table 3, which also shows a key result signee names a substantial fraction of relevant of our research; a retrieval system that exploits patents cannot be found by the baseline ap- our classifier will miss only 0.5% of the relevant proach. In this regard, the types of spelling er- patents, while retrieval precision is compromised rors in assignee names given in Table 2 should by only 3.7%. be considered. Another contribution relates to a new, manu- Table 2: Types of spelling errors with increasing ally-labeled corpus comprising spelling errors in problem complexity according to Stein and Curatolo the assignee field of patents (cf. Section 3). In (2006). The first row refers to lexical errors, whereas this regard, we consider the over 2 million patents the last two rows refer to phonological errors. For each granted by the USPTO between 2001 and 2010. type, an example is given, where a misspelled com- Last, we analyze indications of deliberately in- pany name is followed by the correctly spelled variant. serted spelling errors (cf. Section 4). Spelling error type Example Permutations or dropped letters → Whirpool Corporation Table 3: Mean average Precision, Recall, and F - → Whirlpool Corporation Measure (β = 2) for different expansion sets for q in Misremembering spelling details → Whetherford International a portfolio search task, which is conducted on our test → Weatherford International corpus (cf. Section 3). Spelling out the pronunciation → Emulecks Corporation → Emulex Corporation Expansion set for q Precision Recall F2 ∅ (baseline) 0.993 0.967 0.968 In order to raise the recall for portfolio search A∗q (machine learning) 0.956 0.995 0.980 without significantly impairing precision, an ap- A (trivial) 0.001 1.0 0.005 A+ q (edit distance) 0.274 1.0 0.672 1 www.patbase.com 2 www.freepatentsonline.com 571 1.2 Causes for Inconsistencies in Patents names “Howlett-Packard” and “Hewett-Packard” are distinct but refer to the same company. These We identify the following six factors for inconsis- kinds of near-duplicates impede the identification tencies in the bibliographic fields of patents, in of duplicates (Naumann and Herschel, 2010). particular for assignee names: (1) Misspellings are introduced due to the lack of knowledge, the Near-duplicate Detection The problem of lack of attention, and due to spelling disabili- identifying near-duplicates is also known as ties. Intellevate Inc. (2006) reports that 98% record linkage, or name matching; it is sub- of a sample of patents taken from the USPTO ject of active research (Elmagarmid et al., 2007). database contain errors, most which are spelling With respect to text documents, slightly modi- errors. (2) Spelling errors are only removed by the fied passages in these documents can be identi- USPTO upon request (U.S. Patent & Trademark fied using fingerprints (Potthast and Stein, 2008). Office, 2010). (3) Spelling variations of inventor On the other hand, for data fields which con- names are permitted by the USPTO. The Manual tain natural language such as the assignee name of Patent Examining Procedure (MPEP) states in field, string similarity metrics (Cohen et al., paragraph 605.04(b) that “if the applicant’s full 2003) as well as spelling correction technol- name is ’John Paul Doe,’ either ’John P. Doe’ or ogy are exploited (Damerau, 1964; Monge and ’J. Paul Doe’ is acceptable.” Thus, it is valid to in- Elkan, 1997). String similarity metrics com- troduce many different variations: with and with- pute a numeric value to capture the similarity out initials, with and without a middle name, or of two strings. Spelling correction algorithms, with and without suffixes. This convention ap- by contrast, capture the likelihood for a given plies to assignee names, too. (4) Companies of- word being a misspelling of another word. In ten have branches in different countries, where our analysis, the similarity metric SoftTfIdf is each branch has its own company suffix, e.g., applied, which performs best in name matching “Limited” (United States), “GmbH” (Germany), tasks (Cohen et al., 2003), as well as the complete or “Kabushiki Kaisha” (Japan). Moreover, the range of spelling correction algorithms shown in usage of punctuation varies along company suf- Figure 1: Soundex, which relies on similarity fix abbreviations: “L.L.C.” in contrast to “LLC”, hashing (Knuth, 1997), the Levenshtein distance, for example. (5) Indexing errors emerge from which gives the minimum number of edits needed OCR processing patent applications, because sim- to transform a word into another word (Leven- ilar looking letters such as “e” versus “c” or “l” shtein, 1966), and SmartSpell, a phonetic pro- versus “I” are likely to be misinterpreted. (6) With duction approach that computes the likelihood the advent of electronic patent application filing, of a misspelling (Stein and Curatolo, 2006). In the number of patent reexamination steps was re- order to combine the strength of multiple met- duced. As a consequence, the chance of unde- rics within a near-duplicate detection task, sev- tected spelling errors increases (Adams, 2010). eral authors resort to machine learning (Bilenko All of the mentioned factors add to a highly in- and Mooney, 2002; Cohen et al., 2003). Christen consistent USPTO corpus. (2006) concludes that it is important to exploit all kinds of knowledge about the type of data in ques- 2 Related Work tion, and that inconsistencies are domain-specific. Hence, an effective near-duplicate detection ap- Information within a corpus can only be retrieved proach should employ domain-specific heuristics effectively if the data is both accurate and unique and algorithms (Müller and Freytag, 2003). Fol- (Müller and Freytag, 2003). In order to yield data lowing this argumentation, we augment various that is accurate and unique, approaches to data word similarity assessments with patent-specific cleansing can be utilized to identify and remove meta-features. inconsistencies. Müller and Freytag (2003) clas- sify inconsistencies, where duplicates of entities Patent Search Commercial patent search en- in a corpus are part of a semantic anomaly. These gines, such as PatBase and FreePatentsOnline, duplicates exist in a database if two or more dif- handle near-duplicates in assignee names as fol- ferent tuples refer to the same entity. With respect lows. For queries which contain a company name to the bibliographic fields of patents, the assignee followed by a wildcard operator, PatBase suggests 572 Collision-based Near similarity query expansion sets (cf. Table 3) into two cate- hashing Neighborhood-based gories: (1) The trivial as well as the edit distance Single word Trigram-based expansion sets are underspecific, i.e., users cannot spelling Editing Edit-distance-based correction Rule-based cope with the large amount of irrelevant patents returned; the precision is close to zero. (2) The Heuristic search Phonetic production Hidden Markov baseline approach, by contrast, is overspecific; approach models it returns too few documents, i.e., the achieved Figure 1: Classification of spelling correction methods recall is not optimal. As a consequence, these according to Stein and Curatolo (2006). query expansion sets are not suitable for portfolio a set of additional companies (near-duplicates), search. Our approach, on the other hand, excels which can be considered alongside the company in both precision and recall. name in question. These suggestions are solely Query Spelling Correction Queries which are retrieved based on a trailing wildcard query. Each submitted to standard web search engines differ additional company name can then be marked in- from queries which are posed to patent search en- dividually by a user to expand the original query. gines with respect to both length and language In case the entire set of suggestions is consid- diversity. Hence, research in the field of web ered, this strategy conforms to the expansion of search is concerned with suggesting reasonable a query by the empty set, which equals a rea- alternatives to misspelled queries rather than cor- sonable baseline approach. This query expansion recting single words (Li et al., 2011). Since stan- strategy, however, has the following drawbacks: dard spelling correction dictionaries (e.g. ASpell) (1) The strategy captures only inconsistencies that are not able to capture the rich language used in succeed the given company name in the origi- web queries, large-scale knowledge sources such nal query. Thus, near-duplicates which contain as Wikipedia (Li et al., 2011), query logs (Chen spelling errors in the company name itself are not et al., 2007), and large n-gram corpora (Brants et found. Even if PatBase would support left trailing al., 2007) are employed. It should be noted that wildcards, then only the full combination of wild- the set of correctly written assignee names is un- card expressions would cover all possible cases of known for the USPTO patent corpus. misspellings. (2) Given an acronym of a company Moreover, spelling errors are modeled on the such as IBM, it is infeasible to expand the ab- basis of language models (Li et al., 2011). Okuno breviation to “International Business Machines” (2011) proposes a generative model to encounter without considering domain knowledge. spelling errors, where the original query is ex- panded based on alternatives produced by a small Query Expansion Methods for Patent Search edit distance to the original query. This strategy To date, various studies have investigated query correlates to the trivial query expansion set (cf. expansion techniques in the patent domain that Section 1). Unlike using a small edit distance, we focus on prior-art search and invalidity search allow a reasonable high edit distance to maximize (Magdy and Jones, 2011). Since we are dealing the recall. with queries that comprise only a company name, existing methods cannot be applied. Instead, the Trademark Search The trademark search is near-duplicate task in question is more related to a about identifying registered trademarks which are text reuse detection task discussed by Hagen and similar to a new trademark application. Sim- Stein (2011); given a document, passages which ilarities between trademarks are assessed based also appear identical or slightly modified in other on figurative and verbal criteria. In the former documents, have to be retrieved by using standard case, the focus is on image-based retrieval tech- keyword-based search engines. Their approach is niques. Trademarks are considered verbally simi- guided by the user-over-ranking hypothesis intro- lar for a variety of reasons, such as pronunciation, duced by Stein and Hagen (2011). It states that spelling, and conceptual closeness, e.g., swapping “the best retrieval performance can be achieved letters or using numbers for words. The verbal with queries returning about as many results as similarity of trademarks, on the other hand, can can be considered at user site.” If we make use be determined by using techniques comparable of their terminology, then we can distinguish the to near-duplicate detection: phonological parsing, 573 fuzzy search, and edit distance computation (Fall with a misspelled company name has a low and Giraud-Carrier, 2005). frequency. 2. IPC Overlap. The IPC codes of a patent 3 Detection of Spelling Errors specify the technological areas it applies This section presents our machine learning ap- to. We assume that patents filed under the proach to expand a company query q; the classi- same company name are likely to share the fier c delivers the set A∗q = {a ∈ A | c(q, a) = 1}, same set of IPC codes, regardless whether an approximation of the ideal set of relevant as- the company name is misspelled or not. signee names Aq . As a classification technol- Hence, if we determine the IPC codes of ogy a support vector machine with linear kernel patents which contain q in the assignee is used, which receives each pair (q, a) as a six- name, IPC(q), and the IPC codes of patents dimensional feature vector. For training and test filed under assignee name a, IPC(a), then purposes we identified misspellings for 100 dif- the intersection size of the two sets serves as ferent company names. A detailed description of an indicator for a misspelled company name the constructed test corpus and a report on the in a: classifiers performance is given in the remainder IPC(q) ∩ IPC(a) of this section. FIPC (q, a) = IPC(q) ∪ IPC(a) 3.1 Feature Set The feature set comprises six features, three of 3. Company Suffix Match. The suffix match them being orthographic similarity metrics, which relies on the company suffixes Suffixes(q) are computed for every pair (q, a). Each metric that occur in the assignee names of A con- compares a given company name q with the first taining q. Similar to the IPC overlap fea- |q| words of the assignee name a: ture, we argue that if the company suffix of a exists in the set Suffixes(q), a mis- 1. SoftTfIdf. The SoftTfIdf metric is consid- spelling in a is likely: FSuffixes (q, a) = 1 ered, since the metric is suitable for the com- iff Suffixes(a) ∈ Suffixes(q). parison of names (Cohen et al., 2003). The metric incorporates the Jaro-Winkler met- 3.2 Webis Patent Retrieval Assignee Corpus ric (Winkler, 1999) with a distance threshold A key contribution of our work is a new cor- of 0.9. The frequency values for the similar- pus called Webis Patent Retrieval Assignee Cor- ity computation are trained on A. pus 2012 (Webis-PRA-12). We compiled the cor- 2. Soundex. The Soundex spelling correction pus in order to assess the impact of misspelled algorithm captures phonetic errors. Since the companies on patent retrieval and the effective- algorithm computes hash values for both q ness of our classifier to detect them.3 The corpus and a, the feature is 1 if these hash values is built on the basis of 2 132 825 patents D granted are equal, 0 otherwise. by the USPTO between 2001 and 2010; the patent 3. Levenshtein distance. The Levenshtein dis- corpus is provided publicly by the USPTO in tance for (q, a) is normalized by the charac- XML format. Each patent contains bibliographic ter length of q. fields as well as textual information such as the abstract and the claims section. Since we are in- To obtain further evidence for a misspelling terested in the assignee name a associated with in an assignee name, meta information about the each patent d ∈ D, we parse each patent and ex- patents in D, to which the assignee name refers tract the assignee name. This yields the set A of to, is exploited. In this regard, the following three 202 846 different assignee names. Each assignee features are derived: name refers to a set of patents, which size varies 1. Assignee Name Frequency. The number from 1 to 37 202 (the number of patents filed of patents filed under an assignee name a: under “International Business Machines Corpo- FFreq (a) = Freq (a, D). We assume that the ration”). It should be noted that for a portfolio probability of a misspelling to occur multi- 3 The Webis-PRA-12 corpus is freely available via ple times is low, and thus an assignee name www.webis.de/research/corpora 574 Table 4: Statistics of spelling errors for the 100 companies in the Webis-PRA-12 corpus. Considered are the number of words and the number of letters in the company names, as well as the number of different company suffixes that are used together with a company name (denoted as variants of q) Total Num. of words in q Num. of letters in q Num. of variants of q 1 2 3-4 2-10 11-15 16-35 1-5 6-15 16-96 Number of companies in Q 100 36 53 11 30 35 35 45 32 23 Avg. num. of misspellings in A 3.79 2.13 3.75 9.36 1.16 2.94 6.88 0.91 3.81 9.39 search task the number of patents which refer to signee names A+ q \ Aq form the set of negative an assignee name matters for the computation of examples (12 651 in total). precision and recall. If we, however, isolate the During the manual assessment, names of as- task of detecting misspelled company names, then signees which include the correct company name it is also reasonable to weight each assignee name q were distinguished from misspelled ones. The equally and independently from the number of latter holds true for 379 of the 1 538 assignee patents it refers to. Both scenarios are addressed names. These names are not retrievable by the in the experiments. baseline system, and thus form the main target for Given A, the corpus construction task is to map our classifier. The second row of Table 4 reports each assignee name a ∈ A to the company name on the distribution of the 379 misspelled assignee q it refers to. This gives for each company name names. As expectable, the longer the company q the set of relevant assignee names Aq . For our name, the more spelling errors occur. Compa- corpus, we do not construct Aq for all company nies which file patents under many different as- names but take a selection of 100 company names signee names are likelier to have patents with mis- from the 2011 Fortune 500 ranking as our set of spellings in the company name. company names Q. Since the Fortune 500 rank- 3.3 Classifier Performance ing contains only large companies, the test cor- pus may appear to be biased towards these com- For the evaluation with the Webis-PRA-12 cor- panies. However, rather than the company size the pus, we train a support vector machine,4 which structural properties of a company name are de- considers the six outlined features, and compare terminative; our sample includes short, medium, it to the other expansion techniques. For the train- and long company names, as well as company ing phase, we use 2/3 of the positive examples names with few, medium, and many different to form a balanced training set of 1 025 posi- company suffixes. Table 4 shows the distribution tive and 1 025 negative examples. After 10-fold of company names in Q along these criteria in the cross validation, the achieved classification accu- first row. racy is 95.97%. For a comparison of the expansion techniques For each company name q ∈ Q, we ap- on the test set, which contains the examples not ply a semi-automated procedure to derive the considered in the training phase, two tasks are set of relevant assignee names Aq . In a first distinguished: finding near duplicates in assignee step, all assignee names in A which do not re- names (cf. Table 5, Columns 3–5), and finding all fer to the company name q are filtered auto- patents of a company (cf. Table 5, Columns 6–8). matically. From a preliminary evaluation we The latter refers to the actual task of portfo- concluded that the Levenshtein distance d(q, a) lio search. It can be observed that the perfor- with a relative threshold of |q|/2 is a reasonable mance improvements on both tasks are pretty sim- choice for this filtering step. The resulting sets ilar. The baseline expansion ∅ yields a recall A+q = {a ∈ A | d(q, a) ≤ |q|/2) contain, in total of 0.83 in the first task. The difference of 0.17 over Q, 14 189 assignee names. These assignee to a perfect recall can be addressed by consid- names are annotated by human assessors within a ering query expansion techniques. If the triv- second step to derive the final set Aq for each q ∈ ial expansion A is applied to the task the max- Q. Altogether we identify 1 538 assignee names imum recall can be achieved, which, however, that refer to the 100 companies in Q. With respect to our classification task, the assignee names in 4 We use the implementation of the WEKA toolkit with default each Aq are positive examples; the remaining as- parameters. 575 Table 5: The search results (macro-averaged) for two retrieval tasks and various expansion techniques. Besides Precision and Recall, the F-Measure with β = 2 is stated. Misspelling detection Task: assignee names Task: patents P R F2 P R F2 Baseline (∅) .975 .829 .838 .993 .967 .968 Trivial (A) .000 1.0 .001 .001 1.0 .005 Edit distance (A+ q ) .274 1.0 .499 .412 1.0 .672 SVM (Levenshtein) .752 .981 .853 .851 .991 .911 SVM (SoftTfIdf) .702 .980 .796 .826 .993 .886 SVM (Soundex) .433 .931 .624 .629 .984 .759 SVM (orthographic features) .856 .975 .922 .942 .990 .967 SVM (A∗q , all features) .884 .975 .938 .956 .995 .980 is bought with precision close to zero. Using nological areas based on the International the edit distance expansion A+ q yields a precision Patent Classification scheme IPC: A (Hu- of 0.274 while keeping the recall at maximum. Fi- man necessities), B (Performing operations; nally, the machine learning expansion A∗q leads transporting), C (Chemistry; metallurgy), to a dramatic improvement (cf. Table 5, bottom D (Textiles; paper), E (Fixed constructions), lines), whereas the exploitation of patent meta- F (Mechanical engineering; lighting; heat- features significantly outperforms the exclusive ing; weapons; blasting), G (Physics), and use of orthography-related features; the increase H (Electricity). If spelling errors are in- in recall which is achieved by A∗q is statistically troduced accidentally, then we expect them significant (matched pair t-test) for both tasks (as- to be uniformly distributed across all ar- signee names task: t = −7.6856, df = 99, eas. A biased distribution, on the other p = 0.00; patents task: t = −2.1113, df = 99, hand, indicates that errors might be in- p = 0.037). Note that when being applied as a serted deliberately. single feature none of the spelling metrics (Lev- enshtein, SoftTfIdf, Soundex) is able to achieve In the following, we compile a second corpus a recall close to 1 without significantly impairing on the basis of the entire set A of assignee names. the precision. In order to yield a uniform distribution of the com- panies across years, technological areas and coun- 4 Distribution of Spelling Errors tries, a set of 120 assignee names is extracted for each dimension. After the removal of duplicates, Encouraged by the promising retrieval results we revised these assignee names manually in or- achieved on the Webis-PRA-12 corpus, we ex- der to check (and correct) their spelling. Finally, tend the analysis of spelling errors in patents to trailing business suffixes are removed, which re- the entire USPTO corpus of granted patents be- sults in a set of 3 110 company names. For each tween 2001 and 2010. The analysis focuses on company name q, we generate the set A∗q as de- the following two research questions: scribed in Section 3. 1. Are spelling errors an increasing issue in The results of our analysis are shown in Table 6. patents? According to Adams (2010), the Table 6(a) refers to the first research question and amount of spelling errors should have been shows that the amount of misspellings in compa- increased in the last years due to the elec- nies decreased over the years from 6.67% in 2001 tronic patent filing process (cf. Section 1.2). to 4.74% in 2010 (cf. Row 3). These results let us We address this hypothesis by analyzing the reject the hypothesis of Adams (2010). Neverthe- distribution of spelling errors in company less, the analysis provides evidence that spelling names that occur in patents granted between errors are still an issue. For example, the company 2001 and 2010. identified with most spelling errors are “Konin- klijke Philips Electronics” with 45 misspellings 2. Are misspellings introduced deliberately in in 2008, and “Centre National de la Recherche patents? We address this question by analyz- Scientifique” with 28 misspellings in 2009. The ing the patents with respect to the eight tech- results are consistent with our findings with re- 576 Table 6: Distribution of spelling errors for 3 110 company identifiers in the USPTO patents. The mean of spelling errors per company identifier and the standard deviation σ refer to companies with misspellings. The last row in each table shows the number of patents that are additionally found if the original query q is expanded by A∗q . (a) Distribution of spelling errors between the years 2001 and 2010. Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Number of companies 1 028 1 066 1 115 1 151 1 219 1 261 1 274 1 210 1 224 1 268 Number of companies with misspellings 67 63 53 65 65 60 65 64 53 60 Measure Companies with misspellings (%) 6.52 5.91 4.75 5.65 5.33 4.76 5.1 5.29 4.33 4.73 Mean 2.78 2.35 2.23 2.28 2.18 2.48 2.23 3.0 2.64 2.8 Standard deviation σ 4.62 3.3 3.63 3.13 2.8 3.55 2.87 6.37 4.71 4.6 Maximum misspellings per company 24 12 16 12 10 18 12 45 28 22 Additional number of patents 7.1 7.21 7.43 7.68 7.91 8.48 7.83 8.84 8.92 8.92 (b) Distribution of spelling errors based on the IPC scheme. IPC code A B C D E F G H Number of companies 954 1 231 811 277 412 771 1 232 949 Number of companies with misspellings 59 70 51 7 10 33 83 63 Measure Companies with misspellings (%) 6.18 5.69 6.29 2.53 2.43 4.28 6.74 6.64 Mean 3.0 2.49 3.57 1.86 2.8 1.88 3.29 4.05 Standard deviation σ 5.28 3.65 7.03 1.99 4.22 2.31 5.72 7.13 Maximum misspellings per company 32 14 40 3 12 6 24 35 Additional number of patents 9.25 9.67 11.12 4.71 4.6 4.79 8.92 12.84 spect to the Fortune 500 sample (cf. Table 4), inconsistencies. With the analysis of spelling er- where company names that are longer and pre- rors in assignee names we made a first yet consid- sumably more difficult to write contain more erable contribution in this respect; searches with spelling errors. assignee constraints become a more sensible op- In contrast to the uniform distribution of mis- eration. We showed how a special treatment of spellings over the years, the situation with re- spelling errors can significantly raise the effec- gard to the technological areas is different (cf. Ta- tiveness of patent search. The identification of ble 6(b)). Most companies are associated with this untapped potential, but also the utilization of the IPC sections G and B, which both refer to machine learning to combine patent features with technical domains (cf. Table 6(b), Row 1). The typography, form our main contributions. percentage of misspellings in these sections in- Our current research broadens the application creased compared to the spelling errors grouped of a patent spelling analysis. In order to iden- by year. A significant difference can be seen for tify errors that are introduced deliberately we the sections D and E. Here, the number of as- investigate different types of misspellings (edit signed companies drops below 450 and the per- distance versus phonological). Finally, we con- centage of misspellings decreases significantly sider the analysis of acquisition histories of com- from about 6% to 2.5%. These findings might panies as promising research direction: since support the hypothesis that spelling errors are in- acquired companies often own granted patents, serted deliberately in technical domains. these patents should be considered while search- ing for the company in question in order to further 5 Conclusions increase the recall. While researchers in the patent domain concen- Acknowledgements trate on retrieval models and algorithms to im- prove the search performance, the original aspect This work is supported in part by the German Sci- of our paper is that it points to a different (and or- ence Foundation under grants STE1019/2-1 and thogonal) research avenue: the analysis of patent FU205/22-1. 577 References Processing and Information Retrieval (SPIRE 11), volume 7024 of Lecture Notes in Computer Science, Stephen Adams. 2010. The Text, the Full Text and pages 356–367. Springer. nothing but the Text: Part 1 – Standards for creating Textual Information in Patent Documents and Gen- David Hunt, Long Nguyen, and Matthew Rodgers, ed- eral Search Implications. World Patent Information, itors. 2007. Patent Searching: Tools & Techniques. 32(1):22–29, March. Wiley. Mikhail Bilenko and Raymond J. Mooney. 2002. Intellevate Inc. 2006. Patent Quality, a blog en- Learning to Combine Trained Distance Metrics try. http://www.patenthawk.com/blog/ for Duplicate Detection in Databases. Technical 2006/01/patent_quality.html, January. Report AI 02-296, Artificial Intelligence Labora- Hideo Joho, Leif A. Azzopardi, and Wim Vander- tory, University of Austin, Texas, USA, Austin, bauwhede. 2010. A Survey of Patent Users: An TX, February. Analysis of Tasks, Behavior, Search Functionality Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. and System Requirements. In IIix ’10: Proceed- Och, and Jeffrey Dean. 2007. Large Language ing of the third symposium on Information Inter- Models in Machine Translation. In EMNLP-CoNLL action in Context, pages 13–24, New York, NY, ’07: Proceedings of the 2007 Joint Conference on USA. ACM. Empirical Methods in Natural Language Process- Donald E. Knuth. 1997. The Art of Computer Pro- ing and Computational Natural Language Learn- gramming, Volume I: Fundamental Algorithms, 3rd ing, pages 858–867. ACL, June. Edition. Addison-Wesley. Qing Chen, Mu Li, and Ming Zhou. 2007. Improv- Vladimir I. Levenshtein. 1966. Binary codes capa- ing Query Spelling Correction Using Web Search ble of correcting deletions, insertions and reversals. Results. In EMNLP-CoNLL ’07: Proceedings of Soviet Physics Doklady, 10(8):707–710. Original the 2007 Joint Conference on Empirical Methods in in Doklady Akademii Nauk SSSR 163(4): 845-848. Natural Language Processing and Computational Natural Language Learning, pages 181–189. ACL, Yanen Li, Huizhong Duan, and ChengXiang Zhai. June. 2011. CloudSpeller: Spelling Correction for Search Queries by Using a Unified Hidden Markov Model Peter Christen. 2006. A Comparison of Personal with Web-scale Resources. In Spelling Alteration Name Matching: Techniques and Practical Is- for Web Search Workshop, pages 10–14, July. sues. In ICDM ’06: Workshops Proceedings of the sixth IEEE International Conference on Data Patrice Lopez and Laurent Romary. 2010. Experi- Mining, pages 290–294. IEEE Computer Society, ments with Citation Mining and Key-Term Extrac- December. tion for Prior Art Search. In Martin Braschler, Donna Harman, and Emanuele Pianta, editors, William W. Cohen, Pradeep Ravikumar, and Stephen CLEF 2010 LABs and Workshops, Notebook Pa- E. Fienberg. 2003. A Comparison of String pers, September. Distance Metrics for Name-Matching Tasks. In Subbarao Kambhampati and Craig A. Knoblock, Mihai Lupu, Katja Mayer, John Tait, and Anthony J. editors, IIWeb ’03: Proceedings of the IJCAI Trippe, editors. 2011. Current Challenges in Patent workshop on Information Integration on the Web, Information Retrieval, volume 29 of The Informa- pages 73–78, August. tion Retrieval Series. Springer. Fred J. Damerau. 1964. A Technique for Computer Walid Magdy and Gareth J. F. Jones. 2010. Ap- Detection and Correction of Spelling Errors. Com- plying the KISS Principle for the CLEF-IP 2010 munications of the ACM, 7(3):171–176. Prior Art Candidate Patent Search Task. In Martin Braschler, Donna Harman, and Emanuele Pianta, Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and editors, CLEF 2010 LABs and Workshops, Note- Vassilios S. Verykios. 2007. Duplicate Record De- book Papers, September. tection: A Survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16. Walid Magdy and Gareth J.F. Jones. 2011. A Study on Query Expansion Methods for Patent Retrieval. Caspas J. Fall and Christophe Giraud-Carrier. 2005. In PAIR ’11: Proceedings of the 4th workshop on Searching Trademark Databases for Verbal Similar- Patent information retrieval, AAAI Workshop on ities. World Patent Information, 27(2):135–143. Plan, Activity, and Intent Recognition, pages 19– 24, New York, NY, USA. ACM. Matthias Hagen and Benno Stein. 2011. Candidate Document Retrieval for Web-Scale Text Reuse De- Alvaro E. Monge and Charles Elkan. 1997. An Ef- tection. In 18th International Symposium on String ficient Domain-Independent Algorithm for Detect- 578 ing Approximately Duplicate Database Records. In DMKD ’09: Proceedings of the 2nd workshop on Research Issues on Data Mining and Knowl- edge Discovery, pages 23–29, New York, NY, USA. ACM. Heiko Müller and Johann-C. Freytag. 2003. Prob- lems, Methods and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt-Universität zu Berlin, Institut für Infor- matik, Germany. Felix Naumann and Melanie Herschel. 2010. An In- troduction to Duplicate Detection. Synthesis Lec- tures on Data Management. Morgan & Claypool Publishers. Yoh Okuno. 2011. Spell Generation based on Edit Distance. In Spelling Alteration for Web Search Workshop, pages 25–26, July. Martin Potthast and Benno Stein. 2008. New Is- sues in Near-duplicate Detection. In Christine Preisach, Hans Burkhardt, Lars Schmidt-Thieme, and Reinhold Decker, editors, Data Analysis, Ma- chine Learning and Applications. Selected papers from the 31th Annual Conference of the German Classification Society (GfKl 07), Studies in Classi- fication, Data Analysis, and Knowledge Organiza- tion, pages 601–609, Berlin Heidelberg New York. Springer. Benno Stein and Daniel Curatolo. 2006. Phonetic Spelling and Heuristic Search. In Gerhard Brewka, Silvia Coradeschi, Anna Perini, and Paolo Traverso, editors, 17th European Conference on Artificial In- telligence (ECAI 06), pages 829–830, Amsterdam, Berlin, August. IOS Press. Benno Stein and Matthias Hagen. 2011. Introducing the User-over-Ranking Hypothesis. In Advances in Information Retrieval. 33rd European Conference on IR Resarch (ECIR 11), volume 6611 of Lecture Notes in Computer Science, pages 503–509, Berlin Heidelberg New York, April. Springer. U.S. Patent & Trademark Office. 2010. Manual of Patent Examining Procedure (MPEP), Eighth Edi- tion, July. William W. Winkler. 1999. The State of Record Link- age and Current Research Problems. Technical re- port, Statistical Research Division, U.S. Bureau of the Census. Xiaobing Xue and Bruce W. Croft. 2009. Automatic Query Generation for Patent Search. In CIKM ’09: Proceeding of the eighteenth ACM conference on Information and Knowledge Management, pages 2037–2040, New York, NY, USA. ACM. 579 U BY – A Large-Scale Unified Lexical-Semantic Resource Based on LMF Iryna Gurevych†‡ , Judith Eckle-Kohler‡ , Silvana Hartmann‡ , Michael Matuschek‡ , Christian M. Meyer‡ and Christian Wirth‡ † Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational Information ‡ Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science Technische Universit¨at Darmstadt http://www.ukp.tu-darmstadt.de Abstract Previously, there have been several indepen- dent efforts of combining existing LSRs to en- We present U BY, a large-scale lexical- hance their coverage w.r.t. their breadth and depth, semantic resource combining a wide range i.e. (i) the number of lexical items, and (ii) the of information from expert-constructed types of lexical-semantic information contained and collaboratively constructed resources (Shi and Mihalcea, 2005; Johansson and Nugues, for English and German. It currently 2007; Navigli and Ponzetto, 2010b; Meyer and contains nine resources in two lan- Gurevych, 2011). As these efforts often targeted guages: English WordNet, Wiktionary, Wikipedia, FrameNet and VerbNet, particular applications, they focused on aligning German Wikipedia, Wiktionary and selected, specialized information types. To our GermaNet, and multilingual OmegaWiki knowledge, no single work focused on modeling modeled according to the LMF standard. a wide range of ECRs and CCRs in multiple lan- For FrameNet, VerbNet and all collabora- guages and a large variety of information types in tively constructed resources, this is done a standardized format. Frequently, the presented for the first time. Our LMF model captures model is not easily scalable to accommodate an lexical information at a fine-grained level by employing a large number of Data open set of LSRs in multiple languages and the in- Categories from ISOCat and is designed formation mined automatically from corpora. The to be directly extensible by new languages previous work also lacked the aspects of lexicon and resources. All resources in U BY can format standardization and API access. We be- be accessed with an easy to use publicly lieve that easy access to information in LSRs is available API. crucial in terms of their acceptance and broad ap- plicability in NLP. In this paper, we propose a solution to this. We 1 Introduction define a standardized format for modeling LSRs. This is a prerequisite for resource interoperabil- Lexical-semantic resources (LSRs) are the foun- ity and the smooth integration of resources. We dation of many NLP tasks such as word sense employ the ISO standard Lexical Markup Frame- disambiguation, semantic role labeling, question work (LMF: ISO 24613:2008), a metamodel for answering and information extraction. They are LSRs (Francopoulo et al., 2006), and Data Cate- needed on a large scale in different languages. gories (DCs) selected from ISOCat.1 One of the The growing demand for resources is met nei- main challenges of our work is to develop a model ther by the largest single expert-constructed re- that is standard-compliant, yet able to express the sources (ECRs), such as WordNet and FrameNet, information contained in diverse LSRs, and that in whose coverage is limited, nor by collaboratively the long term supports the integration of the vari- constructed resources (CCRs), such as Wikipedia ous resources. and Wiktionary, which encode lexical-semantic The main contributions of this paper can be knowledge in a less systematic form than ECRs, 1 because they are lacking expert supervision. http://www.isocat.org/ 580 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 580–590, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics summarized as follows: (1) We present an LMF- McCrae et al. (2011) propose LEMON, a con- based model for large-scale multilingual LSRs ceptual model for lexicalizing ontologies as an called U BY-LMF. We model the lexical-semantic extension of the LexInfo model (Buitelaar et al., information down to a fine-grained level of in- 2009). L EMON provides an LMF-implementation formation (e.g. syntactic frames) and employ in the Web Ontology Language (OWL), which standardized definitions of linguistic information is similar to U BY-LMF, as it also uses DCs types from ISOCat. (2) We present U BY, a large- from ISOCat, but diverges further from the stan- scale LSR implementing the U BY-LMF model. dard (e.g. by removing structural elements such U BY currently contains nine resources in two as the predicative representation class). While languages: English WordNet (WN, Fellbaum we focus on modeling lexical-semantic informa- (1998), Wiktionary2 (WKT-en), Wikipedia3 (WP- tion comprehensively and at a fine-grained level, en), FrameNet (FN, Baker et al. (1998)), and the goal of LEMON is to support the linking be- VerbNet (VN, Kipper et al. (2008)); German Wik- tween ontologies and lexicons. This goal entails tionary (WKT-de), Wikipedia (WP-de), and Ger- a task-targeted application: domain-specific lex- maNet (GN, Kunze and Lemnitzer (2002)), and icons are extracted from ontology specifications the English and German entries of OmegaWiki4 and merged with existing LSRs on demand. As a (OW), referred to as OW-en and OW-de. OW, consequence, there is no available large-scale in- a novel CCR, is inherently multilingual – its ba- stance of the LEMON model. sic structure are multilingual synsets, which are a Soria et al. (2009) define WordNet-LMF, an valuable addition to our multilingual U BY. Essen- LMF model for representing wordnets used in tial to U BY are the nine pairwise sense alignments the KYOTO project, and Henrich and Hinrichs between resources, which we provide to enable (2010) do this for GN, the German wordnet. resource interoperability on the sense level, e.g. These models are similar, but they still present by providing access to the often complementary different implementations of the LMF meta- information for a sense in different resources. (3) model, which hampers interoperability between We present a Java-API which offers unified access the resources. We build upon this work, but ex- to the information contained in U BY. tend it significantly: U BY goes beyond model- We will make the U BY-LMF model, the re- ing a single ECR and represents a large number source U BY and the API freely available to the of both ECRs and CCRs with very heterogeneous research community.5 This will make it easy for content in the same format. Also, U BY-LMF the NLP community to utilize U BY in a variety of features deeper modeling of lexical-semantic in- tasks in the future. formation. Henrich and Hinrichs (2010), for instance, do not explicitly model the argument 2 Related Work structure of subcategorization frames, since each frame is represented as a string. In U BY-LMF, The work presented in this paper concerns we represent them at a fine-grained level neces- standardization of LSRs, large-scale integration sary for the transparent modeling of the syntax- thereof at the representational level, and the uni- semantics interface. fied access to lexical-semantic information in the integrated resources. Large-scale integration of resources. Most Standardization of resources. Previous work previous research efforts on the integration of re- includes models for representing lexical informa- sources targeted at world knowledge rather than tion relative to ontologies (Buitelaar et al., 2009; lexical-semantic knowledge. Well known exam- McCrae et al., 2011), and standardized single ples are YAGO (Suchanek et al., 2007), or DBPe- wordnets (English, German and Italian wordnets) dia (Bizer et al., 2009). in the ISO standard LMF (Soria et al., 2009; Hen- Atserias et al. (2004) present the Meaning Mul- rich and Hinrichs, 2010; Toral et al., 2010). tilingual Central Repository (MCR). MCR inte- 2 grates five local wordnets based on the Interlin- http://www.wiktionary.org/ 3 http://www.wikipedia.org/ gual Index of EuroWordNet (Vossen, 1998). The 4 http://www.omegawiki.org/ overall goal of the work is to improve word sense 5 http://www.ukp.tu-darmstadt.de/data/uby disambiguation. This work is similar to ours, as it 581 aims at a large-scale multilingual resource and in- API,6 or the Java-based Wikipedia API.7 cludes several resources. It is however restricted With a stronger focus of the NLP community to a single type of resource (wordnets) and fea- on sharing data and reproducing experimental re- tures a single type of lexical information (seman- sults these tools are becoming important as never tic relations) specified upon synsets. Similarly, before. Therefore, a major design objective of de Melo and Weikum (2009) create a multilin- U BY is a single API. This is similar in spirit to the gual wordnet by integrating wordnets, bilingual motivation of Pradhan et al. (2007), who present dictionaries and information from parallel cor- integrated access to corpus annotations as a main pora. None of these resources integrate lexical- goal of their work on standardizing and integrat- semantic information, such as syntactic subcate- ing corpus annotations in the OntoNotes project. gorization or semantic roles. To summarize, related work focuses either on McFate and Forbus (2011) present NULEX, the standardization of single resources (or a single a syntactic lexicon automatically compiled from type of resource), which leads to several slightly WN, WKT-en and VN. As their goal is to cre- different formats constrained to these resources, ate an open-license resource to enhance syntactic or on the integration of several resources in an parsing, they enrich verbs and nouns in WN with idiosyncratic format. CCRs have not been con- inflection information from WKT-en and syntac- sidered at all in previous work on resource stan- tic frames from VN. Thus, they only use a small dardization, and the level of detail of the model- part of the lexical information present in WKT-en. ing is insufficient to fully accommodate different Padr´o et al. (2011) present their work on lex- types of lexical-semantic information. API ac- icon merging within the Panacea Project. One cess is rarely provided. This makes it hard for goal of Panacea is to create a lexical resource de- the community to exploit their results on a large velopment platform that supports large-scale lex- scale. Thus, it diminishes the impact that these ical acquisition and can be used to combine exist- projects might achieve upon NLP beyond their ing lexicons with automatically acquired ones. To original specific purpose, if their results were rep- this end, Padr´o et al. (2011) explore the automatic resented in a unified resource and could easily be integration of subcategorization lexicons. Their accessed by the community through a single pub- current work only covers Spanish, and though lic API. they mention the LMF standard as a potential data 3 U BY – Data model model, they do not make use of it. Shi and Mihalcea (2005) integrate FN, VN and LMF defines a metamodel of LSRs in the Uni- WN, and Palmer (2009) presents a combination of fied Modeling Language (UML). It provides a Propbank, VN and FN in a resource called S EM - number of UML packages and classes for model- L INK in order to enhance semantic role labeling. ing many different types of resources, e.g. word- Similar to our work, multiple resources are in- nets and multilingual lexicons. The design of tegrated, but their work is restricted to a single a standard-compliant lexicon model in LMF in- language and does not cover CCRs, whose pop- volves two steps: in the first step, the structure ularity and importance has grown tremendously of the lexicon model has to be defined by choos- over the past years. In fact, with the excep- ing a combination of the LMF core package and tion of NULEX, CCRs have only been consid- zero to many extensions (i.e. UML packages). In ered in the sense alignment of individual resource the second step, these UML classes are enriched pairs (Navigli and Ponzetto, 2010a; Meyer and by attributes. To contribute to semantic interop- Gurevych, 2011). erability, it is essential for the lexicon model that the attributes and their values refer to Data Cat- egories (DCs) taken from a reference repository. API access for resources. An important factor DCs are standardized specifications of the terms to the success of a large, integrated resource is a that are used for attributes and their values, or in single public API, which facilitates the access to other words, the linguistic vocabulary occurring the information contained in the resource. The most important LSRs so far can be accessed us- 6 http://sourceforge.net/projects/jwordnet/ 7 ing various APIs, for instance the Java WordNet http://code.google.com/p/jwpl/ 582 in a lexicon model. Consider, for instance, the SubcategorizationFrame is com- term lexeme that is defined differently in WN and posed of syntactic arguments, while FN: in FN, a lexeme refers to a word form, not SemanticPredicate is composed of se- including the sense aspect. In WN, on the con- mantic arguments. The linking between syntactic trary, a lexeme is an abstract pairing of mean- and semantic arguments is represented by the ing and form. According to LMF, the DCs are SynSemCorrespondence class. to be selected from ISOCat, the implementation The SenseAxis class is very important in of the ISO 12620 Data Category Registry (DCR, U BY-LMF, as it connects the different source Broeder et al. (2010)), resulting in a Data Cate- LSRs. Its role is twofold: first, it links the cor- gory Selection (DCS). responding word senses from different languages, e.g. English and German. Second, it represents Design of U BY-LMF. We have designed U BY- monolingual sense alignments, i.e. sense align- LMF8 as a model of the union of various hetero- ments between different lexicons in the same lan- geneous resources, namely WN, GN, FN, and VN guage. The latter is a novel interpretation of on the one hand and CCRs on the other hand. SenseAxis introduced by U BY-LMF. Two design principles guided our development The organization of lexical-semantic knowl- of U BY-LMF: first, to preserve the information edge found in WP, WKT, and OW can be mod- available in the original resources and to uni- eled with the classes in U BY-LMF as well. WP formly represent it in U BY-LMF. Second, to be primarily provides encyclopedic information on able to extend U BY in the future by further lan- nouns. It mainly consists of article pages which guages, resources, and types of linguistic infor- are modeled as Senses in U BY-LMF. mation, in particular, alignments between differ- WKT is in many ways similar to tradi- ent LSRs. tional dictionaries, because it enumerates senses Wordnets, FN and VN are largely complemen- under a given headword on an entry page. tary regarding the information types they provide, Thus, WKT entry pages can be represented by see, e.g. Baker and Fellbaum (2009). Accord- LexicalEntries and WKT senses by Senses. ingly, they use different organizational units to represent this information. Wordnets, such as OW is different from WKT and WP, as it is or- WN and GN, primarily contain information on ganized in multilingual synsets. To model OW lexical-semantic relations, such as synonymy, and in U BY-LMF, we split the synsets per language use synsets (groups of lexemes that are synony- and included them as monolingual Synsets in mous) as organizational units. FN focuses on the corresponding Lexicon (e.g., OW-en or OW- groups of lexemes that evoke the same prototypi- de). The original multilingual information is pre- cal situation (so-called semantic frames, Fillmore served by adding a SenseAxis between corre- (1982)) involving semantic roles (so-called frame sponding synsets in OW-en and OW-de. elements). VN, a large-scale verb lexicon, is or- The LMF standard itself contains only few lin- ganized in Levin-style verb classes (Levin, 1993) guistic terms and does neither specify attributes (groups of verbs that share the same syntactic al- nor their values. Therefore, an important task in ternations and semantic roles) and provides rich developing U BY-LMF has been the specification subcategorization frames including semantic roles of attributes and their values along with the proper and a specification of semantic predicates. attachment of attributes to LMF classes. In partic- U BY-LMF employs several direct subclasses ular, this task involved selecting DCs from ISO- of Lexicon in order to account for the various or- Cat and, if necessary, adding new DCs to ISOCat. ganization types found in the different LSRs con- Extensions in U BY-LMF. Although U BY- sidered. While the LexicalEntry class reflects LMF is largely compliant with LMF, the task of the traditional headword-based lexicon organiza- building a homogeneous lexicon model for many tion, Synset represents synsets from wordnets, highly heterogeneous LSRs led us to extend LMF SemanticPredicate models FN semantic in several ways: we added two new classes and frames, and SubcategorizationFrameSet several new relationships between classes. corresponds to VN alternation classes. First, we were facing a huge variety of lexical- 8 See www.ukp.tu-darmstadt.de/data/uby semantic labels for many different dimensions of 583 semantic classification. Examples of such dimen- form. Disambiguating the WKT relation targets sions include ontological type (e.g. selectional re- to infer the target sense is left to future work. strictions in VN and FN), domain (e.g. Biology in A related issue occurred, when we mapped WN WN), style and register (e.g. labels in WKT, OW), to LMF. WN encodes morphologically related or sentiment (e.g. sentiment of lexical units in forms as sense relations. U BY-LMF represents FN). Since we aim at an extensible LMF-model, these related forms not only as sense relations (as capable of representing further dimensions of se- in WordNet-LMF), but also at the morphologi- mantic classification, we did not squeeze the in- cal level using the RelatedForm class from the formation on semantic classes present in the con- LMF Morphology extension. In LMF, however, sidered LSRs into existing LMF classes. Instead, the RelatedForm class for morphologically re- we addressed this issue by introducing a more lated lexemes is not associated with the corre- general class, SemanticLabel, which is an op- sponding sense in any way. Discarding the WN tional subclass of Sense, SemanticPredicate, information on the senses involved in a particular and SemanticArgument. This new class has morphological relation would lead to information three attributes, encoding the name of the label, loss in some cases. Consider as an example the its type (e.g. ontological, register, sentiment), and WN verb buy (purchase) which is derivationally a numeric quantification (e.g. sentiment strength). related to the noun buy, while on the other hand Second, we attached the subclass Frequency buy (accept as true, e.g. I can’t buy this story) is to most of the classes in U BY-LMF, in order to not derivationally related to the noun buy. We ad- encode frequency information. This is of partic- dressed this issue by adding a sense attribute to ular importance when using the resource in ma- the RelatedForm class. Thus, in extension of chine learning applications. This extension of the LMF, U BY-LMF allows sense relations to refer to standard has already been made in WordNet-LMF a form relation target and morphological relations (Soria et al., 2009). Currently, the Frequency to refer to a sense relation target. class is used to keep corpus frequencies for lex- Data Categories in U BY-LMF. We encoun- ical units in FN, but we plan to use it for en- tered large differences in the availability of DCs riching many other classes with frequency in- in ISOCat for the morpho-syntactic, lexical- formation in future work, such as Senses or syntactic, and lexical-semantic parts of U BY- SubcategorizationFrames. LMF. Many DCs were missing in ISOCat and we Third, the representation of FN in LMF re- had to enter them ourselves. While this was feasi- quired adding two new relationships between ble at the morpho-syntactic and lexical-syntactic LMF classes: we added a relationship between level, due to a large body of standardization re- SemanticArgument and Definition, in or- sults available, it was much harder at the lexical- der to represent the definitions available for frame semantic level where standardization is still on- elements in FN. In addition, we added a re- going. At the lexical-semantic level, U BY-LMF lationship between the Context class and the currently allows string values for a number of at- MonoLingualExternalRef, to represent the tribute values, e.g. for semantic roles. We can eas- links to annotated corpus sentences in FN. ily integrate the results of the ongoing standard- Finally, WKT turned out to be hard to tackle, ization efforts into U BY-LMF in the future. because it contains a special kind of ambiguity in the semantic relations and translation links listed 4 U BY – Population with information for senses: the targets of both relations and trans- 4.1 Representing LSRs in U BY-LMF lation links are ambiguous, as they refer to lem- mas (word forms), rather than to senses (Meyer U BY-LMF is represented by a DTD (as suggested and Gurevych, 2010). These ambiguous rela- by the standard) which can be used to automat- tion targets could not directly be represented in ically convert any given resource into the corre- LMF, since sense and translation relations are sponding XML format.9 This conversion requires defined between senses. To resolve this, we a detailed analysis of the resource to be converted, added a relationship between SenseRelation followed by the definition of a mapping of the and FormRepresentation, in order to encode 9 Therefore, U BY-LMF can be considered as a serializa- the ambiguous WKT relation target as a word tion of LMF. 584 concepts and terms used in the original resource tries provide links to the corresponding WP page. to the U BY-LMF model. There are two major Also, the German and English language editions tasks involved in the development of an automatic of WP and OW are connected by inter-language conversion routine: first, the basic organizational links between articles (Senses in U BY). We can unit in the source LSR has to be identified and expect that these links have high quality, as they mapped, e.g. synset in WN or semantic frame in were entered manually by users and are subject FN, and second, it has to be determined, how a to community control. Therefore, we straightfor- (LMF) sense is defined in the source LSR. wardly imported them into U BY. A notable aspect of converting resources into U BY-LMF is the harmonization of linguistic ter- Alignment Framework. Automatically creat- minology used in the LSRs. For instance, a ing new alignments is difficult because of word WN Word and a GN Lexical Unit are mapped to ambiguities, different granularities of senses, Sense in U BY-LMF. or language specific conceptualizations (Navigli, We developed reusable conversion routines for 2006). To support this task for a large number the future import of updated versions of the source of resources across languages, we have designed LSRs into U BY, provided the structure of the a flexible alignment framework based on the source LSR remains stable. These conversion state-of-the-art method of Niemann and Gurevych routines extract lexical data from the source LSRs (2011). The framework is generic in order to al- by calling their native APIs (rather than process- low alignments between different kinds of entities ing the underlying XML data). Thus, all lexical as found in different resources, e.g. WN synsets, information which can be accessed via the APIs FN frames or WP articles. The only requirement is converted into U BY-LMF. is that the individual entities are distinguishable by a unique identifier in each resource. Converting the LSRs introduced in the previ- ous section yielded an instantiation of U BY-LMF The alignment consists of the following steps: named U BY. The LexicalResource instance First, we extract the alignment candidates for a U BY currently comprises 10 Lexicon instances, given resource pair, e.g. WN sense candidates for one each for OW-de and OW-en, and one lexicon a WKT-en entry. Second, we create a gold stan- each for the remaining eight LSRs. dard by manually annotating a subset of candi- date pairs as “valid“ or “non-valid“. Then, we 4.2 Adding Sense Alignments extract the sense representations (e.g. lemmatized bag-of-words based on glosses) to compute the Besides the uniform and standardized representa- similarity of word senses (e.g. by cosine similar- tion of the single LSRs, one major asset of U BY ity). The gold standard with corresponding sim- is the semantic interoperability of resources at the ilarity values is fed into Weka (Hall et al., 2009) sense level. In the following, we (i) describe how to train a machine learning classifier, and in the we converted already existing sense alignments of final step this classifier is used to automatically resources into LMF, and (ii) present a framework classify the candidate sense pairs as (non-)valid to infer alignments automatically for any pair of alignment. Our framework also allows us to train resources. on a combination of different similarity measures. Existing Alignments. Previous work on sense Using our framework, we were able to re- alignment yielded several alignments, such as produce the results reported by Niemann and WN–WP-en (Niemann and Gurevych, 2011), Gurevych (2011) and Meyer and Gurevych WN–WKT-en (Meyer and Gurevych, 2011) and (2011) based on the publicly available evaluation VN–FN (Palmer, 2009). datasets10 and the configuration details reported We converted these alignments into U BY-LMF in the corresponding papers. by creating a SenseAxis instance for each pair of Cross-Lingual Alignment. In order to align aligned senses. This involved mapping the sense word senses across languages, we extended the IDs from the proprietary alignment files to the monolingual sense alignment described above to corresponding sense IDs in U BY. the cross-lingual setting. Our approach utilizes In addition, we integrated the sense alignments 10 already present in OW and WP. Some OW en- http://www.ukp.tu-darmstadt.de/data/sense-alignment/ 585 Moses,11 trained on the Europarl corpus. The Translation Similarity lemma of one of the two senses to be aligned direction measure P R F1 as well as its representations (e.g. the gloss) is EN > DE Cosine (Cos) 0.666 0.575 0.594 translated into the language of the other resource, DE > EN Cos 0.674 0.658 0.665 yielding a monolingual setting. E.g., the WN DE > EN PPR 0.721 0.712 0.716 synset {vessel, watercraft} with its gloss ’a craft DE > EN PPR + Cos 0.723 0.712 0.717 designed for water transportation’ is translated Table 1: Cross-lingual alignment results into {Schiff, Wasserfahrzeug} and ’Ein Fahrzeug f¨ur Wassertransport’, and then the candidate ex- traction and all downstream steps can take place into English works significantly better than into in German. An inherent problem with this ap- German. Also, the more elaborate similarity mea- proach is that incorrect translations also lead to sure PPR yields better results than cosine similar- invalid alignment candidates. However, these are ity, while the best result is achieved by a combina- most probably filtered out by the machine learn- tion of both. Niemann and Gurevych (2011) make ing classifier as the calculated similarity between a similar observation for the monolingual setting. the sense representations (e.g. glosses) should be Our F-measure of 0.717 in the best configuration low if the candidates do not match. lies between the results of Meyer and Gurevych We evaluated our approach by creating a cross- (2011) (0.66) and Niemann and Gurevych (2011) lingual alignment between WN and OW-de, i.e. (0.78), and thus verifies the validity of the ma- the concepts in OW with a German lexicaliza- chine translation approach. Therefore, the best tion.12 To our knowledge, this is the first study on alignment was subsequently integrated into U BY. aligning OW with another LSR. OW is especially interesting for this task due to its multilingual con- 5 Evaluating U BY cepts, as described by Matuschek and Gurevych We performed an intrinsic evaluation of U BY by (2011). The created gold standard could, for in- computing a number of resource statistics. Our stance, be re-used to evaluate alignments for other evaluation covers two aspects: first, it addresses languages in OW. the question if our automatic conversion routines To compute the similarity of word senses, we work correctly. Second, it provides indicators for followed the approach by Niemann and Gurevych assessing U BY in terms of the gain in coverage (2011) while covering both translation directions. compared to the single LSRs. We used the cosine similarity for comparing the German OW glosses with the German translations Correctness of conversion. Since we aim to of WN glosses and cosine and personalized page preserve the maximal amount of information from rank (PPR) similarity for comparison of the Ger- the original LSRs, we should be able to replace man OW glosses translated into English with the any of the original LSRs and APIs by U BY and original English WN glosses. Note that PPR sim- the U BY-API without losing information. As ilarity is not available for German as it is based the conversion is largely performed automatically, on WN. Thereby, we filtered out the OW con- systematic errors and information loss could be cepts without a German gloss which left us with introduced by a faulty conversion routine. In or- 11,806 unique candidate pairs. We randomly se- der to detect such errors and to prove the correct- lected 500 WN synsets for analysis yielding 703 ness of the automatic conversion and the result- candidate pairs. These were manually annotated ing representation, we have compared the orig- as being (non-)alignments. For the subsequent inal resource statistics of the classes and infor- machine learning task we used a simple threshold- mation types in the source LSRs to the cor- based classifier and ten-fold cross validation. responding classes in their U BY counterparts. Table 1 summarizes the results of different sys- For instance, the number of lexical relations in tem configurations. We observe that translation WordNet has been compared to the number of 11 SenseRelations in the U BY WordNet lexi- http://www.statmt.org/moses/ 12 con.13 OmegaWiki consists of interlinked language- independent concepts to which lexicalizations in several 13 languages are attached. For detailed analysis results see the U BY website. 586 Lexical Sense shows the number of lemmas with entries in one Lexicon Entry Sense Relation or more than one lexicon, additionally split by FN 9,704 11,942 – POS and language. Lemmas occurring only once GN 83,091 93,407 329,213 in U BY increase the coverage at lemma level. For OW-de 30,967 34,691 60,054 lemmas with parallel entries in several U BY lex- OW-en 51,715 57,921 85,952 icons, new information becomes available in the WP-de 790,430 838,428 571,286 form of additional sense definitions and comple- WP-en 2,712,117 2,921,455 3,364,083 mentary information types attached to lemmas. WKT-de 85,575 72,752 434,358 WKT-en 335,749 421,848 716,595 Finally, the increase in coverage at sense level WN 156,584 206,978 8,559 can be estimated for senses that are aligned across VN 3,962 31,891 – at least two U BY-lexicons. We gain access to U BY 4,259,894 4,691,313 5,300,941 all available, partly complementary information types attached to these aligned senses, e.g. seman- Table 2: U BY resource statistics (selected classes). tic relations, subcategorization frames, encyclo- pedic or multilingual information. The number Lexicon pair Languages SenseAxis of pairwise sense alignments provided by U BY is WN–WP-en EN–EN 50,351 given in Table 3. In addition, we computed how WN–WKT-en EN–EN 99,662 many senses simultaneously take part in at least WN–VN EN–EN 40,716 two pairwise sense alignments. For English, this FN–VN EN–EN 17,529 applies to 31,786 senses, for which information WP-en–OW-en EN–EN 3,960 from 3 U BY lexicons is available. WP-de–OW-de DE–DE 1,097 WN–OW-de EN–DE 23,024 EN Lexicons noun verb adjective WP-en–WP-de EN–DE 463,311 OW-en–OW-de EN–DE 58,785 5 1 699 - 4 1,630 1,888 430 U BY All 758,435 3 8,439 1,948 2,271 2 53,856 4,727 12,290 Table 3: U BY alignment statistics. 1 2,900,652 50,209 41,731 Σ (unique EN) 3,080,771 DE Lexicons noun verb adjective Gain in coverage. U BY offers an increased coverage compared to the single LSRs as reflected 4 1,546 - - in the resource statistics. Tables 2 and 3 show the 3 10,374 372 342 2 26,813 3,174 2,643 statistics on central classes in U BY. As U BY is 1 803,770 6,108 7,737 organized in several Lexicons, the number of Σ (unique DE) 862,879 U BY lexical entries is the sum of the lexical en- tries in all 10 Lexicons. Thus, U BY contains Table 4: Number of lemmas (split by POS and lan- more than 4.2 million lexical entries, 4.6 million guage) with entries in i U BY lexicons, i = 1, . . . , 5. senses, 5.3 million semantic relations between senses and more than 750,000 alignments. These statistics represent the total numbers of lexical en- 6 Using U BY tries, senses and sense relations in U BY without filtering of identical (i.e. corresponding) lexical U BY API. For convenient access to U BY, we entries, senses and relations. Listing the num- implemented a Java-API which is built around ber of unique senses would require a full align- the Hibernate14 framework. Hibernate allows to ment between all integrated resources, which is easily store the XML data which results from currently not available. converting resources into Uby-LMF into a corre- We can, however, show that U BY contains over sponding SQL database. 3.08 million unique lemma-POS combinations for Our main design principle was to keep the ac- English and over 860,000 for German, over 3.94 cess to the resource as simple as possible, despite million in total, see Table 4. Therefore, we as- the rich and complex structure of U BY. Another 14 sessed the coverage on lemma level. Table 4 also http://www.hibernate.org/ 587 important design aspect was to ensure that the censing allows,15 already converted resources. If functionality of the individual, resource-specific resources cannot be made available for download, APIs or user interfaces is mirrored in the U BY the conversion tools will still allow users with ac- API. This enables porting legacy applications to cess to these resources to import them into U BY our new resource. To facilitate the transition to easily. In this way, it will be possible for users to U BY, we plan to provide reference tables which build their “custom U BY” containing selected re- list the corresponding U BY-API operations for the sources. As the underlying resources are subject most important operations in the WN API, some to continuous change, updates of the correspond- of which are shown in Table 5. ing components will be made available on a regu- lar basis. WN function U BY function 7 Conclusions Dictionary U BY getIndexWord(pos, getLexicalEntries( We presented U BY, a large-scale, standardized lemma) pos, lemma) LSR containing nine widely used resources in two IndexWord LexicalEntry languages: English WN, WKT-en, WP-en, FN getLemma() getLemmaForm() and VN, German WP-de, WKT-de, and GN, and Synset Synset OW in English and German. As all resources getGloss() getDefinitionText() getWords() getSenses() are modeled in U BY-LMF, U BY enables struc- Pointer SynsetRelation tural interoperability across resources and lan- getType() getRelName() guages down to a fine-grained level of informa- Word Sense tion. For FN, VN and all of the CCRs in En- getPointers() getSenseRelations() glish and German, this is done for the first time. Besides, by integrating sense alignments we also Table 5: Some equivalent operations in WN API and enable the lexical-semantic interoperability of re- U BY API. sources. We presented a unified framework for aligning any LSRs pairwise and reported on ex- periments which align OW-de and WN. We will While it is possible to limit access to single re- release the U BY-LMF model, the resource and the sources by a parameter and thus mimic the behav- U BY-API at the time of publication.16 Due to the ior of the legacy APIs (e.g. only retrieve Synsets added value and the large scale of U BY, as well as and their relations from WN), the true power of its ease of use, we believe U BY will boost the per- U BY API becomes visible when no such con- formance of NLP making use of lexical-semantic straints are applied. In this case, all imported re- knowledge. sources are queried to get one combined result, while retaining the source of the respective in- Acknowledgments formation. On top of this, the information about existing sense alignments across resources can be This work has been supported by the Emmy accessed via SenseAxis relations, so that the re- Noether Program of the German Research Foun- turned combined result covers not only the lexi- dation (DFG) under grant No. GU 798/3-1 and cal, but also the sense level. by the Volkswagen Foundation as part of the Lichtenberg-Professorship Program under grant No. I/82806. We thank Richard Eckart de Community issues. One of the most important Castilho, Yevgen Chebotar, Zijad Maksuti and Tri reasons for U BY is creating an easy-to-use pow- Duc Nghiem for their contributions to this project. erful LSR to advance NLP research and develop- ment. Therefore, community building around the resource is one of our major concerns. To this end, References we will offer free downloads of the lexical data Jordi Atserias, Lu´ıs Villarejo, German Rigau, Eneko and software presented in this paper under open li- Agirre, John Carroll, Bernardo Magnini, and Piek censes, namely: The U BY-LMF DTD, mappings 15 Only GermaNet is subject to a restricted license and can- and conversion tools for existing resources and not be redistributed in U BY format. 16 sense alignments, the Java API, and, as far as li- http://www.ukp.tu-darmstadt.de/data/uby 588 Vossen. 2004. The Meaning Multilingual Central Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Repository. In Proceedings of the second interna- Pfahringer, Peter Reutemann, and Ian H. Witten. tional WordNet Conference (GWC 2004), pages 23– 2009. The WEKA Data Mining Software: An 30, Brno, Czech Republic. Update. ACM SIGKDD Explorations Newsletter, Collin F. Baker and Christiane Fellbaum. 2009. Word- 11(1):10–18. Net and FrameNet as complementary resources for Verena Henrich and Erhard Hinrichs. 2010. Standard- annotation. In Proceedings of the Third Linguis- izing wordnets in the ISO standard LMF: Wordnet- tic Annotation Workshop, ACL-IJCNLP ’09, pages LMF for GermaNet. In Proceedings of the 23rd In- 125–129, Suntec, Singapore. ternational Conference on Computational Linguis- Collin F. Baker, Charles J. Fillmore, and John B. tics (COLING), pages 456–464, Beijing, China. Lowe. 1998. The Berkeley FrameNet project. In Richard Johansson and Pierre Nugues. 2007. Us- Proceedings of the 36th Annual Meeting of the As- ing WordNet to extend FrameNet coverage. In sociation for Computational Linguistics and 17th Proceedings of the Workshop on Building Frame- International Conference on Computational Lin- semantic Resources for Scandinavian and Baltic guistics (COLING-ACL’98, pages 86–90, Montreal, Languages, at NODALIDA, pages 27–30, Tartu, Es- Canada. tonia. Christian Bizer, Jens Lehmann, Georgi Kobilarov, Karin Kipper, Anna Korhonen, Neville Ryant, and S¨oren Auer, Christian Becker, Richard Cyganiak, Martha Palmer. 2008. A Large-scale Classification and Sebastian Hellmann. 2009. DBpedia A Crys- of English Verbs. Language Resources and Evalu- tallization Point for the Web of Data. Journal of ation, 42:21–40. Web Semantics: Science, Services and Agents on the Claudia Kunze and Lothar Lemnitzer. 2002. Ger- World Wide Web, (7):154–165. maNet – representation, visualization, application. In Proceedings of the Third International Con- Daan Broeder, Marc Kemps-Snijders, Dieter Van Uyt- ference on Language Resources and Evaluation vanck, Menzo Windhouwer, Peter Withers, Peter (LREC), pages 1485–1491, Las Palmas, Canary Is- Wittenburg, and Claus Zinn. 2010. A Data Cat- lands, Spain. egory Registry- and Component-based Metadata Framework. In Proceedings of the 7th International Beth Levin. 1993. English Verb Classes and Alterna- Conference on Language Resources and Evaluation tions. The University of Chicago Press, Chicago, (LREC), pages 43–47, Valletta, Malta. IL, USA. Michael Matuschek and Iryna Gurevych. 2011. Paul Buitelaar, Philipp Cimiano, Peter Haase, and Where the journey is headed: Collaboratively con- Michael Sintek. 2009. Towards Linguistically structed multilingual Wiki-based resources. In Grounded Ontologies. In Lora Aroyo, Paolo SFB 538: Mehrsprachigkeit, editor, Hamburger Ar- Traverso, Fabio Ciravegna, Philipp Cimiano, Tom beiten zur Mehrsprachigkeit, Hamburg, Germany. Heath, Eero Hyv¨onen, Riichiro Mizoguchi, Eyal John McCrae, Dennis Spohr, and Philipp Cimiano. Oren, Marta Sabou, and Elena Simperl, editors, The 2011. Linking Lexical Resources and Ontologies Semantic Web: Research and Applications, pages on the Semantic Web with Lemon. In The Seman- 111–125, Berlin/Heidelberg, Germany. Springer. tic Web: Research and Applications, volume 6643 Gerard de Melo and Gerhard Weikum. 2009. Towards of Lecture Notes in Computer Science, pages 245– a universal wordnet by learning from combined ev- 259. Springer, Berlin/Heidelberg, Germany. idence. In Proceedings of the 18th ACM conference Clifton J. McFate and Kenneth D. Forbus. 2011. on Information and knowledge management (CIKM NULEX: an open-license broad coverage lexicon. ’09), CIKM ’09, pages 513–522, New York, NY, In Proceedings of the 49th Annual Meeting of the USA. ACM. Association for Computational Linguistics: Human Christiane Fellbaum. 1998. WordNet: An Electronic Language Technologies: short papers - Volume 2, Lexical Database. MIT Press, Cambridge, MA, HLT ’11, pages 363–367, Portland, OR, USA. USA. Christian M. Meyer and Iryna Gurevych. 2010. Worth Charles J. Fillmore. 1982. Frame Semantics. In The its Weight in Gold or Yet Another Resource — Linguistic Society of Korea, editor, Linguistics in A Comparative Study of Wiktionary, OpenThe- the Morning Calm, pages 111–137. Hanshin Pub- saurus and GermaNet. In Alexander Gelbukh, ed- lishing Company, Seoul, Korea. itor, Computational Linguistics and Intelligent Text Gil Francopoulo, Nuria Bel, Monte George, Nico- Processing: 11th International Conference, volume letta Calzolari, Monica Monachini, Mandy Pet, and 6008 of Lecture Notes in Computer Science, pages Claudia Soria. 2006. Lexical Markup Framework 38–49. Berlin/Heidelberg: Springer, Ias¸i, Romania. (LMF). In Proceedings of the 5th International Christian M. Meyer and Iryna Gurevych. 2011. What Conference on Language Resources and Evaluation Psycholinguists Know About Chemistry: Align- (LREC), pages 233–236, Genoa, Italy. ing Wiktionary and WordNet for Increased Domain 589 Coverage. In Proceedings of the 5th International edge. In Proceedings of the 16th International Con- Joint Conference on Natural Language Processing ference on World Wide Web, pages 697–706, Banff, (IJCNLP), pages 883–892, Chiang Mai, Thailand. Canada. Roberto Navigli and Simone Paolo Ponzetto. 2010a. Antonio Toral, Stefania Bracale, Monica Monachini, BabelNet: Building a Very Large Multilingual Se- and Claudia Soria. 2010. Rejuvenating the Italian mantic Network. In Proceedings of the 48th Annual WordNet: Upgrading, Standarising, Extending. In Meeting of the Association for Computational Lin- Proceedings of the 5th Global WordNet Conference guistics, pages 216–225, Uppsala, Sweden, July. (GWC), Bombay, India. Roberto Navigli and Simone Paolo Ponzetto. 2010b. Piek Vossen, editor. 1998. EuroWordNet: A Multi- Knowledge-rich Word Sense Disambiguation Ri- lingual Database with Lexical Semantic Networks. valing Supervised Systems. In Proceedings of the Kluwer Academic Publishers, Dordrecht, Nether- 48th Annual Meeting of the Association for Com- lands. putational Linguistics, pages 1522–1531, Uppsala, Sweden. Roberto Navigli. 2006. Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation Performance. In Proceedings of the 21st Inter- national Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), pages 105–112, Sydney, Australia. Elisabeth Niemann and Iryna Gurevych. 2011. The People’s Web meets Linguistic Knowledge: Auto- matic Sense Alignment of Wikipedia and WordNet. In Proceedings of the 9th International Conference on Computational Semantics (IWCS), pages 205– 214, Oxford, UK. Muntsa Padr´o, N´uria Bel, and Silvia Necsulescu. 2011. Towards the Automatic Merging of Lexical Resources: Automatic Mapping. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), pages 296–301, Hissar, Bulgaria. Martha Palmer. 2009. Semlink: Linking PropBank, VerbNet and FrameNet. In Proceedings of the Gen- erative Lexicon Conference (GenLex-09), pages 9– 15, Pisa, Italy. Sameer S. Pradhan, Eduard Hovy, Mitch Mar- cus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2007. OntoNotes: A Unified Rela- tional Semantic Representation. In Proceedings of the International Conference on Semantic Comput- ing, pages 517–526, Washington, DC, USA. Lei Shi and Rada Mihalcea. 2005. Putting Pieces To- gether: Combining FrameNet, VerbNet and Word- Net for Robust Semantic Parsing. In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CI- CLing), pages 100–111, Mexico City, Mexico. Claudia Soria, Monica Monachini, and Piek Vossen. 2009. Wordnet-LMF: fleshing out a standardized format for Wordnet interoperability. In Proceed- ings of the 2009 International Workshop on Inter- cultural Collaboration, pages 139–146, Palo Alto, CA, USA. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A Core of Semantic Knowl- 590 Word Sense Induction for Novel Sense Detection Jey Han Lau,♠♡ Paul Cook,♡ Diana McCarthy, ♣ David Newman,♢ and Timothy Baldwin♠♡ ♠ NICTA Victoria Research Laboratory ♡ Dept of Computer Science and Software Engineering, University of Melbourne ♢ Dept of Computer Science, University of California Irvine ♣ Lexical Computing

[email protected]

,

[email protected]

,

[email protected]

,

[email protected]

,

[email protected]

Abstract i.e. the number of senses that best captures the token occurrences of that word. Building on the We apply topic modelling to automatically work of Brody and Lapata (2009) and others, we induce word senses of a target word, and approach WSI via topic modelling — using La- demonstrate that our word sense induction tent Dirichlet Allocation (LDA: Blei et al. (2003)) method can be used to automatically de- tect words with emergent novel senses, as and derivative approaches — and use the topic well as token occurrences of those senses. model to determine the appropriate sense gran- We start by exploring the utility of stan- ularity. Topic modelling is an unsupervised ap- dard topic models for word sense induction proach to jointly learn topics — in the form of (WSI), with a pre-determined number of multinomial probability distributions over words topics (=senses). We next demonstrate that — and per-document topic assignments — in the a non-parametric formulation that learns an form of multinomial probability distributions over appropriate number of senses per word ac- topics. LDA is appealing for WSI as it both as- tually performs better at the WSI task. We go on to establish state-of-the-art results signs senses to words (in the form of topic alloca- over two WSI datasets, and apply the pro- tion), and outputs a representation of each sense posed model to a novel sense detection task. as a weighted list of words. LDA offers a solu- tion to the question of sense granularity determi- nation via non-parametric formulations, such as 1 Introduction a Hierarchical Dirichlet Process (HDP: Teh et al. Word sense induction (WSI) is the task of auto- (2006), Yao and Durme (2011)). matically inducing the different senses of a given Our contributions in this paper are as follows. word, generally in the form of an unsupervised We first establish the effectiveness of HDP for learning task with senses represented as clusters WSI over both the SemEval-2007 and SemEval- of token instances. It contrasts with word sense 2010 WSI datasets (Agirre and Soroa, 2007; Man- disambiguation (WSD), where a fixed sense in- andhar et al., 2010), and show that the non- ventory is assumed to exist, and token instances parametric formulation is superior to a standard of a given word are disambiguated relative to the LDA formulation with oracle determination of sense inventory. While WSI is intuitively appeal- sense granularity for a given word. We next ing as a task, there have been no real examples of demonstrate that our interpretation of HDP-based WSI being successfully deployed in end-user ap- WSI is superior to other topic model-based ap- plications, other than work by Schutze (1998) and proaches to WSI, and indeed, better than the best- Navigli and Crisafulli (2010) in an information re- published results for both SemEval datasets. Fi- trieval context. A key contribution of this paper nally, we apply our method to the novel sense de- is the successful application of WSI to the lexico- tection task based on a dataset developed in this graphical task of novel sense detection, i.e. identi- research, and achieve highly encouraging results. fying words which have taken on new senses over 2 Methodology time. One of the key challenges in WSI is learning In topic modelling, documents are assumed to ex- the appropriate sense granularity for a given word, hibit multiple topics, with each document having 591 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 591–601, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics its own distribution over topics. Words are gen- In our initial experiments, we use LDA topic erated in each document by first sampling a topic modelling, which requires us to set T , the num- from the document’s topic distribution, then sam- ber of topics to be learned by the model. The pling a word from that topic. In this work we LDA generative process is: (1) draw a latent use the topic models’s probabilistic assignment of topic z from a document-specific topic distribu- topics to words for the WSI task. tion P (t = z|d) then; (2) draw a word w from the chosen topic P (w|t = z). Thus, the probabil- 2.1 Data Representation and Pre-processing ity of producing a single copy of word w given a In the context of WSI, topics form our sense rep- document d is given by: resentation, and words in a sentence are gener- ated conditioned on a particular sense of the target ∑ T P (w|d) = P (w|t = z)P (t = z|d). word. The “document” in the WSI case is a sin- z=1 gle sentence or a short document fragment con- taining the target word, as we would not expect In standard LDA, the user needs to specify the to be able to generate a full document from the number of topics T . In non-parametric variants of sense of a single target word.1 In the case of the LDA, the model dynamically learns the number of SemEval datasets, we use the word contexts pro- topics as part of the topic modelling. The particu- vided in the dataset, while in our novel sense de- lar implementation of non-parametric topic model tection experiments, we use a context window of we experiment with is Hierarchical Dirichlet Pro- three sentences, one sentence to either side of the cess (HDP: Teh et al. (2006)),3 where, for each token occurrence of the target word. document, a distribution of mixture components As our baseline representation, we use a bag of P (t|d) is sampled from a base distribution G0 words, where word frequency is kept but not word as follows: (1) choose a base distribution G0 ∼ order. All words are lemmatised, and stopwords DP (γ, H); (2) for each document d, generate dis- and low frequency terms are removed. tribution P (t|d) ∼ DP (α0 , G0 ); (3) draw a la- We also experiment with the addition of po- tent topic z from the document’s mixture compo- sitional context word information, as commonly nent distribution P (t|d), in the same manner as used in WSI. That is, we introduce an additional for LDA; and (4) draw a word w from the chosen word feature for each of the three words to the left topic P (w|t = z).4 and right of the target word. For both LDA and HDP, we individually topic Pad´o and Lapata (2007) demonstrated the im- model each target word, and determine the sense portance of syntactic dependency relations in the assignment z for a given instance by aggregating construction of semantic space models, e.g. for over the topic assignments for each word in the WSD. Based on these findings, we include depen- instance and selecting the sense with the highest dency relations as additional features in our topic aggregated probability, arg maxz P (t = z|d). models,2 but just for dependency relations that in- volve the target word. 3 SemEval Experiments To facilitate comparison of our proposed method 2.2 Topic Modelling for WSI with previous approaches, we use the Topic models learn a probability distribution over dataset from the SemEval-2007 and SemEval- topics for each document, by simply aggregating 2010 word sense induction tasks (Agirre and the distributions over topics for each word in the 3 We use the C++ implementation of HDP document. In WSI terms, we take this distribu- (http://www.cs.princeton.edu/˜blei/ tion over topics for each target word (“instance” topicmodeling.html) in our experiments. 4 in WSI parlance) as our distribution over senses The two HDP parameters γ and α0 control the variabil- for that word. ity of senses in the documents. In particular, γ controls the degree of sharing of topics across documents — a high γ 1 Notwithstanding the one sense per discourse heuristic value leads to more topics, as topics for different documents (Gale et al., 1992). are more dissimilar. α0 , on the other hand, controls the de- 2 We use the Stanford Parser to do part of speech tagging gree of mixing of topics within a document — a high α0 gen- and to extract the dependency relations (Klein and Manning, erates fewer topics, as topics are less homogeneous within a 2003; De Marneffe et al., 2006). document. 592 Soroa, 2007; Manandhar et al., 2010). We first In our original experiments with LDA, we set experiment with the SemEval-2010 dataset, as it the number of topics (T ) for each target word to includes explicit training and test data for each the number of senses represented in the test data target word and utilises a more robust evaluation for that word (varying T for each target word). methodology. We then return to experiment with This is based on the unreasonable assumption that the SemEval-2007 dataset, for comparison pur- we will have access to gold-standard information poses with other published results for topic mod- on sense granularity for each target word, and is elling approaches to WSI. done to establish an upper bound score for LDA. We then relax the assumption, and use a fixed T 3.1 SemEval-2010 setting for each of sets of nouns (T = 7) and 3.1.1 Dataset and Methodology verbs (T = 3), based on the average number of Our primary WSI evaluation is based on senses from the test data in each case. Finally, the dataset provided by the SemEval-2010 WSI we introduce positional context features for LDA, shared task (Manandhar et al., 2010). The dataset once again using the fixed T values for nouns and contains 100 target words: 50 nouns and 50 verbs. verbs. For each target word, a fixed set of training and We next apply HDP to the WSI task, using test instances are supplied, typically 1 to 3 sen- positional features, but learning the number of tences in length, each containing the target word. senses automatically for each target word via the The default approach to evaluation for the model. Finally, we experiment with adding de- SemEval-2010 WSI task is in the form of WSD pendency features to the model. over the test data, based on the senses that have To summarise, we provide results for the fol- been automatically induced from the training lowing models: data. Because the induced senses will likely vary in number and nature between systems, the WSD 1. LDA+Variable T : LDA with variable T evaluation has to incorporate a sense alignment for each target word based on the number of step, which it performs by splitting the test in- gold-standard senses. stances into two sets: a mapping set and an eval- 2. LDA+Fixed T : LDA with fixed T for each uation set. The optimal mapping from induced of nouns and verbs. senses to gold-standard senses is learned from the 3. LDA+Fixed T +Position: LDA with fixed mapping set, and the resulting sense alignment is T and extra positional word features. used to map the predictions of the WSI system to 4. HDP+Position: HDP (which automatically pre-defined senses for the evaluation set. The par- learns T ), with extra positional word fea- ticular split we use to calculate WSD effective- tures. ness in this paper is 80%/20% (mapping/test), av- 5. HDP+Position+Dependency: HDP with eraged across 5 random splits.5 both positional word and dependency fea- The SemEval-2010 training data consists of ap- tures. proximately 163K training instances for the 100 target words, all taken from the web. The test We compare our models with two baselines data is approximately 9K instances taken from a from the SemEval-2010 task: (1) Baseline Ran- variety of news sources. Following the standard dom — randomly assign each test instance to one approach used by the participating systems in the of four senses; (2) Baseline MFS — most fre- SemEval-2010 task, we induce senses only from quent sense baseline, assigning all test instances the training instances, and use the learned model to one sense; and also a benchmark system to assign senses to the test instances. (UoY), in the form of the University of York sys- 5 tem (Korkontzelos and Manandhar, 2010), which A 60%/40% split is also provided as part of the task achieved the best overall WSD results in the orig- setup, but the results are almost identical to those for the 80%/20% split, and so are omitted from this paper. The orig- inal SemEval-2010 task. inal task also made use of V-measure and Paired F-score to evaluate the induced word sense clusters, but have degen- 3.2 SemEval-2010 Results erate behaviour in correlating strongly with the number of senses induced by the method (Manandhar et al., 2010), and The results of our experiments over the SemEval- are hence omitted from this paper. 2010 dataset are summarised in Table 1. 593 WSD (80%/20%) not convey a coherent sense. These topics are an System All Verbs Nouns artifact of HDP: they are learnt at a much later Baselines Baseline Random 0.57 0.66 0.51 stage of the iterative process of Gibbs sampling Baseline MFS 0.59 0.67 0.53 and are often smaller than other topics (i.e. have LDA more zero-probability terms). We notice that they Variable T 0.64 0.69 0.60 are assigned as topics to instances very rarely (al- Fixed T 0.63 0.68 0.59 Fixed T +Position 0.63 0.68 0.60 though they are certainly used to assign topics to HDP non-target words in the instances), and as such, +Position 0.68 0.72 0.65 they do not present a real issue when assigning +Position+Dependency 0.68 0.72 0.65 the sense to an instance, as they are likely to be Benchmark UoY 0.62 0.67 0.59 overshadowed by the dominant senses.7 This con- clusion is born out when we experimented with Table 1: WSD F-score over the SemEval-2010 dataset manually filtering out these topics when assign- ing instance to senses: there was no perceptible change in the results, reinforcing our suggestion Looking first at the results for LDA, we see that these topics do not impact on target word that the first LDA approach (variable T ) is very sense assignment. competitive, outperforming the benchmark sys- Comparing the results for HDP back to those tem. In this approach, however, we assume per- for LDA, HDP tends to learn almost double the fect knowledge of the number of gold senses of number of senses per target word as are in the each target word, meaning that the method isn’t gold-standard (and hence are used for the “Vari- truly unsupervised. When we fixed T for each able T ” version of LDA). Far from hurting our of the nouns and verbs, we see a small drop in WSD F-score, however, the extra topics are dom- F-score, but encouragingly the method still per- inated by junk topics, and boost WSD F-score for forms above the benchmark. Adding positional the “genuine” topics. Based on this insight, we word features improves the results very slightly ran LDA once again with variable T (and posi- for nouns. tional and dependency features), but this time set- When we relax the assumption on the number ting T to the value learned by HDP, to give LDA of word senses in moving to HDP, we observe a the facility to use junk topics. This resulted in an marked improvement in F-score over LDA. This F-score of 0.66 across all word classes (verbs = is highly encouraging and somewhat surprising, 0.71, nouns = 0.62), demonstrating that, surpris- as in hiding information about sense granularity ingly, even for the same T setting, HDP achieves from the model, we have actually improved our superior results to LDA. I.e., not only does HDP results. We return to discuss this effect below. learn T automatically, but the topic model learned For the final feature, we add dependency features for a given T is superior to that for LDA. to the HDP model (in addition to retaining the Looking at the other senses discovered for positional word features), but see no movement cheat, we notice that the model has induced a in the results.6 While the dependency features myriad of senses: the relationship sense of cheat didn’t reduce F-score, their utility is questionable (senses 1, 3 and 4, e.g. husband cheats); the exam as the generation of the features from the Stanford usage of cheat (sense 2); the competition/game parser is computationally expensive. usage of cheat (sense 5); and cheating in the po- To better understand these results, we present litical domain (sense 6). Although the senses are the top-10 terms for each of the senses induced for possibly “split” a little more than desirable (e.g. the word cheat in Table 2. These senses are learnt senses 1, 3 and 4 arguably describe the same using HDP with both positional word features sense), the overall quality of the produced senses (e.g. husband #-1, indicating the lemma husband 7 In the WSD evaluation, the alignment of induced senses to the immediate left of the target word) and de- to the gold senses is learnt automatically based on the map- pendency features (e.g. cheat#prep on#wife). The ping instances. E.g. if all instances that are assigned sense first observation to make is that senses 7, 8 and a have gold sense x, then sense a is mapped to gold sense 9 are “junk” senses, in that the top-10 terms do x. Therefore, if the proportion of junk senses in the map- ping instances is low, their influence on WSD results will be 6 An identical result was observed for LDA. negligible. 594 Sense Num Top-10 Terms 1 cheat think want ... love feel tell guy cheat#nsubj#include find 2 cheat student cheating test game school cheat#aux#to teacher exam study 3 husband wife cheat wife #1 tiger husband #-1 cheat#prep on#wife ... woman cheat#nsubj#husband 4 cheat woman relationship cheating partner reason cheat#nsubj#man woman #-1 cheat#aux#to spouse 5 cheat game play player cheating poker cheat#aux#to card cheated money 6 cheat exchange china chinese foreign cheat #-2 cheat #2 china #-1 cheat#aux#to team 7 tina bette kirk walk accuse mon pok symkyn nick star 8 fat jones ashley pen body taste weight expectation parent able 9 euro goal luck fair france irish single 2000 cheat#prep at#point complain Table 2: The top-10 terms for each of the senses induced for the verb cheat by the HDP model (with positional word and dependency features) is encouraging. Also, we observe a spin-off ben- System F-Score BL 0.855 efit of topic modelling approaches to WSI: the YVD 0.857 high-ranking words in each topic can be used to SemEval Best (I2R) 0.868 gist the sense, and anecdotally confirm the impact Our method (default parameters) 0.842 of the different feature types (i.e. the positional Our method (tuned parameters) 0.869 word and dependency features). Table 3: F-score for the SemEval-2007 WSI task, for 3.3 Comparison with other Topic Modelling our HDP method with default and tuned parameter set- tings, as compared to competitor topic modelling and Approaches to WSI other approaches to WSI The idea of applying topic modelling to WSI is not entirely new. Brody and Lapata (2009) pro- the nouns in this paper. Training data was not pro- posed an LDA-based model which assigns differ- vided as part of the original dataset, so we fol- ent weights to different feature sets (e.g. unigram low the approach of BL and YVD in construct- tokens vs. dependency relations), using a “lay- ing our own training dataset for each target word ered” feature representation. They carry out ex- from instances extracted from the British National tensive parameter optimisation of both the (fixed) Corpus (BNC: Burnard (2000)).8 Both BL and number of senses, number of layers, and size of YVD separately report slightly higher in-domain the context window. results from training on WSJ data (the SemEval- Separately, Yao and Durme (2011) proposed 2007 data was taken from the WSJ). For the pur- the use of non-parametric topic models in WSI. poses of model comparison under identical train- The authors preprocess the instances slightly dif- ing settings, however, it is appropriate to report on ferently, opting to remove the target word from results for only the BNC. each instance and stem the tokens. They also We experiment with both our original method tuned the hyperparameters of the topic model to (with both positional word and dependency fea- optimise the WSI effectiveness over the evalua- tures, and default parameter settings for HDP) tion set, and didn’t use positional or dependency without any parameter tuning, and the same features. method with the tuned parameter settings of Both of these papers were evaluated over YVD, for direct comparability. We present the re- only the SemEval-2007 WSI dataset (Agirre and sults in Table 3, including the results for the best- Soroa, 2007), so we similarly apply our HDP performing system in the original SemEval-2007 method to this dataset for direct comparability. In task (I2R: Niu et al. (2007)). the remainder of this section, we refer to Brody The results are enlightening: with default pa- and Lapata (2009) as BL, and Yao and Durme rameter settings, our methodology is slightly be- (2011) as YVD. low the results of the other three models. Bear The SemEval-2007 dataset consists of roughly 8 In creating the training dataset, each instance is made 27K instances, for 65 target verbs and 35 target up of the sentence the target word occurs in, as we as one nouns. BL report on results only over the noun sentence to either side of that sentence, i.e. 3 sentences in instances, so we similarly restrict our attention to total per instance. 595 in mind, however, that the two topic modelling- WSI, in identifying words which have taken on based approaches were tuned extensively to the novel senses over time, based on analysis of di- dataset. When we use the tuned hyperparame- achronic data. Our topic modelling approach is ter settings of YVD, our results rise around 2.5% particularly attractive for this task as, not only to surpass both topic modelling approaches, and does it jointly perform type-level WSI, and token- marginally outperform the I2R system from the level WSD based on the induced senses (in as- original task. Recall that both BL and YVD report signing topics to each instance), but it is possible higher results again using in-domain training data, to gist the induced senses via the contents of the so we would expect to see further gains again over topic (typically using the topic words with highest the I2R system in following this path. marginal probability). Overall, these results agree with our findings The meanings of words can change over time; over the SemEval-2010 dataset (Section 3.2), un- in particular, words can take on new senses. Con- derlining the viability of topic modelling to auto- temporary examples of new word-senses include mated word sense induction. the meanings of swag and tweet as used below: 3.4 Discussion 1. We all know Frankie is adorable, but does he have swag? [swag = ‘style’] As part of our preprocessing, we remove all stop- words (other than for the positional word and de- 2. The alleged victim gave a description of the pendency features), as described in Section 2.1. man on Twitter and tweeted that she thought We separately experimented with not removing she could identify him. [tweet = ‘send a mes- stopwords, based on the intuition that prepositions sage on Twitter’] such as to and on can be informative in determin- These senses of swag and tweet are not included ing word sense based on local context. The results in many dictionaries or computational lexicons — were markedly worse, however. We also tried ap- e.g., neither of these senses is listed in Wordnet pending part of speech information to each word 3.0 (Fellbaum, 1998) — yet appear to be in regu- lemma, but the resulting data sparseness meant lar usage, particularly in text related to pop culture that results dropped marginally. and online media. When determining the sense for an instance, we The manual identification of such new word- aggregate the sense assignments for each word in senses is a challenge in lexicography over and the instance (not just the target word). An alter- above identifying new words themselves, and nate strategy is to use only the target word topic is essential to keeping dictionaries up-to-date. assignment, but again, the results for this strategy Moreover, lexicons that better reflect contempo- were inferior to the aggregate method. rary usage could benefit NLP applications that use In the SemEval-2007 experiments (Sec- sense inventories. tion 3.3), we found that YVD’s hyperparameter The challenge of identifying changes in word settings yielded better results than the default sense has only recently been considered in com- settings. We experimented with parameter tuning putational linguistics. For example, Sagi et al. over the SemEval-2010 dataset (including YVD’s (2009), Cook and Stevenson (2010), and Gulor- optimal setting on the 2007 dataset), but found dava and Baroni (2011) propose type-based mod- that the default setting achieved the best overall els of semantic change. Such models do not results: although the WSD F-score improved a account for polysemy, and appear best-suited to little for nouns, it worsened for verbs. This obser- identifying changes in predominant sense. Bam- vation is not unexpected: as the hyperparameters man and Crane (2011) use a parallel Latin– were optimised for nouns in their experiments, English corpus to induce word senses and build the settings might not be appropriate for verbs. a WSD system, which they then apply to study This also suggests that their results may be due in diachronic variation in word senses. Crucially, in part to overfitting the SemEval-2007 data. this token-based approach there is a clear connec- tion between word senses and tokens, making it 4 Identifying Novel Senses possible to identify usages of a specific sense. Having established the effectiveness of our ap- Based on the findings in Section 3.2, here we proach at WSI, we next turn to an application of apply the HDP method for WSI to the task of 596 identifying new word-senses. In contrast to Bam- noted challenge for approaches to identifying lex- man and Crane (2011) our token-based approach ical semantic differences between corpora (Peirs- does not require parallel text to induce senses. man et al., 2010), but are difficult to avoid given the corpora that are available. We use TreeTagger 4.1 Method (Schmid, 1994) to tokenise and lemmatise both Given two corpora — a reference corpus which corpora. we take to represent standard usage, and a second Evaluating approaches to identifying seman- corpus of newer texts — we identify senses that tic change is a challenge, particularly due to the are novel to the second corpus compared to the lack of appropriate evaluation resources; indeed, reference corpus. For a given word w, we pool most previous approaches have used very small all usages of w in the reference corpus and sec- datasets (Sagi et al., 2009; Cook and Stevenson, ond corpus, and run the HDP WSI method on this 2010; Bamman and Crane, 2011). Because this super-corpus to induce the senses of w. We then is a preliminary attempt at applying WSI tech- tag all usages of w in both corpora with their sin- niques to identifying new word-senses, our evalu- gle most-likely automatically-induced sense. ation will also be based on a rather small dataset. Intuitively, if a word w is used in some sense We require a set of words that are known to s in the second corpus, and w is never used in have acquired a new sense between the late 20th that sense in the reference corpus, then w has ac- and early 21st centuries. The Concise Oxford quired a new sense, namely s. We capture this English Dictionary aims to document contempo- intuition into a novelty score (“Nov”) that indi- rary usage, and has been published in numerous cates whether a given word w has a new sense in editions including Thompson (1995, COD95) and the second corpus, s, compared to the reference Soanes and Stevenson (2008, COD08). Although corpus, r, as below: some of the entries have been substantially re- ({ }) vised between editions, many have not, enabling ps (ti ) − pr (ti ) Nov(w) = max : ti ∈ T us to easily identify new senses amongst the en- pr (ti ) (1) tries in COD08 relative to COD95. A manual lin- where ps (ti ) and pr (ti ) are the probability of ear search through the entries in these dictionaries sense ti in the second corpus and reference cor- would be very time consuming, but by exploit- pus, respectively, calculated using smoothed max- ing the observation that new words often corre- imum likelihood estimates, and T is the set of spond to concepts that are culturally salient (Ayto, senses induced for w. Novelty is high if there is 2006), we can quickly identify some candidates some sense t that has much higher relative fre- for words that have taken on a new sense. quency in s than r and that is also relatively infre- Between the time periods of our two corpora, quent in r. computers and the Internet have become much more mainstream in society. We therefore ex- 4.2 Data tracted all entries from COD08 containing the Because we are interested in the identification of word computing (which is often used as a topic la- novel word-senses for applications such as lexi- bel in this dictionary) that have a token frequency con maintenance, we focus on relatively newly- of at least 1000 in the BNC. We then read the coined word-senses. In particular, we take the entries for these 87 lexical items in COD95 and written portion of the BNC — consisting primar- COD08 and identified those which have a clear ily of British English text from the late 20th cen- computing sense in COD08 that was not present tury — as our reference corpus, and a similarly- in COD95. In total we found 22 such items. This sized random sample of documents from the process, along with all the annotation in this sec- ukWaC (Ferraresi et al., 2008) — a Web corpus tion, is carried out by a native English-speaking built from the .uk domain in 2007 which in- author of this paper. cludes a wide range of text types — as our sec- To ensure that the words identified from the ond corpus. Text genres are represented to dif- dictionaries do in fact have a new sense in the ferent extents in these corpora with, for example, ukWaC sample compared to the BNC, we exam- text types related to the Internet being much more ine the usage of these words in the corpora. We common in the ukWaC. Such differences are a extract a random sample of 100 usages of each 597 lemma from the BNC and ukWaC sample and Lemma Novelty Freq. ratio Novel sense freq. annotate these usages as to whether they corre- domain (n) 116.2 2.60 41 worm (n) 68.4 1.04 30 spond to the novel sense or not. This binary dis- mirror (n) 38.4 0.53 10 tinction is easier than fine-grained sense annota- guess (v) 16.5 0.93 – tion, and since we do not use these annotations export (v) 13.8 0.88 28 for formal evaluation — only for selecting items founder (n) 11.0 1.20 – cinema (n) 9.7 1.30 – for our dataset — we do not carry out an inter- poster (n) 7.9 1.83 4 annotator agreement study here. We eliminate any racism (n) 2.4 0.98 – lemma for which we find evidence of the novel symptom (n) 2.1 1.16 – sense in the BNC, or for which we do not find Table 4: Novelty score (“Nov”), ratio of frequency in evidence of the novel sense in the ukWaC sam- the ukWaC sample and BNC, and frequency of the ple.9 We further check word sketches (Kilgarriff novel sense in the manually-annotated 100 instances and Tugwell, 2002)10 for each of these lemmas from the ukWaC sample (where applicable), for all in the BNC and ukWaC for collocates that likely lemmas in our dataset. Lemmas shown in boldface correspond to the novel sense; we exclude any have a novel sense in the ukWaC sample compared to lemma for which we find evidence of the novel the BNC. sense in the BNC, or fail to find evidence of the novel sense in the ukWaC sample. At the end topic modelling. The results are shown in column of this process we have identified the following “Novelty” in Table 4. The lemmas with a novel 5 lemmas that have the indicated novel senses in sense have higher novelty scores than the distrac- the ukWaC compared to the BNC: domain (n) “In- tors according to a one-sided Wilcoxon rank sum ternet domain”; export (v) “export data”; mirror test (p < .05). (n) “mirror website”; poster (n) “one who posts When a lemma takes on a new sense, it might online”; and worm (n) “malicious program”. For also increase in frequency. We therefore also con- each of the 5 lemmas with novel senses, a sec- sider a baseline in which we rank the lemmas by ond annotator — also a native English-speaking the ratio of their frequency in the second and ref- author of this paper — annotated the sample of erence corpora. These results are shown in col- 100 usages from the ukWaC. The observed agree- umn “Freq. ratio” in Table 4. The difference be- ment and unweighted Kappa between the two an- tween the frequency ratios for the lemmas with a notators is 97.2% and 0.92, respectively, indicat- novel sense, and the distractors, is not significant ing that this is indeed a relatively easy annotation (p > .05). task. The annotators discussed the small number Examining the frequency of the novel senses — of disagreements to reach consensus. shown in column “Novel sense freq.” in Table 4 For our dataset we also require items that have — we see that the lowest-ranked lemma with a not acquired a novel sense in the ukWaC sample. novel sense, poster, is also the lemma with the For each of the above 5 lemmas we identified a least-frequent novel sense. This result is unsur- distractor lemma of the same part-of-speech that prising as our novelty score will be higher for has a similar frequency in the BNC, and that has higher-frequency novel senses. The identification not undergone sense change between COD95 and of infrequent novel senses remains a challenge. COD08. The 5 distractors are: cinema (n); guess The top-ranked topic words for the sense cor- (v); symptom (n); founder (n); and racism (n). responding to the maximum in Equation 1 for the highest-ranked distractor, guess, are the fol- 4.3 Results lowing: @card@, post, ..., n’t, comment, think, We compute novelty (“Nov”, Equation 1) for all subject, forum, view, guess. This sense seems 10 items in our dataset, based on the output of the to correspond to usages of guess in the context of online forums, which are better represented 9 We use the IMS Open Corpus Workbench (http:// in the ukWaC sample than the BNC. Because of cwb.sourceforge.net/) to extract the usages of our the challenges posed by such differences between target lemmas from the corpora. This extraction process fails in some cases, and so we also eliminate such items from our corpora (discussed in Section 4.2) we are unsur- dataset. prised to see such an error, but this could be ad- 10 http://www.sketchengine.co.uk/ dressed in the future by building comparable cor- 598 Topic Selection Methodology Lemma Nov Oracle (single topic) Oracle (multiple topics) Precision Recall F-score Precision Recall F-score Precision Recall F-score domain (n) 1.00 0.29 0.45 1.00 0.56 0.72 0.97 0.88 0.92 export (v) 0.93 0.96 0.95 0.93 0.96 0.95 0.90 1.00 0.95 mirror (n) 0.67 1.00 0.80 0.67 1.00 0.80 0.67 1.00 0.80 poster (n) 0.00 0.00 0.00 0.44 1.00 0.62 0.44 1.00 0.62 worm (n) 0.93 0.90 0.92 0.93 0.90 0.92 0.86 1.00 0.92 Table 5: Results for identifying the gold-standard novel senses based on the three topic selection methodologies of: (1) Nov; (2) oracle selection of a single topic; and (3) oracle selection of multiple topics. pora for use in this application. the sense selection heuristic could theoretically Having demonstrated that our method for iden- improve our method for identifying novel senses, tifying novel senses can distinguish lemmas that and that the topic modelling approach proposed have a novel sense in one corpus compared to an- in this paper has considerable promise for auto- other from those that do not, we now consider matic novel sense detection. Of particular note is whether this method can also automatically iden- the result for poster: although the gold-standard tify the usages of the induced novel sense. novel sense of poster is rare, all of its usages are For each lemma with a gold-standard novel grouped into a single topic. sense, we define the automatically-induced novel Finally, we consider whether an oracle which sense to be the single sense corresponding to the can select the best subset of induced senses — in maximum in Equation 1. We then compute the terms of F-score — as the novel sense could of- precision, recall, and F-score of this novel sense fer further improvements. In this case — results with respect to the gold-standard novel sense, shown in the final three columns of Table 5 — based on the 100 annotated tokens for each of we again see an increase in F-score to 0.92 for the 5 lemmas with a novel sense. The results are domain. For this lemma the gold-standard novel shown in the first three numeric columns of Ta- sense usages were split across multiple induced ble 5. topics, and so we are unsurprised to find that a In the case of export and worm the results are method which is able to select multiple topics as remarkably good, with precision and recall both the novel sense performs well. Based on these over 0.90. For domain, the low recall is a result of findings, in future work we plan to consider alter- the majority of usages of the gold-standard novel native formulations of novelty. sense (“Internet domain”) being split across two induced senses — the top-two highest ranked in- 5 Conclusion duced senses according to Equation 1. The poor performance for poster is unsurprising due to the We propose the application of topic modelling very low frequency of this lemma’s gold-standard to the task of word sense induction (WSI), start- novel sense. ing with a simple LDA-based methodology with These results are based on our novelty rank- a fixed number of senses, and culminating in ing method (“Nov”), and the assumption that a nonparametric method based on a Hierarchi- the novel sense will be represented in a single cal Dirichlet Process (HDP), which automatically topic. To evaluate the theoretical upper-bound learns the number of senses for a given target for a topic-ranking method which uses our HDP- word. Our HDP-based method outperforms all based WSI method and selects a single topic to methods over the SemEval-2010 WSI dataset, and capture the novel sense, we next evaluate an op- is also superior to other topic modelling-based timal topic selection approach. In the middle approaches to WSI based on the SemEval-2007 three numeric columns of Table 5, we present re- dataset. We applied the proposed WSI model to sults for an experimental setup in which the sin- the task of identifying words which have taken on gle best induced sense — in terms of F-score — new senses, including identifying the token oc- is selected as the novel sense by an oracle. We currences of the new word sense. Over a small see big improvements in F-score for domain and dataset developed in this research, we achieved poster. This encouraging result suggests refining highly encouraging results. 599 References Dan Klein and Christopher D. Manning. 2003. Fast exact inference with a factored model for natural Eneko Agirre and Aitor Soroa. 2007. SemEval-2007 language parsing. In Advances in Neural Informa- Task 02: Evaluating word sense induction and dis- tion Processing Systems 15 (NIPS 2002), pages 3– crimination systems. In Proceedings of the Fourth 10, Whistler, Canada. International Workshop on Semantic Evaluations Ioannis Korkontzelos and Suresh Manandhar. 2010. (SemEval-2007), pages 7–12, Prague, Czech Re- Uoy: Graphs of unambiguous vertices for word public. sense induction and disambiguation. In Proceed- John Ayto. 2006. Movers and Shakers: A Chronology ings of the 5th International Workshop on Semantic of Words that Shaped our Age. Oxford University Evaluation, pages 355–358, Uppsala, Sweden. Press, Oxford. Suresh Manandhar, Ioannis Klapaftis, Dmitriy Dli- David Bamman and Gregory Crane. 2011. Measur- gach, and Sameer Pradhan. 2010. SemEval-2010 ing historical word sense variation. In Proceedings Task 14: Word sense induction & disambiguation. of the 2011 Joint International Conference on Dig- In Proceedings of the 5th International Workshop ital Libraries (JCDL 2011), pages 1–10, Ottawa, on Semantic Evaluation, pages 63–68, Uppsala, Canada. Sweden. D. Blei, A. Ng, and M. Jordan. 2003. Latent dirichlet Roberto Navigli and Giuseppe Crisafulli. 2010. In- allocation. Journal of Machine Learning Research, ducing word senses to improve web search result 3:993–1022. clustering. In Proceedings of the 2010 Conference S. Brody and M. Lapata. 2009. Bayesian word sense on Empirical Methods in Natural Language Pro- induction. pages 103–111, Athens, Greece. cessing, pages 116–126, Cambridge, USA. Lou Burnard. 2000. The British National Corpus Zheng-Yu Niu, Dong-Hong Ji, and Chew-Lim Tan. Users Reference Guide. Oxford University Com- 2007. I2R: Three systems for word sense discrimi- puting Services. nation, chinese word sense disambiguation, and en- Paul Cook and Suzanne Stevenson. 2010. Automat- glish word sense disambiguation. In Proceedings ically identifying changes in the semantic orienta- of the Fourth International Workshop on Seman- tion of words. In Proceedings of the Seventh In- tic Evaluations (SemEval-2007), pages 177–182, ternational Conference on Language Resources and Prague, Czech Republic. Evaluation (LREC 2010), pages 28–34, Valletta, Malta. Sebastian Pad´o and Mirella Lapata. 2007. Dependency-based construction of semantic Marie-Catherine De Marneffe, Bill Maccartney, and space models. Comput. Linguist., 33:161–199. Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. Yves Peirsman, Dirk Geeraerts, and Dirk Speelman. Genoa, Italy. 2010. The automatic identification of lexical varia- tion between language varieties. Natural Language Christiane Fellbaum, editor. 1998. WordNet: An Elec- Engineering, 16(4):469–491. tronic Lexical Database. MIT Press, Cambridge, MA. Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2009. Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Semantic density analysis: Comparing word mean- Silvia Bernardini. 2008. Introducing and evaluat- ing across time and space. In Proceedings of ing ukwac, a very large web-derived corpus of en- the EACL 2009 Workshop on GEMS: GEometrical glish. In Proceedings of the 4th Web as Corpus Models of Natural Language Semantics, pages 104– Workshop: Can we beat Google, pages 47–54, Mar- 111, Athens, Greece. rakech, Morocco. Helmut Schmid. 1994. Probabilistic part-of-speech William A. Gale, Kenneth W. Church, and David tagging using decision trees. In Proceedings of the Yarowsky. 1992. One sense per discourse. pages International Conference on New Methods in Lan- 233–237. guage Processing, pages 44–49, Manchester, UK. Kristina Gulordava and Marco Baroni. 2011. A dis- Hinrich Schutze. 1998. Automatic word sense dis- tributional similarity approach to the detection of crimination. Computational Linguistics, 24(1):97– semantic change in the Google Books Ngram cor- 123. pus. In Proceedings of the GEMS 2011 Workshop Catherine Soanes and Angus Stevenson, editors. 2008. on GEometrical Models of Natural Language Se- The Concise Oxford English Dictionary. Oxford mantics, pages 67–71, Edinburgh, Scotland. University Press, eleventh (revised) edition. Oxford Adam Kilgarriff and David Tugwell. 2002. Sketch- Reference Online. ing words. In Marie-H´el`ene Corr´eard, editor, Lex- Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. icography and Natural Language Processing: A 2006. Hierarchical Dirichlet processes. Journal Festschrift in Honour of B. T. S. Atkins, pages 125– of the American Statistical Association, 101:1566– 137. Euralex, Grenoble, France. 1581. 600 Della Thompson, editor. 1995. The Concise Oxford Dictionary of Current English. Oxford University Press, Oxford, ninth edition. Xuchen Yao and Benjamin Van Durme. 2011. Non- parametric bayesian word sense induction. In Pro- ceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, pages 10–14, Portland, Oregon. 601 Learning Language from Perceptual Context Raymond Mooney University of Texas at Austin

[email protected]

Abstract Machine learning has become the dominant approach to building natural-language processing sys- tems. However, current approaches generally require a great deal of laboriously constructed human- annotated training data. Ideally, a computer would be able to acquire language like a child by being exposed to linguistic input in the context of a relevant but ambiguous perceptual environment. As a step in this direction, we have developed systems that learn to sportscast simulated robot soccer games and to follow navigation instructions in virtual environments by simply observing sample hu- man linguistic behavior in context. This work builds on our earlier work on supervised learning of semantic parsers that map natural language into a formal meaning representation. In order to apply such methods to learning from observation, we have developed methods that estimate the meaning of sentences given just their ambiguous perceptual context. 602 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, page 602, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter Micol Marchetti-Bowick Nathanael Chambers Microsoft Corporation Department of Computer Science 475 Brannan Street United States Naval Academy San Francisco, CA 94122 Annapolis, MD 21409

[email protected] [email protected]

Abstract apply our approach to the problem of predicting Presidential Job Approval polls from Twitter data, Microblogging websites such as Twitter and we present results that improve on previous offer a wealth of insight into a popu- lation’s current mood. Automated ap- work in this area. We also present a novel base- proaches to identify general sentiment to- line that performs remarkably well without using ward a particular topic often perform two topic identification. steps: Topic Identification and Sentiment Topic identification is the task of identifying Analysis. Topic Identification first identi- text that discusses a topic of interest. Most pre- fies tweets that are relevant to a desired vious work on microblogs uses simple keyword topic (e.g., a politician or event), and Sen- searches to find topic-relevant tweets on the as- timent Analysis extracts each tweet’s atti- tude toward the topic. Many techniques for sumption that short tweets do not need more so- Topic Identification simply involve select- phisticated processing. For instance, searches for ing tweets using a keyword search. Here, the name “Obama” have been assumed to return we present an approach that instead uses a representative set of tweets about the U.S. Pres- distant supervision to train a classifier on ident (O’Connor et al., 2010). One of the main the tweets returned by the search. We show contributions of this paper is to show that keyword that distant supervision leads to improved search can lead to noisy results, and that the same performance in the Topic Identification task keywords can instead be used in a distantly super- as well in the downstream Sentiment Anal- ysis stage. We then use a system that incor- vised framework to yield improved performance. porates distant supervision into both stages Distant supervision uses noisy signals in text to analyze the sentiment toward President as positive labels to train classifiers. For in- Obama expressed in a dataset of tweets. stance, the token “Obama” can be used to iden- Our results better correlate with Gallup’s tify a series of tweets that discuss U.S. President Presidential Job Approval polls than pre- Barack Obama. Although searching for token vious work. Finally, we discover a sur- prising baseline that outperforms previous matches can return false positives, using the re- work without a Topic Identification stage. sulting tweets as positive training examples pro- vides supervision from a distance. This paper ex- periments with several diverse sets of keywords 1 Introduction to train distantly supervised classifiers for topic Social networks and blogs contain a wealth of identification. We evaluate each classifier on a data about how the general public views products, hand-labeled dataset of political and apolitical campaigns, events, and people. Automated algo- tweets, and demonstrate an improvement in F1 rithms can use this data to provide instant feed- score over simple keyword search (.39 to .90 in back on what people are saying about a topic. the best case). We also make available the first la- Two challenges in building such algorithms are beled dataset for topic identification in politics to (1) identifying topic-relevant posts, and (2) iden- encourage future work. tifying the attitude of each post toward the topic. Sentiment analysis encompasses a broad field This paper studies distant supervision (Mintz et of research, but most microblog work focuses al., 2009) as a solution to both challenges. We on two moods: positive and negative sentiment. 603 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 603–612, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Algorithms to identify these moods range from In contrast to lexicons, many approaches in- matching words in a sentiment lexicon to training stead focus on ways to train supervised classi- classifiers with a hand-labeled corpus. Since la- fiers. However, labeled data is expensive to cre- beling corpora is expensive, recent work on Twit- ate, and examples of Twitter classifiers trained on ter uses emoticons (i.e., ASCII smiley faces such hand-labeled data are few (Jiang et al., 2011). In- as :-( and :-)) as noisy labels in tweets for distant stead, distant supervision has grown in popular- supervision (Pak and Paroubek, 2010; Davidov et ity. These algorithms use emoticons to serve as al., 2010; Kouloumpis et al., 2011). This paper semantic indicators for sentiment. For instance, presents new analysis of the downstream effects a sad face (e.g., :-() serves as a noisy label for a of topic identification on sentiment classifiers and negative mood. Read (2005) was the first to sug- their application to political forecasting. gest emoticons for UseNet data, followed by Go Interest in measuring the political mood of et al. (Go et al., 2009) on Twitter, and many others a country has recently grown (O’Connor et al., since (Bifet and Frank, 2010; Pak and Paroubek, 2010; Tumasjan et al., 2010; Gonzalez-Bailon et 2010; Davidov et al., 2010; Kouloumpis et al., al., 2010; Carvalho et al., 2011; Tan et al., 2011). 2011). Hashtags (e.g., #cool and #happy) have Here we compare our sentiment results to Presi- also been used as noisy sentiment labels (Davi- dential Job Approval polls and show that the sen- dov et al., 2010; Kouloumpis et al., 2011). Fi- timent scores produced by our system are posi- nally, multiple models can be blended into a sin- tively correlated with both the Approval and Dis- gle classifier (Barbosa and Feng, 2010). Here, we approval job ratings. adopt the emoticon algorithm for sentiment analy- In this paper we present a method for cou- sis, and evaluate it on a specific domain (politics). pling two distantly supervised algorithms for Topic identification in Twitter has received topic identification and sentiment classification on much less attention than sentiment analysis. The Twitter. In Section 4, we describe our approach to majority of approaches simply select a single topic identification and present a new annotated keyword (e.g., “Obama”) to represent their topic corpus of political tweets for future study. In Sec- (e.g., “US President”) and retrieve all tweets that tion 5, we apply distant supervision to sentiment contain the word (O’Connor et al., 2010; Tumas- analysis. Finally, Section 6 discusses our sys- jan et al., 2010; Tan et al., 2011). The underlying tem’s performance on modeling Presidential Job assumption is that the keyword is precise, and due Approval ratings from Twitter data. to the vast number of tweets, the search will re- turn a large enough dataset to measure sentiment 2 Previous Work toward that topic. In this work, we instead use a distantly supervised system similar in spirit to The past several years have seen sentiment anal- those recently applied to sentiment analysis. ysis grow into a diverse research area. The idea Finally, we evaluate the approaches presented of sentiment applied to microblogging domains is in this paper on the domain of politics. Tumasjan relatively new, but there are numerous recent pub- et al. (2010) showed that the results of a recent lications on the subject. Since this paper focuses German election could be predicted through fre- on the microblog setting, we concentrate on these quency counts with remarkable accuracy. Most contributions here. similar to this paper is that of O’Connor et al. The most straightforward approach to senti- (2010), in which tweets relating to President ment analysis is using a sentiment lexicon to la- Obama are retrieved with a keyword search and bel tweets based on how many sentiment words a sentiment lexicon is used to measure overall appear. This approach tends to be used by appli- approval. This extracted approval ratio is then cations that measure the general mood of a popu- compared to Gallup’s Presidential Job Approval lation. O’Connor et al. (2010) use a ratio of posi- polling data. We directly compare their results tive and negative word counts on Twitter, Kramer with various distantly supervised approaches. (2010) counts lexicon words on Facebook, and 3 Datasets Thelwall (2011) uses the publicly available Sen- tiStrength algorithm to make weighted counts of The experiments in this paper use seven months of keywords based on predefined polarity strengths. tweets from Twitter (www.twitter.com) collected 604 between June 1, 2009 and December 31, 2009. ID Type Keywords The corpus contains over 476 million tweets la- PC-1 Obama obama beled with usernames and timestamps, collected PC-2 General republican, democrat, senate, congress, government through Twitter’s ‘spritzer’ API without keyword PC-3 Topic health care, economy, tax cuts, filtering. Tweets are aligned with polling data in tea party, bailout, sotomayor Section 6 using their timestamps. PC-4 Politician obama, biden, mccain, reed, The full system is evaluated against the pub- pelosi, clinton, palin licly available daily Presidential Job Approval PC-5 Ideology liberal, conservative, progres- polling data from Gallup1 . Every day, Gallup asks sive, socialist, capitalist 1,500 adults in the United States about whether Table 1: The keywords used to select positive training they approve or disapprove of “the job Presi- sets for each political classifier (a subset of all PC-3 dent Obama is doing as president.” The results and PC-5 keywords are shown to conserve space). are compiled into two trend lines for Approval and Disapproval ratings, as shown in Figure 1. positive: LOL, obama made a bears refer- We compare our positive and negative sentiment ence in green bay. uh oh. scores against these two trends. negative: New blog up! It regards the new 4 Topic Identification iPhone 3G S: <URL> This section addresses the task of Topic Identi- We then use these automatically extracted fication in the context of microblogs. While the datasets to train a multinomial Naive Bayes classi- general field of topic identification is broad, its fier. Before feature collection, the text is normal- use on microblogs has been somewhat limited. ized as follows: (a) all links to photos (twitpics) Previous work on the political domain simply uses are replaced with a single generic token, (b) all keywords to identify topic-specific tweets (e.g., non-twitpic URLs are replaced with a token, (c) O’Connor et al. (2010) use “Obama” to find pres- all user references (e.g., @MyFriendBob) are col- idential tweets). This section shows that distant lapsed, (d) all numbers are collapsed to INT, (e) supervision can use the same keywords to build a tokens containing the same letter twice or more classifier that is much more robust to noise than in a row are condensed to a two-letter string (e.g. approaches that use pure keyword search. the word ahhhhh becomes ahh), (f) lowercase the text and insert spaces between words and punctu- 4.1 Distant Supervision ation. The text of each tweet is then tokenized, Distant supervision uses noisy signals to identify and the tokens are used to collect unigram and bi- positive examples of a topic in the face of unla- gram features. All features that occur fewer than beled data. As described in Section 2, recent sen- 10 times in the training corpus are ignored. timent analysis work has applied distant supervi- Finally, after training a classifier on this dataset, sion using emoticons as the signals. The approach every tweet in the corpus is classified as either extracts tweets with ASCII smiley faces (e.g., :) positive (i.e., relevant to the topic) or negative and ;)) and builds classifiers trained on these pos- (i.e., irrelevant). The positive tweets are then sent itive examples. We apply distant supervision to to the second sentiment analysis stage. topic identification and evaluate its effectiveness on this subtask. 4.2 Keyword Selection As with sentiment analysis, we need to collect Keywords are the input to our proposed distantly positive and negative examples of tweets about supervised system, and of course, the input to pre- the target topic. Instead of emoticons, we extract vious work that relies on keyword search. We positive tweets containing one or more predefined evaluate classifiers based on different keywords to keywords. Negative tweets are randomly chosen measure the effects of keyword selection. from the corpus. Examples of positive and neg- O’Connor et al. (2010) used the keywords ative tweets that can be used to train a classifier “Obama” and “McCain”, and Tumasjan et al. based on the keyword “Obama” are given here: (2010) simply extracted tweets containing Ger- 1 http://gallup.com/poll/113980/gallup-daily-obama-job- many’s political party names. Both approaches approval.aspx extracted matching tweets, considered them rele- 605 Gallup Daily Obama Job Approval Ratings Figure 1: Gallup presidential job Approval and Disapproval ratings measured between June and Dec 2009. vant (correctly, in many cases), and applied sen- domly chosen from the keyword searches of PC- timent analysis. However, different keywords 2, PC-3, PC-4, and PC-5 with 500 tweets from may result in very different extractions. We in- each. This combined dataset enables an evalua- stead attempted to build a generic “political” topic tion of how well each classifier can identify tweets classifier. To do this, we experimented with the from other classifiers. The General Dataset con- five different sets of keywords shown in Table 1. tains 2,000 random tweets from the entire corpus. For each set, we extracted all tweets matching This dataset allows us to evaluate how well clas- one or more keywords, and created a balanced sifiers identify political tweets in the wild. positive/negative training set by then selecting This paper’s authors initially annotated the negative examples randomly from non-matching same 200 tweets in the General Dataset to com- tweets. A couple examples of ideology (PC-5) ex- pute inter-annotator agreement. The Kappa was tractions are shown here: 0.66, which is typically considered good agree- ment. Most disagreements occurred over tweets You often hear of deontologist libertarians and utilitarian liberals but are there any about money and the economy. We then split the Aristotelian socialists? remaining portions of the two datasets between <url> - Then, slather on a liberal amount the two annotators. The Political Dataset con- of plaster, sand down smooth, and paint tains 1,691 political and 309 apolitical tweets, and however you want. I hope this helps! the General Dataset contains 28 political tweets and 1,978 apolitical tweets. These two datasets of The second tweet is an example of the noisy 2000 tweets each are publicly available for future nature of keyword extraction. Most extractions evaluation and comparison to this work2 . are accurate, but different keywords retrieve very different sets of tweets. Examples for the political 4.4 Experiments topics (PC-3) are shown here: Our first experiment addresses the question of RT @PoliticalMath: hope the president’s keyword variance. We measure performance on health care predictions <url> are better the Political Dataset, a combination of all of our than his stimulus predictions <url> proposed political keywords. Each keyword set @adamjschmidt You mean we could have contributed to 25% of the dataset, so the eval- chosen health care for every man woman uation measures the extent to which a classifier and child in America or the Iraq war? identifies other keyword tweets. We classified the 2000 tweets with the five distantly supervised Each keyword set builds a classifier using the ap- classifiers and the one “Obama” keyword extrac- proach described in Section 4.1. tor from O’Connor et al. (2010). 4.3 Labeled Datasets Results are shown on the left side of Figure 2. Precision and recall calculate correct identifica- In order to evaluate distant supervision against tion of the political label. The five distantly super- keyword search, we created two new labeled vised approaches perform similarly, and show re- datasets of political and apolitical tweets. markable robustness despite their different train- The Political Dataset is an amalgamation of all ing sets. In contrast, the keyword extractor only four keyword extractions (PC-1 is a subset of PC- 2 4) listed in Table 1. It consists of 2,000 tweets ran- http://www.usna.edu/cs/nchamber/data/twitter 606 Figure 2: Five distantly supervised classifiers and the Obama keyword classifier. Left panel: the Political Dataset of political tweets. Right panel: the General Dataset representative of Twitter as a whole. captures about a quarter of the political tweets. my life I am ashamed of our government. PC-1 is the distantly supervised analog to the Obama keyword extractor, and we see that dis- These results also illustrate that distant supervi- tant supervision increases its F1 score dramati- sion allows for flexibility in construction of the cally from 0.39 to 0.90. classifier. Different keywords show little change The second evaluation addresses the question in classifier performance. of classifier performance on Twitter as a whole, The General Dataset experiment evaluates clas- not just on a political dataset. We evaluate on the sifier performance in the wild. The keyword ap- General Dataset just as on the Political Dataset. proach again scores below those trained on noisy Results are shown on the right side of Figure 2. labels. It classifies most tweets as apolitical and Most tweets posted to Twitter are not about pol- thus achieves very low recall for tweets that are itics, so the apolitical label dominates this more actually about politics. On the other hand, distant representative dataset. Again, the five distant supervision creates classifiers that over-extract supervision classifiers have similar results. The political tweets. This is a result of using balanced Obama keyword search has the highest precision, datasets in training; such effects can be mitigated but drastically sacrifices recall. Four of the five by changing the training balance. Even so, four classifiers outperform keyword search in F1 score. of the five distantly trained classifiers score higher than the raw keyword approach. The only under- 4.5 Discussion performer was PC-1, which suggests that when The Political Dataset results show that distant su- building a classifier for a relatively broad topic pervision adds robustness to a keyword search. like politics, a variety of keywords is important. The distantly supervised “Obama” classifier (PC- The next section takes the output from our clas- 1) improved the basic “Obama” keyword search sifiers (i.e., our topic-relevant tweets) and eval- by 0.51 absolute F1 points. Furthermore, dis- uates a fully automated sentiment analysis algo- tant supervision doesn’t require additional human rithm against real-world polling data. input, but simply adds a trained classifier. Two example tweets that an Obama keyword search 5 Targeted Sentiment Analysis misses but that its distantly supervised analog The previous section evaluated algorithms that captures are shown here: extract topic-relevant tweets. We now evaluate Why does Congress get to opt out of the methods to distill the overall sentiment that they Obummercare and we can’t. A company express. This section compares two common ap- gets fined if they don’t comply. Kiss free- proaches to sentiment analysis. dom goodbye. We first replicated the technique used in I agree with the lady from california, I am O’Connor et al. (2010), in which a lexicon of pos- sixty six years old and for the first time in itive and negative sentiment words called Opin- 607 ionFinder (Wilson and Hoffmann, 2005) is used tweets that contain at least one positive emoti- to evaluate the sentiment of each tweet (others con and no negative emoticons. We generated a have used similar lexicons (Kramer, 2010; Thel- negative training set using an analogous process. wall et al., 2010)). We evaluate our full distantly The emoticon symbols used for positive sentiment supervised approach to theirs. We also experi- were :) =) :-) :] =] :-] :} :o) :D =D :-D :P =P mented with SentiStrength, a lexicon-based pro- :-P C:. Negative emoticons were :( =( :-( :[ =[ gram built to identify sentiment in online com- :-[ :{ :-c :c} D: D= :S :/ =/ :-/ :’( : (. Using this ments of the social media website, MySpace. data, we train a multinomial Naive Bayes classi- Though MySpace is close in genre to Twitter, we fier using the same method used for the political did not observe a performance gain. All reported classifiers described in Section 4.1. This classifier results thus use OpinionFinder to facilitate a more is then used to label topic-specific tweets as ex- accurate comparison with previous work. pressing positive or negative sentiment. Finally, Second, we built a distantly supervised system the three overall sentiment scores Spos , Sneg , and using tweets containing emoticons as done in pre- Sratio are calculated from the results. vious work (Read, 2005; Go et al., 2009; Bifet and Frank, 2010; Pak and Paroubek, 2010; Davidov 6 Predicting Approval Polls et al., 2010; Kouloumpis et al., 2011). Although distant supervision has previously been shown to This section uses the two-stage Targeted Senti- outperform sentiment lexicons, these evaluations ment Analysis system described above in a real- do not consider the extra topic identification step. world setting. We analyze the sentiment of Twit- ter users toward U.S. President Barack Obama. 5.1 Sentiment Lexicon This allows us to both evaluate distant supervision The OpinionFinder lexicon is a list of 2,304 pos- against previous work on the topic, and demon- itive and 4,151 negative sentiment terms (Wilson strate a practical application of the approach. and Hoffmann, 2005). We ignore neutral words 6.1 Experiment Setup in the lexicon and we do not differentiate between weak and strong sentiment words. A tweet is la- The following experiments combine both topic beled positive if it contains any positive terms, and identification and sentiment analysis. The previ- negative if it contains any negative terms. A tweet ous sections described six topic identification ap- can be marked as both positive and negative, and proaches, and two sentiment analysis approaches. if a tweet contains words in neither category, it We evaluate all combinations of these systems, is marked neutral. This procedure is the same as and compare their final sentiment scores for each used by O’Connor et al. (2010). The sentiment day in the nearly seven-month period over which scores Spos and Sneg for a given set of N tweets our dataset spans. are calculated as follows: Gallup’s Daily Job Approval reports two num- P 1{xlabel = positive} bers: Approval and Disapproval. We calculate in- Spos = x (1) dividual sentiment scores Spos and Sneg for each N P day, and compare the two sets of trends using 1{xlabel = negative} Pearson’s correlation coefficient. O’Connor et al. Spos = x (2) N do not explicitly evaluate these two, but instead where 1{xlabel = positive} is 1 if the tweet x is use the ratio Sratio . We also calculate this daily labeled positive, and N is the number of tweets in ratio from Gallup for comparison purposes by di- the corpus. For the sake of comparison, we also viding the Approval by the Disapproval. calculate a sentiment ratio as done in O’Connor et al. (2010): 6.2 Results and Discussion The first set of results uses the lexicon-based clas- P x 1{xlabel = positive} Sratio = P (3) sifier for sentiment analysis and compares the dif- x 1{xlabel = negative} ferent topic identification approaches. The first 5.2 Distant Supervision table in Table 2 reports Pearson’s correlation co- To build a trained classifier, we automatically gen- efficient with Gallup’s Approval and Disapproval erated a positive training set by searching for ratings. Regardless of the Topic classifier, all 608 Sentiment Lexicon cient for this approach is 0.71 with Approval and Topic Classifier Approval Disapproval 0.73 with Disapproval. keyword -0.22 0.42 Finally, we compute the ratio Sratio between PC-1 -0.65 0.71 the positive and negative sentiment scores (Equa- PC-2 -0.61 0.71 tion 3) to compare to O’Connor et al. (2010). Ta- PC-3 -0.51 0.65 ble 3 shows the results. The distantly supervised PC-4 -0.49 0.60 topic identification algorithms show little change PC-5 -0.65 0.74 between a sentiment lexicon or a classifier. How- ever, O’Connor et al.’s keyword approach im- Distantly Supervised Sentiment proves when used with a distantly supervised sen- Topic Classifier Approval Disapproval timent classifier (.22 to .40). Merging Approval keyword 0.27 0.38 and Disapproval into one ratio appears to mask PC-1 0.71 0.73 the sentiment lexicon’s poor correlation with Ap- PC-2 0.33 0.46 proval. The ratio may not be an ideal evalua- PC-3 0.05 0.31 tion metric for this reason. Real-world interest in PC-4 0.08 0.26 Presidential Approval ratings desire separate Ap- PC-5 0.54 0.62 proval and Disapproval scores, as Gallup reports. Our results (Table 2) show that distant supervi- Table 2: Correlation between Gallup polling data and sion avoids a negative correlation with Approval, the extracted sentiment with a lexicon (trends shown in Figure 3) and distant supervision (Figure 4). but the ratio hides this important advantage. One reason the ratio may mask the negative Sentiment Lexicon Approval correlation is because tweets are often keyword PC-1 PC-2 PC-3 PC-4 PC-5 classified as both positive and negative by a lexi- .22 .63 .46 .33 .27 .61 con (Section 5.1). This could explain the behav- Distantly Supervised Sentiment ior seen in Figure 3 in which both the positive and keyword PC-1 PC-2 PC-3 PC-4 PC-5 negative sentiment scores rise over time. How- .40 .64 .46 .30 .28 .60 ever, further experimentation did not rectify this pattern. We revised Spos and Sneg to make binary Table 3: Correlation between Gallup Approval / Dis- decisions for a lexicon: a tweet is labeled posi- approval ratio and extracted sentiment ratio scores. tive if it strictly contains more positive words than negative (and vice versa). Correlation showed lit- systems inversely correlate with Presidential Ap- tle change. Approval was still negatively corre- proval. However, they correlate well with Dis- lated, Disapproval positive (although less so in approval. Figure 3 graphically shows the trend both), and the ratio scores actually dropped fur- lines for the keyword and the distantly supervised ther. The sentiment ratio continued to hide the system PC-1. The visualization illustrates how poor Approval performance by a lexicon. the keyword-based approach is highly influenced by day-by-day changes, whereas PC-1 displays a 6.3 New Baseline: Topic-Neutral Sentiment much smoother trend. Distant supervision for sentiment analysis outper- The second set of results uses distant supervi- forms that with a sentiment lexicon (Table 2). sion for sentiment analysis and again varies the Distant supervision for topic identification further topic identification approach. The second table improves the results (PC-1 v. keyword). The in Table 2 gives the correlation numbers and Fig- best system uses distant supervision in both stages ure 4 shows the keyword and PC-1 trend lines.The (PC-1 with distantly supervised sentiment), out- results are widely better than when a lexicon is performing the purely keyword-based algorithm used for sentiment analysis. Approval is no longer of O’Connor et al. (2010). However, the question inversely correlated, and two of the distantly su- of how important topic identification is has not yet pervised systems strongly correlate (PC-1, PC-5). been addressed here or in the literature. The best performing system (PC-1) used dis- Both O’Connor et al. (2010) and Tumasjan et tant supervision for both topic identification and al. (2010) created joint systems with two topic sentiment analysis. Pearson’s correlation coeffi- identification and sentiment analysis stages. But 609 Sentiment Lexicon Figure 3: Presidential job approval and disapproval calculated using two different topic identification techniques, and using a sentiment lexicon for sentiment analysis. Gallup polling results are shown in black. Distantly Supervised Sentiment Figure 4: Presidential job approval sentiment scores calculated using two different topic identification techniques, and using the emoticon classifier for sentiment analysis. Gallup polling results are shown in black. Topic-Neutral Sentiment Figure 5: Presidential job approval sentiment scores calculated using the entire twitter corpus, with two different techniques for sentiment analysis. Gallup polling results are shown in black for comparison. 610 Topic-Neutral Sentiment build upon what has recently been shown in the Algorithm Approval Disapproval literature: distant supervision with emoticons is Distant Sup. 0.69 0.74 a valuable methodology. We also expand upon Keyword Lexicon -0.63 0.69 prior work by discovering drastic performance differences between positive and negative lexi- Table 4: Pearson’s correlation coefficient of Sentiment con words. The OpinionFinder lexicon failed Analysis without Topic Identification. to correlate (inversely) with Gallup’s Approval polls, whereas a distantly trained classifier cor- what if the topic identification step were removed related strongly with both Approval and Disap- and sentiment analysis instead run on the entire proval (Pearson’s .71 and .73). We only tested Twitter corpus? To answer this question, we OpinionFinder and SentiStrength, so it is possible ran the distantly supervised emoticon classifier to that another lexicon might perform better. How- classify all tweets in the 7 months of Twitter data. ever, our results suggest that lexicons vary in their For each day, we computed the positive and neg- quality across sentiment, and distant supervision ative sentiment scores as above. The evaluation is may provide more robustness. identical, except for the removal of topic identifi- Third, our results outperform previous work on cation. Correlation results are shown in Table 4. Presidential Job Approval prediction (O’Connor This baseline parallels the results seen when et al., 2010). We presented two novel approaches topic identification is used: the sentiment lexi- to the domain: a coupled distantly supervised sys- con is again inversely correlated with Approval, tem, and a topic-neutral baseline, both of which and distant supervision outperforms the lexicon outperform previous results. In fact, the baseline approach in both ratings. This is not surpris- surprisingly matches or outperforms the more so- ing given previous distantly supervised work on phisticated approaches that use topic identifica- sentiment analysis (Go et al., 2009; Davidov et tion. The baseline correlates .69 with Approval al., 2010; Kouloumpis et al., 2011). However, and .74 with Disapproval. This suggests a new our distant supervision also performs as well as baseline that should be used in all topic-specific the best performing topic-specific system. The sentiment applications. best performing topic classifier, PC-1, correlated Fourth, we described and made available two with Approval with r=0.71 (0.69 here) and Dis- new annotated datasets of political tweets to facil- approval with r=0.73 (0.74 here). Computing itate future work in this area. overall sentiment on Twitter performs as well as Finally, Twitter users are not a representative political-specific sentiment. This unintuitive re- sample of the U.S. population, yet the high corre- sult suggests a new baseline that all topic-based lation between political sentiment on Twitter and systems should compute. Gallup ratings makes these results all the more intriguing for polling methodologies. Our spe- 7 Discussion cific 7-month period of time differs from previous This paper introduces a new methodology for work, and thus we hesitate to draw strong con- gleaning topic-specific sentiment information. clusions from our comparisons or to extend im- We highlight four main contributions here. plications to non-political domains. Future work First, this work is one of the first to evaluate should further investigate distant supervision as a distant supervision for topic identification. All tool to assist topic detection in microblogs. five political classifiers outperformed the lexicon- Acknowledgments driven keyword equivalent that has been widely used in the past. Our model achieved .90 F1 com- We thank Jure Leskovec for the Twitter data, pared to the keyword .39 F1 on our political tweet Brendan O’Connor for open and frank correspon- dataset. On twitter as a whole, distant supervision dence, and the reviewers for helpful suggestions. increased F1 by over 100%. The results also sug- gest that performance is relatively insensitive to the specific choice of seed keywords that are used to select the training set for the political classifier. Second, the sentiment analysis experiments 611 References Conference On Language Resources and Evalua- tion (LREC). Luciano Barbosa and Junlan Feng. 2010. Robust sen- Jonathon Read. 2005. Using emoticons to reduce de- timent detection on twitter from biased and noisy pendency in machine learning techniques for senti- data. In Proceedings of the 23rd International ment classification. In Proceedings of the ACL Stu- Conference on Computational Linguistics (COL- dent Research Workshop (ACL-2005). ING 2010). Chenhao Tan, Lillian Lee, Jie Tang, Long Jiang, Ming Albert Bifet and Eibe Frank. 2010. Sentiment knowl- Zhou, and Ping Li. 2011. User-level sentiment edge discovery in twitter streaming data. In Lecture analysis incorporating social networks. In Pro- Notes in Computer Science, volume 6332, pages 1– ceedings of the 17th ACM SIGKDD Conference on 15. Knowledge Discovery and Data Mining. Paula Carvalho, Luis Sarmento, Jorge Teixeira, and Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Mario J. Silva. 2011. Liars and saviors in a senti- Di Cai, and Arvid Kappas. 2010. Sentiment ment annotated corpus of comments to political de- strength detection in short informal text. Journal of bates. In Proceedings of the Association for Com- the American Society for Information Science and putational Linguistics (ACL-2011), pages 564–568. Technology, 61(12):2544–2558. Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. Mike Thelwall, Kevan Buckley, and Georgios Pal- Enhanced sentiment learning using twitter hashtags toglou. 2011. Sentiment in twitter events. Jour- and smileys. In Proceedings of the 23rd Inter- nal of the American Society for Information Science national Conference on Computational Linguistics and Technology, 62(2):406–418. (COLING 2010). Andranik Tumasjan, Timm O. Sprenger, Philipp G. Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit- Sandner, and Isabell M. Welpe. 2010. Election ter sentiment classification using distant supervi- forecasts with twitter: How 140 characters reflect sion. Technical report. the political landscape. Social Science Computer Sandra Gonzalez-Bailon, Rafael E. Banchs, and An- Review. dreas Kaltenbrunner. 2010. Emotional reactions J.; Wilson, T.; Wiebe and P. Hoffmann. 2005. Recog- and the pulse of public opinion: Measuring the im- nizing contextual polarity in phrase-level sentiment pact of political events on the sentiment of online analysis. In Proceedings of the Conference on Hu- discussions. Technical report. man Language Technology and Empirical Methods Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and in Natural Language Processing. Tiejun Zhao. 2011. Target-dependent twitter sen- timent classification. In Proceedings of the Associ- ation for Computational Linguistics (ACL-2011). Efthymios Kouloumpis, Theresa Wilson, and Johanna Moore. 2011. Twitter sentiment analysis: The good the bad and the omg! In Proceedings of the Fifth International AAAI Conference on Weblogs and So- cial Media. Adam D. I. Kramer. 2010. An unobtrusive behavioral model of ‘gross national happiness’. In Proceed- ings of the 28th International Conference on Human Factors in Computing Systems (CHI 2010). Mike Mintz, Steven Bills, Rion Snow, and Dan Ju- rafsky. 2009. Distant supervision for relation ex- traction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL ’09, pages 1003–1011. Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010. From tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of the AAAI Conference on Weblogs and Social Media. Alexander Pak and Patrick Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion min- ing. In Proceedings of the Seventh International 612 Learning from evolving data streams: online triage of bug reports Grzegorz Chrupala Spoken Language Systems Saarland University

[email protected]

Abstract to substantially reduce the time and cost of this task. Open issue trackers are a type of social me- dia that has received relatively little atten- 1.2 Issue trackers as social media tion from the text-mining community. We investigate the problems inherent in learn- In a large software project with a loose, not ing to triage bug reports from time-varying strictly hierarchical organization, standards and data. We demonstrate that concept drift is practices are not exclusively imposed top-down an important consideration. We show the but also tend to spontaneously arise in a bottom- effectiveness of online learning algorithms up fashion, arrived at through interaction of in- by evaluating them on several bug report datasets collected from open issue trackers dividual developers, testers and users. The indi- associated with large open-source projects. viduals involved may negotiate practices explic- We make this collection of data publicly itly, but may also imitate and influence each other available. via implicitly acquired reputation and status. This process has a strong emergent component: an in- formal taxonomy may arise and evolve in an is- 1 Introduction sue tracker via the use of free-form tags or labels. There has been relatively little research to date Developers, testers and users can attach tags to on applying machine learning and Natural Lan- their issue reports in order to informally classify guage Processing techniques to automate soft- them. The issue tracking software may give users ware project workflows. In this paper we address feedback by informing them which tags were fre- the problem of bug report triage. quently used in the past, or suggest tags based on the content of the report or other information. 1.1 Issue tracking Through this collaborative, feedback driven pro- Large software projects typically track defect re- cess involving both human and machine partici- ports, feature requests and other issue reports us- pants, an evolving consensus on the label inven- ing an issue tracker system. Open source projects tory and semantics typically arises, without much tend to use trackers which are open to both devel- top-down control (Halpin et al. 2007). opers and users. If the product has many users its This kind of emergent taxonomy is known as tracker can receive an overwhelming number of a folksonomy or collaborative tagging and is issue reports: Mozilla was receiving almost 300 very common in the context of social web appli- reports per day in 2006 (Anvik et al. 2006). Some- cations. Large software projects, especially those one has to monitor those reports and triage them, with open policies and little hierarchical struc- that is decide which component they affect and tures, tend to exhibit many of the same emergent which developer or team of developers should be social properties as the more prototypical social responsible for analyzing them and fixing the re- applications. While this is a useful phenomenon, ported defects. An automated agent assisting the it presents a special challenge from the machine- staff responsible for such triage has the potential learning point of view. 613 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 613–622, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics 1.3 Concept drift 1.4 Online learning Many standard supervised approaches in This paucity of research on online learning from machine-learning assume a stationary distribution issue tracker streams is rather surprising, given from which training examples are independently that truly incremental learners have been well- drawn. The set of training examples is processed known for many years. In fact one of the first as a batch, and the resulting learned decision learning algorithms proposed was Rosenblatt’s function (such as a classifier) is then used on test perceptron, a simple mistake-driven discrimina- items, which are assumed to be drawn from the tive classification algorithm (Rosenblatt 1958). In same stationary distribution. the current paper we address this situation and If we need an automated agent which uses hu- show that by using simple, standard online learn- man labels to learn to tag objects the batch learn- ing methods we can improve on batch or pseudo- ing approach is inadequate. Examples arrive one- online learning. We also show that when using by-one in a stream, not as a batch. Even more a sophisticated state-of-the-art stochastic gradient importantly, both the output (label) distribution descent technique the performance gains can be and the input distribution from which the exam- quite large. ples come are emphatically not stationary. As a 1.5 Contributions software project progresses and matures, the type of issues reported is going to change. As project Our main contributions are the following: Firstly, members and users come and go, the vocabulary we explicitly show that concept-drift is pervasive they use to describe the issues will vary. As the and serious in real bug report streams. We then consensus tag folksonomy emerges, the label and address this problem by leveraging state-of-the- training example distribution will evolve. This art online learning techniques which automati- phenomenon is sometimes referred to as concept cally track the evolving data stream and incremen- drift (Widmer and Kubat 1996, Tsymbal 2004). tally update the model after each data item. We Early research on learning to triage tended to also adopt the continuous evaluation paradigm, ˇ either not notice the problem (Cubrani´ c and Mur- where the learner predicts the output for each ex- phy 2004), or acknowledge but not address it (An- ample before using it to update the model. Sec- vik et al. 2006): the evaluation these authors used ondly, we address the important issue of repro- assigned bug reports randomly to training and ducibility in research in bug triage automation evaluation sets, discarding the temporal sequenc- by making available the data sets which we col- ing of the data stream. lected and used, in both their raw and prepro- cessed forms. Bhattacharya and Neamtiu (2010) explicitly address the issue of online training and evalua- 2 Open issue-tracker data tion. In their setup, the system predicts the out- put for an item based only on items preceding it Open source software repositories and their as- in time. However, their approach to incremen- sociated issue trackers are a naturally occurring tal learning is simplistic: they use a batch clas- source of large amounts of (partially) labeled data. sifier, but retrain it from scratch after receiving There seems to be growing interest in exploiting each training example. A fully retrained batch this rich resource as evidenced by existing publi- classifier will adapt only slowly to changing data cations as well as the appearance of a dedicated stream, as more recent example have no more in- workshop (Working Conference on Mining Soft- fluence on the decision function that less recent ware Repositories). ones. In spite of the fact that the data is publicly avail- Tamrawi et al. (2011) propose an incremental able in open repositories, it is not possible to di- approach to bug triage: the classes are ranked rectly compare the results of the research con- according to a fuzzy set membership function, ducted on bug triage so far: authors use non- which is based on incrementally updated fea- trivial project-specific filtering, re-labeling and ture/class co-occurrence counts. The model is ef- pre-processing heuristics; these steps are usually ficient in online classification, but also adapts only not specified in enough detail that they could be slowly. easily reproduced. 614 Field Meaning of the closed statuses. We generated two data sets Identifier Issue ID from the Chromium issues: Title Short description of issue Description Content of issue report, which • Chromium S UBCOMPONENT. Chromium may include steps to reproduce, uses special tags to help triage the bug re- error messages, stack traces etc. ports. Tags prefixed with Area- specify Author ID of report submitter which subcomponent of the project the bug CCS List of IDs of people CC’d on the issue report should be routed to. In some cases more Labels List of tags associated with is- than one Area- tag is present. Since this sue affects less than 1% of reports, for simplic- Status Label describing the current sta- ity we treat these as single, compound labels. tus of the issue (e.g. Invalid, The development set contains 31,953 items, Fixed, Won’t Fix) and 75 unique output labels. Assigned To ID of person who has been as- signed to deal with the issue • Chromium A SSIGNED. In this dataset the Published Date on which issue report was output is the value of the assignedTo submitted field. We discarded issues where the field was left empty, as well as the Table 1: Issue report record ones which contained the placeholder value all-bugs-test.chromium.org. The To help remedy this situation we decided to col- development set contains 16,154 items and lect data from several open issue trackers, use the 591 unique output labels. minimal amount of simple preprocessing and fil- Android Android is a mobile operating sys- ter heuristics to get useful input data, and publicly tem project (http://code.google.com/ share both the raw and preprocessed data. p/android/). We retrieved all the bugs reports, We designed a simple record type which acts of which 6,341 had a closed status. We generated as a common denominator for several tracker for- two datasets: mats. Thus we can use a common representation for issue reports from various trackers. The fields • Android S UBCOMPONENT. The reports in our record are shown in Table 1. which are labeled with tags prefixed with Below we describe the issue trackers used Component-. The development set con- and the datasets we build from them. As dis- tains 888 items and 12 unique output labels. cussed above (and in more detail in Section 4.1), • Android A SSIGNED. The output label is the we use progressive validation rather than a split value of the assignedTo field. We dis- into training and test set. However, in order carded issues with the field left empty. The to avoid developing on the test data, we split development set contains 718 items and 72 each data stream into two substreams, by assign- unique output labels. ing odd-numbered examples to the test stream and the even-numbered ones to the development Firefox Firefox is the well-known web-browser stream. We can use the development stream for project (https://bugzilla.mozilla. exploratory data analysis and feature and param- org). eter tuning, and then use progressive validation to We obtained a total of 81,987 issues with a evaluate on entirely unseen test data. Below we closed status. specify the size and number of unique labels in • Firefox A SSIGNED. We discarded issues the development sets; the test sets are very similar where the field was left empty, as well as in size. the ones which contained a placeholder value (nobody). The development set contains Chromium Chromium is the open source- 12,733 items and 503 unique output labels. project behind Google’s Chrome browser (http://code.google.com/p/ Launchpad Launchpad is an issue tracker chromium/). We retrieved all the bugs run by Canonical Ltd for mostly Ubuntu-related from the issue tracker, of which 66,704 have one projects (https://bugs.launchpad. 615 net/). We obtained a total of 99,380 issues with a closed status. • Launchpad A SSIGNED. We discarded issues where the field was left empty. The devel- opment set contains 18,634 items and 1,970 unique output labels. 3 Analysis of concept drift In the introduction we have hypothesized that in issue tracker streams concept drift would be an especially acute problem. In this section we show how class distributions evolve over time in the data we collected. A time-varying distribution is difficult to sum- marize with a single number, but it is easy to ap- preciate in a graph. Figures 1 and 2 show concept drift for several of our data streams. The horizon- tal axis indexes the position in the data stream. The vertical axis shows the class proportions at Figure 1: S UBCOMPONENT class distribution change each position, averaged over a window containing over time 7% of all the examples in the stream, i.e. in each thin vertical bar the proportion of colors used cor- 4 Experimental results responds to the smoothed class distribution at a particular position in the stream. In an online setting it is important to use an evalu- ation regime which closely mimics the continuous Consider the plot for Chromium S UBCOMPO - use of the system in a real-life situation. NENT . We can see that a bit before the middle point in the stream class proportions change quite 4.1 Progressive validation dramatically: The orange BROWSERUI and vio- When learning from data streams the standard let MISC almost disappears, while blue INTER - evaluation methodology where data is split into a NALS , pink UI and dark red UNDEFINED take separate training and test set is not applicable. An over. This likely corresponds to an overhaul in the evaluation regime know as progressive validation label inventory and/or recommended best practice has been used to accurately measure the general- for triage in this project. There are also more ization performance of online algorithms (Blum gradual and smaller scale changes throughout the et al. 1999). Under progressive evaluation, an in- data stream. put example from a temporally ordered sequence The Android S UBCOMPONENT stream con- is sent to the learner, which returns the prediction. tains much less data so the plot is less smooth, but The error incurred on this example is recorded, there are clear transitions in this image also. We and the true output is only then sent to the learner see that light blue GOOGLE all but disappears after which may update its model based on it. The fi- about two thirds point and the proportion of vio- nal error is the mean of the per-example errors. let TOOLS and light-green DALVIK dramatically Thus even though there is no separate test set, the increases. prediction for each input is generated based on a In Figure 2 we see the evolution of class pro- model trained on examples which do not include portions in the A SSIGNED datasets. Each plot’s it. idiosyncratic shape illustrates that there is wide In previous work on bug report triage, Bhat- variation in the amount and nature of concept drift tacharya and Neamtiu (2010) and Tamrawi et al. in different software project issue trackers. (2011) used an evaluation scheme (close to) pro- 616 Figure 2: A SSIGNED class distribution change over time 1 th whole rankings for all the examples. MRR is also gressive validation. They omit the initial 11 of the examples from the mean. a special case of Mean Average Precision when there is only one true output per item. 4.2 Mean reciprocal rank A bug report triaging agent is most likely to be 4.3 Input representation used in a semi-automatic workflow, where a hu- Since in this paper we focus on the issues related man triager is presented with a ranked list of to concept drift and online learning, we kept the possible outputs (component labels or developer feature set relatively simple. We preprocess the IDs). As such it is important to evaluate not only text in the issue report title and description fields accuracy of the top ranking suggesting, but rather by removing HTML markup, tokenizing, lower- the quality of the whole ranked list. casing and removing most punctuation. We then Previous research (Bhattacharya and Neamtiu extracted the following feature types: 2010, Tamrawi et al. 2011) made an attempt at approximating this criterion by reporting scores • Title unigram and bigram counts which indicate whether the true output is present • Description unigram and bigram counts in the top n elements of the ranking, for several • Author ID (binary indicator feature) values of n. Here we suggest borrowing the mean reciprocal rank (MRR) metric from the informa- • Year, month and day of submission (binary tion retrieval domain (Voorhees 2000). It is de- indicator features) fined as the mean of the reciprocals of the rank at 4.4 Models which the true output is found: We tested a simple online baseline, a pseudo- 1 N X online algorithm which uses a batch model and MRR = rank(i)−1 repeatedly retrains it, an online model used in pre- N i=1 vious research on bug triage and two generic on- line learning algorithms. where rank(i) indicates the rank of the ith true output. MRR has the advantage of providing a Window Frequency Baseline This baseline single number which summarizes the quality of does not use any input features. It outputs the 617 ranked list of labels for the current item based Algorithm 1 Multiclass online perceptron on the relative frequencies of output labels in the function PREDICT(Y, W, x) window of k previous items. We tested windows return {(y, WyT x) | y ∈ Y } of size 100 and 1000 and report the better result. procedure UPDATE(W, x, yˆ, y) SVM Minibatch This model uses the mul- if yˆ 6= y then ticlass linear Support Vector Machine model Wyˆ ← Wyˆ − x (Crammer and Singer 2002) as implemented in Wy ← Wy + x SVM Light (Joachims 1999). SVM is known as a state-of-the-art batch model in classification in general and in text categorization in particu- where y is the output label, X the set of features lar. The output classes for an input example are in the input issue report, n(y, x) the number of ex- ranked according to the value of the discriminant amples labeled as y which contain feature x, n(y) values returned by the SVM classifier. In order number of examples labeled y and n(x) number to adapt the model to an online setting we retrain of examples containing feature x. The counts are it every n examples on the window of k previous updated online. Tamrawi et al. (2011) also use examples. The parameters n and k can have large two so called caches: the label cache keeps the influence on the prediction, but it is not clear how j% most recent labels and the term cache the k to set them when learning from streams. Here we most significant features for each label. Since chose the values (100,1000) based on how feasi- in Tamrawi et al. (2011)’s experiments the label ble the run time was and on the performance dur- cache did not affect the results significantly, here ing exploratory experiments on Chromium S UB - we always set j to 100%. We select the optimal COMPONENT . Interestingly, keeping the window k parameter from {100, 1000, 5000} based on the parameter relatively small helps performance: a development set. window of 1,000 works better than a window of Regression with Stochastic Gradient Descent 5,000. This model performs online multiclass learning Perceptron We implemented a single-pass on- by means of a reduction to regression. The re- line multiclass Perceptron with a constant learn- gressor is a linear model trained using Stochastic ing rate. It maintains a weight vector for each Gradient Descent (Zhang 2004). SGD updates the output seen so far: the prediction function ranks current parameter vector w(t) based on the gradi- outputs according to the inner product of the cur- ent of the loss incurred by the regressor on the rent example with the corresponding weight vec- current example (x(t) , y (t) ): tor. The update function takes the true output and T the predicted output. If they are not equal, the w(t+1) = w(t) − η(t)∇L(y (t) , w(t) x(t) ) current input is subtracted from the weight vector The parameter η(t) is the learning rate at time t, corresponding to the predicted output and added and L is the loss function. We use the squared to the weight vector corresponding to the true out- loss: put (see Algorithm 1). We hash each feature to an L(y, yˆ) = (y − yˆ)2 integer value and use it as the feature’s index in the weight vectors in order to bound memory us- We reduce multiclass learning to regression us- age in an online setting (Weinberger et al. 2009). ing a one-vs-all-type scheme, by effectively trans- The Perceptron is a simple but strong baseline for forming an example (x, y) ∈ X × Y into |Y | online learning. (x0 , y 0 ) ∈ X 0 × {0, 1} examples, where Y is the set of labels seen so far. The transform T is de- Bugzie This is the model described in Tamrawi fined as follows: et al. (2011). The output classes are ranked ac- cording to the fuzzy set membership function de- T (x, y) = {(x0 , I(y = y 0 )) | y 0 ∈ Y, x0h(i,y0 ) = xi } fined as follows: where h(i, y 0 ) composes the index i with the label 0 Y y (by hashing). n(y, x) For a new input x the ranking of the outputs µ(y, X) = 1− 1− n(y) + n(x) − n(y, x) y ∈ Y is obtained according to the value of the x∈X 618 prediction of the base regressor on the binary ex- Dataset RER ample corresponding to each class label. Chromium S UB 0.36 Android S UB 0.38 As our basic regression learner we use the ef- Chromium AS 0.21 ficient implementation of regression via SGD, Android AS 0.19 Vowpal Wabbit (VW) (Langford et al. 2011). VW Firefox AS 0.16 implements setting adaptive individual learning Launchpad AS 0.49 rates for each feature as proposed by Duchi et al. (2010), McMahan and Streeter (2010). Table 2: Best model’s error relative to baseline on the This is appropriate when there are many sparse development set features, and is especially useful in learning from text from fast evolving data. The features such Task Model MRR Acc as unigram and bigram counts that we rely on are Chromium Window 0.5747 0.3467 notoriously sparse, and this is exacerbated by the SVM 0.5766 0.4535 Perceptron 0.5793 0.4393 change over time in bug report streams. Bugzie 0.4971 0.2638 Regression 0.7271 0.5672 4.5 Results Android Window 0.5209 0.3080 Figures 3 and 4 show the progressive validation SVM 0.5459 0.4255 results on all the development data streams. The Perceptron 0.5892 0.4390 horizontal lines indicate the mean MRR scores for Bugzie 0.6281 0.4614 the whole stream. The curves show a moving av- Regression 0.7012 0.5610 erage of MRR in a window comprised of 7% of Table 3: S UBCOMPONENT evaluation results on test the total number of items. In most of the plots it is set. evident how the prediction performance depends on the concept drift illustrated in the plots in Sec- tion 3: for example on Chromium S UBCOMPO - pression seems to be born out by informal inspec- NENT the performance of all the models drops a tion. bit before the midpoint in the stream while the On the other hand as the scores in Table 2 learners adapt to the change in label distribution indicate, Chromium S UBCOMPONENT, Android that is happening at this time. This is especially S UBCOMPOMENT and Launchpad A SSIGNED pronounced for Bugzie, since it is not able to learn contain enough high-quality signal for the best from mistakes and adapt rapidly, but simply accu- model to substantially outperform the label fre- mulates counts. quency baseline. For five out of the six datasets, Regression SGD On Launchpad A SSIGNED Regression SGD gives the best overall performance. On Launch- performs worse than Bugzie. The concept drift pad A SSIGNED, Bugzie scores higher – we inves- plot for these data suggests one reason: there is tigate this anomaly below. very little change in class distribution over time Another observation is that the window-based as compared to the other datasets. In fact, even frequency baseline can be quite hard to beat: though the issue reports in Launchpad range from In three out of the six cases, the minibatch year 2005 to 2011, the more recent ones are heav- SVM model is no better than the baseline. ily overrepresented: 84% of the items in the de- Bugzie sometimes performs quite well, but for velopment data are from 2011. Thus fast adap- Chromium S UBCOMPONENT and Firefox A S - tation is less important in this case and Bugzie is SIGNED it scores below the baseline. able to perform well. Regarding the quality of the different datasets, On the other hand, the reason for the less than an interesting indicator is the relative error reduc- stellar score achieved with Regression SGD is due tion by the best model over the baseline (see Ta- to another special feature of this dataset: it has ble 2). It is especially hard to extract meaning- by far the largest number of labels, almost 2,000. ful information about the labeling from the inputs This degrades the performance for the one-vs-all on the Firefox A SSIGNED dataset. One possible scheme we use with SGD Regression. Prelim- cause of this can be that the assignment labeling inary investigation indicates that the problem is practices in this project are not consistent: this im- mostly caused by our application of the “hash- 619 Task Model MRR Acc Chromium Window 0.0999 0.0472 SVM 0.0908 0.0550 Perceptron 0.1817 0.1128 Bugzie 0.2063 0.0960 Regression 0.3074 0.2157 Android Window 0.3198 0.1684 SVM 0.2541 0.1684 Perceptron 0.3225 0.2057 Bugzie 0.3690 0.2086 Regression 0.4446 0.2951 Firefox Window 0.5695 0.4426 SVM 0.4604 0.4166 Perceptron 0.5191 0.4306 Bugzie 0.5402 0.4100 Regression 0.6367 0.5245 Launchpad Window 0.0725 0.0337 SVM 0.1006 0.0704 Perceptron 0.3323 0.2607 Bugzie 0.5271 0.4339 Regression 0.4702 0.3879 Table 4: A SSIGNED evaluation results on test set Figure 3: S UBCOMPONENT evaluation results on the development set the data representation (bag of words) and learn- ing algorithm (Naive Bayes) typical for text clas- ing trick” to feature-label pairs (see section 4.4), sification at the time. They collect over 15,000 which leads to excessive collisions with very large bug reports from the Eclipse project. The max- label sets. Our current implementation can use at imum accuracy they report is 30% which was most 29 bit-sized hashes which is insufficient for achieved by using 90% of the data for training. datasets like Launchpad A SSIGNED. We are cur- In Anvik et al. (2006) the authors experiment rently removing this limitation and we expect it with three learning algorithms: Naive Bayes, will lead to substantial gains on massively multi- SVM and Decision Tree: SVM performs best in class problems. their experiments. They evaluate using precision In Tables 3 and 4 we present the overall MRR and recall rather than accuracy. They report re- results on the test data streams. The picture is sim- sults on the Eclipse and Firefox projects, with pre- ilar to the development data discussed above. cision 57% and 64% respectively, but very low re- call (7% and 2%). 5 Discussion and related work Matter et al. (2009) adopt a different approach Our results show that by choosing the appropri- to bug triage. In addition to the project’s issue ate learner for the scenario of learning from data tracker data, they use also the source-code ver- streams, we can achieve much better results than sion control data. They build an expertise model by attempting to twist batch algorithm to fit the for each developer which is a word count vec- online learning setting. Even a simple and well- tor of the source code changes committed. They know algorithm such as Perceptron can be effec- also build a word count vector for each bug report, tive, but by using recent advances in research on and use the cosine between the report and the ex- SGD algorithms we can obtain substantial im- pertise model to rank developers. Using this ap- provements on the best previously used approach. proach (with a heuristic term weighting scheme) Below we review the research on bug report triage they report 33.6% accuracy on Eclipse. most relevant to our work. Bhattacharya and Neamtiu (2010) acknowl- ˇ Cubrani´c and Murphy (2004) seems to be the edge the evolving nature of bug report streams first attempt to automate bug triage. The authors and attempt to apply incremental learning meth- cast bug triage as a text classification task and use ods to bug triage. They use a two-step approach: 620 Figure 4: A SSIGNED evaluation results on the development set first they predict the most likely developer to as- streams and on the evaluation of learning under its sign to a bug using a classifier. In a second step constraints. We also show that for evolving issue they rank candidate developers according to how tracker data, in a large majority of cases SGD Re- likely they were to take over a bug from the de- gression handily outperforms Bugzie. veloper predicted in the first step. Their approach to incremental learning simply involves fully re- 6 Conclusion training a batch classifier after each item in the We demonstrate that concept drift is a real, perva- data stream. They test their approach on fixed sive issue for learning from issue tracker streams. bugs in Mozilla and Eclipse, reporting accuracies We show how to adapt to it by leveraging recent of 27.5% and 38.2% respectively. research in online learning algorithms. We also Tamrawi et al. (2011) propose the Bugzie make our dataset collection publicly available to model where developers are ranked according to enable direct comparisons between different bug the fuzzy set membership function as defined triage systems.1 in section 4.4. They also use the label (devel- We have identified a good learning framework oper) cache and term cache to speed up pro- for mining bug reports: in future we would like cessing and make the model adapt better to the to explore smarter ways of extracting useful sig- evolving data stream. They evaluate Bugzie and nals from the data by using more linguistically compare its performance to the models used in informed preprocessing and higher-level features Bhattacharya and Neamtiu (2010) on seven issue such as word classes. trackers: Bugzie has superior performance on all of them ranging from 29.9% to 45.7% for top-1 Acknowledgments output. They do not use separate validation sets This work was carried out in the context of for system development and parameter tuning. the Software-Cluster project E MERGENT and was In comparison to Bhattacharya and Neamtiu partially funded by BMBF under grant number (2010) and Tamrawi et al. (2011), here we focus 01IC10S01O. 1 much more on the analysis of concept drift in data Available from http://goo.gl/ZquBe 621 References Rosenblatt, F. (1958). The perceptron: A prob- abilistic model for information storage and or- Anvik, J., Hiew, L., and Murphy, G. (2006). Who ganization in the brain. Psychological review, should fix this bug? In Proceedings of the 28th 65(6):386. international conference on Software engineer- ing, pages 361–370. ACM. Tamrawi, A., Nguyen, T., Al-Kofahi, J., and Nguyen, T. (2011). Fuzzy set and cache-based Bhattacharya, P. and Neamtiu, I. (2010). Fine- approach for bug triaging. In Proceedings of grained incremental learning and multi-feature the 19th ACM SIGSOFT symposium and the tossing graphs to improve bug triaging. In 13th European conference on Foundations of International Conference on Software Mainte- software engineering, pages 365–375. ACM. nance (ICSM), pages 1–10. IEEE. Tsymbal, A. (2004). The problem of concept Blum, A., Kalai, A., and Langford, J. (1999). drift: definitions and related work. Computer Beating the hold-out: Bounds for k-fold and Science Department, Trinity College Dublin. progressive cross-validation. In Proceedings Voorhees, E. (2000). The TREC-8 question an- of the twelfth annual conference on Computa- swering track report. NIST Special Publication, tional learning theory, pages 203–208. ACM. pages 77–82. Crammer, K. and Singer, Y. (2002). On the al- Weinberger, K., Dasgupta, A., Langford, J., gorithmic implementation of multiclass kernel- Smola, A., and Attenberg, J. (2009). Feature based vector machines. The Journal of Ma- hashing for large scale multitask learning. In chine Learning Research, 2:265–292. Proceedings of the 26th Annual International Duchi, J., Hazan, E., and Singer, Y. (2010). Adap- Conference on Machine Learning, pages 1113– tive subgradient methods for online learning 1120. ACM. and stochastic optimization. Journal of Ma- Widmer, G. and Kubat, M. (1996). Learning in the chine Learning Research. presence of concept drift and hidden contexts. Halpin, H., Robu, V., and Shepherd, H. (2007). Machine learning, 23(1):69–101. The complex dynamics of collaborative tag- Zhang, T. (2004). Solving large scale linear ging. In Proceedings of the 16th international prediction problems using stochastic gradient conference on World Wide Web, pages 211– descent algorithms. In Proceedings of the 220. ACM. twenty-first international conference on Ma- Joachims, T. (1999). Making large-scale svm chine learning, page 116. ACM. learning practical. In Sch¨olkopf, B., Burges, ˇ Cubrani´ c, D. and Murphy, G. C. (2004). Auto- C., and Smola, A., editors, Advances in Kernel matic bug triage using text categorization. In Methods-Support Vector Learning. MIT-Press. In SEKE 2004: Proceedings of the Sixteenth In- Langford, J., Hsu, D., Karampatziakis, N., ternational Conference on Software Engineer- Chapelle, O., Mineiro, P., Hoffman, M., ing & Knowledge Engineering, pages 92–97. Hofman, J., Lamkhede, S., Chopra, S., KSI Press. Faigon, A., Li, L., Rios, G., and Strehl, A. (2011). Vowpal wabbit. https: //github.com/JohnLangford/ vowpal_wabbit/wiki. Matter, D., Kuhn, A., and Nierstrasz, O. (2009). Assigning bug reports using a vocabulary- based expertise model of developers. In Sixth IEEE Working Conference on Mining Software Repositories. McMahan, H. and Streeter, M. (2010). Adap- tive bound optimization for online convex op- timization. Arxiv preprint arXiv:1002.4908. 622 Towards a model of formal and informal address in English Manaal Faruqui Sebastian Padó Computer Science and Engineering Institute of Computational Linguistics Indian Institute of Technology Heidelberg University Kharagpur, India Heidelberg, Germany

[email protected] [email protected]

Abstract information about formality, e.g. the extraction of social relationships or, notably, machine transla- Informal and formal (“T/V”) address in dia- tion from English into languages with a T/V dis- logue is not distinguished overtly in mod- tinction which involves a pronoun choice. ern English, e.g. by pronoun choice like In this paper, we investigate the possibility to in many other languages such as French recover the T/V distinction for (monolingual) sen- (“tu”/“vous”). Our study investigates the status of the T/V distinction in English liter- tences of 19th and 20th-century English such as: ary texts. Our main findings are: (a) human (1) Can I help you, Sir? (V) raters can label monolingual English utter- (2) You are my best friend! (T) ances as T or V fairly well, given sufficient context; (b), a bilingual corpus can be ex- After describing the creation of an English corpus ploited to induce a supervised classifier for of T/V labels via annotation projection (Section 3), T/V without human annotation. It assigns we present an annotation study (Section 4) which T/V at sentence level with up to 68% accu- racy, relying mainly on lexical features; (c), establishes that taggers can indeed assign T/V la- there is a marked asymmetry between lex- bels to monolingual English utterances in context ical features for formal speech (which are fairly reliably. Section 5 investigates how T/V is conventionalized and therefore general) and expressed in English texts by experimenting with informal speech (which are text-specific). different types of features, including words, seman- tic classes, and expressions based on Politeness Theory. We find word features to be most reliable, 1 Introduction obtaining an accuracy of close to 70%. In many Indo-European languages, there are two 2 Related Work pronouns corresponding to the English you. This distinction is generally referred to as the T/V di- There is a large body of work on the T/V distinc- chotomy, from the Latin pronouns tu (informal, T) tion in (socio-)linguistics and translation studies, and vos (formal, V) (Brown and Gilman, 1960). covering in particular the conditions governing The V form (such as Sie in German and Vous in T/V usage in different languages (Kretzenbacher French) can express neutrality or polite distance et al., 2006; Schüpbach et al., 2006) and the diffi- and is used to address social superiors. The T culties in translation (Ardila, 2003; Künzli, 2010). form (German du, French tu) is employed towards However, many observations from this literature friends or addressees of lower social standing, and are difficult to operationalize. Brown and Levin- implies solidarity or lack of formality. son (1987) propose a general theory of politeness English used to have a T/V distinction until the which makes many detailed predictions. They as- 18th century, using you as V pronoun and thou sume that the pragmatic goal of being polite gives for T. However, in contemporary English, you has rise to general communication strategies, such as taken over both uses, and the T/V distinction is not avoiding to lose face (cf. Section 5.2). marked anymore. In NLP, this makes generation In computational linguistics, it is a common in English and translation into English easy. Con- observation that for almost every language pair, versely, many NLP tasks suffer from the lack of there are distinctions that are expressed overtly 623 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 623–633, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics V projection V study since they either contain almost no direct address at all or, if they do, just formal address (V). Darf ich Sie etwas fragen? Please permit me to ask Fortunately, for many literary texts from the 19th you a question. and early 20th century, copyright has expired, and Step 1: German pronoun Step 2: copy T/V class they are freely available in several languages. provides overt T/V label label to English sentence We identified 110 stories and novels among the texts provided by Project Gutenberg (English) and Figure 1: T/V label induction for English sentences in a parallel corpus with annotation projection Project Gutenberg-DE (German)1 that were avail- able in both languages, with a total of 0.5M sen- tences per language. Examples are Dickens’ David in one language, but remain covert in the other. Copperfield or Tolstoy’s Anna Karenina. We ex- Examples include morphology (Fraser, 2009) and cluded plays and poems, as well as 19th-century tense (Schiehlen, 1998). A technique that is often adventure novels by Sir Walter Scott and James F. applied in such cases is annotation projection, the Cooper which use anachronistic English for stylis- use of parallel corpora to copy information from a tic reasons, including words that previously (until language where it is overtly realized to one where the 16th century) indicated T (“thee”, “didst”). it is not (Yarowsky and Ngai, 2001; Hwa et al., We cleaned the English and German novels man- 2005; Bentivogli and Pianta, 2005). ually by deleting the tables of contents, prologues, The phenomenon of formal and informal ad- epilogues, as well as chapter numbers and titles dress has been considered in the contexts of transla- occurring at the beginning of each chapter to ob- tion into (Hobbs and Kameyama, 1990; Kanayama, tain properly parallel texts. The files were then 2003) and generation in Japanese (Bateman, 1988). formatted to contain one sentence per line using Li and Yarowsky (2008) learn pairs of formal and the sentence splitter and tokenizer provided with informal constructions in Chinese with a para- EUROPARL (Koehn, 2005). Blank lines were phrase mining strategy. Other relevant recent stud- inserted to preserve paragraph boundaries. All ies consider the extraction of social networks from novels were lemmatized and POS-tagged using corpora (Elson et al., 2010). A related study is TreeTagger (Schmid, 1994).2 Finally, they were (Bramsen et al., 2011) which considers another sentence-aligned using Gargantuan (Braune and sociolinguistic distinction, classifying utterances Fraser, 2010), an aligner that supports one-to-many as “upspeak” and “downspeak” based on the social alignments, and word-aligned in both directions relationship between speaker and addressee. using Giza++ (Och and Ney, 2003). This paper extends a previous pilot study (Faruqui and Padó, 2011). It presents more an- 3.2 T/V Gold Labels for English Utterances notation, investigates a larger and better motivated As Figure 1 shows, the automatic construction of feature set, and discusses the findings in detail. T/V labels for English involves two steps. 3 A Parallel Corpus of Literary Texts Step 1: Labeling German Pronouns as T/V. This section discusses the construction of T/V gold German has three relevant personal pronouns for standard labels for English sentences. We obtain the T/V distinction: du (T), sie (V), and ihr (T/V). these labels from a parallel English–German cor- However, various ambiguities makes their interpre- pus using the technique of annotation projection tation non-straightforward. (Yarowsky and Ngai, 2001) sketched in Figure 1: The pronoun ihr can both be used for plural T We first identify the T/V status of German pro- address or for a somewhat archaic singular or plu- nouns, then copy this T/V information onto the ral V address. In principle, these usages should corresponding English sentence. be distinguished by capitalization (V pronouns are generally capitalized in German), but many 3.1 Data Selection and Preparation T instances in our corpora informal use are nev- Annotation projection requires a parallel corpus. ertheless capitalized. Additional, ihr can be the We found commonly used parallel corpora like EU- 1 http://www.gutenberg.org, http://gutenberg.spiegel.de/ ROPARL (Koehn, 2005) or the JRC Acquis corpus 2 It must be expected that the tagger degrades on this (Steinberger et al., 2006) to be unsuitable for our dataset; however we did not quantify this effect. 624 dative form of the 3rd person feminine pronoun sie Comparison No context In context (she/her). These instances are neutral with respect A1 vs. A2 75% (.49) 79% (.58) A1 vs. GS 60% (.20) 70% (.40) to T/V but were misanalysed by TreeTagger as in- A2 vs. GS 65% (.30) 76% (.52) stances of the T/V lemma ihr. Since TreeTagger (A1 ∩ A2) vs. GS 67% (.34) 79% (.58) does not provide person information, and we did not want to use a full parser, we decided to omit Table 1: Manual annotation for T/V on a 200-sentence ihr/Ihr from consideration.3 sample. Comparison among human annotators (A1 and Of the two remaining pronouns (du and sie), du A2) and to projected gold standard (GS). All cells show raw agreement and Cohen’s κ (in parentheses). expresses (singular) T. A minor problem is pre- sented by novels set in France, where du is used as an nobiliary particle. These instances can be recog- 18K T sentences4 , of which 255 (0.6%) are labeled nised reliably since the names before and after du as both T and V. We exclude these sentences. are generally unknown to the German tagger. Thus Note that this strategy relies on the direct cor- we do not interpret du as T if the word preceding respondence assumption (Hwa et al., 2005), that or succeeding it has “unknown” as its lemma. is, it assumes that the T/V status of an utterance is The V pronoun, sie, doubles as the pronoun for not changed in translation. We believe that this is third person (she/they) when not capitalized. We a reasonable assumption, given that T/V is deter- therefore interpret only capitalized instances of Sie mined by the social relation between interlocutors; as V. Furthermore, we ignore utterance-initial po- but see Section 4 for discussion. sitions, where all words are capitalized. This is defined as tokens directly after a sentence bound- 3.3 Data Splitting ary (POS $.) or after a bracket (POS $(). Finally, we divided our English data into train- These rules concentrate on precision rather than ing, development and test sets with 74 novels recall. They leave many instances of German sec- (26K sentences), 19 novels (9K sentences) and ond person pronouns unlabeled; however, this is 13 novels (8K sentences), respectively. The cor- not a problem since we do not currently aim at pus is available for download at http://www. obtaining complete coverage on the English side nlpado.de/~sebastian/data.shtml. of our parallel corpus. From the 0.5M German sen- tences, about 14% of the sentences were labeled 4 Human Annotation of T/V for English as T or V (37K for V and 28K for T). In a random sample of roughly 300 German sentences which This section investigates how well the T/V distinc- we analysed, we did not find any errors. This puts tion can be made in English by human raters, and the precision of our heuristics at above 99%. on the basis of what information. Two annotators with near native-speaker competence in English Step 2: Annotation Projection. We now copy were asked to label 200 random sentences from the information over onto the English side. We the training set as T or V. Sentences were first pre- originally intended to transfer T/V labels between sented in isolation (“no context”). Subsequently, German and English word-aligned pronouns. How- they were presented with three sentences pre- and ever, we pronouns are not necessarily translated post-context each (“in context”). into pronouns; additionally, we found word align- Table 1 shows the results of the annotation ment accuracy for pronouns to be far from perfect, study. The first line compares the annotations due to the variability in function word translation. of the two annotators against each other (inter- For these reason, we decided to look at T/V labels annotator agreement). The next two lines compare at the level of complete sentences, ignoring word the taggers’ annotations against the gold standard alignment. This is generally unproblematic – ad- labels projected from German (GS). The last line dress is almost always consistent within sentences: compares the annotator-assigned labels to the GS of the 65K German sentences with T or V labels, for the instances on which the annotators agree. only 269 (< 0.5%) contain both T and V. Our pro- For all cases, we report raw accuracy and Co- jection on the English side results in 25K V and hen’s κ (1960), i.e. chance-corrected agreement. 3 4 Instances of ihr as possessive pronoun occurred as well, Our sentence aligner supports one-to-many alignments but could be filtered out on the basis of the POS tag. and often aligns single German to multiple English sentences. 625 We first observe that the T/V distinction is con- be T, as presumed by both annotators. Conver- siderably more difficult to make for individual sations between lovers or family members form sentences (no context) than when the discourse is another example, where T is modern usage, but available. In context, inter-annotator agreement in- the novels tend to use V: creases from 75% to 79%, and agreement with the (6) [...] she covered her face with the other gold standard rises by 10%. It is notable that the to conceal her tears. “Corinne!”, said Os- two annotators agree worse with one another than wald, “Dear Corinne! My absence has with the gold standard (see below for discussion). then rendered you unhappy!”6 On those instances where they agree, Cohen’s κ reaches 0.58 in context, which is interpreted as In sum, our annotation study establishes that the approaching good agreement (Fleiss, 1981). Al- T/V distinction, although not realized by different though far from perfect, this inter-annotator agree- pronouns in English, can be recovered manually ment is comparable to results for the annotation from text, provided that discourse context is avail- of fine-grained word sense or sentiment (Navigli, able. A substantial part of the errors is due to social 2009; Bermingham and Smeaton, 2009). changes in T/V usage. An analysis of disagreements showed that many sentences can be uttered in both T and V contexts 5 Monolingual T/V Modeling and cannot be labeled without context: The second part of the paper explores the auto- matic prediction of the T/V distinction for English (3) “And perhaps sometime you may see her.” sentences. Given the ability to create an English This case (gold label: V) is disambiguated by the training corpus with T/V labels with the annotation previous sentence which indicates a hierarchical projection methods described in Section 3.2, we social relation between speaker and addressee: can phrase T/V prediction for English as a standard supervised learning task. Our experiments have (4) “And she is a sort of relation of your lord- a twin motivation: (a), on the NLP side, we are ship’s,” said Dawson. . . . mainly interested in obtaining a robust classifier to assign the labels T and V to English sentences; Still, even a three-sentence window is often not (b), on the sociolinguistic side, we are interested in sufficient, since the surrounding sentences may be investigating through which features the categories just as uninformative. In these cases, more global T and V are expressed in English. information about the situation is necessary. Even with perfect information, however, judgments can 5.1 Classification Framework sometimes deviate, as there are considerable “grey We phrase T/V labeling as a binary classification areas” in T/V usage (Kretzenbacher et al., 2006). task at the sentence level, performing the classifica- In addition, social rules like T/V usage vary tion with L2-regularized logistic regression using in time and between countries (Schüpbach et al., the LibLINEAR library (Fan et al., 2008). Logis- 2006). This helps to explain why annotators agree tic regression defines the probability that a binary better with one another than with the gold standard: response variable y takes some value as a logit- 21st century annotators tend to be unfamiliar with transformed linear combination of the features fi , 19th century T/V usage. Consider this example each of which is assigned a coefficient βi . from a book written in second person perspective: 1 X p(y = 1) = with z = βi fi (7) (5) Finally, you acquaint Caroline with the 1 + e−z i fatal result: she begins by consoling you. Regularization incorporates the size of the coef- “One hundred thousand francs lost! We ficient vector β into the objective function, sub- shall have to practice the strictest econ- tracting it from the likelihood of the data given the omy”, you imprudently add.5 model. This allows the user to trade faithfulness to the data against generalization.7 Here, the author and translator use V to refer to the 6 reader, while today’s usage would almost certainly A.L.G. de Staël: Corinne 7 We use LIBLINEAR’s default parameters and set the 5 H. de Balzac: Petty Troubles of Married Life cost (regularization) parameter to 0.01. 626 p(C|V ) p(C|T ) Words indicative for V, ranked by the ratio of probabilities 4.59 Mister, sir, Monsieur, sirrah, . . . for T and V, estimated on the training set. 2.36 Mlle., Mr., M., Herr, Dr., . . . 1.60 Gentlemen, patients, rascals, . . . Politeness Theory Features. The third feature type is based on the Politeness Theory (Brown Table 2: 3 of the 400 clustering-based semantic classes and Levinson, 1987). Brown and Levinson’s pre- (classes most indicative for V) diction is that politeness levels will be detectable in concrete utterances in a number of ways, e.g. 5.2 Feature Types a higher use of conjunctive or hedges in polite We experiment with three features types that are speech. Formal address (i.e., V as opposed to T) is candidates to express the T/V English distinction. one such expression. Politeness Theory therefore predicts that other politeness indicators should cor- Word Features. The intuition to use word fea- relate with the T/V classification. This holds in tures draws on the parallel between T/V and infor- particular for English, where pronoun choice is mation retrieval tasks like document classification: unavailable to indicate politeness. some words are presumably correlated with formal We constructed 16 features on the basis of Po- address (like titles), while others should indicate liteness Theory predictions, that is, classes of ex- informal address (like first names). In a prelimi- pressions indicating either formality or informality. nary experiment, we noticed that in the absence of From a computational perspective, the problem further constraints, many of the most indicative fea- with Politeness Theory predictions is that they are tures are names of persons from particular novels only described qualitatively and by example, with- which are systematically addressed formally (like out detailed lists. For each feature, we manually Phileas Fogg from J. Vernes’ Around the world in identified around 10 words or multi-word relevant eighty days) or informally (like Mowgli, Baloo, expressions. Table 3 shows these 16 features with and Bagheera from R. Kipling’s Jungle Book). their intended classes and some example expres- These features clearly do not generalize to new sions. Similar to the semantic class features, the books. We therefore added a constraint to remove value of each politeness feature is the sum of the all features which did not occur in at least three frequencies of its members in a sentence. novels. To reduce the number of word features to a reasonable order of magnitude, we also performed 5.3 Context: Size and Type a χ2 -based feature selection (Manning et al., 2008) As our annotation study in Section 4 found, con- on the training set. Preliminary experiments es- text is crucial for human annotators, and this pre- tablished that selecting the top 800 word features sumably carries over to automatic methods human yielded a model with good generalization. annotators: if the features for a sentence are com- Semantic Class Features. Our second feature puted just on that sentence, we will face extremely type is semantic class features. These can be seen sparse data. We experiment with symmetrical win- as another strategy to counteract the sparseness dow contexts, varying the size between n = 0 (just at the level of word features. We cluster words the target sentence) and n = 10 (target sentence into 400 semantic classes on the basis of distribu- plus 10 preceding and 10 succeeding sentences). tional and morphological similarity features which This kind of simple “sentence context” makes an are extracted from an unlabeled English collec- important oversimplification, however. It lumps to- tion of Gutenberg novels comprising more than gether material from different speech turns as well 100M tokens, using the approach by Clark (2003). as from “narrative” sentences, which may generate These features measure how similar tokens are to misleading features. For example, narrative sen- one another in terms of their occurrences in the tences may refer to protagonists by their full names document and are useful in Named Entity Recog- including titles (strong features for V) even when nition (Finkel and Manning, 2009). As features these protagonists are in T-style conversations: in the T/V classification of a given sentence, we (8) “You are the love of my life”, said Sir simply count for each class the number of tokens Phileas Fogg.8 (T) in this class present in the current sentence. For 8 illustration, Table 2 shows the three classes most J. Verne: Around the world in 80 days 627 Class Example expressions Class Example expressions Inclusion (T) let’s, shall we Exclamations (T) hey, yeah Subjunctive I (T) can, will Subjunctive II (V) could, would Proximity (T) this, here Distance (V) that, there Negated question (V) didn’t I, hasn’t it Indirect question (V) would there, is there Indefinites (V) someone, something Apologizing (V) bother, pardon Polite adverbs (V) marvellous, superb Optimism (V) I hope, would you Why + modal (V) why would(n’t) Impersonals (V) necessary, have to Polite markers (V) please, sorry Hedges (V) in fact, I guess Table 3: 16 Politeness theory-based features with intended classes and example expressions Example (8) also demonstrates that narrative mate- 67 ● rial and direct speech may even be mixed within ● ● 66 ● ● individual sentences. ● ● Accuracy (%) For these reasons, we introduce an alternative ● ● ● 65 ● ● ● ● ● concept of context, namely direct speech context, 64 ● whose purpose is to exclude narrative material. We ● compute direct speech context in two steps: (a), 63 ● ● ● ● segmentation of sentences into chunks that are 62 either completely narrative or speech, and (b), la- beling of chunks with a classifier that distinguishes 61 ● these two classes. The segmentation step (a) takes place with a regular expression that subdivides sen- 0 2 4 6 8 10 tences on every occurrence of quotes (“ , ” , ’ , ‘, Context size (n) etc.). As training data for the classification step (b), we manually tagged 1000 chunks from our Figure 2: Accuracy vs. number of sentences in context training data as either B-DS (begin direct speech), (empty circles: sentence context; solid circles: direct I-DS (inside direct speech) and O (outside direct speech context) speech, i.e. narrative material).9 We used this dataset to train the CRF-based sequence tagger Mallet (McCallum, 2002) using all tokens, includ- together utterances by different speakers and can ing punctuation, as features.10 This tagger is used therefore yield misleading features in the case of to classify all chunks in our dataset, resulting in asymmetric conversational situations, in addition output like the following example: to possible direct speech misclassifications. (B-DS) “I am going to see his Ghost! 6 Experimental Evaluation (I-DS) It will be his Ghost not him!” (9) 6.1 Evaluation on the Development Set (O) Mr. Lorry quietly chafed the hands that held his arm.11 We first perform model selection on the develop- ment set and then validate our results on the test Direct speech chunks belonging to the same sen- set (cf. Section 3.3). tence are subsequently recombined. We define the direct speech context of size n for Influence of Context. Figure 2 shows the influ- a given sentence as the n preceding and following ence of size and type of context, using only words direct speech chunks that are labeled B-DS or I-DS as features. Without context, we obtain a perfor- while skipping any chunks labeled O. Note that mance of 61.1% (sentence context) and of 62.9% this definition of direct speech context still lumps (direct speech context). These numbers beat the random baseline (50.0%) and the frequency base- 9 The labels are chosen after IOB notation conventions line (59.1%). The addition of more context further (Ramshaw and Marcus, 1995). 10 We also experimented with rule-based chunk labeling improves performance substantially for both con- based on quotes, but found the use of quotes too inconsistent. text types. The ideal context size is fairly large, 11 C. Dickens: A tale of two cities. namely 7 sentences and 8 direct speech chunks, re- 628 Model Accuracy Model Accuracy ∆ to dev set Random Baseline 50.0 Frequency baseline 59.3 + 0.2 Frequency Baseline 59.1 Words (no context) 62.5 - 0.4 Words 67.0∗∗ Words (context size 6) 67.3 + 1.0 SemClass 57.5 Words (context size 8) 67.5 + 0.5 PoliteClass 59.6 Words (context size 10) 66.8 + 1.0 Words + SemClass 66.6∗∗ Words + PoliteClass 66.4∗∗ Table 5: T/V classification accuracy on the test set and Words + PoliteClass + SemClass 66.2∗∗ differences to dev set results (direct speech context) Raw human IAA (no context) 75.0 Raw human IAA (in context) 79.0 not overfit on the development set when picking Table 4: T/V classification accuracy on the develop- the best model. The tendencies correspond well ment set (direct speech context, size 8). ∗∗ : Significant to the development set: the frequency baseline is difference to frequency baseline (p<0.01) almost identical, as are the results for the different models. The differences to the development set are all equal to or smaller than 1% accuracy, and spectively. This indicates that sparseness is indeed the best result at 67.5% is 0.5% better than on the a major challenge, and context can become large development set. This is a reassuring result, as our before the effects mentioned in Section 5.3 counter- model appears to generalize well to unseen data. act the positive effect of more data. Direct speech context outperforms sentence context throughout, 6.3 Analysis by Feature Types with a maximum accuracy of 67.0% as compared The results from Section 6.1 motivate further anal- to 65.2%, even though it shows higher variation, ysis of the individual feature types. which we attribute to the less stable nature of the direct speech chunks and their automatically cre- Analysis of Word Features. Word features are ated labels. From now on, we adopt a direct speech by far the most effective features. Table 6 lists context of size 8 unless specified differently. the top twenty words indicating T and V (ranked by the ratio of probabilities for the two classes Influence of Features. Table 4 shows the results on the training set). The list still includes some for different feature types. The best model (word proper names like Vrazumihin or Louis-Gaston features only) is highly significantly better than (even though all features have to occur in at least the frequency baseline (which it beats by 8%) as three novels), but they are relatively infrequent. determined by a bootstrap resampling test (Noreen, The most prominent indicators for the formal class 1989). It gains 17% over the random baseline, V are titles (monsieur, (ma)’am) and instances of but is still more than 10% below inter-annotator formulaic language (Permit (me), Excuse (me)). agreement in context, which is often seen as an There are also some terms which are not straight- upper bound for automatic models. forward indicators of formal address (angelic, stub- Disappointingly, the comparison of the feature bornness), but are associated with a high register. groups yields a null result: We are not able to There is a notable asymmetry between T and improve over the results for just word features with V. The word features for T are considerably more either the semantic class or the politeness features. difficult to interpret. We find some forms of earlier Neither feature type outperforms the frequency period English (thee, hast, thou, wilt) that result baseline significantly (p>0.05). Combinations of from occasional archaic passages in the novels as the different feature types also do worse than just well first names (Louis-Gaston, Justine). Never- words. The differences between the best model theless, most features are not straightforward to (just words) and the combination models are all connect to specifically informal speech. not significant (p>0.05). These negative results warrant further analysis. It follows in Section 6.3. Analysis of Semantic Class Features. We ranked the semantic classes we obtained by distri- 6.2 Results on the Test Set butional clustering in a similar manner to the word Table 5 shows the results of evaluating models features. Table 2 shows the top three classes in- with the best feature set and with different context dicative for V. Almost all others of the 400 clusters sizes on the test set, in order to verify that we did do not have a strong formal/informal association 629 Top 20 words for V Top 20 words for T p(f |V )/p(f |T ) are between 0.9 and 1.3, that is, P (w|V ) P (w|T ) Word w P (w|T ) Word w P (w|V ) the features were only weakly indicative of one of Excuse 36.5 thee 94.3 the classes. Furthermore, not all features turned Permit 35.0 amenable 94.3 out to be indicative of the class we designed them ’ai 29.2 stuttering 94.3 for. The best indicator for V was the Indefinites ’am 29.2 guardian 94.3 feature (somehow, someone cf. Table 3), as ex- stubbornness 29.2 hast 92.0 flights 29.2 Louis-Gaston 92.0 pected. In contrast, the best indicator for T was the monsieur 28.6 lease-making 92.0 Negation question feature which was supposedly Vrazumihin 28.6 melancholic 92.0 an indicator for V (didn’t I, haven’t we). mademoiselle 26.5 ferry-boat 92.0 A majority of politeness features (13 of the 16) angelic 26.5 Justine 92.0 had p(f |V )/p(f |T ) values above 1, that is, were Allow 24.5 Thou 66.0 indicative for the class V. Thus for this feature type, madame 21.2 responsibility 63.8 like for the others, it appears to be more difficult to delicacies 21.2 thou 63.8 entrapped 21.2 Iddibal 63.8 identify T than to identify V. This negative result lack-a-day 21.2 twenty-fifth 63.8 can be attributed at least in part to our method of ma 21.0 Chic 63.8 hand-crafting lists of expressions for these features. duke 18.0 allegiance 63.8 The inadvertent inclusion of overly general terms policeman 18.0 Jouy 63.8 V might be responsible for the features’ inability free-will 18.0 wilt 47.0 to discriminate well, while we have presumably Canon 18.0 shall 47.0 missed specific terms which has hurt coverage. Table 6: Most indicative word features for T or V This situation may in the future be remedied with the semi-automatic acquisition of instantiations of politeness features. but mix formal, informal, and neutral vocabulary. This tendency is already apparent in class 3: Gen- 6.4 Analysis of Individual Novels tlemen is clearly formal, while rascals is informal. One possible hypothesis regarding the difficulty patients can belong to either class. Even in class of finding indicators for the class T is that indi- 1, we find Sirrah, a contemptuous term used in ad- cators for T tend to be more novel-specific than dressing a man or boy with a low formality score indicators for V, since formal language is more (p(w|V )/p(w|T ) = 0.22). From cluster 4 onward, conventionalized (Brown and Levinson, 1987). If none of the clusters is strongly associated with ei- this were the case, then our strategy of building ther V or T (p(c|V )/p(c|T ) ≈ 1). well-generalizing models by combining text from Our interpretation of these observations is that different novels would naturally result in models in contrast to text categorization, there is no clear- that have problems with picking up T features. cut topical or domain difference between T and V: both categories co-occur with words from almost To investigate this hypothesis, we trained mod- any domain. In consequence, semantic classes do els with the best parameters as before (8-sentence not, in general, represent strong unambiguous indi- direct speech context, words as features). How- cators. Similar to the word features, the situation ever, this time we trained novel-specific models, is worse for T than for V: there still are reasonably splitting each novel into 50% training data and strong features for V, the “marked” case, but it is 50% testing data. We required novels to contain more difficult to find indicators for T. more than 200 labeled sentences. This ruled out most short stories, leaving us with 7 novels in the Analysis of politeness features. A major reason test set. The results are shown in Table 7 and show for the ineffectiveness of the Politeness Theory- a clear improvement. The accuracy is 13% higher based features seems to be their low frequency: than in our main experiment (67% vs. 80%), even in the best model, with a direct speech context of though the models were trained on considerably size 8, only an average of 7 politeness features less data. Six of the seven novels perform above was active for any given sentence. However, fre- the 67.5% result from the main experiment. quency was not the only problem – the politeness The top-ranked features for T and V show a features were generally unable to discriminate well much higher percentage of names for both T and between T and V. For all features, the values of V than in the main experiment. This is to be ex- 630 Novel Accuracy lation system for a task-based evaluation on the H. Beecher-Stove: Uncle Tom’s Cabin 90.0 translation of direct address into German and other J. Spyri: Cornelli 88.3 languages with different T/V pronouns. E. Zola: Lourdes 83.9 H. de Balzac: Cousin Pons 82.3 Considering our sociolinguistic goal of deter- C. Dickens: The Pickwick Papers 77.7 mining the ways in which English realizes the T/V C. Dickens: Nicholas Nickleby 74.8 distinction, we first obtained a negative result: only F. Hodgson Burnett: Little Lord 61.6 word features perform well, while semantic classes All (micro average) 80.0 and politeness features do hardly better than a fre- quency baseline. Notably, there are no clear “topi- Table 7: T/V prediction models for individual novels (50% of each novel for training and 50% testing) cal” divisions between T and V, like for example in text categorization: almost all words are very weakly correlated with either class, and seman- pected, since this experiment does not restrict itself tically similar words can co-occur with different to features that occurred in at least three novels. classes. Consequently, distributionally determined The price we pay for this is worse generalization to semantic classes are not helpful for the distinction. other novels. There is also still a T/V asymmetry: Politeness features are difficult to operationalize more top features are shared among the V lists of with sufficiently high precision and recall. individual novels and with the main experiment An interesting result is the asymmetry between V list than on the T side. Like in the main exper- the linguistic features for V and T at the lexical iment (cf. Section 6.3), V features indicate titles level. V language appears to be more convention- and other features of elevated speech, while T fea- alized; the models therefore identified formulaic tures mostly refer to novel-specific protagonists expressions and titles as indicators for V. On the and events. In sum, these results provide evidence other hand, very few such generic features exist for for a difference in status of T and V. the class T; consequently, the classifier has a hard time learning good discriminating and yet generic 7 Discussion and Conclusions features. Those features that are indicative of T, In this paper, we have studied the distinction such as first names, are highly novel-specific and between formal and information (T/V) address, were deliberately excluded from the main exper- which is not expressed overtly through pronoun iment. When we switched to individual novels, choice or morphosyntactic marking in modern En- the models picked up such features, and accuracy glish. Our hypothesis was that the T/V distinction increased – at the cost of lower generalizability can be recovered in English nevertheless. Our man- between novels. A more technical solution to this ual annotation study has shown that annotators can problem would be the training of a single-class in fact tag monolingual English sentences as T or classifier for V, treating T as the “default” class V with reasonable accuracy, but only if they have (Tax and Duin, 1999). sufficient context. We exploited the overt informa- Finally, an error analysis showed that many er- tion from German pronouns to induce T/V labels rors arise from sentences that are too short or un- for English and used this labeled corpus to train a specific to determine T or V reliably. This points monolingual T/V classifier for English. We exper- to the fact that T/V should not be modelled as a imented with features based on words, semantic sentence-level classification task in the first place: classes, and Politeness Theory predictions. T/V is not a choice made for each sentence, but With regard to our NLP goal of building a T/V one that is determined once for each pair of inter- classifier, we conclude that T/V classification is locutors and rarely changed. In future work, we a phenomenon that can be modelled on the basis will attempt to learn social networks from novels of corpus features. A major factor in classifica- (Elson et al., 2010), which should provide con- tion performance is the inclusion of a wide context straints on all instances of communication between to counteract sparse data, and more sophisticated a speaker and an addressee. However, the big – and context definitions improve results. We currently unsolved, as far as we know – challenge is to au- achieve top accuracies of 67%-68%, which still tomatically assign turns to interlocutors, given the leave room for improvement. We next plan to varied and often inconsistent presentation of direct couple our T/V classifier with a machine trans- speech turns in novels. 631 References Joseph L. Fleiss. 1981. Statistical methods for rates and proportions. John Wiley, New York, 2nd edi- John Ardila. 2003. (Non-Deictic, Socio-Expressive) tion. T-/V-Pronoun Distinction in Spanish/English Formal Alexander Fraser. 2009. Experiments in morphosyn- Locutionary Acts. Forum for Modern Language tactic processing for translating to and from German. Studies, 39(1):74–86. In Proceedings of the EACL MT workshop, pages John A. Bateman. 1988. Aspects of clause politeness in 115–119, Athens, Greece. Japanese: An extended inquiry semantics treatment. Jerry Hobbs and Megumi Kameyama. 1990. Trans- In Proceedings of ACL, pages 147–154, Buffalo, lation by abduction. In Proceedings of COLING, New York. pages 155–161, Helsinki, Finland. Luisa Bentivogli and Emanuele Pianta. 2005. Ex- Rebecca Hwa, Philipp Resnik, Amy Weinberg, Clara ploiting parallel texts in the creation of multilingual Cabezas, and Okan Kolak. 2005. Bootstrap- semantically annotated resources: the MultiSemCor ping parsers via syntactic projection across parallel Corpus. Journal of Natural Language Engineering, texts. Journal of Natural Language Engineering, 11(3):247–261. 11(3):311–325. Adam Bermingham and Alan F. Smeaton. 2009. A Hiroshi Kanayama. 2003. Paraphrasing rules for au- study of inter-annotator agreement for opinion re- tomatic evaluation of translation into Japanese. In trieval. In Proceedings of ACM SIGIR, pages 784– Proceedings of the Second International Workshop 785. on Paraphrasing, pages 88–93, Sapporo, Japan. Philip Bramsen, Martha Escobar-Molano, Ami Patel, Philipp Koehn. 2005. Europarl: A Parallel Corpus for and Rafael Alonso. 2011. Extracting social power Statistical Machine Translation. In Proceedings of relationships from natural language. In Proceedings the 10th Machine Translation Summit, pages 79–86, of ACL/HLT, pages 773–782, Portland, OR. Phuket, Thailand. Fabienne Braune and Alexander Fraser. 2010. Im- Heinz L. Kretzenbacher, Michael Clyne, and Doris proved unsupervised sentence alignment for symmet- Schüpbach. 2006. Pronominal Address in German: rical and asymmetrical parallel corpora. In Coling Rules, Anarchy and Embarrassment Potential. Aus- 2010: Posters, pages 81–89, Beijing, China. tralian Review of Applied Linguistics, 39(2):17.1– Roger Brown and Albert Gilman. 1960. The pronouns 17.18. of power and solidarity. In Thomas A. Sebeok, edi- Alexander Künzli. 2010. Address pronouns as a prob- tor, Style in Language, pages 253–277. MIT Press, lem in French-Swedish translation and translation Cambridge, MA. revision. Babel, 55(4):364–380. Penelope Brown and Stephen C. Levinson. 1987. Po- Zhifei Li and David Yarowsky. 2008. Mining and liteness: Some Universals in Language Usage. Num- modeling relations between formal and informal Chi- ber 4 in Studies in Interactional Sociolinguistics. nese phrases from web corpora. In Proceedings of Cambridge University Press. EMNLP, pages 1031–1040, Honolulu, Hawaii. Alexander Clark. 2003. Combining distributional and Christopher D. Manning, Prabhakar Raghavan, and morphological information for part of speech induc- Hinrich Schütze. 2008. Introduction to Information tion. In Proceedings of EACL, pages 59–66, Bu- Retrieval. Cambridge University Press, Cambridge, dapest, Hungary. UK, 1st edition. J. Cohen. 1960. A Coefficient of Agreement for Nomi- Andrew Kachites McCallum. 2002. Mal- nal Scales. Educational and Psychological Measure- let: A machine learning for language toolkit. ment, 20(1):37–46. http://mallet.cs.umass.edu. David Elson, Nicholas Dames, and Kathleen McKe- Roberto Navigli. 2009. Word Sense Disambiguation: own. 2010. Extracting social networks from literary a survey. ACM Computing Surveys, 41(2):1–69. fiction. In Proceedings of ACL, pages 138–147, Up- Eric W. Noreen. 1989. Computer-intensive Methods psala, Sweden. for Testing Hypotheses: An Introduction. John Wiley Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- and Sons Inc. Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: Franz Josef Och and Hermann Ney. 2003. A System- A library for large linear classification. Journal of atic Comparison of Various Statistical Alignment Machine Learning Research, 9:1871–1874. Models. Computational Linguistics, 29(1):19–51. Manaal Faruqui and Sebastian Padó. 2011. “I Thou Lance Ramshaw and Mitch Marcus. 1995. Text chunk- Thee, Thou Traitor”: Predicting formal vs. infor- ing using transformation-based learning. In Proceed- mal address in English literature. In Proceedings of ing of the 3rd ACL Workshop on Very Large Corpora, ACL/HLT 2011, pages 467–472, Portland, OR. Cambridge, MA. Jenny Rose Finkel and Christopher D. Manning. 2009. Michael Schiehlen. 1998. Learning tense transla- Nested named entity recognition. In Proceedings of tion from bilingual corpora. In Proceedings of EMNLP, pages 141–150, Singapore. ACL/COLING, pages 1183–1187, Montreal, Canada. 632 Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the International Conference on New Methods in Lan- guage Processing, pages 44–49, Manchester, UK. Doris Schüpbach, John Hajek, Jane Warren, Michael Clyne, Heinz Kretzenbacher, and Catrin Norrby. 2006. A cross-linguistic comparison of address pro- noun use in four European languages: Intralingual and interlingual dimensions. In Proceedings of the Annual Meeting of the Australian Linguistic Society, Brisbane, Australia. Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, and Dan Tufis. 2006. The JRC-Acquis: A multilingual aligned parallel cor- pus with 20+ languages. In Proceedings of LREC, pages 2142–2147, Genoa, Italy. David M. J. Tax and Robert P. W. Duin. 1999. Sup- port vector domain description. Pattern Recognition Letters, 20:1191–1199. David Yarowsky and Grace Ngai. 2001. Inducing mul- tilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of NAACL, pages 200–207, Pittsburgh, PA. 633 Character-based Kernels for Novelistic Plot Structure Micha Elsner Institute for Language, Cognition and Computation (ILCC) School of Informatics University of Edinburgh

[email protected]

Abstract text to fiction is that the most important struc- ture underlying the narrative—its plot—occurs at Better representations of plot structure a high level of abstraction, while the actual narra- could greatly improve computational meth- tion is of a series of lower-level events. ods for summarizing and generating sto- ries. Current representations lack abstrac- A short synopsis of Jane Austen’s novel Pride tion, focusing too closely on events. We and Prejudice, for example, is that Elizabeth Ben- present a kernel for comparing novelistic net first thinks Mr. Darcy is arrogant, but later plots at a higher level, in terms of the grows to love him. But this is not stated straight- cast of characters they depict and the so- forwardly in the text; the reader must infer it from cial relationships between them. Our kernel the behavior of the characters as they participate compares the characters of different nov- in various everyday scenes. els to one another by measuring their fre- quency of occurrence over time and the In this paper, we present the plot kernel, a descriptive and emotional language associ- coarse-grained, but robust representation of nov- ated with them. Given a corpus of 19th- elistic plot structure. The kernel evaluates the century novels as training data, our method similarity between two novels in terms of the can accurately distinguish held-out novels characters and their relationships, constructing in their original form from artificially dis- functional analogies between them. These are in- ordered or reversed surrogates, demonstrat- tended to correspond to the labelings produced by ing its ability to robustly represent impor- tant aspects of plot structure. human literary critics when they write, for exam- ple, that Elizabeth Bennet and Emma Woodhouse are protagonists of their respective novels. By fo- 1 Introduction cusing on which characters and relationships are Every culture has stories, and storytelling is one important, rather than specifically how they inter- of the key functions of human language. Yet while act, our system can abstract away from events and we have robust, flexible models for the structure focus on more easily-captured notions of what of informative documents (for instance (Chen et makes a good story. al., 2009; Abu Jbara and Radev, 2011)), current The ability to find correspondences between approaches have difficulty representing the nar- characters is key to eventually summarizing or rative structure of fictional stories. This causes even generating interesting stories. Once we can problems for any task requiring us to model effectively model the kinds of people a romance fiction, including summarization and generation or an adventure story is usually about, and what of stories; Kazantseva and Szpakowicz (2010) kind of relationships should exist between them, show that state-of-the-art summarizers perform we can begin trying to analyze new texts by com- extremely poorly on short fictional texts1 . A ma- parison with familiar ones. In this work, we eval- jor problem with applying models for informative uate our system on the comparatively easy task 1 Apart from Kazantseva, we know of one other at- projects/autosummarize. Although this cannot be tempt to apply a modern summarizer to fiction, by the treated as a scientific experiment, the results are unusably artist Jason Huff, using Microsoft Word 2008’s extrac- bad; they consist mostly of short exclamations containing tive summary feature: http://jason-huff.com/ the names of major characters. 634 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 634–644, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics of recognizing acceptable novels (section 6), but ture in terms of both characters and their emo- recognition is usually a good first step toward tional states. However, they operate at a very de- generation—a recognition model can always be tailed level and so can be applied only to short used as part of a generate-and-rank pipeline, and texts. Scheherazade (Elson and McKeown, 2010) potentially its underlying representation can be allows human annotators to mark character goals used in more sophisticated ways. We show a de- and emotional states in a narrative, and indicate tailed analysis of the character correspondences the causal links between them. AESOP (Goyal et discovered by our system, and discuss their po- al., 2010) attempts to learn a similar structure au- tential relevance to summarization, in section 9. tomatically. AESOP’s accuracy, however, is rel- atively poor even on short fables, indicating that 2 Related work this fine-grained approach is unlikely to be scal- Some recent work on story understanding has fo- able to novel-length texts; our system relies on a cused on directly modeling the series of events much coarser analysis. that occur in the narrative. McIntyre and Lapata Kazantseva and Szpakowicz (2010) summarize (2010) create a story generation system that draws short stories, although unlike the other projects on earlier work on narrative schemas (Chambers we discuss here, they explicitly try to avoid giving and Jurafsky, 2009). Their system ensures that away plot details—their goal is to create “spoiler- generated stories contain plausible event-to-event free” summaries focusing on characters, settings transitions and are coherent. Since it focuses only and themes, in order to attract potential readers. on events, however, it cannot enforce a global no- They do find it useful to detect character men- tion of what the characters want or how they relate tions, and also use features based on verb aspect to to one another. automatically exclude plot events while retaining Our own work draws on representations that descriptive passages. They compare their genre- explicitly model emotions rather than events. Alm specific system with a few state-of-the-art meth- and Sproat (2005) were the first to describe sto- ods for summarizing news, and find it outper- ries in terms of an emotional trajectory. They an- forms them substantially. notate emotional states in 22 Grimms’ fairy tales We evaluate our system by comparing real nov- and discover an increase in emotion (mostly posi- els to artificially produced surrogates, a procedure tive) toward the ends of stories. They later use this previously used to evaluate models of discourse corpus to construct a reasonably accurate clas- coherence (Karamanis et al., 2004; Barzilay and sifier for emotional states of sentences (Alm et Lapata, 2005) and models of syntax (Post, 2011). al., 2005). Volkova et al. (2010) extend the hu- As in these settings, we anticipate that perfor- man annotation approach using a larger number of mance on this kind of task will be correlated with emotion categories and applying them to freely- performance in applied settings, so we use it as an defined chunks instead of sentences. The largest- easier preliminary test of our capabilities. scale emotional analysis is performed by Moham- 3 Dataset mad (2011), using crowd-sourcing to construct a large emotional lexicon with which he analyzes We focus on the 19th century novel, partly fol- adult texts such as plays and novels. In this work, lowing Elson et al. (2010) and partly because we adopt the concept of emotional trajectory, but these texts are freely available via Project Guten- apply it to particular characters rather than works berg. Our main dataset is composed of romances as a whole. (which we loosely define as novels focusing on a In focusing on characters, we follow Elson et courtship or love affair). We select 41 texts, tak- al. (2010), who analyze narratives by examining ing 11 as a development set and the remaining their social network relationships. They use an 30 as a test set; a complete list is given in Ap- automatic method based on quoted speech to find pendix A. We focus on the novels used in Elson social links between characters in 19th century et al. (2010), but in some cases add additional ro- novels. Their work, designed for computational mances by an already-included author. We also literary criticism, does not extract any temporal selected 10 of the least romantic works as an out- or emotional structure. of-domain set; experiments on these are in section A few projects attempt to represent story struc- 8. 635 4 Preprocessing reply left-of-[name] 17 right-of-[name] feel 14 In order to compare two texts, we must first ex- right-of-[name] look 10 tract the characters in each and some features of right-of-[name] mind 7 their relationships with one another. Our first step right-of-[name] make 7 is to split the text into chapters, and each chapter into paragraphs; if the text contains a running di- Table 1: Top five stemmed unigram dependency fea- alogue where each line begins with a quotation tures for “Miss Elizabeth Bennet”, protagonist of mark, we append it to the previous paragraph. Pride and Prejudice, and their frequencies. We segment each paragraph with MXTerminator (Reynar and Ratnaparkhi, 1997) and parse it with the self-trained Charniak parser (McClosky et al., and the first and last names are consistent (Char- 2006). Next, we extract a list of characters, com- niak, 2001). We then merge single-word mentions pute dependency tree-based unigram features for with matching multiword mentions if they appear each character, and record character frequencies in the same paragraph, or if not, with the multi- and relationships over time. word mention that occurs in the most paragraphs. When this process ends, we have resolved each 4.1 Identifying characters mention in the novel to some specific character. As in previous work, we discard very infrequent We create a list of possible character references characters and their mentions. for each work by extracting all strings of proper nouns (as detected by the parser), then discarding For the reasons stated, this method is error- those which occur less than 5 times. Grouping prone. Our intuition is that the simpler method these into a useful character list is a problem of described in Elson et al. (2010), which merges cross-document coreference. each mention to the most recent possible coref- Although cross-document coreference has been erent, must be even more so. However, due to extensively studied (Bhattacharya and Getoor, the expense of annotation, we make no attempt to 2005) and modern systems can achieve quite high compare these methods directly. accuracy on the TAC-KBP task, where the list of available entities is given in advance (Dredze 4.2 Unigram character features et al., 2010), novelistic text poses a significant challenge for the methods normally used. The Once we have obtained the character list, we use typical 19th-century novel contains many related the dependency relationships extracted from our characters, often named after one another. There parse trees to compute features for each charac- are complicated social conventions determining ter. Similar feature sets are used in previous work which titles are used for whom—for instance, in word classification, such as (Lin and Pantel, the eldest unmarried daughter of a family can be 2001). A few example features are shown in Table called “Miss Bennet”, while her younger sister 1. must be “Miss Elizabeth Bennet”. And characters To find the features, we take each mention in often use nicknames, such as “Lizzie”. the corpus and count up all the words outside the Our system uses the multi-stage clustering mention which depend on the mention head, ex- approach outlined in Bhattacharya and Getoor cept proper nouns and stop words. We also count (2005), but with some features specific to 19th the mention’s own head word, and mark whether century European names. To begin, we merge all it appears to the right or the left (in general, this identical mentions which contain more than two word is a verb and the direction reflects the men- words (leaving bare first or last names unmerged). tion’s role as subject or object). We lemmatize Next, we heuristically assign each mention a gen- all feature words with the WordNet (Miller et al., der (masculine, feminine or neuter) using a list of 1990) stemmer. The resulting distribution over gendered titles, then a list of male and female first words is our set of unigram features for the char- names2 . We then merge mentions where each is acter. (We do not prune rare features, although longer than one word, the genders do not clash, they have proportionally little influence on our 2 The most frequent names from the 1990 US census. measurement of similarity.) 636 1.6 1.4 Freq of Miss Elizabeth Bennet Emotions of Miss Elizabeth Bennet 1.2 Cross freq x Mr. Darcy 1.0 0.8 0.6 0.4 0.2 0.00 10 20 30 40 50 Figure 1: Normalized frequency and emotions associated with “Miss Elizabeth Bennet”, protagonist of Pride and Prejudice, and frequency of paragraphs about her and “Mr. Darcy”, smoothed and projected onto 50 basis points. 4.3 Temporal relationships care mostly about the strength of key relationships rather than the existence of infrequent ones. We record two time-varying features for each Finally, we perform some smoothing, by taking character, each taking one value per chapter. The a weighted moving average of each feature value first is the character’s frequency as a proportion with a window of the three values on either side. of all character mentions in the chapter. The sec- Then, in order to make it easy to compare books ond is the frequency with which the character is with different numbers of chapters, we linearly in- associated with emotional language—their emo- terpolate each series of points into a curve and tional trajectory (Alm et al., 2005). We use the project it onto a fixed basis of 50 evenly spaced strong subjectivity cues from the lexicon of Wil- points. An example of the final output is shown in son et al. (2005) as a measurement of emotion. Figure 1. If, in a particular paragraph, only one character is mentioned, we count all emotional words in 5 Kernels that paragraph and add them to the character’s total. To render the numbers comparable across Our plot kernel k(x, y) measures the similarity works, each paragraph subtotal is normalized by between two novels x and y in terms of the fea- the amount of emotional language in the novel as tures computed above. It takes the form of a a whole. Then the chapter score is the average convolution kernel (Haussler, 1999) where the over paragraphs. “parts” of each novel are its characters u ∈ x, For pairwise character relationships, we count v ∈ y and c is a kernel over characters: the number of paragraphs in which only two char- XX acters are mentioned, and treat this number (as a k(x, y) = c(u, v) (1) proportion of the total) as a measurement of the u∈x v∈y strength of the relationship between that pair3 . El- We begin by constructing a first-order ker- son et al. (2010) show that their method of find- nel over characters, c1 (u, v), which is defined in ing conversations between characters is more pre- terms of a kernel d over the unigram features and cise in showing whether a relationship exists, but a kernel e over the single-character temporal fea- the co-occurrence technique is simpler, and we tures. We represent the unigram feature counts as 3 We tried also counting emotional language in these para- distributions pu (w) and pv (w), and compute their graphs, but this did not seem to help in development experi- similarity as the amount of shared mass, times a ments. small penalty of .1 for mismatched genders: 637 co-occur with), smoothed and normalized as de- P scribed in subsection 4.3. This produces a single d(pu , pv ) = exp(−α(1 − w min(pu (w), pv (w)))) time-varying curve for each novel, representing ×.1 I{genu = genv } the average emotional intensity of each chapter. We use our curve kernel e (equation 2) to mea- We compute similarity between a pair of time- sure similarity between novels. varying curves (which are projected onto 50 evenly spaced points) using standard cosine dis- 6 Experiments tance, which approximates the normalized inte- gral of their product. We evaluate our kernels on their ability to distin- guish between real novels from our dataset and !β u•v artificial surrogate novels of three types. First, we e(u, v) = p (2) alter the order of a real novel by permuting its kukkvk chapters before computing features. We construct The weights α and β are parameters of the sys- one uniformally-random permutation for each test tem, which scale d and e so that they are compa- novel. Second, we change the identities of the rable to one another, and also determine how fast characters by reassigning the temporal features the similarity scales up as the feature sets grow for the different characters uniformally at random closer; we set them to 5 and 10 respectively. while leaving the unigram features unaltered. (For We sum together the similarities of the char- example, we might assign the frequency, emotion acter frequency and emotion curves to measure and relationship curves for “Mr. Collins” to “Miss overall temporal similarity between the charac- Elizabeth Bennet” instead.) Again, we produce ters. Thus our first-order character kernel c1 is: one test instance of this type for each test novel. Third, we experiment with a more difficult order- ing task by taking the chapters in reverse. c1 (u, v) = d(pu , pv )(e(uf req , vf req )+e(uemo , vemo )) In each case, we use our kernel to perform We use c1 and equation 1 to construct a first- a ranking task, deciding whether k(x, y) > order plot kernel (which we call k1 ), and also as k(x, yperm ). Since this is a binary forced-choice an ingredient in a second-order character kernel classification, a random baseline would score c2 which takes into account the curve of pairwise 50%. We evaluate performance in the case where frequencies u,d u0 between two characters u and u0 we are given only a single training document x, in the same novel. and for a whole training set X, in which case we combine the decisions using a weighted nearest XX neighbor (WNN) strategy: c2 (u, v) = c1 (u, v) e(u, d u0 , v, d v 0 )c1 (u0 , v 0 ) X X u0 ∈x v 0 ∈y k(x, y) > k(x, yperm ) x∈X x∈X In other words, u is similar to v if, for some relationships of u with other characters u0 , there In each case, we perform the experiment in are similar characters v 0 who serves the same role a leave-one-out fashion; we include the 11 de- for v. We use c2 and equation 1 to construct our velopment documents in X, but not in the test full plot kernel k2 . set. Thus there are 1200 single-document compar- isons and 30 with WNN. The results of our three 5.1 Sentiment-only baseline systems (the baseline, the first-order kernel k1 and In addition to our plot kernel systems, we imple- the second-order kernel k2 ) are shown in Table ment a simple baseline intended to test the effec- 2. (The sentiment-only baseline has no character- tiveness of tracking the emotional trajectory of specific features, and so cannot perform the char- the novel without using character identities. We acter task.) give our baseline access to the same subjectiv- Using the full dataset and second-order kernel ity lexicon used for our temporal features. We k2 , our system’s performance on these tasks is compute the number of emotional words used in quite good; we are correct 90% of the time for each chapter (regardless of which characters they order and character examples, and 67% for the 638 order character reverse tion. sentiment only 46.2 - 51.5 single doc k1 59.5 63.7 50.7 7 Significance testing single doc k2 61.8 67.7 51.6 In addition to using our kernel as a classifier, we WNN sentiment 50 - 53 can directly test its ability to distinguish real from WNN k1 77 90 63 altered novels via a non-parametric two-sample WNN k2 90 90 67 significance test, the Maximum Mean Discrep- Table 2: Accuracy of kernels ranking 30 real novels ancy (MMD) test (Gretton et al., 2007). Given against artificial surrogates (chance accuracy 50%). samples from a pair of distributions p and q and a kernel k, this test determines whether the null hypothesis that p and q are identically distributed more difficult reverse cases. Results of this qual- in the kernel’s feature space can be rejected. The ity rely heavily on the WNN strategy, which trusts advantage of this test is that, since it takes all close neighbors more than distant ones. pairwise comparisons (except self-comparisons) In the single training point setup, the system within and across the classes into account, it uses is much less accurate. In this setting, the sys- more information than our classification experi- tem is forced to make decisions for all pairs of ments, and can therefore be more sensitive. texts independently, including pairs it considers As in Gretton et al. (2007), we find an unbiased very dissimilar because it has failed to find any estimate of the test statistic M M D2 for sample useful correspondences. Performance for these sets x ∼ p, y ∼ q, each with m samples, by pair- pairs is close to chance, dragging down overall ing the two as z = (xi , yi ) and computing: scores (52% for reverse) even if the system per- forms well on pairs where it finds good correspon- m dences, enabling a higher WNN score (67%). 1 X M M D2 (x, y) = h(zi , zj ) The reverse case is significantly harder than (m)(m − 1) i6=j order. This is because randomly permuting a novel actually breaks up the temporal continuity h(zi , zj ) = k(xi , xj )+k(yi , yj )−k(xi , yj )−k(xj , yi ) of the text—for instance, a minor character who appeared in three adjacent chapters might now ap- Intuitively, M M D2 approaches 0 if the ker- pear in three separate places. Reversing the text nel cannot distinguish x from y and is positive does not cause this kind of disruption, so correctly otherwise. The null distribution is computed by detecting a reversal requires the system to repre- the bootstrap method; we create null-distributed sent patterns with a distinct temporal orientation, samples by randomly swapping xi and yi in ele- for instance an intensification in the main char- ments of z and computing the test statistic. We acter’s emotions, or in the number of paragraphs use 10000 test permutations. Using both k1 and focusing on pairwise relationships, toward the end k2 , we can reject the null hypothesis that the dis- of the text. tribution of novels is equal to order or characters The baseline system is ineffective at detecting with p < .001; for reversals, we cannot reject the either ordering or reversals4 . The first-order ker- null hypothesis. nel k1 is as good as k2 in detecting character per- 8 Out-of-domain data mutations, but less effective on reorderings and reversals. As we will show in section 9, k1 places In our main experiments, we tested our kernel more emphasis on correspondences between mi- only on romances; here we investigate its ability nor characters and between places, while k2 is to generalize across genres. We take as our train- more sensitive to protagonists and their relation- ing set X the same romances as above, but as our ships, which carry the richest temporal informa- test set Y a disjoint set of novels focusing mainly on crime, children and the supernatural. 4 The baseline detects reversals as well as the plot kernels Our results (Table 3) are not appreciably differ- given only a single point of comparison, but these results do not transfer to the WNN strategy. This suggests that unlike ent from those of the in-domain experiments (Ta- the plot kernels, the baseline is no more accurate for docu- ble 2) considering the small size of the dataset. ments it considers similar than for those it judges are distant. This shows our system to be robust, but shallow; 639 order character reverse “Emma Woodhouse”, both labeled “female pro- sentiment only 33.0 - 53.4 tagonist”, contributes 26% of the kernel similarity single doc k1 59.5 61.7 52.7 between the works in which they appear.) We plot single doc k2 63.7 62.0 57.3 these as Hinton-style diagrams in Figure 2. The WNN sentiment 20 - 70 size of each black rectangle indicates the magni- WNN k1 80 90 80 tude of the contribution. (Since kernel functions WNN k2 100 80 70 are symmetric, we show only the lower diagonal.) Table 3: Accuracy of kernels ranking 10 non-romance Under the kernel for unigram features, d novels against artificial surrogates, with 41 romances (top), the most common character types—non- used for comparison. characters (almost always places) and non- marriageable women—contribute most to the ker- the patterns it can represent generalize acceptably nel scores; this is especially true for places, since across domains, but this suggests it is describing they often occur with similar descriptive terms. broad concepts like “main character” rather than The diagram also shows the effect of the kernel’s genre-specific ones like “female romantic lead”. penalty for gender mismatches, since females pair more strongly with females and males with males. 9 Character-level analysis Character roles have relatively little impact. The first-order kernel c1 (middle), which takes To gain some insight into exactly what kinds of into account frequency and emotion as well as un- similarities the system picks up on when compar- igrams, is much better than d at distinguishing ing two works, we sorted the characters detected places from real characters, and assigns somewhat by our system into categories and measured their more weight to protagonists. contribution to the kernel’s overall scores. We selected four Jane Austen works from the devel- Finally, c2 (bottom), which takes into account opment set5 and hand-categorized each character second-order relationships, places much more detected by our system. (We performed the cate- emphasis on female protagonists and much less gorization based on the most common full name on places. This is presumably because the female mention in each cluster. This name is usually a protagonists of Jane Austen’s novels are the view- good identifier for all the mentions in the cluster, point characters, and the novels focus on their re- but if our coreference system has made an error, it lationships, while characters do not tend to have may not be.) strong relationships with places. An increased Our categorization for characters is intended to tendency to match male marriageable characters capture the stereotypical plot dynamics of liter- with marriageable females, and “other” males ary romance, sorting the characters according to with “other” females, suggests that c2 relies more their gender and a simple notion of their plot func- on character function and less on unigrams than tion. The genders are female, male, plural (“the c1 when finding correspondences between char- Crawfords”) or not a character (“London”). The acters. functional classes are protagonist (used for the As we concluded in the previous section, the female viewpoint character and her eventual hus- frequent confusion between categories suggests band), marriageable (single men and women that the analogies we construct are relatively non- who are seeking to marry within the story) and specific. We might hope to create role-based sum- other (older characters, children, and characters mary of novels by finding their nearest neighbors married before the story begins). and then propagating the character categories (for We evaluate the pairwise kernel similarities example, “ is the protagonist of this novel. She among our four works, and add up the propor- lives at . She eventually marries , her other tional contribution made by character pairs of suitors are and her older guardian is .”) each type to the eventual score. (For instance, but the present system is probably not adequate the similarity between “Elizabeth Bennet” and for the purpose. We expect that detecting a fine- 5 Pride and Prejudice, Emma, Mansfield Park and Per- grained set of emotions will help to separate char- suasion. acter functions more clearly. 640 10 Conclusions This work presents a method for describing nov- ns Toke elistic plots at an abstract level. It has three main s Type contributions: the description of a plot in terms F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non of analogies between characters, the use of emo- t ot rr. arr. arr. her ther ther -char tional and frequency trajectories for individual Character frequency by category characters rather than whole works, and evalua- t tion using artificially disordered surrogate novels. F Pro In future work, we hope to sharpen the analogies ot M Pr we construct so that they are useful for summa- rr. F Ma rization, perhaps by finding an external standard arr. MM by which we can make the notion of “analogous” arr. Pl M characters precise. We would also like to investi- her gate what gains are possible with a finer-grained F Ot ther emotional vocabulary. MO ther Pl O r Acknowledgements -cha Non F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non Thanks to Sharon Goldwater, Mirella Lapata, Vic- t ot rr. arr. arr. her ther ther -char Unigram features (d) toria Adams and the ProbModels group for their comments on preliminary versions of this work, t F Pro Kira Mour˜ao for suggesting graph kernels, and ot M Pr three reviewers for their comments. rr. F Ma arr. MM arr. References Pl M her F Ot Amjad Abu Jbara and Dragomir Radev. 2011. Coher- ther ent citation-based summarization of scientific pa- MO ther pers. In Proceedings of ACL 2011, Portland, Ore- Pl O gon. r -cha Non Cecilia Ovesdotter Alm and Richard Sproat. 2005. F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non Emotional sequencing and development in fairy t ot rr. arr. arr. her ther ther -char First-order (c1) tales. In ACII, pages 668–674. Cecilia Ovesdotter Alm, Dan Roth, and Richard t F Pro Sproat. 2005. Emotions from text: Machine learn- ot M Pr ing for text-based emotion prediction. In Proceed- rr. ings of Human Language Technology Conference F Ma arr. and Conference on Empirical Methods in Natural MM Language Processing, pages 579–586, Vancouver, arr. Pl M British Columbia, Canada, October. Association for her F Ot Computational Linguistics. ther Regina Barzilay and Mirella Lapata. 2005. Model- MO ther ing local coherence: an entity-based approach. In Pl O Proceedings of the 43rd Annual Meeting of the As- r -cha Non sociation for Computational Linguistics (ACL’05). F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non Indrajit Bhattacharya and Lise Getoor. 2005. Rela- t ot rr. arr. arr. her ther ther -char Second-order (c2) tional clustering for multi-type entity resolution. In Proceedings of the 4th international workshop on Figure 2: Affinity diagrams showing character types Multi-relational mining, MRDM ’05, pages 3–12, contributing to the kernel similarity between four New York, NY, USA. ACM. works by Jane Austen. Nathanael Chambers and Dan Jurafsky. 2009. Un- supervised learning of narrative schemas and their participants. In Proceedings of the Joint Confer- ence of the 47th Annual Meeting of the ACL and the 641 4th International Joint Conference on Natural Lan- Anna Kazantseva and Stan Szpakowicz. 2010. Sum- guage Processing of the AFNLP, pages 602–610, marizing short stories. Computational Linguistics, Suntec, Singapore, August. Association for Com- pages 71–109. putational Linguistics. Dekang Lin and Patrick Pantel. 2001. Induction of Eugene Charniak. 2001. Unsupervised learning of semantic classes from natural language text. In name structure from coreference data. In Second Proceedings of the seventh ACM SIGKDD interna- Meeting of the North American Chapter of the Asso- tional conference on Knowledge discovery and data ciation for Computational Linguistics (NACL-01). mining, KDD ’01, pages 317–322, New York, NY, Harr Chen, S.R.K. Branavan, Regina Barzilay, and USA. ACM. David R. Karger. 2009. Global models of docu- David McClosky, Eugene Charniak, and Mark John- ment structure using latent permutations. In Pro- son. 2006. Effective self-training for parsing. In ceedings of Human Language Technologies: The Proceedings of the Human Language Technology 2009 Annual Conference of the North American Conference of the NAACL, Main Conference, pages Chapter of the Association for Computational Lin- 152–159. guistics, pages 371–379, Boulder, Colorado, June. Neil McIntyre and Mirella Lapata. 2010. Plot induc- Association for Computational Linguistics. tion and evolutionary search for story generation. Mark Dredze, Paul McNamee, Delip Rao, Adam Ger- In Proceedings of the 48th Annual Meeting of the ber, and Tim Finin. 2010. Entity disambigua- Association for Computational Linguistics, pages tion for knowledge base population. In Proceed- 1562–1572, Uppsala, Sweden, July. Association for ings of the 23rd International Conference on Com- Computational Linguistics. putational Linguistics (Coling 2010), pages 277– G. Miller, A.R. Beckwith, C. Fellbaum, D. Gross, and 285, Beijing, China, August. Coling 2010 Organiz- K. Miller. 1990. Introduction to WordNet: an on- ing Committee. line lexical database. International Journal of Lexi- David K. Elson and Kathleen R. McKeown. 2010. cography, 3(4). Building a bank of semantically encoded narratives. Saif Mohammad. 2011. From once upon a time In Nicoletta Calzolari (Conference Chair), Khalid to happily ever after: Tracking emotions in novels Choukri, Bente Maegaard, Joseph Mariani, Jan and fairy tales. In Proceedings of the 5th ACL- Odijk, Stelios Piperidis, Mike Rosner, and Daniel HLT Workshop on Language Technology for Cul- Tapias, editors, Proceedings of the Seventh con- tural Heritage, Social Sciences, and Humanities, ference on International Language Resources and pages 105–114, Portland, OR, USA, June. Associa- Evaluation (LREC’10), Valletta, Malta, May. Euro- tion for Computational Linguistics. pean Language Resources Association (ELRA). Matt Post. 2011. Judging grammaticality with tree David Elson, Nicholas Dames, and Kathleen McKe- substitution grammar derivations. In Proceedings own. 2010. Extracting social networks from liter- of the 49th Annual Meeting of the Association ary fiction. In Proceedings of the 48th Annual Meet- for Computational Linguistics: Human Language ing of the Association for Computational Linguis- Technologies, pages 217–222, Portland, Oregon, tics, pages 138–147, Uppsala, Sweden, July. Asso- USA, June. Association for Computational Linguis- ciation for Computational Linguistics. tics. Amit Goyal, Ellen Riloff, and Hal Daume III. 2010. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. Automatically producing plot unit representations A maximum entropy approach to identifying sen- for narrative text. In Proceedings of the 2010 Con- tence boundaries. In Proceedings of the Fifth Con- ference on Empirical Methods in Natural Language ference on Applied Natural Language Processing, Processing, pages 77–86, Cambridge, MA, Octo- pages 16–19, Washington D.C. ber. Association for Computational Linguistics. Ekaterina P. Volkova, Betty Mohler, Detmar Meur- Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, ers, Dale Gerdemann, and Heinrich H. B¨ulthoff. Bernhard Schlkopf, and Alexander J. Smola. 2007. 2010. Emotional perception of fairy tales: Achiev- A kernel method for the two-sample-problem. In ing agreement in emotion annotation of text. In B. Sch¨olkopf, J. Platt, and T. Hoffman, editors, Ad- Proceedings of the NAACL HLT 2010 Workshop on vances in Neural Information Processing Systems Computational Approaches to Analysis and Gener- 19, pages 513–520. MIT Press, Cambridge, MA. ation of Emotion in Text, pages 98–106, Los Ange- David Haussler. 1999. Convolution kernels on dis- les, CA, June. Association for Computational Lin- crete structures. Technical Report UCSC-CRL-99- guistics. 10, Computer Science Department, UC Santa Cruz. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Nikiforos Karamanis, Massimo Poesio, Chris Mellish, 2005. Recognizing contextual polarity in phrase- and Jon Oberlander. 2004. Evaluating centering- level sentiment analysis. In Proceedings of Hu- based metrics of coherence. In ACL, pages 391– man Language Technology Conference and Confer- 398. ence on Empirical Methods in Natural Language 642 Processing, pages 347–354, Vancouver, British Columbia, Canada, October. Association for Com- putational Linguistics. 643 A List of texts Dev set (11 works) Austen Emma, Mansfield Park, Northanger Bront¨e, Emily Wuthering Heights Abbey, Persuasion, Pride and Prej- udice, Sense and Sensibility Burney Cecilia (1782) Hardy Tess of the D’Urbervilles James The Ambassadors Scott Ivanhoe Test set (30 works) Braddon Aurora Floyd Bront¨e, Anne The Tenant of Wildfell Hall Bront¨e, Charlotte Jane Eyre, Villette Bulwer-Lytton Zanoni Disraeli Coningsby, Tancred Edgeworth The Absentee, Belinda, Helen Eliot Adam Bede, Daniel Deronda, Mid- Gaskell Mary Barton, North and South dlemarch Gissing In the Year of Jubilee, New Grub Hardy Far From the Madding Crowd, Jude Street the Obscure, Return of the Native, Under the Greenwood Tree James The Wings of the Dove Meredith The Egoist, The Ordeal of Richard Feverel Scott The Bride of Lammermoor Thackeray History of Henry Esmond, History of Pendennis, Vanity Fair Trollope Doctor Thorne Out-of-domain set (10 works) Ainsworth The Lancashire Witches Bulwer-Lytton Paul Clifford Dickens Oliver Twist, The Pickwick Papers Collins The Moonstone Conan-Doyle A Study in Scarlet, The Sign of the Hughes Tom Brown’s Schooldays Four Stevenson Treasure Island Stoker Dracula Table 4: 19th century novels used in our study. 644 Smart Paradigms and the Predictability and Complexity of Inflectional Morphology Gr´egoire D´etrez and Aarne Ranta Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract aiming for more precision, has 235 paradigms for Swedish. Morphological lexica are often imple- Mathematically, a paradigm is a function that mented on top of morphological paradigms, produces inflection tables. Its argument is a word corresponding to different ways of building the full inflection table of a word. Compu- string (either a dictionary form or a stem), and tationally precise lexica may use hundreds its value is an n-tuple of strings (the word forms): of paradigms, and it can be hard for a lex- icographer to choose among them. To au- P : String → Stringn tomate this task, this paper introduces the notion of a smart paradigm. It is a meta- We assume that the exponent n is determined by paradigm, which inspects the base form and the language and the part of speech. For instance, tries to infer which low-level paradigm ap- English verbs might have n = 5 (for sing, sings, plies. If the result is uncertain, more forms sang, sung, singing), whereas for French verbs in are given for discrimination. The number Bescherelle, n = 51. We assume the tuples to of forms needed in average is a measure be ordered, so that for instance the French sec- of predictability of an inflection system. The overall complexity of the system also ond person singular present subjunctive is always has to take into account the code size of found at position 17. In this way, word-paradigm the paradigms definition itself. This pa- pairs can be easily converted to morphogical lex- per evaluates the smart paradigms imple- ica and to transducers that map form descriptions mented in the open-source GF Resource to surface forms and back. A properly designed Grammar Library. Predictability and com- set of paradigms permits a compact representation plexity are estimated for four different lan- of a lexicon and a user-friendly way to extend it. guages: English, French, Swedish, and Finnish. The main result is that predictabil- Different paradigm systems may have different ity does not decrease when the complex- numbers of paradigms. There are two reasons for ity of morphology grows, which means that this. One is that traditional paradigms often in fact smart paradigms provide an efficient tool require more arguments than one: for the manual construction and/or auto- matically bootstrapping of lexica. P : Stringm → Stringn Here m ≤ n and the set of arguments is a subset 1 Introduction of the set of values. Thus the so-called fourth verb Paradigms are a cornerstone of grammars in the conjugation in Swedish actually needs three forms European tradition. A classical Latin grammar to work properly, for instance sitta, satt, suttit for has five paradigms for nouns (“declensions”) and the equivalent of sit, sat, sat in English. In Hell- four for verbs (“conjugations”). The modern ref- berg (1978), as in the French Bescherelle, each erence on French verbs, Bescherelle (Bescherelle, paradigm is defined to take exactly one argument, 1997), has 88 paradigms for verbs. Swedish and hence each vowel alternation pattern must be grammars traditionally have, like Latin, five a different paradigm. paradigms for nouns and four for verbs, but a The other factor that affects the number of modern computational account (Hellberg, 1978), paradigms is the nature of the string operations 645 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 645–653, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics allowed in the function P . In Hellberg (1978), • conj19finir(s), if s ends ir noun paradigms only permit the concatenation of • conj53rendre(s), if s ends re suffixes to a stem. Thus the paradigms are iden- • conj14assi´eger(s), if s ends e´ ger tified with suffix sets. For instance, the inflection • conj11jeter(s), if s ends eler or patterns bil–bilar (“car–cars”) and nyckel–nycklar eter (“key–keys”) are traditionally both treated as in- • conj10c´eder(s), if s ends e´ der stances of the second declension, with the plural • conj07placer(s), if s ends cer ending ar and the contraction of the unstressed • conj08manger(s), if s ends ger e in the case of nyckel. But in Hellberg, the • conj16payer(s), if s ends yer word nyckel has nyck as its “technical stem”, to • conj06parler(s), if s ends er which the paradigm numbered 231 adds the sin- gular ending el and the plural ending lar. Notice that the cases must be applied in the given The notion of paradigm used in this paper al- order; for instance, the last case applies only to lows multiple arguments and powerful string op- those verbs ending with er that are not matched erations. In this way, we will be able to reduce by the earlier cases. the number of paradigms drastically: in fact, each Also notice that the above paradigm is just lexical category (noun, adjective, verb), will have like the more traditional ones, in the sense that just one paradigm but with a variable number of we cannot be sure if it really applies to a given arguments. Paradigms that follow this design will verb. For instance, the verb partir ends with ir be called smart paradigms and are introduced and would hence receive the same inflection as in Section 2. Section 3 defines the notions of finir; however, its real conjugation is number 26 predictability and complexity of smart paradigm in Bescherelle. That mkV uses 19 rather than systems. Section 4 estimates these figures for four number 26 has a good reason: a vast majority of different languages of increasing richness in mor- ir verbs is inflected in this conjugation, and it is phology: English, Swedish, French, and Finnish. also the productive one, to which new ir verbs are We also evaluate the smart paradigms as a data added. compression method. Section 5 explores some Even though there is no mathematical differ- uses of smart paradigms in lexicon building. Sec- ence between the mkV paradigm and the tradi- tion 6 compares smart paradigms with related tional paradigms like those in Bescherelle, there techniques such as morphology guessers and ex- is a reason to call mkV a smart paradigm. This traction tools. Section 7 concludes. name implies two things. First, a smart paradigm implements some “artificial intelligence” to pick 2 Smart paradigms the underlying “stupid” paradigm. Second, a In this paper, we will assume a notion of paradigm smart paradigm uses heuristics (informed guess- that allows multiple arguments and arbitrary com- ing) if string matching doesn’t decide the matter; putable string operations. As argued in (Ka- the guess is informed by statistics of the distribu- plan and Kay, 1994) and amply demonstrated in tions of different inflection classes. (Beesley and Karttunen, 2003), no generality is One could thus say that smart paradigms are lost if the string operators are restricted to ones “second-order” or “meta-paradigms”, compared computable by finite-state transducers. Thus the to more traditional ones. They implement a examples of paradigms that we will show (only lot of linguistic knowledge and intelligence, and informally), can be converted to matching and re- thereby enable tasks such as lexicon building to placements with regular expressions. be performed with less expertise than before. For For example, a majority of French verbs can instance, instead of “07” for foncer and “06” be defined by the following paradigm, which for marcher, the lexicographer can simply write analyzes a variable-size suffix of the infinitive “mkV” for all verbs instead of choosing from 88 form and dispatches to the Bescherelle paradigms numbers. (identified by a number and an example verb): In fact, just “V”, indicating that the word is a verb, will be enough, since the name of the mkV : String → String51 paradigm depends only on the part of speech. mkV(s) = This follows the model of many dictionaries and 646 methods of language teaching, where character- elsewhere in the library. Function application is istic forms are used instead of paradigm identi- expressed without parentheses, by the juxtaposi- fiers. For instance, another variant of mkV could tion of the function and the argument. use as its second argument the first person plural mkV : Str -> V present indicative to decide whether an ir verb is mkV s = case s of { in conjugation 19 or in 26: _ + "ir" -> conj19finir s ; mkV : String2 → String51 _ + ("eler"|"eter") -> conj11jeter s ; mkV(s, t) = _ + "er" -> conj06parler s ; • conj26partir(s), if for some x, s = } x+ir and t = x+ons • conj19finir(s), if s ends with ir The GF Resource Grammar Library1 has • (all the other cases that can be rec- comprehensive smart paradigms for 18 lan- ognized by this extra form) guages: Amharic, Catalan, Danish, Dutch, En- • mkV(s) otherwise (fall-back to the glish, Finnish, French, German, Hindi, Italian, one-argument paradigm) Nepalese, Norwegian, Romanian, Russian, Span- ish, Swedish, Turkish, and Urdu. A few other lan- In this way, a series of smart paradigms is built guages have complete sets of ”traditional” inflec- for each part of speech, with more and more ar- tion paradigms but no smart paradigms. guments. The trick is to investigate which new Six languages in the library have comprehen- forms have the best discriminating power. For sive morphological dictionaries: Bulgarian (53k ease of use, the paradigms should be displayed to lemmas), English (42k), Finnish (42k), French the user in an easy to understand format, e.g. as a (92k), Swedish (43k), and Turkish (23k). They table specifying the possible argument lists: have been extracted from other high-quality re- verb parler sources via conversions to GF using the paradigm verb parler, parlons systems. In Section 4, four of them will be used verb parler, parlons, parlera, parla, parl´e for estimating the strength of the smart paradigms, noun chien that is, the predictability of each language. noun chien, masculine noun chien, chiens, masculine 3 Cost, predictability, and complexity Notice that, for French nouns, the gender is listed Given a language L, a lexical category C, and a set as one of the pieces of information needed for P of smart paradigms for C, the predictability of lexicon building. In many cases, it can be in- the morphology of C in L by P depends inversely ferred from the dictionary form just like the in- on the average number of arguments needed to flection; for instance, that most nouns ending e generate the correct inflection table for a word. are feminine. A gender argument in the smart The lower the number, the more predictable the noun paradigm makes it possible to override this system. default behaviour. Predictability can be estimated from a lexicon 2.1 Paradigms in GF that contains such a set of tables. Formally, a smart paradigm is a family Pm of functions Smart paradigms as used in this paper have been implemented in the GF programming language Pm : Stringm → Stringn (Grammatical Framework, (Ranta, 2011)). GF is a functional programming lnguage enriched with where m ranges over some set of integers from 1 regular expressions. For instance, the following to n, but need not contain all those integers. A function implements a part of the one-argument lexicon L is a finite set of inflection tables, French verb paradigm shown above. It uses a case expression to pattern match with the argument s; L = {wi : Stringn | i = 1, . . . , ML } the pattern _ matches anything, while + divides a string to two pieces, and | expresses alternation. 1 Source code and documentation in http://www. The functions conj19finir etc. are defined grammaticalframework.org/lib. 647 As the n is fixed, this is a lexicon specialized to source code size rather than e.g. a finite automa- one part of speech. A word is an element of the ton size gives in our view a better approximation lexicon, that is, an inflection table of size n. of the “cognitive load” of the paradigm system, An application of a smart paradigm Pm to a its “learnability”. As a functional programming word w ∈ L is an inflection table resulting from language, GF permits abstractions comparable to applying Pm to the appropriate subset σm (w) of those available for human language learners, who the inflection table w, don’t need to learn the repetitive details of a finite automaton. Pm [w] = Pm (σm (w)) : Stringn We define the code complexity as the size of the abstract syntax tree of the source code. This Thus we assume that all arguments are existing size is given as the number of nodes in the syntax word forms (rather than e.g. stems), or features tree; for instance, such as the gender. n X An application is correct if • size(f (x1 , . . . , xn )) = 1 + size(xi ) i=1 Pm [w] = w • size(s) = 1, for a string literal s Using the abstract syntax size makes it possible The cost of a word w is the minimum number of to ignore programmer-specific variation such as arguments needed to make the application correct: identifier size. Measurements of the GF Resource cost(w) = argmin(Pm [w] = w) Grammar Library show that code size measured m in this way is in average 20% of the size of source For practical applications, it is useful to require files in bytes. Thus a source file of 1 kB has the Pm to be monotonic, in the sense that increasing code complexity around 200 on the average. m preserves correctness. Notice that code complexity is defined in a way The cost of a lexicon L is the average cost for that makes it into a straightforward generaliza- its words, tion of the cost of a word as expressed in terms of paradigm applications in GF source code. The ML X source code complexity of a paradigm application cost(wi ) is i=1 cost(L) = size(Pm [w]) = 1 + m ML where ML is the number of words in the lexicon, Thus the complexity for a word w is its cost plus as defined above. one; the addition of one comes from the applica- The predictability of a lexicon could be de- tion node for the function Pm and corresponds to fined as a quantity inversely dependent on its cost. knowing the part of speech of the word. For instance, an information-theoretic measure 4 Experimental results could be defined 1 We conducted experiments in four languages (En- predict(L) = glish, Swedish, French and Finnish2 ), presented 1 + log cost(L) here in order of morphological richness. We used with the intuition that each added argument cor- trusted full form lexica (i.e. lexica giving the com- responds to a choice in a decision tree. However, plete inflection table of every word) to compute we will not use this measure in this paper, but just the predictability, as defined above, in terms of the concrete cost. the smart paradigms in GF Resource Grammar Li- The complexity of a paradigm system is de- brary. fined as the size of its code in a given coding We used a simple algorithm for computing the system, following the idea of Kolmogorov com- cost c of a lexicon L with a set Pm of smart plexity (Solomonoff, 1964). The notion assumes paradigms: a coding system, which we fix to be GF source 2 This choice correspond to the set of language for which code. As the results are relative to the coding both comprehensive smart paradigms and morphological system, they are only usable for comparing def- dictionaries were present in GF with the exception of Turk- initions in the same system. However, using GF ish, which was left out because of time constraints. 648 • set c := 0 one third of the nouns of the lexicon were not in- cluded in the experiment because one of the form • for each word wi in L, was missing. The vast majority of the remaining – for each m in growing order for which 15,000 nouns are very regular, with predictable Pm is defined: deviations such as kiss - kisses and fly - flies which if Pm [w] = w, then c := c + m, else try can be easily predicted by the smart paradigm. with next m With the average cost of 1.05, this was the most predictable lexicon in our experiment. • return c Verbs. Verbs are the most interesting category in English because they present the richest mor- The average cost is c divided by the size of L. phology. Indeed, as shown by Table 1, the cost The procedure presupposes that it is always for English verbs, 1.21, is similar to what we got possible to get the correct inflection table. For for morphologically richer languages. this to be true, the smart paradigms must have a “worst case scenario” version that is able to gen- 4.2 Swedish erate all forms. In practice, this was not always As gold standard, we used the SALDO lexicon the case but we checked that the number of prob- (Borin et al., 2008). lematic words is so small that it wouldn’t be sta- Nouns. The noun inflection tables had 8 tistically significant. A typical problem word was forms (singular/plural indefinite/definite nomina- the equivalent of the verb be in each language. tive/genitive) plus a gender (uter/neuter). Swedish Another source of deviation is that a lexicon nouns are intrinsically very unpredictable, and may have inflection tables with size deviating there are many examples of homonyms falling un- from the number n that normally defines a lex- der different paradigms (e.g. val - val “choice” vs. ical category. Some words may be “defective”, val -valar “whale”). The cost 1.70 is the highest i.e. lack some forms (e.g. the singular form of all the lexica considered. Of course, there may in “plurale tantum” words), whereas some words be room for improving the smart paradigm. may have several variants for a given form (e.g. Verbs. The verbs had 20 forms, which in- learned and learnt in English). We made no ef- cluded past participles. We ran two experiments, fort to predict defective words, but just ignored by choosing either the infinitive or the present in- them. With variant forms, we treated a prediction dicative as the base form. In traditional Swedish as correct if it matched any of the variants. grammar, the base form of the verb is considered The above algorithm can also be used for help- to be the infinitive, e.g. spela, leka (“play” in ing to select the optimal sets of characteristic two different senses). But this form doesn’t dis- forms; we used it in this way to select the first tinguish between the “first” and the “second con- form of Swedish verbs and the second form of jugation”. However, the present indicative, here Finnish nouns. spelar, leker, does. Using it gives a predictive The results are collected in Table 1. The sec- power 1.13 as opposed to 1.22 with the infinitive. tions below give more details of the experiment in Some modern dictionaries such as Lexin4 there- each language. fore use the present indicative as the base form. 4.1 English 4.3 French As gold standard, we used the electronic version For French, we used the Morphalou morpholog- of the Oxford Advanced Learner’s Dictionary of ical lexicon (Romary et al., 2004). As stated in Current English3 which contains about 40,000 the documentation5 the current version of the lex- root forms (about 70,000 word forms). icon (version 2.0) is not complete, and in par- Nouns. We considered English nouns as hav- ticular, many entries are missing some or all in- ing only two forms (singular and plural), exclud- flected forms. So for those experiments we only ing the genitive forms which can be considered to 4 be clitics and are completely predictable. About http://lexin.nada.kth.se/lexin/ 5 http://www.cnrtl.fr/lexiques/ 3 available in electronic form at http://www.eecs. morphalou/LMF-Morphalou.php#body_3.4.11, qmul.ac.uk/˜mpurver/software.html accessed 2011-11-04 649 Table 1: Lexicon size and average cost for the nouns (N) and verbs (V) in four languages, with the percentage of words correctly inferred from one and two forma (i.e. m = 1 and m ≤ 2, respectively). Lexicon Forms Entries Cost m=1 m≤2 Eng N 2 15,029 1.05 95% 100% Eng V 5 5,692 1.21 84% 95% Swe N 9 59,225 1.70 46% 92% Swe V 20 4,789 1.13 97% 97% Fre N 3 42,390 1.25 76% 99% Fre V 51 6,851 1.27 92% 94% Fin N 34 25,365 1.26 87% 97% Fin V 102 10,355 1.09 96% 99% included entries where all the necessary forms glutinative way. The traditional number and case were presents. count for nouns gives 26, whereas for verbs the Nouns: Nouns in French have two forms (sin- count is between 100 and 200, depending on how gular and plural) and an intrinsic gender (mascu- participles are counted. Notice that the definition line or feminine), which we also considered to be of predictability used in this paper doesn’t depend a part of the inflection table. Most of the unpre- on the number of forms produced (i.e. not on n dictability comes from the impossibility to guess but only on m); therefore we can simply ignore the gender. this question. However, the question is interesting Verbs: The paradigms generate all of the sim- if we think about paradigms as a data compression ple (as opposed to compound) tenses given in tra- method (Section 4.5). ditional grammars such as the Bescherelle. Also Nouns. Compound nouns are a problem for the participles are generated. The auxiliary verb morphology prediction in Finnish, because inflec- of compound tenses would be impossible to guess tion is sensitive to the vowel harmony and num- from morphological clues, and was left out of ber of syllables, which depend on where the com- consideration. pound boundary goes. While many compounds are marked in KOTUS, we had to remove some 4.4 Finnish compounds with unmarked boundaries. Another The Finnish gold standard was the KOTUS lexi- peculiarity was that adjectives were included in con (Kotimaisten Kielten Tutkimuskeskus, 2006). nouns; this is no problem since the inflection pat- It has around 90,000 entries tagged with part terns are the same, if comparison forms are ig- of speech, 50 noun paradigms, and 30 verb nored. The figure 1.26 is better than the one re- paradigms. Some of these paradigms are rather ported in (Ranta, 2008), which is 1.42; the reason abstract and powerful; for instance, grade alterna- is mainly that the current set of paradigms has a tion would multiply many of the paradigms by a better coverage of three-syllable nouns. factor of 10 to 20, if it was treated in a concate- Verbs. Even though more numerous in forms native way. For instance, singular nominative- than nouns, Finnish verbs are highly predictable genitive pairs show alternations such as talo–talon (1.09). (“house”), katto–katon (“roof”), kanto–kannon (“stub”), rako–raon (“crack”), and sato–sadon 4.5 Complexity and data compression (“harvest”). All of these are treated with one and The cost of a lexicon has an effect on learnabil- the same paradigm, which makes the KOTUS sys- ity. For instance, even though Finnish words have tem relatively abstract. ten or a hundred times more forms than English The total number of forms of Finnish nouns and forms, these forms can be derived from roughly verbs is a question of definition. Koskenniemi the same number of characteristic forms as in En- (Koskenniemi, 1983) reports 2000 for nouns and glish. But this is of course just a part of the truth: 12,000 for verbs, but most of these forms result by it might still be that the paradigm system itself is adding particles and possessive suffixes in an ag- much more complex in some languages than oth- 650 gives, for the Finnish verb lexicon, a file of 60 kB, Table 2: Paradigm complexities for nouns and verbs in the four languages, computed as the syntax tree size which implies a joint compression rate of 227. of GF code. That the compression rates for the code can be language noun verb total higher than the numbers of forms in the full-form English 403 837 991 lexicon is explained by the fact that the gener- Swedish 918 1039 1884 ated forms are longer than the base forms. For instance, the full-form entry of the Finnish verb French 351 2193 2541 uida (”swim”) is 850 bytes, which means that the Finnish 4772 3343 6885 average form size is twice the size of the basic form. ers. 5 Smart paradigms in lexicon building Following the definitions of Section 3, we have counted the the complexity of the smart paradigm Building a high-quality lexicon needs a lot of definitions for nouns and verbs in the different manual work. Traditionally, when one is not writ- languages in the GF Resource Grammar Library. ing all the forms by hand (which would be almost Notice that the total complexity of the system is impossible in languages with rich morphology), lower than the sum of the parts, because many sets of paradigms are used that require the lexi- definitions (such as morphophonological transfor- cographer to specify the base form of the word mations) are reused in different parts of speech. and an identifier for the paradigm to use. This has The results are in Table 2. several usability problems: one has to remember These figures suggest that Finnish indeed has a all the paradigm identifiers and choose correctly more complex morphology than French, and En- from them. glish is the simplest. Of course, the paradigms Smart paradigm can make this task easier, even were not implemented with such comparisons in accessible to non-specialist, because of their abil- mind, and it may happen that some of the differ- ity to guess the most probable paradigm from a ences come from different coding styles involved single base form. As shown by Table 1, this is in the collaboratively built library. Measuring more often correct than not, except for Swedish code syntax trees rather than source code text neu- nouns. If this information is not enough, only a tralizes some of this variation (Section 3). few more forms are needed, requiring only prac- tical knowledge of the language. Usually (92% to Finally, we can estimate the power of smart 100% in Table 1), adding a second form (m = 2) paradigms as a data compression function. In a is enough to cover all words. Then the best prac- sense, a paradigm is a function designed for the tice for lexicon writing might be always to give very purpose of compressing a lexicon, and one these two forms instead of just one. can expect better compression than with generic Smart paradigms can also be used for an auto- tools such as bzip2. Table 3 shows the compres- matic bootstrapping of a list of base forms into a sion rates for the same full-form lexica as used full form lexicon. As again shown by the last col- in the predictability experiment (Table 1). The umn of Table 1, one form alone can provide an sizes are in kilobytes, where the code size for excellent first approximation in most cases. What paradigms is calculated as the number of con- is more, it is often the case that uncaught words structors multiplied by 5 (Section 3). The source belong to a limited set of “irregular” words, such lexicon size is a simple character count, similar to as the irregular verbs in many languages. All new the full-form lexicon. words can then be safely inferred from the base Unexpectedly, the compression rate of the form by using smart paradigms. paradigms improves as the number of forms in the full-form lexicon increases (see Table 1 for 6 Related work these numbers). For English and French nouns, bzip2 is actually better. But of course, unlike Smart paradigms were used for a study of Finnish the paradigms, it also gives a global compression morphology in (Ranta, 2008). The present paper over all entries in the lexicon. Combining the can be seen as a generalization of that experiment two methods by applying bzip2 to the source code to more languages and with the notion of code 651 Table 3: Comparison between using bzip2 and paradigms+lexicon source as a compression method. Sizes in kB. Lexicon Fullform bzip2 fullform/bzip2 Source fullform/source Eng N 264 99 2.7 135 2.0 Eng V 245 78 3.2 57 4.4 Swe N 6,243 1,380 4.5 1,207 5.3 Swe V 840 174 4.8 58 15 Fre N 952 277 3.4 450 2.2 Fre V 3,888 811 4.8 98 40 Fin N 11,295 2,165 5.2 343 34 Fin V 13,609 2,297 5.9 123 114 complexity. Also the paradigms for Finnish are ber of forms to determine that a word belongs to a improved here (cf. Section 4.4 above). certain paradigm. Smart paradigms can then give Even though smart paradigm-like descriptions the method to actually construct the full inflection are common in language text books, there is to tables from the characteristic forms. our knowledge no computational equivalent to the smart paradigms of GF. Finite state morphology 7 Conclusion systems often have a function called a guesser, We have introduced the notion of smart which, given a word form, tries to guess either paradigms, which implement the linguistic the paradigm this form belongs to or the dictio- knowledge involved in inferring the inflection of nary form (or both). A typical guesser differs words. We have used the paradigms to estimate from a smart paradigms in that it does not make the predictability of nouns and verbs in English, it possible to correct the result by giving more Swedish, French, and Finnish. The main result forms. Examples of guessers include (Chanod is that, with the paradigms used, less than two and Tapanainen, 1995) for French, (Hlav´acˇ ov´a, forms in average is always enough. In half of the 2001) for Czech, and (Nakov et al., 2003) for Ger- languages and categories, one form is enough to man. predict more than 90% of forms correctly. This Another related domain is the unsupervised gives a promise for both manual lexicon building learning of morphology where machine learning and automatic bootstrapping of lexicon from is used to automatically build a language mor- word lists. phology from corpora (Goldsmith, 2006). The To estimate the overall complexity of inflection main difference is that with the smart paradigms, systems, we have also measured the size of the the paradigms and the guess heuristics are imple- source code for the paradigm systems. Unsurpris- mented manually and with a high certainty; in un- ingly, Finnish is around seven times as complex supervised learning of morphology the paradigms as English, and around three times as complex as are induced from the input forms with much lower Swedish and French. But this cost is amortized certainty. Of particular interest are (Chan, 2006) when big lexica are built. and (Dreyer and Eisner, 2011), dealing with the Finally, we looked at smart paradigms as a data automatic extraction of paradigms from text and compression method. With simple morphologies, investigate how good these can become. The main such as English nouns, bzip2 gave a better com- contrast is, again, that our work deals with hand- pression of the lexicon than the source code us- written paradigms that are correct by design, and ing paradigms. But with Finnish verbs, the com- we try to see how much information we can drop pression rate was almost 20 times higher with before losing correctness. paradigms than with bzip2. Once given, a set of paradigms can be used in The general conclusion is that smart paradigms automated lexicon extraction from raw data, as in are a good investment when building morpho- (Forsberg et al., 2006) and (Cl´ement et al., 2004), logical lexica, as they ease the task of both hu- by a method that tries to collect a sufficient num- man lexicographers and automatic bootstrapping 652 methods. They also suggest a method to assess [Goldsmith2006] John Goldsmith. 2006. An Algo- the complexity and learnability of languages, re- rithm for the Unsupervised Learning of Morphol- lated to Kolmogorov complexity. The results in ogy. Nat. Lang. Eng., 12(4):353–371. the current paper are just preliminary in this re- [Hellberg1978] Staffan Hellberg. 1978. The Morphol- ogy of Present-Day Swedish. Almqvist & Wiksell. spect, since they might still tell more about par- [Hlav´acˇ ov´a2001] Jaroslava Hlav´acˇ ov´a. 2001. Mor- ticular implementations of paradigms than about phological guesser of czech words. In V´aclav Ma- the languages themselves. touˇsek, Pavel Mautner, Roman Moucek, and Karel Tauˇser, editors, Text, Speech and Dialogue, volume Acknowledgements 2166 of Lecture Notes in Computer Science, pages We are grateful to the anonymous referees for 70–75. Springer Berlin / Heidelberg. valuable remarks and questions. The research [Kaplan and Kay1994] R. Kaplan and M. Kay. 1994. leading to these results has received funding from Regular Models of Phonological Rule Systems. the European Union’s Seventh Framework Pro- Computational Linguistics, 20:331–380. [Koskenniemi1983] Kimmo Koskenniemi. 1983. gramme (FP7/2007-2013) under grant agreement Two-Level Morphology: A General Computational no FP7-ICT-247914 (the MOLTO project). Model for Word-Form Recognition and Production. Ph.D. thesis, University of Helsinki. [Kotimaisten Kielten Tutkimuskeskus2006] References Kotimaisten Kielten Tutkimuskeskus. 2006. [Beesley and Karttunen2003] Kenneth R. Beesley and KOTUS Wordlist. http://kaino.kotus. Lauri Karttunen. 2003. Finite State Morphology. fi/sanat/nykysuomi. CSLI Publications. [Nakov et al.2003] Preslav Nakov, Yury Bonev, and [Bescherelle1997] Bescherelle. 1997. La conjugaison et al. 2003. Guessing morphological classes of un- pour tous. Hatier. known german nouns. [Borin et al.2008] Lars Borin, Markus Forsberg, and [Ranta2008] Aarne Ranta. 2008. How pre- Lennart L¨onngren. 2008. Saldo 1.0 (svenskt as- dictable is Finnish morphology? an experi- sociationslexikon version 2). Spr˚akbanken, 05. ment on lexicon construction. In J. Nivre and [Chan2006] Erwin Chan. 2006. Learning probabilistic M. Dahll¨of and B. Megyesi, editor, Resource- paradigms for morphology in a latent class model. ful Language Technology: Festschrift in Honor In Proceedings of the Eighth Meeting of the ACL of Anna S˚agvall Hein, pages 130–148. University Special Interest Group on Computational Phonol- of Uppsala. http://publications.uu.se/ ogy and Morphology, SIGPHON ’06, pages 69–78, abstract.xsql?dbid=8933. Stroudsburg, PA, USA. Association for Computa- [Ranta2011] Aarne Ranta. 2011. Grammatical Frame- tional Linguistics. work: Programming with Multilingual Grammars. [Chanod and Tapanainen1995] Jean-Pierre Chanod CSLI Publications, Stanford. ISBN-10: 1-57586- and Pasi Tapanainen. 1995. Creating a tagset, 626-9 (Paper), 1-57586-627-7 (Cloth). lexicon and guesser for a french tagger. CoRR, [Romary et al.2004] Laurent Romary, Susanne cmp-lg/9503004. Salmon-Alt, and Gil Francopoulo. 2004. Standards [Cl´ement et al.2004] Lionel Cl´ement, Benoˆıt Sagot, going concrete: from LMF to Morphalou. In The and Bernard Lang. 2004. Morphology based au- 20th International Conference on Computational tomatic acquisition of large-coverage lexica. In Linguistics - COLING 2004, Gen`eve/Switzerland. Proceedings of LREC-04, Lisboa, Portugal, pages coling. 1841–1844. [Solomonoff1964] Ray J. Solomonoff. 1964. A formal [Dreyer and Eisner2011] Markus Dreyer and Jason theory of inductive inference: Parts 1 and 2. Infor- Eisner. 2011. Discovering morphological mation and Control, 7:1–22 and 224–254. paradigms from plain text using a dirichlet process mixture model. In Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP ’11, pages 616–627, Stroudsburg, PA, USA. Association for Computational Linguis- tics. [Forsberg et al.2006] Markus Forsberg, Harald Ham- marstr¨om, and Aarne Ranta. 2006. Morpholog- ical Lexicon Extraction from Raw Text Data. In T. Salakoski, editor, FinTAL 2006, volume 4139 of LNCS/LNAI. 653 Probabilistic Hierarchical Clustering of Morphological Paradigms Burcu Can Suresh Manandhar Department of Computer Science Department of Computer Science University of York University of York Heslington, York, YO10 5GH, UK Heslington, York, YO10 5GH, UK

[email protected] [email protected]

Abstract (StemList, SuffixList) such that each concatena- tion of Stem+Suffix (where Stem ∈ StemList and We propose a novel method for learning Suffix ∈ SuffixList) is a valid word form. The morphological paradigms that are struc- learning of morphological paradigms is not novel tured within a hierarchy. The hierarchi- as there has already been existing work in this area cal structuring of paradigms groups mor- phologically similar words close to each such as Goldsmith (2001), Snover et al. (2002), other in a tree structure. This allows detect- Monson et al. (2009), Can and Manandhar (2009) ing morphological similarities easily lead- and Dreyer and Eisner (2011). However, none of ing to improved morphological segmen- these existing approaches address learning of the tation. Our evaluation using (Kurimo et hierarchical structure of paradigms. al., 2011a; Kurimo et al., 2011b) dataset Hierarchical organisation of words help cap- shows that our method performs competi- ture morphological similarities between words in tively when compared with current state-of- a compact structure by factoring these similarities art systems. through stems, suffixes or prefixes. Our inference algorithm simultaneously infers latent variables 1 Introduction (i.e. the morphemes) along with their hierarchical organisation. Most hierarchical clustering algo- Unsupervised morphological segmentation of a rithms are single-pass, where once the hierarchi- text involves learning rules for segmenting words cal structure is built, the structure does not change into their morphemes. Morphemes are the small- further. est meaning bearing units of words. The learn- The paper is structured as follows: section 2 ing process is fully unsupervised, using only raw gives the related work, section 3 describes the text as input to the learning system. For example, probabilistic hierarchical clustering scheme, sec- the word respectively is split into morphemes re- tion 4 explains the morphological segmenta- spect, ive and ly. Many fields, such as machine tion model by embedding it into the clustering translation, information retrieval, speech recog- scheme and describes the inference algorithm nition etc., require morphological segmentation along with how the morphological segmentation since new words are always created and storing is performed, section 5 presents the experiment all the word forms will require a massive dictio- settings along with the evaluation scores, and fi- nary. The task is even more complex, when mor- nally section 6 presents a discussion with a com- phologically complicated languages (i.e. agglu- parison with other systems that participated in tinative languages) are considered. The sparsity Morpho Challenge 2009 and 2010 . problem is more severe for more morphologically complex languages. Applying morphological seg- 2 Related Work mentation mitigates data sparsity by tackling the issue with out-of-vocabulary (OOV) words. We propose a Bayesian approach for learning of In this paper, we propose a paradigmatic ap- paradigms in a hierarchy. If we ignore the hierar- proach. A morphological paradigm is a pair chical aspect of our learning algorithm, then our 654 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 654–663, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Dk {walk, talk, quick}{0,ed,ing,ly, s} Di {walk, talk}{0,ed,ing,s} Dj {walk}{0,ing} {talk}{ed,s} {quick}{0,ly} X1 X2 X3 X4 Figure 2: A segment of a tree with with internal nodes walk walking talked talks quick quickly Di , Dj , Dk having data points {x1 , x2 , x3 , x4 }. The subtree below the internal node Di is called Ti , the subtree below the internal node Dj is Tj , and the sub- Figure 1: A sample tree structure. tree below the internal node Dk is Tk . method is similar to the Dirichlet Process (DP) model can be denoted as p(xi |θ) where θ denotes based model of Goldwater et al. (2006). From the parameters of the probabilistic model. this perspective, our method can be understood The marginal probability of data in any node as adding a hierarchical structure learning layer can be calculated as: on top of the DP based learning method proposed ∫ in Goldwater et al. (2006). Dreyer and Eisner p(Dk ) = p(Dk |θ)p(θ|β)dθ (1) (2011) propose an infinite Diriclet mixture model for capturing paradigms. However, they do not The likelihood of data under any subtree is de- address learning of hierarchy. fined as follows: The method proposed in Chan (2006) also learns within a hierarchical structure where La- p(Dk |Tk ) = p(Dk )p(Dl |Tl )p(Dr |Tr ) (2) tent Dirichlet Allocation (LDA) is used to find stem-suffix matrices. However, their work is su- where the probability is defined in terms of left Tl pervised, as true morphological analyses of words and right Tr subtrees. Equation 2 provides a re- are provided to the system. In contrast, our pro- cursive decomposition of the likelihood in terms posed method is fully unsupervised. of the likelihood of the left and the right sub- trees until the leaf nodes are reached. We use the 3 Probabilistic Hierarchical Model marginal probability (Equation 1) as prior infor- The hierarchical clustering proposed in this work mation since the marginal probability bears the is different from existing hierarchical clustering probability of having the data from the left and algorithms in two aspects: right subtrees within a single cluster. 4 Morphological Segmentation • It is not single-pass as the hierarchical struc- ture changes. In our model, data points are words to be clus- tered and each cluster represents a paradigm. In • It is probabilistic and is not dependent on a the hierarchical structure, words will be organised distance metric. in such a way that morphologically similar words will be located close to each other to be grouped 3.1 Mathematical Definition in the same paradigms. Morphological similarity In this paper, a hierarchical structure is a binary refers to at least one common morpheme between tree in which each internal node represents a clus- words. However, we do not make a distinction be- ter. tween morpheme types. Instead, we assume that Let a data set be D = {x1 , x2 , . . . , xn } and each word is organised as a stem+suffix combina- T be the entire tree, where each data point xi is tion. located at one of the leaf nodes (see Figure 2). Here, Dk denotes the data points in the branch 4.1 Model Definition Tk . Each node defines a probabilistic model for Let a dataset D consist of words to be analysed, words that the cluster acquires. The probabilistic where each word wi has a latent variable which is 655 the split point that analyses the word into its stem βs βm si and suffix mi : D = {w1 = s1 + m1 , . . . , wn = sn + mn } Ps Gs Gm Pm The marginal likelihood of words in the node k is defined such that: si mi p(Dk ) = p(Sk )p(Mk ) L N = p(s1 , s2 , . . . , sn )p(m1 , m2 , . . . , mn ) The words in each cluster represents a wi paradigm that consists of stems and suffixes. The n hierarchical model puts words sharing the same stems or suffixes close to each other in the tree. Figure 3: The plate diagram of the model, representing Each word is part of all the paradigms on the the generation of a word wi from the stem si and the path from the leaf node having that word to the suffix mi that are generated from Dirichlet processes. root. The word can share either its stem or suffix In the representation, solid-boxes denote that the pro- with other words in the same paradigm. Hence, cess is repeated with the number given on the corner a considerable number of words can be generated of each box. through this approach that may not be seen in the corpus. lengths implicitly through the morpheme letters: We postulate that stems and suffixes are gen- ∏ erated independently from each other. Thus, the Ps (si ) = p(ci ) (4) probability of a word becomes: ci ∈si p(w = s + m) = p(s)p(m) (3) where ci denotes the letters, which are distributed uniformly. Modelling morpheme letters is a way We define two Dirichlet processes to generate of modelling the morpheme length since shorter stems and suffixes independently: morphemes are favoured in order to have fewer factors in Equation 4 (Creutz and Lagus, 2005b). Gs |βs , Ps ∼ DP (βs , Ps ) The Dirichlet process, DP (βm , Pm ), is defined Gm |βm , Pm ∼ DP (βm , Pm ) for suffixes analogously. The graphical represen- s|Gs ∼ Gs tation of the entire model is given in Figure 3. m|Gm ∼ Gm Once the probability distributions G = {Gs , Gm } are drawn from both Dirichlet pro- where DP (βs , Ps ) denotes a Dirichlet process cesses, words can be generated by drawing a stem that generates stems. Here, βs is the concentration from Gs and a suffix from Gm . However, we do parameter, which determines the number of stem not attempt to estimate the probability distribu- types generated by the Dirichlet process. The tions G; instead, G is integrated out. The joint smaller the value of the concentration parameter, probability of stems is calculated by integrating the less likely to generate new stem types the pro- out Gs : cess is. In contrast, the larger the value of concen- tration parameter, the more likely it is to generate p(s1 , s2 , . . . , sM ) ∫ ∏ L new stem types, yielding a more uniform distribu- (5) = p(Gs ) p(si |Gs )dGs tion over stem types. If βs < 1, sparse stems are i=1 supported, it yields a more skewed distribution. To support a small number of stem types in each where L denotes the number of stem tokens. The cluster, we chose βs < 1. joint probability distribution of stems can be tack- Here, Ps is the base distribution. We use the led as a Chinese restaurant process. The Chi- base distribution as a prior probability distribu- nese restaurant process introduces dependencies tion for morpheme lengths. We model morpheme between stems. Hence, the joint probability of 656 stems S = {s1 , . . . , sL } becomes: p(s1 , s2 , . . . , sL ) = p(s1 )p(s2 |s1 ) . . . p(sM |s1 , . . . , sM −1 ) Γ(βs ) ∏K ∏K = βsK−1 Ps (si ) (nsi − 1)! Γ(L + βs ) i=1 i=1 (6) where K denotes the number of stem types. In the equation, the second and the third factor corre- exclaim+ed spond to the case where novel stems are generated consist+ed consist+s for the first time; the last factor corresponds to the case in which stems that have already been gener- ated for nsi times previously are being generated plugg+ed skew+ed again. The first factor consists of all denominators from both cases. The integration process is applied for proba- bility distributions Gm for suffixes analogously. Hence, the joint probability of suffixes M = {m1 , . . . , mN } becomes: liken+s liken+ed p(m1 , m2 , . . . , mN ) borrow+s borrow+ed = p(m1 )p(m2 |m1 ) . . . p(mN |m1 , . . . , mN −1 ) Γ(α) ∏T ∏T = αT Pm (mi ) (nmi − 1)! Γ(N + α) i=1 i=1 Figure 4: A portion of a sample tree. (7) where T denotes the number of suffix types and nmi is the number of stem types mi which have the set of suffixes, excluding the new instance of been already generated. the suffix mi . Following the joint probability distribution of A portion of a tree is given in Figure 4. As stems, the conditional probability of a stem given can be seen on the figure, all words are lo- previously generated stems can be derived as: cated at leaf nodes. Therefore, the root node of this subtree consists of words {plugg+ed, p(si |S −si , βs , Ps ) skew+ed, exclaim+ed, borrow+s, borrow+ed, −s  nSsi i −si L−1+βs if si ∈ S (8) liken+s, liken+ed, consist+s, consist+ed}. =  βs ∗Ps (si ) otherwise L−1+βs 4.2 Inference −s where nSsi i denotes the number of stem in- The initial tree is constructed by randomly choos- stances si that have been previously generated, ing a word from the corpus and adding this into a where S −si denotes the stem set excluding the randomly chosen position in the tree. When con- new instance of the stem si . structing the initial tree, latent variables are also The conditional probability of a suffix given the assigned randomly, i.e. each word is split at a ran- other suffixes that have been previously generated dom position (see Algorithm 1). is defined similarly: We use Metropolis Hastings algorithm (Hast- ings, 1970), an instance of Markov Chain Monte p(mi |M −mi , β m , Pm ) Carlo (MCMC) algorithms, to infer the optimal −mi  nM −mi hierarchical structure along with the morphologi- N −1+βm if mi ∈ M mi = cal segmentation of words (given in Algorithm 2).  βm ∗Pm (mi ) otherwise N −1+βm During each iteration i, a leaf node Di = {wi = (9) M −i si + mi } is drawn from the current tree structure. where nmikis the number of instances mi that The drawn leaf node is removed from the tree. have been generated previously where M −m is i Next, a node Dk is drawn uniformly from the tree 657 Algorithm 1 Creating initial tree. Algorithm 2 Inference algorithm 1: input: data D = {w1 = s1 + m1 , . . . , wn = 1: input: data D = {w1 = s1 + m1 , . . . , wn = sn + mn }, sn + mn }, initial tree T , initial temperature 2: initialise: root ← D1 where of the system γ, the target temperature of the D1 = {w1 = s1 + m1 } system κ, temperature decrement η 3: initialise: c ← n − 1 2: initialise: i ← 1, w ← wi = si + mi , 4: while c >= 1 do pcur (D|T ) ← p(D|T ) 5: Draw a word wj from the corpus. 3: while γ > κ do 6: Split the word randomly such that wj = 4: Remove the leaf node Di that has the s j + mj word wi = si + mi 7: Create a new node Dj where Dj = 5: Draw a split point for the word such that ′ ′ {wj = sj + mj } wi = si + mi 8: Choose a sibling node Dk for Dj 6: Draw a sibling node Dj 9: Merge Dnew ← Dj ⊎ Dk 7: Dm ← Di ⊎ Dj 10: Remove wj from the corpus 8: Update pnext (D|T ) 11: c←c−1 9: if pnext (D|T ) >= pcur (D|T ) then 12: end while 10: Accept the new tree structure 13: output: Initial tree 11: pcur (D|T ) ← pnext (D|T ) 12: else 13: random ∼ N ormal(0, 1) to make it a sibling node to Di . In addition to a ( )1 ′ ′ sibling node, a split point wi = si + mi is drawn 14: if random < ppnext (D|T ) γ cur (D|T ) then ′ ′ uniformly. Next, the node Di = {wi = si + mi } 15: Accept the new tree structure is inserted as a sibling node to Dk . After updating 16: pcur (D|T ) ← pnext (D|T ) all probabilities along the path to the root, the new 17: else tree structure is either accepted or rejected by ap- 18: Reject the new tree structure plying the Metropolis-Hastings update rule. The 19: Re-insert the node Di at its pre- likelihood of data under the given tree structure is vious position with the previous used as the sampling probability. split point We use a simulated annealing schedule to up- 20: end if date PAcc : 21: end if 22: w ← wi+1 = si+1 + mi+1 ( )1 23: γ ←γ−η pnext (D|T ) γ PAcc = (10) 24: end while pcur (D|T ) 25: output: A tree structure where each node where γ denotes the current temperature, corresponds to a paradigm. pnext (D|T ) denotes the marginal likelihood of the data under the new tree structure, and pcur (D|T ) denotes the marginal likelihood of ture decreases only tree structures that lead lead to data under the latest accepted tree structure. If a considerable improvement in the marginal prob- (pnext (D|T ) > pcur (D|T )) then the update is ability p(D|T ) are accepted. accepted (see line 9, Algorithm 2), otherwise, the tree structure is still accepted with a probability An illustration of sampling a new tree structure of pAcc (see line 14, Algorithm 2). In our is given in Figure 5 and 6. Figure 5 shows that experiments (see section 5) we set γ to 2. The D0 will be removed from the tree in order to sam- system temperature is reduced in each iteration ple a new position on the tree, along with a new of the Metropolis Hastings algorithm: split point of the word. Once the leaf node is re- moved from the tree, the parent node is removed from the tree, as the parent node D5 will consist γ ←γ−η (11) of only one child. Figure 6 shows that D8 is sam- Most tree structures are accepted in the earlier pled to be the sibling node of D0 . Subsequently, stages of the algorithm, however, as the tempera- the two nodes are merged within a new cluster that 658 D6 p(sj |Sroot , βs , Ps ) p(mj |Mroot , βm , Pm ) (13) D7 where Sroot denotes all the stems in Droot and D5 D8 Mroot denotes all the suffixes in Droot . Here p(sj |Sroot , βs , Ps ) is calculated as given below: D0 D1 D2 D3 D4 p(si |Sroot  , βSs , Ps ) = Figure 5: D0 will be removed from the tree.  nsiroot L+βs if si ∈ Sroot (14)  βs ∗Ps (si ) otherwise D6 L+βs D7 Similarly, p(mj |Mroot , βm , Pm ) is calculated D9 as: D8 p(mi |Mroot , β m , Pm ) =  nM root N +βm if mi ∈ Mroot mi D1 D2 D3 D4 D0 (15)  βm ∗Pm (mi ) otherwise N +βm Figure 6: D8 is sampled to be the sibling of D0 . 4.3.2 Multiple Split Points In order to discover words with multiple split introduces a new node D9 . points, we propose a hierarchical segmentation 4.3 Morphological Segmentation where each segment is split further. The rules for generating multiple split points is given by the fol- Once the optimal tree structure is inferred, along lowing context free grammar: with the morphological segmentation of words, any novel word can be analysed. For the segmen- tation of novel words, the root node is used as it w ← s1 m1 |s2 m2 (16) contains all stems and suffixes which are already s1 ← s m|s s (17) extracted from the training data. Morphological s2 ← s (18) segmentation is performed in two ways: segmen- tation at a single point and segmentation at multi- m1 ← m m (19) ple points. m2 ← s m|m m (20) 4.3.1 Single Split Point In order to find single split point for the mor- Here, s is a pre-terminal node that generates all phological segmentation of a word, the split point the stems from the root node. And similarly, m is yielding the maximum probability given inferred a pre-terminal node that generates all the suffixes stems and suffixes is chosen to be the final analy- from the root node. First, using Equation 16, the sis of the word: word (e.g. housekeeper) is split into s1 m1 (e.g. housekeep+er) or s2 m2 (house+keeper). The first segment is regarded as a stem, and the second arg max p(wi = sj + mj |Droot , βm , Pm , βs , Ps ) j segment is either a stem or a suffix, consider- (12) ing the probability of having a compound word. where Droot refers to the root of the entire tree. Equation 12 is used to decide whether the sec- Here, the probability of a segmentation of a ond segment is a stem or a suffix. At the sec- given word given Droot is calculated as given be- ond segmentation level, each segment is split once low: more. If the first production rule is followed in the first segmentation level, the first segment s1 p(wi = sj + mj |Droot , βm , Pm , βs , Ps ) = can be analysed as s m (e.g. housekeep+∅) or s s 659 !"#$%&%%'%( ! !"#$% &%%'%( !"#$% ) &%%' %( Figure 7: An example that depicts how the word housekeeper can be analysed further to find more split Figure 8: Marginal likelihood convergence for datasets points. of size 16K and 22K words. (e.g. house+keep) (Equation 17). The decision are generated, by splitting each stem and suffix to choose which production rule to apply is made once more, if it is possible to do so. using: Morpho Challenge (Kurimo et al., 2011b) pro- vides a well established evaluation framework { s s if p(s|S, βs , Ps ) > p(m|M, βm , Pm ) that additionally allows comparing our model in s1 ← s m otherwise a range of languages. In both sets of experiments, (21) the Morpho Challenge 2010 dataset is used (Ku- where S and M denote all the stems and suffixes rimo et al., 2011b). Experiments are performed in the root node. for English, where the dataset consists of 878,034 Following the same production rule, the second words. Although the dataset provides word fre- segment m1 can only be analysed as m m (er+∅). quencies, we have not used any frequency infor- We postulate that words cannot have more than mation. However, for training our model, we only two stems and suffixes always follow stems. We chose words with frequency greater than 200. do not allow any prefixes, circumfixes, or infixes. In our experiments, we used dataset sizes of Therefore, the first production rule can output two 10K, 16K, 22K words. However, for final eval- different analyses: s m m m and s s m m (e.g. uation, we trained our models on 22K words. We housekeep+er and house+keep+er). were unable to complete the experiments with On the other hand, if the word is analysed as larger training datasets due to memory limita- s2 m2 (e.g. house+keeper), then s2 cannot be tions. We plan to report this in future work. Once analysed further. (e.g. house). The second seg- the tree is learned by the inference algorithm, the ment m2 can be analysed further, such that s m final tree is used for the segmentation of the entire (stem+suffix) (e.g. keep+er, keeper+∅) or m m dataset. Several experiments are performed for (suffix+suffix). The decision to choose which pro- each setting where the setting varies with the tree duction rule to apply is made as follows: size and the model parameters. Model parameters { are the concentration parameters β = {βs , βm } s m if p(s|S, βs , Ps ) > p(m|M, βm , Pm ) of the Dirichlet processes. The concentration pa- m2 ← m m otherwise rameters, which are set for the experiments, are (22) 0.1, 0.2, 0.02, 0.001, 0.002. Thus, the second production rule yields two In all experiments, the initial temperature of the different analyses: s s m and s m m (e.g. system is assigned as γ = 2 and it is reduced to house+keep+er or house+keeper). the temperature γ = 0.01 with decrements η = 5 Experiments & Results 0.0001. Figure 8 shows how the log likelihoods of trees of size 16K and 22K converge in time (where Two sets of experiments were performed for the the time axis refers to sampling iterations). evaluation of the model. In the first set of exper- Since different training sets will lead to differ- iments, each word is split at single point giving a ent tree structures, each experiment is repeated single stem and a single suffix. In the second set three times keeping the experiment setting the of experiments, potentially multiple split points same. 660 Data Size P(%) R(%) F(%) β s , βm System P(%) R(%) F(%) 10K 81.48 33.03 47.01 0.1, 0.1 Allomorf1 68.98 56.82 62.31 16K 86.48 35.13 50.02 0.002, 0.002 Morf. Base.2 74.93 49.81 59.84 22K 89.04 36.01 51.28 0.002, 0.002 PM-Union3 55.68 62.33 58.82 Lignos4 83.49 45.00 58.48 Table 1: Highest evaluation scores of single split point Prob. Clustering (multiple) 57.08 57.58 57.33 experiments obtained from the trees with 10K, 16K, PM-mimic3 53.13 59.01 55.91 and 22K words. MorphoNet5 65.08 47.82 55.13 Rali-cof6 68.32 46.45 55.30 Data Size P(%) R(%) F(%) β s , βm CanMan7 58.52 44.82 50.76 10K 62.45 57.62 59.98 0.1, 0.1 1 16K 67.80 57.72 62.36 0.002, 0.002 Virpioja et al. (2009) 2 22K 68.71 62.56 62.56 0.001 0.001 Creutz and Lagus (2002) 3 Monson et al. (2009) 4 Table 2: Evaluation scores of multiple split point ex- Lignos et al. (2009) 5 periments obtained from the trees with 10K, 16K, and Bernhard (2009) 6 22K words. Lavall´ee and Langlais (2009) 7 Can and Manandhar (2009) 5.1 Experiments with Single Split Points Table 3: Comparison with other unsupervised systems that participated in Morpho Challenge 2009 for En- In the first set of experiments, words are split into glish. a single stem and suffix. During the segmentation, Equation 12 is used to determine the split position of each word. Evaluation scores are given in Ta- We compare our system with the other partici- ble 1. The highest F-measure obtained is 51.28% pant systems in Morpho Challenge 2010. Results with the dataset of 22K words. The scores are no- are given in Table 6 (Virpioja et al., 2011). Since ticeably higher with the largest training set. the model is evaluated using the official (hidden) Morpho Challenge 2010 evaluation dataset where 5.2 Experiments with Multiple Split Points we submit our system for evaluation to the organ- isers, the scores are different from the ones that The evaluation scores of experiments with mul- we presented Table 1 and Table 2. tiple split points are given in Table 2. The high- We also demonstrate experiments with Morpho est F-measure obtained is 62.56% with the dataset Challenge 2009 English dataset. The dataset con- with 22K words. As for single split points, the sists of 384, 904 words. Our results and the re- scores are noticeably higher with the largest train- sults of other participant systems in Morpho Chal- ing set. lenge 2009 are given in Table 3 (Kurimo et al., For both, single and multiple segmentation, the 2009). It should be noted that we only present same inferred tree has been used. the top systems that participated in Morpho Chal- 5.3 Comparison with Other Systems lenge 2009. If all the systems are considered, our system comes 5th out of 16 systems. For all our evaluation experiments using Mor- The problem of morphologically rich lan- pho Challenge 2010 (English and Turkish) and guages is not our priority within this research. Morpho Challenge 2009 (English), we used 22k Nevertheless, we provide evaluation scores on words for training. For each evaluation, we ran- Turkish. The Turkish dataset consists of 617,298 domly chose 22k words for training and ran our words. We chose words with frequency greater MCMC inference procedure to learn our model. than 50 for Turkish since the Turkish dataset is not We generated 3 different models by choosing 3 large enough. The results for Turkish are given in different randomly generated training sets each Table 4. Our system comes 3rd out of 7 systems. consisting of 22k words. The results are the best results over these 3 models. We are reporting the 6 Discussion best results out of the 3 models due to the small (22k word) datasets used. Use of larger datasets The model can easily capture common suffixes would have resulted in less variation and better such as -less, -s, -ed, -ment, etc. Some sample tree results. nodes obtained from trees are given in Table 6. 661 System P(%) R(%) F(%) System P(%) R(%) F(%) Morf. CatMAP 79.38 31.88 45.49 Base Inference1 80.77 53.76 64.55 Aggressive Comp. 55.51 34.36 42.45 Iterative Comp.1 80.27 52.76 63.67 Prob. Clustering (multiple) 72.36 25.81 38.04 Aggressive Comp.1 71.45 52.31 60.40 Iterative Comp. 68.69 21.44 32.68 Nicolas2 67.83 53.43 59.78 Nicolas 79.02 19.78 31.64 Prob. Clustering (multiple) 57.08 57.58 57.33 Morf. Base. 89.68 17.78 29.67 Morf. Baseline3 81.39 41.70 55.14 Base Inference 72.81 16.11 26.38 Prob. Clustering (single) 70.76 36.51 48.17 Morf. CatMAP4 86.84 30.03 44.63 Table 4: Comparison with other unsupervised systems 1 Lignos (2010) that participated in Morpho Challenge 2010 for Turk- 2 Nicolas et al. (2010) ish. 3 Creutz and Lagus (2002) 4 Creutz and Lagus (2005a) regard+less, base+less, shame+less, bound+less, harm+less, regard+ed, relent+less Table 6: Comparison of our model with other unsuper- solve+d, high+-priced, lower+s, lower+-level, vised systems that participated in Morpho Challenge high+-level, lower+-income, histor+ians 2010 for English. pre+mise, pre+face, pre+sumed, pre+, pre+gnant base+ment, ail+ment, over+looked, predica+ment, deploy+ment, compart+ment, embodi+ment Sometimes similarities may not yield a valid anti+-fraud, anti+-war, anti+-tank, anti+-nuclear, analysis of words. For example, the prefix pre- anti+-terrorism, switzer+, anti+gua, switzer+land leads the words pre+mise, pre+sumed, pre+gnant sharp+ened, strength+s, tight+ened, strength+ened, black+ened to be analysed wrongly, whereas pre- is a valid inspir+e, inspir+ing, inspir+ed, inspir+es, earn+ing, prefix for the word pre+face. Another nice fea- ponder+ing ture about the model is that compounds are easily downgrade+s, crash+ed, crash+ing, lack+ing, captured through common stems: e.g. doubt+fire, blind+ing, blind+, crash+, compris+ing, com- bon+fire, gun+fire, clear+cut. pris+es, stifl+ing, compris+ed, lack+s, assist+ing, blind+ed, blind+er, 7 Conclusion & Future Work Table 5: Sample tree nodes obtained from various trees. In this paper, we present a novel probabilis- tic model for unsupervised morphology learn- As seen from the table, morphologically similar ing. The model adopts a hierarchical structure words are grouped together. Morphological sim- in which words are organised in a tree so that ilarity refers to at least one common morpheme morphologically similar words are located close between words. For example, the words high- to each other. priced and lower-level are grouped in the same In hierarchical clustering, tree-cutting would be node through the word high-level which shares a very useful thing to do but it is not addressed the same stem with high-priced and the same end- in the current paper. We used just the root node ing with lower-level. as a morpheme lexicon to apply segmentation. As seen from the sample nodes, prefixes Clearly, adding tree cutting would improve the ac- can also be identified, for example anti+fraud, curacy of the segmentation and will help us iden- anti+war, anti+tank, anti+nuclear. This illus- tify paradigms with higher accuracy. However, trates the flexibility in the model by capturing the the segmentation accuracy obtained without us- similarities through either stems, suffixes or pre- ing tree cutting provides a very useful indicator fixes. However, as mentioned above, the model to show whether this approach is promising. And does not consider any discrimination between dif- experimental results show that this is indeed the ferent types of morphological forms during train- case. ing. As the prefix pre- appears at the beginning of In the current model, we did not use any syn- words, it is identified as a stem. However, identi- tactic information, only words. POS tags can be fying pre- as a stem does not yield a change in the utilised to group words which are both morpho- morphological analysis of the word. logically and syntactically similar. 662 References ments, CLEF’09, pages 578–597, Berlin, Heidel- berg. Springer-Verlag. Delphine Bernhard. 2009. Morphonet: Exploring the Mikko Kurimo, Krista Lagus, Sami Virpioja, and use of community structure for unsupervised mor- Ville Turunen. 2011a. Morpho challenge pheme analysis. In Working Notes for the CLEF 2009. http://research.ics.tkk.fi/ 2009 Workshop, September. events/morphochallenge2009/, June. Burcu Can and Suresh Manandhar. 2009. Cluster- Mikko Kurimo, Krista Lagus, Sami Virpioja, and ing morphological paradigms using syntactic cate- Ville Turunen. 2011b. Morpho challenge gories. In Working Notes for the CLEF 2009 Work- 2010. http://research.ics.tkk.fi/ shop, September. events/morphochallenge2010/, June. Erwin Chan. 2006. Learning probabilistic paradigms Jean Franc¸ois Lavall´ee and Philippe Langlais. 2009. for morphology in a latent class model. In Proceed- Morphological acquisition by formal analogy. In ings of the Eighth Meeting of the ACL Special Inter- Working Notes for the CLEF 2009 Workshop, est Group on Computational Phonology and Mor- September. phology, SIGPHON ’06, pages 69–78, Stroudsburg, Constantine Lignos, Erwin Chan, Mitchell P. Marcus, PA, USA. Association for Computational Linguis- and Charles Yang. 2009. A rule-based unsuper- tics. vised morphology learning framework. In Working Mathias Creutz and Krista Lagus. 2002. Unsu- Notes for the CLEF 2009 Workshop, September. pervised discovery of morphemes. In Proceed- Constantine Lignos. 2010. Learning from unseen ings of the ACL-02 workshop on Morphological data. In Mikko Kurimo, Sami Virpioja, Ville Tu- and phonological learning - Volume 6, MPL ’02, runen, and Krista Lagus, editors, Proceedings of the pages 21–30, Stroudsburg, PA, USA. Association Morpho Challenge 2010 Workshop, pages 35–38, for Computational Linguistics. Aalto University, Espoo, Finland. Mathias Creutz and Krista Lagus. 2005a. Induc- Christian Monson, Kristy Hollingshead, and Brian ing the morphological lexicon of a natural language Roark. 2009. Probabilistic paramor. In Pro- from unannotated text. In In Proceedings of the ceedings of the 10th cross-language evaluation fo- International and Interdisciplinary Conference on rum conference on Multilingual information access Adaptive Knowledge Representation and Reasoning evaluation: text retrieval experiments, CLEF’09, (AKRR 2005, pages 106–113. September. Mathias Creutz and Krista Lagus. 2005b. Unsu- Lionel Nicolas, Jacques Farr´e, and Miguel A. Mo- pervised morpheme segmentation and morphology linero. 2010. Unsupervised learning of concate- induction from text corpora using morfessor 1.0. native morphology based on frequency-related form Technical Report A81. occurrence. In Mikko Kurimo, Sami Virpioja, Ville Markus Dreyer and Jason Eisner. 2011. Discover- Turunen, and Krista Lagus, editors, Proceedings of ing morphological paradigms from plain text using the Morpho Challenge 2010 Workshop, pages 39– a dirichlet process mixture model. In Proceedings 43, Aalto University, Espoo, Finland. of the 2011 Conference on Empirical Methods in Matthew G. Snover, Gaja E. Jarosz, and Michael R. Natural Language Processing, pages 616–627, Ed- Brent. 2002. Unsupervised learning of morphol- inburgh, Scotland, UK., July. Association for Com- ogy using a novel directed search algorithm: Taking putational Linguistics. the first step. In Proceedings of the ACL-02 Work- John Goldsmith. 2001. Unsupervised learning of the shop on Morphological and Phonological Learn- morphology of a natural language. Computational ing, pages 11–20, Morristown, NJ, USA. ACL. Linguistics, 27(2):153–198. Sami Virpioja, Oskar Kohonen, and Krista Lagus. Sharon Goldwater, Thomas L. Griffiths, and Mark 2009. Unsupervised morpheme discovery with al- Johnson. 2006. Interpolating between types and to- lomorfessor. In Working Notes for the CLEF 2009 kens by estimating power-law generators. In In Ad- Workshop. September. vances in Neural Information Processing Systems Sami Virpioja, Ville T. Turunen, Sebastian Spiegler, 18, page 18. Oskar Kohonen, and Mikko Kurimo. 2011. Em- W. K. Hastings. 1970. Monte carlo sampling meth- pirical comparison of evaluation methods for unsu- ods using markov chains and their applications. pervised learning of morphology. In Traitement Au- Biometrika, 57:97–109. tomatique des Langues. Mikko Kurimo, Sami Virpioja, Ville T. Turunen, Graeme W. Blackwood, and William Byrne. 2009. Overview and results of morpho challenge 2009. In Proceedings of the 10th cross-language eval- uation forum conference on Multilingual infor- mation access evaluation: text retrieval experi- 663 Modeling Inflection and Word-Formation in SMT Alexander Fraser∗ Marion Weller∗ Aoife Cahill† Fabienne Cap∗ ∗ † Institut f¨ur Maschinelle Sprachverarbeitung Educational Testing Service Universit¨at Stuttgart Princeton, NJ 08541 D–70174 Stuttgart, Germany USA {fraser,wellermn,cap}@ims.uni-stuttgart.de

[email protected]

Abstract pare the mostly unlexicalized prediction of lin- guistic features (with a subsequent surface form The current state-of-the-art in statistical generation step) versus the direct prediction of machine translation (SMT) suffers from is- surface forms, and show that both approaches sues of sparsity and inadequate modeling have complementary strengths. (iii) We com- power when translating into morphologi- cally rich languages. We model both in- bine the advantages of the prediction of linguis- flection and word-formation for the task tic features with the prediction of surface forms. of translating into German. We translate We implement this in a CRF framework which from English words to an underspecified improves on a standard phrase-based SMT base- German representation and then use linear- line. (iv) We develop separate (but related) pro- chain CRFs to predict the fully specified cedures for inflection prediction and dealing with German representation. We show that im- word-formation (compounds and portmanteaus), proved modeling of inflection and word- in contrast with most previous work which usu- formation leads to improved SMT. ally either approaches both problems as inflec- tional problems, or approaches both problems as 1 Introduction word-formation problems. We evaluate on the end-to-end SMT task of Phrase-based statistical machine translation translating from English to German of the 2009 (SMT) suffers from problems of data sparsity ACL workshop on SMT. We achieve BLEU score with respect to inflection and word-formation increases on both the test set and the blind test set. which are particularly strong when translating to a morphologically rich target language, such as 2 Overview of the translation process for German. We address the problem of inflection inflection prediction by first translating to a stem-based representation, and then using a second process to inflect these The work we describe is focused on generaliz- stems. We study several models for doing ing phrase-based statistical machine translation to this, including: strongly lexicalized models, better model German NPs and PPs. We particu- unlexicalized models using linguistic features, larly want to ensure that we can generate novel and models combining the strengths of both of German NPs, where what we mean by novel is these approaches. We address the problem of that the (inflected) realization is not present in the word-formation for compounds in German, by parallel German training data used to build the translating from English into German word parts, SMT system, and hence cannot be produced by and then determining whether to merge these our baseline (a standard phrase-based SMT sys- parts to form compounds. tem). We first present our system for dealing with We make the following new contributions: (i) the difficult problem of inflection in German, in- we introduce the first SMT system combining cluding the inflection-dependent phenomenon of inflection prediction with synthesis of portman- portmanteaus. Later, after performing an exten- teaus and compounds. (ii) For inflection, we com- sive analysis of this system, we will extend it 664 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 664–674, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics to model compounds, a highly productive phe- We then build a standard Moses system trans- nomenon in German (see Section 8). lating from English to German stems. We obtain The key linguistic knowledge sources that we a sequence of stems and POS2 from this system, use are morphological analysis and generation of and then predict the correct inflection using a se- German based on SMOR, a morphological ana- quence model. Finally we generate surface forms. lyzer/generator of German (Schmid et al., 2004) and the BitPar parser, which is a state-of-the-art 2.3 German Stem Markup parser of German (Schmid, 2004). The translation process consists of two major steps. The first step is translation of English 2.1 Issues of inflection prediction words to German stems, which are enriched with In order to ensure coherent German NPs, we some inflectional markup. The second step is model linguistic features of each word in an NP. the full inflection of these stems (plus markup) We model case, gender, and number agreement to obtain the final sequence of inflected words. and whether or not the word is in the scope of The purpose of the additional German inflectional a determiner (such as a definite article), which markup is to strongly improve prediction of in- we label in-weak-context (this linguistic feature flection in the second step through the addition of is necessary to determine the type of inflection of markup to the stems in the first step. adjectives and other words: strong, weak, mixed). In general, all features to be predicted are This is a diverse group of features. The number stripped from the stemmed representation because of a German noun can often be determined given they are subject to agreement restrictions of a only the English source word. The gender of a noun or prepositional phrase (such as case of German noun is innate and often difficult to deter- nouns or all features of adjectives). However, we mine given only the English source word. Case need to keep all morphological features that are is a function of the slot in the subcategorization not dependent on, and thus not predictable from, frame of the verb (or preposition). There is agree- the (German) context. They will serve as known ment in all of these features in an NP. For instance input for the inflection prediction model. We now the number of an article or adjective is determined describe this markup in detail. by the head noun, while the type of inflection of an Nouns are marked with gender and number: we adjective is determined by the choice of article. consider the gender of a noun as part of its stem, We can have a large number of surface forms. whereas number is a feature which we can obtain For instance, English blue can be translated as from English nouns. German blau, blaue, blauer, blaues, blauen. We Personal pronouns have number and gender an- predict which form is correct given the context. notation, and are additionally marked with nom- Our system can generate forms not seen in the inative and not-nominative, because English pro- training data. We follow a two-step process: in nouns are marked for this (except for you). step-1 we translate to blau (the stem), in step-2 we Prepositions are marked with the case their ob- predict features and generate the inflected form.1 ject takes: this moves some of the difficulty in pre- dicting case from the inflection prediction step to 2.2 Procedure the stem translation step. Since the choice of case We begin building an SMT system by parsing the in a PP is often determined by the PP’s meaning German training data with BitPar. We then extract (and there are often different meanings possible morphological features from the parse. Next, we given different case choices), it seems reasonable lookup the surface forms in the SMOR morpholog- to make this decision during stem translation. ical analyzer. We use the morphological features Verbs are represented using their inflected surface in the parse to disambiguate the set of possible form. Having access to inflected verb forms has a SMOR analyses. Finally, we output the “stems” positive influence on case prediction in the second of the German text, with the addition of markup 2 We use an additional target factor to obtain the coarse taken from the parse (discussed in Section 2.3). POS for each stem, applying a 7-gram POS model. Koehn and Hoang (2007) showed that the use of a POS factor only 1 E.g., case=nominative, gender=masculine, num- results in negligible BLEU improvements, but we need ac- ber=singular, in-weak-context=true; inflected: blaue. cess to the POS in our inflection prediction models. 665 input decoder output inflected merged must be inflected before making a decision about in<APPR><Dat> in in im whether to merge a preposition and the article into die<+ART><Def> dem contrast Gegensatz<+NN><Masc><Sg> Gegensatz Gegensatz a portmanteau. See Table 1 for examples. to zu<APPR><Dat> zu zur the die<+ART><Def> der animated lebhaft<+ADJ><Pos> lebhaften lebhaften debate Debatte<+NN><Fem><Sg> Debatte Debatte 4 Models for Inflection Prediction Table 1: Re-merging of prepositions and articles after We present 5 procedures for inflectional predic- inflection to form portmanteaus, in dem means in the. tion using supervised sequence models. The first two procedures use simple N-gram models over fully inflected surface forms. step through subject-verb agreement. 1. Surface with no features is presented with an Articles are reduced to their stems (the stem itself underspecified input (a sequence of stems), and makes clear the definite or indefinite distinction, returns the most likely inflected sequence. but lemmatizing involves removing markings of 2. Surface with case, number, gender is a hybrid case, gender and number features). system giving the surface model access to linguis- Other words are also represented by their stems tic features. In this system prepositions have addi- (except for words not covered by SMOR, where tionally been labeled with the case they mark (in surface forms are used instead). both the underspecified input and the fully spec- 3 Portmanteaus ified output the sequence model is built on) and gender and number markup is also available. Portmanteaus are a word-formation phenomenon The rest of the procedures predict morpholog- dependent on inflection. As we have discussed, ical features (which are input to a morphological standard phrase-based systems have problems generator) rather than surface words. We have de- with picking a definite article with the correct veloped a two-stage process for predicting fully case, gender and number (typically due to spar- inflected surface forms. The first stage takes a sity in the language model, e.g., a noun which stem and predicts morphological features for that was never before seen in dative case will often stem, based on the surrounding context. The aim not receive the correct article). In German, port- of the first stage is to take a stem and predict manteaus increase this sparsity further, as they four morphological features: case, gender, num- are compounds of prepositions and articles which ber and type of inflection. We experiment with must agree with a noun. a number of models for doing this. The sec- We adopt the linguistically strict definition of ond stage takes the stems marked with morpho- the term portmanteau: the merging of two func- logical features (predicted in the first stage) and tion words.3 We treat this phenomena by split- uses a morphological generator to generate the ting the component parts during training and re- full surface form. For the second stage, a modified merging during generation. Specifically for version of SMOR (Schmid et al., 2004) is used, German, this requires splitting the words which which, given a stem annotated with morphologi- have German POS tag APPRART into an APPR cal features, generates exactly one surface form. (preposition) and an ART (article). Merging is re- stricted, the article must be definite, singular4 and We now introduce our first linguistic feature the preposition can only take accusative or dative prediction systems, which we call joint sequence case. Some prepositions allow for merging with models (JSMs). These are standard language an article only for certain noun genders, for exam- models, where the “word” tokens are not repre- ple the preposition inDative is only merged with sented as surface forms, but instead using POS the following article if the following noun is of and features. In testing, we supply the input as a masculine or neuter gender. The definite article sequence in underspecified form, where some of the features are specified in the stem markup (for 3 Some examples are: zum (to the) = zu (to) + dem (the) instance, POS=Noun, gender=masculine, num- [German], du (from the) = de (from) + le (the) [French] or al (to the) = a (to) + el (the) [Spanish]. ber=plural), and then use Viterbi search to find the 4 This is the reason for which the preposition + article in most probable fully specified form (for instance, Table 2 remain unmerged. POS=Noun, gender=masculine, number=plural, 666 output decoder input prediction output prediction inflected forms gloss haben<VAFIN> haben-V haben-V haben have Zugang<+NN><Masc><Sg> NN-Sg-Masc NN-Masc.Acc.Sg.in-weak-context=false Zugang access zu<APPR><Dat> APPR-zu-Dat APPR-zu-Dat zu to die<+ART><Def> ART-in-weak-context=true ART-Neut.Dat.Pl.in-weak-context=true den the betreffend<+ADJ><Pos> ADJA ADJA-Neut.Dat.Pl.in-weak-context=true betreffenden respective Land<+NN><Neut><Pl> NN-Pl-Neut NN-Neut.Dat.Pl.in-weak-context=true L¨andern countries Table 2: Overview: inflection prediction steps using a single joint sequence model. All words except verbs and prepositions are replaced by their POS tags in the input. Verbs are inflected in the input (“haben”, meaning “have” as in “they have”, in the example). Prepositions are lexicalized (“zu” in the example) and indicate which case value they mark (“Dat”, i.e., Dative in the example). case=nominative, in-weak-context=true).5 sonable linguistic assumption to make given the 3. Single joint sequence model on features. We additional German markup that we use. By split- illustrate the different stages of the inflection pre- ting the inflection prediction problem into 4 com- diction when using a joint sequence model. The ponent parts, we end up with 4 simpler models stemmed input sequence (cf. Section 2.3) contains which are less sensitive to data sparseness. several features that will be part of the input to Each linguistic feature is modeled indepen- the inflection prediction. With the exception of dently (by a JSM) and has a different input rep- verbs and prepositions, the representation for fea- resentation based on the previously described ture prediction is based on POS-tags. markup. The input consists of a sequence of As gender and number are given by the heads coarse POS tags, and for those stems that are of noun phrases and prepositional phrases, and marked up with the relevant feature, this feature the expected type of inflection is set by articles, value. Finally, we combine the predicted fea- the model has sufficient information to compute tures together to produce the same final output as values for these features and there is no need to the single joint sequence model, and then generate know the actual words. In contrast, the prediction each surface form using SMOR. of case is more difficult as it largely depends on 5. Using four CRFs (one for each linguistic fea- the content of the sentence (e.g. which phrase is ture). The sequence models already presented are object, which phrase is subject). Assuming that limited to the n-gram feature space, and those that verbs and prepositions indicate subcategorization predict linguistic features are not strongly lexi- frames, the model is provided crucial information calized. Toutanova et al. (2008) uses an MEMM for the prediction of case by keeping verbs (recall which allows the integration of a wide variety of that verbs are produced by the stem translation feature functions. We also wanted to experiment system in their inflected form) and prepositions with additional feature functions, and so we train (the prepositions also have case markup) instead 4 separate linear chain CRF6 models on our data of replacing them with their tags. (one for each linguistic feature we want to pre- After having predicted a single label with val- dict). We chose CRFs over MEMMs to avoid the ues for all features, an inflected word form for the label bias problem (Lafferty et al., 2001). stem and the features is generated. The prediction The CRF feature functions, for each German steps are illustrated in Table 2. word wi , are in Table 3. The common feature 4. Using four joint sequence models (one for functions are used in all models, while each of the each linguistic feature). Here the four linguistic 4 separate models (one for each linguistic feature) feature values are predicted separately. The as- includes the context of only that linguistic feature. sumption that the different linguistic features can We use L1 regularization to eliminate irrelevant be predicted independently of one another is a rea- feature functions, the regularization parameter is 5 Joint sequence models are a particularly simple HMM. optimized on held out data. Unlike the HMMs used for POS-tagging, an HMM as used 6 here only has a single emission possibility for each state, We use the Wapiti Toolkit (Lavergne et al., 2010) on 4 with probability 1. The states in the HMM are the fully x 12-Core Opteron 6176 2.3 GHz with 256GB RAM to train specified representation. The emissions of the HMM are the our CRF models. Training a single CRF model on our data stems+markup (the underspecified representation). was not tractable, so we use one for each linguistic feature. 667 Common lemmawi−5 ...wi+5 , tagwi−7 ...wi+7 tion 2.3), and the second is inflection prediction Case casewi−5 ...wi+5 Gender genderwi−5 ...wi+5 as described previously in the paper. To derive Number numberwi−5 ...wi+5 the stem+markup representation we first parse in-weak-context in-weak-contextwi−5 ...wi+5 the German training data and then produce the Table 3: Feature functions used in CRF models (fea- stemmed representation. We then build a sys- ture functions are binary indicators of the pattern). tem for translating from English words to Ger- man stems (the stem+markup representation), on the same data (so the German side of the parallel 5 Experimental Setup data, and the German language modeling uses the To evaluate our end-to-end system, we perform stem+markup representation). Likewise, MERT the well-studied task of news translation, us- is performed using references which are in the ing the Moses SMT package. We use the En- stem+markup representation. glish/German data released for the 2009 ACL To train the inflection prediction systems, we Workshop on Machine Translation shared task on use the monolingual data. The basic surface form translation.7 There are 82,740 parallel sentences model is trained on lowercased surface forms, from news-commentary09.de-en and 1,418,115 the hybrid surface form model with features is parallel sentences from europarl-v4.de-en. The trained on lowercased surface forms annotated monolingual data contains 9.8 M sentences.8 with markup. The linguistic feature prediction To build the baseline, the data was tokenized systems are trained on the monolingual data pro- using the Moses tokenizer and lowercased. We cessed as described previously (see Table 2). use GIZA++ to generate alignments, by running Our JSMs are trained using the SRILM Toolkit. 5 iterations of Model 1, 5 iterations of the HMM We use the SRILM disambig tool for predicting Model, and 4 iterations of Model 4. We sym- inflection, which takes a “map” that specifies the metrize using the “grow-diag-final-and” heuris- set of fully specified representations that each un- tic. Our Moses systems use default settings. The derspecified stem can map to. For surface form LM uses the monolingual data and is trained as models, it specifies the mapping from stems to a five-gram9 using the SRILM-Toolkit (Stolcke, lowercased surface forms (or surface forms with 2002). We run MERT separately for each sys- markup for the hybrid surface model). tem. The recaser used is the same for all systems. 6 Results for Inflection Prediction It is the standard recaser supplied with Moses, trained on all German training data. The dev set We build two different kinds of translation sys- is wmt-2009-a and the test set is wmt-2009-b, and tem, the baseline and the stem translation system we report end-to-end case sensitive BLEU scores (where MERT is used to train the system to pro- against the unmodified reference SGML file. The duce a stem+markup sequence which agrees with blind test set used is wmt-2009-blind (all lines). the stemmed reference of the dev set). In this sec- In developing our inflection prediction sys- tion we present the end-to-end translation results tems (and making such decisions as n-gram order for the different inflection prediction models de- used), we worked on the so-called “clean data” fined in Section 4, see Table 4. task, predicting the inflection on stemmed refer- If we translate from English into a stemmed ence sentences (rather than MT output). We used German representation and then apply a unigram the 2000 sentence dev-2006 corpus for this task. stem-to-surface-form model to predict the surface Our contrastive systems consist of two steps, form, we achieve a BLEU score of 9.97 (line 2). the first is a translation step using a similar This is only presented for comparison. Moses system (except that the German side is The baseline10 is 14.16, line 1. We compare stemmed, with the markup indicated in Sec- this with a 5-gram sequence model11 that predicts 7 10 http://www.statmt.org/wmt09/translation-task.html This is a better case-sensitive score than the baselines 8 However, we reduced the monolingual data (only) by on wmt-2009-b in experiments by top-performers Edinburgh retaining only one copy of each unique line, which resulted and Karlsruhe at the shared task. We use Moses with default in 7.55 M sentences. settings. 9 11 Add-1 smoothing for unigrams and Kneser-Ney Note that we use a different set, the “clean data” set, to smoothing for higher order n-grams, pruning defaults. determine the choice of n-gram order, see Section 7. We use 668 surface forms without access to morphological 1 baseline 14.16 2 unigram surface (no features) 9.97 features, resulting in a BLEU score of 14.26. In- 3 surface (no features) 14.26 troducing morphological features (case on prepo- 4 surface (with case, number, gender features) 14.58 5 1 JSM morphological features 14.53 sitions, number and gender on nouns) increases 6 4 JSMs morphological features 14.29 the BLEU score to 14.58, which is in the same 7 4 CRFs morphological features, lexical information 14.72 range as the single JSM system predicting all lin- guistic features at once. Table 4: BLEU scores (detokenized, case sensitive) on the development test set wmt-2009-b This result shows that the mostly unlexicalized single JSM can produce competitive results with direct surface form prediction, despite not having each linguistic feature performs best (14.72, line access to a model of inflected forms, which is the 7). The CRF framework combines the advantages desired final output. This strongly suggests that of surface form prediction and linguistic feature the prediction of morphological features can be prediction by using feature functions that effec- used to achieve additional generalization over di- tively cover the feature function spaces used by rect surface form prediction. When comparing the both forms of prediction. The performance of the simple direct surface form prediction (line 3) with CRF models results in a statistically significant the hybrid system enriched with number, gender improvement12 (p < 0.05) over the baseline. We and case (line 4), it becomes evident that feature also tried CRFs with bilingual features (projected markup can also aid surface form prediction. from English parses via the alignment output by Since the single JSM has no access to lexical Moses), but obtained only a small improvement of information, we used a language model to score 0.03, probably because the required information different feature predictions: for each sentence of is transferred in our stem markup (also a poor im- the development set, the 100 best feature predic- provement beyond monolingual features is con- tions were inflected and scored with a language sistent with previous work, see Section 8.3). De- model. We then optimized weights for the two tails are omitted due to space. scores LM (language model on surface forms) We further validated our results by translating and FP (feature prediction, the score assigned by the blind test set from wmt-2009, which we have the JSM). This method disprefers feature predic- never looked at in any way. Here we also had tions with a top FP-score if the inflected sen- a statistically significant difference between the tence obtains a bad LM score and likewise dis- baseline and the CRF-based prediction, the scores favors low-ranked feature prediction with a high were 13.68 and 14.18. LM score. The prediction of case is the most difficult given no lexical information, thus scor- 7 Analysis of Inflection-based System ing different prediction possibilities on inflected words is helpful. An example is when the case of Stem Markup. The first step of translating a noun phrase leads to an inflected phrase which from English to German stems (with the markup never occurs in the (inflected) language model we previously discussed) is substantially easier (e.g., case=genitive vs. case=other). Applying than translating directly to inflected German (we this method to the single JSM leads to a negligible see BLEU scores on stems+markup that are over improvement (14.53 vs. 14.56). Using the n-best 2.0 BLEU higher than the BLEU scores on in- output of the stem translation system did not lead flected forms when running MERT). The addition to any improvement. of case to prepositions only lowered the BLEU The comparison between different feature pre- score reached by MERT by about 0.2, but is very diction models is also illustrative. Performance helpful for prediction of the case feature. decreases somewhat when using individual joint Inflection Prediction Task. Clean data task re- sequence models (one for each linguistic feature) sults13 are given in Table 5. The 4 CRFs outper- compared to one single model (14.29, line 6). form the 4 JSMs by more than 2%. The framework using the individual CRFs for 12 We used Kevin Gimpel’s implementation of pairwise a 5-gram for surface forms and a 4-gram for JSMs, and the bootstrap resampling with 1000 samples. 13 same smoothing (Kneser-Ney, add-1 for unigrams, default 26,061 of 55,057 tokens in our test set are ambiguous. pruning). We report % surface form matches for ambiguous tokens. 669 Model Accuracy generalize from the accusative example with no unigram surface (no features) 55.98 surface (no features) 86.65 portmanteau and take advantage of longer phrase surface (with case, number, gender features) 91.24 pairs, even when translating to something that will 1 JSM morphological features 92.45 4 JSMs morphological features 92.01 be inflected as dative and should be realized as a 4 CRFs morphological features, lexical information 94.29 portmanteau. The baseline does not have this ca- pability. It should be noted that the portmanteau Table 5: Comparing predicting surface forms directly merging method described in Section 3 remerges with predicting morphological features. all occurrences of APPR and ART that can techni- cally form a portmanteau. There are a few cases training data 1 model 4 models 7.3 M sentences 92.41 91.88 where merging, despite being grammatical, does 1.5 M sentences 92.45 92.01 not lead to a good result. Such exceptions require 100000 sentences 90.20 90.64 semantic interpretation and are difficult to capture 1000 sentences 83.72 86.94 with a fixed set of rules. Table 6: Accuracy for different training data sizes of the single and the four separate joint sequence models. 8 Adding Compounds to the System Compounds are highly productive in German and lead to data sparsity. We split the German com- As we mentioned in Section 4, there is a spar- pounds in the training data, so that our stem trans- sity issue at small training data sizes for the sin- lation system can now work with the individual gle joint sequence model. This is shown in Ta- words in the compounds. After we have trans- ble 6. At the largest training data sizes, model- lated to a split/stemmed representation, we deter- ing all 4 features together results in the best pre- mine whether to merge words together to form a dictions of inflection. However using 4 separate compound. Then we merge them to create stems models is worth this minimal decrease in perfor- in the same representation as before and we per- mance, since it facilitates experimentation with form inflection and portmanteau merging exactly the CRF framework for which the training of a as previously discussed. single model is not currently tractable. Overall, the inflection prediction works well for 8.1 Details of Splitting Process gender, number and type of inflection, which are We prepare the training data by splitting com- local features to the NP that normally agree with pounds in two steps, following the technique of the explicit markup output by the stem transla- Fritzinger and Fraser (2010). First, possible split tion system (for example, the gender of a com- points are extracted using SMOR, and second, the mon noun, which is marked in the stem markup, best split points are selected using the geometric is usually successfully propagated to the rest of mean of word part frequencies. the NP). Prediction of case does not always work compound word parts gloss well, and could maybe be improved through hier- Inflationsrate Inflation Rate inflation rate archical labeled-syntax stem translation. auszubrechen aus zu brechen out to break (to break out) Portmanteaus. An example of where the sys- Training data is then stemmed as described in tem is improved because of the new handling of Section 2.3. The formerly modifying words of the portmanteaus can be seen in the dative phrase compound (in our example the words to the left im internationalen Rampenlicht (in the interna- of the rightmost word) do not have a stem markup tional spotlight), which does not occur in the par- assigned, except for two cases: i) they are nouns allel data. The accusative phrase in das interna- themselves or ii) they are particles separated from tionale Rampenlicht does occur, however in this a verb. In these cases, former modifiers are rep- case there is no portmanteau, but a one-to-one resented identically to their individual occurring mapping between in the and in das. For a given counterparts, which helps generalization. context, only one of accusative or dative case is valid, and a strongly disfluent sentence results 8.2 Model for Compound Merging from the incorrect choice. In our system, these After translation, compound parts have to be two cases are handled in the same way (def-article resynthesized into compounds before inflection. international Rampenlicht). This allows us to Two decisions have to be taken: i) where to 670 merge and ii) how to merge. Following the work 1 1 JSM morphological features 13.94 2 4 CRFs morphological features, lexical information 14.04 of Stymne and Cancedda (2011), we implement a linear-chain CRF merging system using the following features: stemmed (separated) surface Table 7: Results with Compounds on the test set form, part-of-speech14 and frequencies from the training corpus for bigrams/merging of word and ture can be translated as German Miniatur- and word+1, word as true prefix, word+1 as true suf- gets the correct output. fix, plus frequency comparisons of these. The CRF is trained on the split monolingual data. It 9 Related Work only proposes merging decisions, merging itself uses a list extracted from the monolingual data There has been a large amount of work on trans- (Popovic et al., 2006). lating from a morphologically rich language to English, we omit a literature review here due to 8.3 Experiments space considerations. Our work is in the opposite We evaluated the end-to-end inflection system direction, which primarily involves problems of with the addition of compounds.15 As in the in- generation, rather than problems of analysis. flection experiments described in Section 5, we The idea of translating to stems and then in- use a 5-gram surface LM and a 7-gram POS flecting is not novel. We adapted the work of LM, but for this experiment, they are trained on Toutanova et al. (2008), which is effective but lim- stemmed, split data. The POS LM helps com- ited by the conflation of two separate issues: word pound parts and heads appear in correct order. formation and inflection. The results are in Table 7. The BLEU score of the Given a stem such as brother, Toutanova et. al’s CRF on test is 14.04, which is low. However the system might generate the “stem and inflection” system produces 19 compound types which are corresponding to and his brother. Viewing and in the reference but not in the parallel data, and and his as inflection is problematic since a map- therefore not accessible to other systems. We also ping from the English phrase and his brother to observe many more compounds in general. The the Arabic stem for brother is required. The situ- 100-best inflection rescoring technique previously ation is worse if there are English words (e.g., ad- discussed reached 14.07 on the test set. Blind jectives) separating his and brother. This required test results with CRF prediction are much better, mapping is a significant problem for generaliza- 14.08, which is a statistically significant improve- tion. We view this issue as a different sort of prob- ment over the baseline (13.68) and approaches the lem entirely, one of word-formation (rather than result we obtained without compounds (14.18). inflection). We apply a “split in preprocessing and Correctly generated compounds are single words resynthesize in postprocessing” approach to these which usually carry the same information as mul- phenomena, combined with inflection prediction tiple words in English, and are hence likely un- that is similar to that of Toutanova et. al. The derweighted by BLEU. We again see many in- only work that we are aware of which deals with teresting generalizations. For instance, take the both issues is the work of de Gispert and Mari˜no case of translating English miniature cameras to (2008), which deals with verbal morphology and the German compound Miniaturkameras. minia- attached pronouns. There has been other work ture camera or miniature cameras does not occur on solving inflection. Koehn and Hoang (2007) in the training data, and so there is no appropri- introduced factored SMT. We use more complex ate phrase pair in any system (baseline, inflec- context features. Fraser (2009) tried to solve the tion, or inflection&compound-splitting). How- inflection prediction problem by simply building ever, our system with compound splitting has an SMT system for translating from stems to in- learned from split composita that English minia- flected forms. Bojar and Kos (2010) improved on this by marking prepositions with the case they 14 Compound modifiers get assigned a special tag based on mark (one of the most important markups in our the POS of their former heads, e.g., Inflation in the example is marked as a non-head of a noun. system). Both efforts were ineffective on large 15 We found it most effective to merge word parts during data sets. Williams and Koehn (2011) used uni- MERT (so MERT uses the same stem references as before). fication in an SMT system to model some of the 671 agreement phenomena that we model. Our CRF coded in a rule-based morphological analyser and framework allows us to use more complex con- then selecting the best analysis based on the ge- text features. ometric mean of word part frequencies. Other We have directly addressed the question as to approaches use less deep linguistic resources whether inflection should be predicted using sur- (e.g., POS-tags Stymne (2008)) or are (almost) face forms as the target of the prediction, or knowledge-free (e.g., Koehn and Knight (2003)). whether linguistic features should be predicted, Compound merging is less well studied. Popovic along with the use of a subsequent generation et al. (2006) used a simple, list-based merging ap- step. The direct prediction of surface forms is proach, merging all consecutive words included limited to those forms observed in the training in a merging list. This approach resulted in too data, which is a significant limitation. How- many compounds. We follow Stymne and Can- ever, it is reasonable to expect that the use of cedda (2011), for compound merging. We trained features (and morphological generation) could a CRF using (nearly all) of the features they used also be problematic as this requires the use of and found their approach to be effective (when morphologically-aware syntactic parsers to anno- combined with inflection and portmanteau merg- tate the training data with such features, and addi- ing) on one of our two test sets. tionally depends on the coverage of morpholog- ical analysis and generation. Despite this, our 10 Conclusion research clearly shows that the feature-based ap- We have shown that both the prediction of sur- proach is superior for English-to-German SMT. face forms and the prediction of linguistic features This is a striking result considering state-of-the- are of interest for improving SMT. We have ob- art performance of German parsing is poor com- tained the advantages of both in our CRF frame- pared with the best performance on English pars- work, and also integrated handling of compounds, ing. As parsing performance improves, the per- and an inflection-dependent word formation phe- formance of linguistic-feature-based approaches nomenon, portmanteaus. We validated our work will increase. on a well-studied large corpora translation task. Virpioja et al. (2007), Badr et al. (2008), Luong et al. (2010), Clifton and Sarkar (2011), and oth- Acknowledgments ers are primarily concerned with using morpheme segmentation in SMT, which is a useful approach The authors wish to thank the anonymous review- for dealing with issues of word-formation. How- ers for their comments. Aoife Cahill was partly ever, this does not deal directly with linguistic fea- supported by Deutsche Forschungsgemeinschaft tures marked by inflection. In German these lin- grant SFB 732. Alexander Fraser, Marion Weller guistic features are marked very irregularly and and Fabienne Cap were funded by Deutsche there is widespread syncretism, making it difficult Forschungsgemeinschaft grant Models of Mor- to split off morphemes specifying these features. phosyntax for Statistical Machine Translation. So it is questionable as to whether morpheme seg- The research leading to these results has received mentation techniques are sufficient to solve the in- funding from the European Community’s Seventh flectional problem we are addressing. Framework Programme (FP7/2007-2013) under Much previous work looks at the impact of us- grant agreement Nr. 248005. This work was sup- ing source side information (i.e., feature func- ported in part by the IST Programme of the Euro- tions on the aligned English), such as those pean Community, under the PASCAL2 Network of Avramidis and Koehn (2008), Yeniterzi and of Excellence, IST-2007-216886. This publica- Oflazer (2010) and others. Toutanova et. al.’s tion only reflects the authors’ views. We thank work showed that it is most important to model Thomas Lavergne and Helmut Schmid. target side coherence and our stem markup also allows us to access source side information. Us- ing additional source side information beyond the References markup did not produce a gain in performance. Eleftherios Avramidis and Philipp Koehn. 2008. En- For compound splitting, we follow Fritzinger riching Morphologically Poor Languages for Statis- and Fraser (2010), using linguistic knowledge en- tical Machine Translation. In Proceedings of ACL- 672 08: HLT, pages 763–770, Columbus, Ohio, June. Thomas Lavergne, Olivier Capp´e, and Franc¸ois Yvon. Association for Computational Linguistics. 2010. Practical very large scale CRFs. In Proceed- Ibrahim Badr, Rabih Zbib, and James Glass. 2008. ings the 48th Annual Meeting of the Association for Segmentation for English-to-Arabic statistical ma- Computational Linguistics (ACL), pages 504–513. chine translation. In Proceedings of ACL-08: HLT, Association for Computational Linguistics, July. Short Papers, pages 153–156, Columbus, Ohio, Minh-Thang Luong, Preslav Nakov, and Min-Yen June. Association for Computational Linguistics. Kan. 2010. A Hybrid Morpheme-Word Represen- Ondˇrej Bojar and Kamil Kos. 2010. 2010 Failures in tation for Machine Translation of Morphologically English-Czech Phrase-Based MT. In Proceedings Rich Languages. In Proceedings of the 2010 Con- of the Joint Fifth Workshop on Statistical Machine ference on Empirical Methods in Natural Language Translation and MetricsMATR, pages 60–66, Upp- Processing, pages 148–157, Cambridge, MA, Octo- sala, Sweden, July. Association for Computational ber. Association for Computational Linguistics. Linguistics. Maja Popovic, Daniel Stein, and Hermann Ney. 2006. Ann Clifton and Anoop Sarkar. 2011. Combin- Statistical Machine Translation of German Com- ing morpheme-based machine translation with post- pound Words. In Proceedings of FINTAL-06, pages processing morpheme prediction. In Proceed- 616–624, Turku, Finland. Springer Verlag, LNCS. ings of the 49th Annual Meeting of the Associa- Helmut Schmid, Arne Fitschen, and Ulrich Heid. tion for Computational Linguistics: Human Lan- 2004. SMOR: A German Computational Morphol- guage Technologies, pages 32–42, Portland, Ore- ogy Covering Derivation, Composition, and Inflec- gon, USA, June. Association for Computational tion. In 4th International Conference on Language Linguistics. Resources and Evaluation. Adri`a de Gispert and Jos´e B. Mari˜no. 2008. On the Helmut Schmid. 2004. Efficient Parsing of Highly impact of morphology in English to Spanish statisti- Ambiguous Context-Free Grammars with Bit Vec- cal MT. Speech Communication, 50(11-12):1034– tors. In Proceedings of Coling 2004, pages 162– 1046. 168, Geneva, Switzerland, Aug 23–Aug 27. COL- Alexander Fraser. 2009. Experiments in Morphosyn- ING. tactic Processing for Translating to and from Ger- Andreas Stolcke. 2002. SRILM - An Extensible Lan- man. In Proceedings of the Fourth Workshop on guage Modeling Toolkit. In International Confer- Statistical Machine Translation, pages 115–119, ence on Spoken Language Processing. Athens, Greece, March. Association for Computa- Sara Stymne and Nicola Cancedda. 2011. Produc- tional Linguistics. tive Generation of Compound Words in Statistical Fabienne Fritzinger and Alexander Fraser. 2010. How Machine Translation. In Proceedings of the Sixth to Avoid Burning Ducks: Combining Linguistic Workshop on Statistical Machine Translation, pages Analysis and Corpus Statistics for German Com- 250–260, Edinburgh, Scotland UK, July. Associa- pound Processing. In Proceedings of the Fifth tion for Computational Linguistics. Workshop on Statistical Machine Translation, pages Sara Stymne. 2008. German Compounds in Factored 224–234. Association for Computational Linguis- Statistical Machine Translation. In Proceedings of tics. GOTAL-08, pages 464–475, Gothenburg, Sweden. Philipp Koehn and Hieu Hoang. 2007. Factored Springer Verlag, LNCS/LNAI. Translation Models. In Proceedings of the 2007 Kristina Toutanova, Hisami Suzuki, and Achim Joint Conference on Empirical Methods in Natural Ruopp. 2008. Applying Morphology Generation Language Processing and Computational Natural Models to Machine Translation. In Proceedings of Language Learning (EMNLP-CoNLL), pages 868– ACL-08: HLT, pages 514–522, Columbus, Ohio, 876, Prague, Czech Republic, June. Association for June. Association for Computational Linguistics. Computational Linguistics. Sami Virpioja, Jaakko J. V¨ayrynen, Mathias Creutz, Philipp Koehn and Kevin Knight. 2003. Empirical and Markus Sadeniemi. 2007. Morphology-aware methods for compound splitting. In EACL ’03: statistical machine translation based on morphs in- Proceedings of the 10th conference of the European duced in an unsupervised manner. In PROC. OF chapter of the Association for Computational Lin- MT SUMMIT XI, pages 491–498. guistics, pages 187–193, Morristown, NJ, USA. As- Philip Williams and Philipp Koehn. 2011. Agree- sociation for Computational Linguistics. ment constraints for statistical machine translation John Lafferty, Andrew McCallum, and Fernando into German. In Proceedings of the Sixth Workshop Pereira. 2001. Conditional random fields: Prob- on Statistical Machine Translation, pages 217–226, abilistic models for segmenting and labeling se- Edinburgh, Scotland, July. Association for Compu- quence data. In Proceedings of the International tational Linguistics. Conference on Machine Learning, pages 282–289. Reyyan Yeniterzi and Kemal Oflazer. 2010. Syntax- Morgan Kaufmann, San Francisco, CA. to-Morphology Mapping in Factored Phrase-Based 673 Statistical Machine Translation from English to Turkish. In Proceedings of the 48th Annual Meet- ing of the Association for Computational Linguis- tics, pages 454–464, Uppsala, Sweden, July. Asso- ciation for Computational Linguistics. 674 Identifying Broken Plurals, Irregular Gender, and Rationality in Arabic Text Sarah Alkuhlani and Nizar Habash Center for Computational Learning Systems Columbia University {sma2149,nh2142}@columbia.edu Abstract that look masculine), and the semantic feature of rationality, which has no morphological re- Arabic morphology is complex, partly be- alization (Smrž, 2007b; Alkuhlani and Habash, cause of its richness, and partly because 2011). These features heavily participate in Ara- of common irregular word forms, such as broken plurals (which resemble singular bic morpho-syntactic agreement. Alkuhlani and nouns), and nouns with irregular gender Habash (2011) show that without proper model- (feminine nouns that look masculine and ing, Arabic agreement cannot be accounted for vice versa). In addition, Arabic morpho- in about a third of all noun-adjective pairs and syntactic agreement interacts with the lex- a quarter of verb-subject pairs. They also report ical semantic feature of rationality, which that over half of all plurals in Arabic are irregular, has no morphological realization. In this 8% of nominals have irregular gender and almost paper, we present a series of experiments half of all proper nouns and 5% of all nouns are on the automatic prediction of the latent linguistic features of functional gender and rational. number, and rationality in Arabic. We com- In this paper, we present results on the task pare two techniques, using simple maxi- of automatic identification of functional gender, mum likelihood (MLE) with back-off and number and rationality of Arabic words in con- a support vector machine based sequence tagger (Yamcha). We study a number of text. We consider two supervised learning tech- orthographic, morphological and syntactic niques: a simple maximum-likelihood model with learning features. Our results show that back-off (MLE) and a support-vector-machine- the MLE technique is preferred for words based sequence tagger, Yamcha (Kudo and Mat- seen in the training data, while the Yam- sumoto, 2003). We consider a large number of cha technique is optimal for unseen words, orthographic, morphological and syntactic learn- which are our real target. Furthermore, we ing features. Our results show that the MLE tech- show that for unseen words, morphological nique is preferred for words seen in the training features help beyond orthographic features and that syntactic features help even more. data, while the Yamcha technique is optimal for A combination of the two techniques im- unseen words, which are our real target. Further- proves overall performance even further. more, we show that for unseen words, morpho- logical features help beyond orthographic features 1 Introduction and that syntactic features help even more. A Arabic morphology is complex, partly because combination of the two techniques improves over- of its richness, and partly because of its com- all performance even further. plex morpho-syntactic agreement rules which de- This paper is structured as follows: Sec- pend on functional features not necessarily ex- tions 2 and 3 present relevant linguistic facts and pressed in word forms. Particularly challeng- related work, respectively. Section 4 presents the ing are broken plurals (which resemble singu- data collection we use and the metrics we target. lar nouns), nouns with irregular gender (mascu- Section 5 discusses our approach. And Section 6 line nouns that look feminine and feminine nouns presents our results. 675 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 675–685, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics VRB ÑêÊJ‚ SBJ OBJ MOD NOM NOM PRT H. AJºË@ A’’¯ áÓ MOD MOD OBJ NOM NOM NOM JK YmÌ '@ àñ èYK Yg ©ÒJj.ÖÏ@ . MOD MOD NOM NOM ú G. QªË@ Õç' Y®Ë@ Word ystlhm AlktAb AlHdyθwn qSSA jdyd¯ h mn Almjtmς Alςrby Alqdym Form MS MS MP MS FS NaNa MS MS MS Func MSN MPR MPN FPI FSN NaNaNa MSI MSN MSN Gloss be-inspired the-writers the-modern stories new from culture Arab ancient English ‘Modern writers are inspired by ancient Arab culture to write new stories .’ Figure 1: An example Arabic sentence showing its dependency representation together with the form-based and functional gender and number features and rationality. The dependency tree is in the CATiB treebank represen- tation (Habash and Roth, 2009). The shown POS tags are VRB “verb”, NOM “nominal (noun/adjective)”, and PRT “particle”. The relations are SBJ “subject”, OBJ “object” and MOD “modifier”. The form-based features are only for gender and number. 2 Linguistic Facts àðQëAÓ mAhrwn (M P ), and H@ QëAÓ mAhrAt (F P ). For a sizable minority of words, these Arabic has a rich and complex morphology. In features are expressed templatically, i.e., through addition to being both templatic (root/pattern) and pattern change, coupled with some singular suf- concatenative (stems/affixes/clitics), Arabic’s op- fix. A typical example of this phenomenon is the tional diacritics add to the degree of word ambi- class of broken plurals, which accounts for over guity. We focus on two problems of Arabic mor- half of all plurals (Alkuhlani and Habash, 2011). phology: the discrepancy between morphological In such cases, the form of the morphology (sin- form and function; and the complexity of morpho- gular suffix) is inconsistent with the word’s func- syntactic agreement rules. tional number (plural). For example, the word 2.1 Form and Function I.KA¿ kAtb (M S) ‘writer’ has the broken plural: H. AJ» ktAb ( MMPS ).2 See the second word in the ex- Arabic nominals (i.e. nouns, proper nouns and ample in Figure 1, which is the word H adjectives) and verbs inflect for gender: mascu- . AJ» ktAb line (M ) and feminine (F ), and for number: sin- ‘writers’ prefixed with the definite article Al+. In gular (S), dual (D) and plural (P ). These features addition to broken plurals, Arabic has words with are regularly expressed using a set of suffixes that irregular gender, e.g., the feminine singular ad- uniquely convey gender and number combina- jective ‘red’ Z@QÔg HmrA’ ( M S F S ), and the nouns h1 (F S), àð tions: +φ (M S), è+ +¯ + +wn (M P ), é®J Êg xlyf¯h ( MF SS ) ‘caliph’ and ÉÓAg HAml ( MF SS ) and H@ + +At (F P ). For example, the adjective ‘pregnant’. Verbs and nominal duals do not dis- play this discrepancy. QëAÓ mAhr ‘clever’ has the following forms among others: QëAÓ mAhr (M S), èQëAÓ mAhr¯ h (F S), 2.2 Morpho-syntactic Agreement 1 Arabic transliteration is presented in the Habash-Soudi- Arabic gender and number features participate in Buckwalter (HSB) scheme (Habash et al., 2007): (in alpha- morpho-syntactic agreement within specific con- ˇ betical order) AbtθjHxdðrzsšSDTDςγfqklmnhwy and the ad- ˇ ¯ ditional symbols: ’ Z,  @, A @ , A @, wˆ ð', yˆ Zø', ¯h è, ý ø. 2 F orm This nomenclature denotes ( F unction ). 676 structions such as nouns with their adjectives Altantawy et al., 2010; Alkuhlani and Habash, and verbs with their subjects. Arabic agreement 2011). rules are more complex than the simple match- In terms of resources, Smrž (2007b)’s work ing rules found in languages such as Spanish contrasting illusory (form) features and functional (Holes, 2004; Habash, 2010). For instance, Ara- features inspired our distinction of morphologi- bic adjectives agree with the nouns they mod- cal form and function. However, unlike him, we ify in gender and number except for plural ir- do not distinguish between sub-functional (logi- rational (non-human) nouns, which always take cal and formal) features. His ElixirFM analyzer feminine singular adjectives. Rationality (‘hu- (Smrž, 2007a) extends BAMA by including func- manness’ ‘ ɯA« Q «/ ɯA«’) is a morpho-lexical tional number and some functional gender infor- feature that is narrower than animacy. English mation, but not rationality. This analyzer was expresses it mainly in pronouns (he/she vs. it) used as part of the annotation of the Prague Ara- and relativizers (men who... vs. cars/cows bic Dependency Treebank (PADT) (Smrž and Ha- which...). We follow the convention by Alkuh- jiˇc, 2006). More recently, Alkuhlani and Habash lani and Habash (2011) who specify rationality (2011) built on the work of Smrž (2007b) and ex- as part of the functional features of the word. tended beyond it to fully annotate functional gen- The values of this feature are: rational (R), irra- der, number and rationality in the PATB part 3. tional (I), and not-specified (N ). N is assigned to We use their resource to train and evaluate our verbs, adjectives, numbers and quantifiers.3 For system. example, in Figure 1, the plural rational noun In terms of techniques, Goweder et al. (2004) H. AJºË@ AlktAb ( MMPSR ) ‘writers’ takes the plural investigated several approaches using root and adjective àñ JK YmÌ '@ AlHdyθwn ( M P ) ‘modern’; pattern morphology for identifying broken plu- MP N while the plural irrational word A’’¯ qSSA ‘sto- rals in undiacritized Arabic text. Their effort re- ries’ ( FMPSI ) takes the feminine singular adjective sulted in an improved stemming system for Ara- èYK Yg jdyd¯h ( F S ). bic information retrieval that collapses singulars . F SN and plurals. They report results on identifying 3 Related Work broken plurals out of context. Similar to them, we undertake the task of identifying broken plu- Much work has been done on Arabic morpholog- rals; however, we also target the templatic gen- ical analysis, morphological disambiguation and der and rationality features, and we do this in- part-of-speech (POS) tagging (Al-Sughaiyer and context. Elghamry et al. (2008) presented an auto- Al-Kharashi, 2004; Soudi et al., 2007; Habash, matic cue-based algorithm that uses bilingual and 2010). The bulk of this work does not address monolingual cues to build a web-extracted lexi- form-function discrepancy or morpho-syntactic con enriched with gender, number and rationality agreement issues. This includes the most com- features. Their automatic technique achieves an monly used resources and tools for Arabic NLP: F-score of 89.7% against a gold standard set. Un- the Buckwalter Arabic Morphological Analyzer like them, we use a manually annotated corpus to (BAMA) (Buckwalter, 2004) which is used in the train and test the prediction of gender, number and Penn Arabic Tree Bank (PATB) (Maamouri et al., rationality features. 2004), and the various POS tagging and morpho- Our approach to identifying these features ex- logical disambiguation tools trained using them plores a large set of orthographic, morphological (Diab et al., 2004; Habash and Rambow, 2005). and syntactic learning features. This is very much There are some important exceptions (Goweder et following several previous efforts in Arabic NLP al., 2004; Habash, 2004; Smrž, 2007b; Elghamry in which different tagsets and morphological fea- et al., 2008; Abbès et al., 2004; Attia, 2008; tures have been studied for a variety of purposes, 3 We previously defined the rationality value N as not- e.g., base phrase chunking (Diab, 2007) and de- applicable when we only considered nominals (Alkuhlani pendency parsing (Marton et al., 2010). In this and Habash, 2011). In this work, we rename the rationality paper we use the parser of Marton et al. (2010) value N as not-specified without changing its meaning. We use the value N a (not-applicable) for parts-of-speech that as our source of syntactic learning features. We do not have a meaningful value for any feature, e.g., prepo- follow their splits for training, development and sitions have gender, number and rationality values of N a. testing. 677 4 Problem Definition 5 Approach Our approach involves using two techniques: Our goal is to predict the functional gender, num- MLE with back-off and Yamcha. For each tech- ber and rationality features for all words. nique, we explore the effects of different learning features and try to come up with the best tech- 4.1 Corpus and Experimental Settings nique and feature set for each target feature. We use the corpus of Alkuhlani and Habash 5.1 Learning Features (2011), which is based on the PATB. The corpus We investigate the contribution of different learn- contains around 16.6K sentences and over 400K ing features in predicting functional gender, num- tokens. We use the train/development/test splits ber and rationality features. The learning features of Marton et al. (2010). We train on a quarter of are explored in the following order: the training set and classify words in sequence. We only use a portion of the training data to in- Orthographic Features These features are or- crease the percentage of words unseen in training. ganized in two sets: W1 is the unnormalized form We also compare to using all of the training data of the word, and W2 includes W1 plus letter n- in Section 6.7. grams. The n-grams used are the first letter, first two letters, last letter, and last two letters of the Our data is gold tokenized; however, all of word form. We tried using the Alif/Ya normalized the features we use are predicted using MADA forms of the words (Habash, 2010), but these be- (Habash and Rambow, 2005) following the work haved consistently worse than the unnormalized of Marton et al. (2010). Words whose tags are un- forms. known in the training set are excluded from the evaluation, but not training. In terms of ambigu- Morphological Features We explore the fol- ity, the percentage of word types with ambiguous lowing morphological features inspired by the gender, number and rationality in the train set is work of Marton et al. (2010): 1.35%, 0.79%, and 4.8% respectively. These per- • POS tags. We experiment with different POS centages are consistent with how we perform on tag sets: CATiB-6 (6 tags) (Habash et al., 2009), these features, with number being the easiest and CATiB-EX (44 tags), Kulick (34 tags) (Kulick et rationality the hardest. al., 2006), Buckwalter (BW) (Buckwalter, 2004), which is the tag used in the PATB (430 tags), and a reduced form of BW tag that ignores case 4.2 Metrics and mood (BW-) (217 tags). These tags differ in We report all results in terms of token accuracy. their granularity and range from very specific tags Evaluation is done for the following sets: all (Buckwalter) to more general tags (CATiB). words, seen words, and unseen words. A word is • Lemma. We use the diacritized lemma considered seen if it is in the training data regard- (Lemma), and the normalized and undiacritized less of whether it appears with the same lemma form of the lemma, the LMM (LMM). and POS tag or not. Defining seen words this way • Form-based features. Form-based features makes the decision on whether a word is seen or (F) are extracted from the word form and do not unseen unaffected by lemma and/or POS predic- necessarily reflect functional features. These fea- tion errors in the development and test sets. Us- tures are form-based gender, form-based number, ing our definition of seen words, 34.3% of words person and the definite article. types (and 10.2% of word tokens) in the devel- Syntactic Features We use the following syn- opment set have not been seen in quarter of the tactic features (SYN) derived from the CATiB de- training set. pendency version of the PATB (Habash and Roth, We train single classifiers for G (gender), N 2009): parent, dependency relation, order of ap- (number), R (rationality), GN and GNR, and eval- pearance (the word comes before or after its par- uate them. We also combine the tags of the sin- ent), the distance between the word and its parent, gle classifiers into larger tags (G+N, GN+R and and the parent’s orthographic and morphological G+N+R). features. 678 For all of these features, we train on gold val- Single vs Joint Classification In this paper, we ues, but only experiment with predicted values in only discuss systems trained for a single classifier the development and test sets. For predicting mor- (for gender, for number and for rationality). In phological features, we use the MADA system experiments we have done, we found that training (Habash and Rambow, 2005). The MADA sys- single classifiers and combining their outcomes tem corrects for suboptimal orthographic choices almost always outperforms a single joint classi- and effectively produces a consistent and unnor- fier for the three target features. In other words, malized orthography. For the syntactic features, combining the results of G and N (G+N) outper- we use Marton et al. (2010)’s system. forms the results of the single classifier GN. The same is also true for G+N+R, which outperforms 5.2 Techniques GNR and GN+R. Therefore, we only present the We describe below the two techniques we ex- results for the single classifiers G, N, R and their plored. combination G+N+R. MLE with Back-off We implemented an MLE 6 Results system with multiple back-off modes using our set of linguistic features. The order of the back-off We perform a series of experiments increasing in is from specific to general. We start with an MLE feature complexity. We greedily select which fea- system that uses only the word form, and backs tures to pass on to the next level of experiments. off to the most common feature value across all In cases of ties, we pass the top two performers words (excluding unknown and N a values). This to the next step. We discuss each of these exper- simple MLE system is used as a baseline. iments next for both the MLE and Yamcha tech- As we add more features to the MLE system, niques. Statistical significance is measured using it tries to match all these features to predict the the McNemar test of statistical significance (Mc- value for a given word. If such a combination of Nemar, 1947). features is not seen in the training set, the sys- 6.1 Experiment Set I: Orthographic tem backs off to a more general combination of Features features. For example, if an MLE system is us- The first set of experiments uses the orthographic ing the features W2+LMM+BW, the system tries features. See Table 1. The MLE system with the to match this combination. If it is not seen in word only feature (W1) is effectively our base- training, the system backs off to the following set: line. It does surprisingly well for seen cases. In LMM+BW, and tries to return the most common fact it is the highest performer across all exper- value for this POS tag and lemma combination. If iments in this paper for seen cases. For unseen again it fails to find a match, it backs off to BW, cases, it produces a miserable and expected low and returns the most common value for that par- score of 21.0% accuracy. The addition of the n- ticular POS tag. If no word is seen with this POS gram features (W2) improves statistically signif- tag, the system returns the most common value icantly over W1 for unseen cases, but it is indis- across all words. tinguishable for seen cases. The Yamcha system Yamcha Sequence Tagger We use Yamcha shows the same difference in results between W1 (Kudo and Matsumoto, 2003), a support-vector- and W2. machine-based sequence tagger. We perform dif- Across the two sets of features, the MLE sys- ferent experiments with the different sets of fea- tem consistently outperforms Yamcha in the case tures presented above. After that, we apply a of seen words, while Yamcha does better for un- consistency filter that ensures that every word- seen words. This can be explained by the fact that lemma-pos combination always gets the same the MLE system matches only on the word form value for gender, number and rationality features. and if the word is unseen, it backs off to the most Yamcha in its default settings tags words using a common value across all words. Moreover, Yam- window of two words before and two words af- cha uses some limited context information that al- ter the word being tagged. This gives Yamcha an lows it to generalize for unseen words. advantage over the MLE system which tags each Among the target features, number is the easi- word independently. est to predict, while rationality is the hardest. 679 MLE Yamcha G N R G+N+R G N R G+N+R Features seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen W1 99.2 61.6 99.3 69.2 97.4 44.7 97.0 21.0 95.9 67.8 96.7 72.0 94.5 67.4 90.2 35.2 W2 99.2 81.7 99.3 81.6 97.4 63.4 97.0 49.1 97.1 86.6 97.7 87.1 95.6 82.0 92.8 65.5 Table 1: Experiment Set I: Baselines and simple orthographic features. W1 is the word only. W2 is the word with additional 1-gram and 2-gram prefix and suffix features. All numbers are accuracy percentages. MLE Yamcha G N R G+N+R G N R G+N+R Features seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen W2+F 99.2 86.9 99.3 88.9 97.4 63.4 96.9 51.9 97.7 89.8 98.1 91.7 96.0 83.5 93.8 72.0 W2+Lemma 97.4 68.3 97.6 71.5 95.6 70.3 95.2 33.8 97.4 86.8 97.7 86.4 96.1 82.2 93.3 65.4 W2+LMM 99.1 68.8 99.3 71.7 97.2 67.6 96.8 33.2 97.5 86.7 97.9 86.6 96.1 82.6 93.5 65.7 W2+CATIB 99.1 85.0 99.3 83.8 97.4 70.0 97.1 56.2 97.5 87.9 98.0 88.6 96.0 83.5 93.6 69.7 W2+CATIB-EX 99.1 85.7 99.3 84.3 97.4 70.4 97.1 56.7 97.5 88.0 97.9 88.1 96.0 83.6 93.6 69.9 W2+Kulick 99.0 86.7 99.1 85.6 97.1 78.7 96.7 65.5 97.3 88.8 97.9 89.4 95.8 83.5 93.3 70.9 W2+BW- 99.0 88.8 99.0 88.8 97.0 80.7 96.6 68.5 97.5 89.7 98.0 91.2 96.0 85.2 93.7 73.2 W2+BW 98.6 87.9 98.5 88.8 96.8 80.3 95.9 67.8 97.5 89.5 97.9 89.5 96.1 85.7 93.7 72.8 Table 2: Experiment Set II.a: Morphological features: (i) form-based gender and number, (ii) lemma and LMM (undiacritized lemma) and (iii) a variety of POS tag sets. For each subset, the best performers are bolded. 6.2 Experiment Set II: Morphological reasonable given that LMM is easier to predict; Features although LMM is more ambiguous. As for the POS tag sets, looking at the MLE Individual Morphological Features In this set results, CATIB-EX is the best performer for seen of experiments, we use our best system from the words, and BW- is the best for unseen. CATIB-6 previous set, W2, and add individual morpholog- is a general POS tag set and since the MLE tech- ical features to it. We organize these features in nique is very strict in its matching process (an ex- three sub-groups: (i) form-based features (F), (ii) act match or no match), using a general key to lemma and LMM, and (iii) the five POS tag sets. match on adds a lot of ambiguity. With Yamcha, See Table 2. BW and BW- are the best among all POS. Yamcha The F, Lemma and LMM improve over the is still doing consistently better in terms of unseen baseline in terms of unseen words for both MLE words. The best two systems from both Yamcha and Yamcha techniques. However, for seen and MLE are used as the basic systems for the words, these systems do worse than or equal to the next subset of experiments where we combine the baseline when the MLE technique is used. The morphological features. MLE system in these cases tries to match the word and its morphological features as a single unit and Combined Morphological Features Until this if such a combination is not seen, it backs off to point, all experiments using the two techniques the morphological feature which is more general. are similar. In this subset, MLE explores the ef- Since we are using predicted data, prediction er- fect of using the CATIB-EX and BW- with other rors could be the reason behind this decrease in morphological features. And Yamcha explores accuracy for seen words. Among these systems, the effect of using BW- and BW with other mor- W2+F is the best for both Yamcha and MLE ex- phological features. See Table 3. Again, Yamcha cept for rationality which is expected since there is still doing consistently better in terms of unseen are no form-based features for rationality. In this words, but when it comes to seen words, MLE set of experiments, Yamcha consistently outper- performs better. For seen words, our best results forms MLE when it comes to unseen words, but come from MLE using CATIB-EX and LMM. For for seen words, MLE does better almost always. unseen words, our best results come from Yam- LMM overall does better than Lemma. This is cha with the BW- tag and the form-based features 680 MLE Yamcha Features: G N R G+N+R Features: G N R G+N+R W2 seen unseen seen unseen seen unseen seen unseen W2 seen unseen seen unseen seen unseen seen unseen +CATIB-EX 99.1 85.7 99.3 84.3 97.4 70.4 97.0 56.7 +BW 97.5 89.5 97.9 89.5 96.1 85.7 93.7 72.8 +F 98.7 88.6 99.1 89.4 94.9 70.4 94.3 59.7 +F 97.8 90.6 98.2 92.4 96.3 85.3 94.2 75.4 +LMM 99.1 78.9 99.3 80.4 97.3 69.6 96.9 44.7 +LMM 97.6 88.9 98.1 88.9 96.5 85.7 94.1 72.3 +LMM+F 98.7 89.9 99.0 89.7 94.8 69.6 94.2 58.1 +LMM+F 98.1 90.4 98.4 92.5 96.7 85.8 94.8 75.9 +BW- 99.0 88.8 99.0 88.8 97.0 80.7 96.6 68.5 +BW- 97.5 89.7 98.0 91.2 96.0 85.2 93.7 73.2 +F 99.0 88.8 99.1 89.9 97.0 80.7 96.6 69.6 +F 97.7 90.7 98.2 92.5 96.1 85.6 94.0 75.3 +LMM 98.9 90.0 99.0 88.0 97.0 83.6 96.6 69.8 +LMM 97.7 89.6 98.1 90.4 96.2 85.1 94.0 72.5 +LMM+F 98.9 90.0 99.0 89.1 97.0 83.6 96.6 70.8 +LMM+F 98.0 90.3 98.2 92.4 96.5 85.7 94.5 75.1 Table 3: Experiment Set II.b: Combining different morphological features. Yamcha G N R G+N+R Features: seen unseen seen unseen seen unseen seen unseen W2 +BW +F+SYN 97.3 90.6 97.8 92.5 96.1 86.1 93.5 76.0 W2 +BW +LMM+SYN 97.4 89.1 97.5 88.3 96.2 86.0 93.4 71.7 W2 +BW +LMM+F+SYN 97.5 90.8 98.0 92.5 96.4 86.2 93.8 76.2 W2 +BW- +F+SYN 97.4 90.7 97.9 92.7 96.1 85.2 93.5 75.0 W2 +BW- +LMM+SYN 97.4 89.5 97.7 89.8 96.1 85.7 93.4 72.1 W2 +BW- +LMM+F+SYN 97.4 90.8 97.9 92.7 96.2 85.3 93.6 75.2 Table 4: Experiment Set III: Syntactic features. for both gender and number. For rationality, the words. In Yamcha, we can argue that the +/-2 best features to use with Yamcha are BW, LMM word window allows some form of shallow syn- and form-based features. The lemma seems to ac- tax modeling, which is why Yamcha is doing bet- tually hurt when predicting gender and number. ter from the start. But the longer distance features This can be explained by the fact that gender and are helping even more, perhaps because they cap- number features are often properties of the word ture agreement relations. The overall best system form and not of the lemma. This is different for for unseen words is W2+BW+LMM+F+SYN, rationality, which is a property of the lemma and except for number, where W2+BW-+F+SYN therefore, we expect the lemma to help. is slightly better. In terms of G+N+R The fact that the predicted BW set helps is not scores, W2+BW+LMM+F+SYN is statistically consistent with previous work by Marton et al. significantly better than all other systems in (2010). In that effort, BW helps parsing only in this set for seen and unseen words, ex- the gold condition. BW prediction accuracy is cept for unseen words with W2+BW+F+SYN. low because it includes case endings. We pos- W2+BW+LMM+F+SYN is also statistically sig- tulate that perhaps in our task, which is far more nificantly better than its non-syntactic variant for limited than general parsing, errors in case pre- both seen and unseen words. The prediction ac- diction may not matter too much. The more com- curacy for seen words is still not as good as the plex tag set may actually help establish good lo- MLE systems. cal agreement sequences (even if incorrect case- wise), which is relevant to the target features. 6.4 System Combination The simple MLE W1 system, which happens to be 6.3 Experiment Set III: Syntactic Features the baseline, is the best predictor for seen words, This set of experiments adds syntactic features and the more advanced Yamcha system using syn- to the experiments in set II. We add syntax to tactic features is the best predictor for unseen the systems that uses Yamcha only since it is words. Next, we create a new system that takes not obvious how to add syntactic information to advantage of the two systems. We use the sim- the MLE system. Syntax improves the predic- ple MLE W1 system for seen words, and Yam- tion accuracy for unseen words but not for seen cha with syntax for unseen words. For unseen 681 words, since each target feature has its own set of All seen unseen best learning features, we also build a combina- MLE W1 88.5 96.8 21.2 tion system that uses the best systems for gender, Yamcha BW+LMM+F 91.4 94.1 70.4 Yamcha BW+LMM+F+SYN 91.0 93.3 72.2 number and rationality and combine their output Combination 94.1 96.8 72.4 into a single system for unseen words. For gender and rationality, we use W2+BW+LMM+F+SYN, Table 5: Results on blind test. Scores for and for number, we use W2+BW-+F+SYN. As All/Seen/Unseen are shown for the G+N+R condition. expected the combination system outperforms the We compare the MLE word baseline, with the best basic systems. For comparison: The MLE W1 Yamcha system with and without syntactic features and the combined system. system gets an (all, seen, unseen) scores of (89.3, 97.0, 21.0) for G+N+R, while the best single Yamcha syntactic system gets (92.0, 93.8, 76.2); Since the Yamcha system uses MADA features, the combination on the other hand gets (94.9, we investigated the effect of the correctness of 97.0, 76.2). The overall (all) improvement over MADA features on the system prediction accu- the MLE baseline or the best Yamcha translates racy. The overall MADA accuracy in identifying into 52% error reduction or 36% error reduction, the lemma and the Buckwalter tag together – a respectively. very harsh measure – is 77.0% (79.3% for seen and 56.8% for unseen). Our error analysis shows 6.5 Error Analysis that when MADA is correct, the prediction ac- We conducted an analysis of the errors in the out- curacy for G+N+R is 95.6%, 96.5% and 84.4% put of the combination system as well as the two for all, seen and unseen, respectively. However, systems that contributed to it. this accuracy goes down to 79.2%, 82.5% and In the combination system, out of the total er- 65.5% for all, seen and unseen, respectively, when ror in G+N+R (5.1%), 53% of the cases are for MADA is wrong. This suggests that the Yam- seen words (3.0% of all seen) and 47% for unseen cha system suffers when MADA makes wrong words (23.8% of all unseen). Overall, rational- choices and improving MADA would lead to im- ity errors are the biggest contributor to G+N+R provement in the system’s performance. error at 73% relative, followed by gender (33% relative) and number (26% relative). Among er- 6.6 Blind Test ror cases of seen words, rationality errors soar to Finally, we apply our baseline, best combination 87% relative, almost four times the corresponding model and best single Yamcha syntactic model gender and number errors (27% and 22%, respec- (with and without syntax) to the blind test set. tively). However, among error cases of unseen The results are in Table 5. The results in the blind words, rationality errors are 57% relative, while test are consistent with the development set. The gender and number corresponding errors are (39% MLE baseline is best on seen words, Yamcha is and 31%, respectively). As expected, rational- best on unseen words, syntactic features help in ity is much harder to tag than gender and number handling unseen words, and overall combination due to its higher word-form ambiguity and depen- improves over all specific systems. dence on context. We classified the type of errors in the MLE sys- 6.7 Additional Training Data tem for seen words, which we use in the combi- After experimenting on quarter of the train set to nation system. We found that 86% of the G+N+R optimize for various settings, we train our com- errors involve an ambiguity in the training data bination system on the full train set and achieve where the correct answer was present but not cho- (96.0, 96.8, 74.9) for G+N+R (all, seen, unseen) sen. This is an expected limitation of the MLE ap- on the development set and (96.5, 96.8, 65.6) proach. In the rest of the cases, the correct answer on the blind test set. As expected, the overall was not actually present in the training data. The (all) scores are higher simply due to the addi- proportion of ambiguity errors is almost identical tional training data. The results on seen and un- for gender, number and rationality. However ra- seen words, which are redefined against the larger tionality overall is the biggest cause of error, sim- training set, are not higher than results for the ply due to its higher degree of ambiguity. quarter training data. Of course, these numbers 682 should not be compared directly. The number of 7 Conclusions and Future Work unseen word tokens in the full train set is 3.7% We presented a series of experiments for auto- compared to 10.2% in quarter of the train set. matic prediction of the latent features of func- tional gender and number, and rationality in Ara- 6.8 Comparison with MADA bic. We compared two techniques, a simple MLE We compare our results with the form-based with back-off and an SVM-based sequence tag- features from the state-of-the-art morphological ger, Yamcha, using a number of orthographic, analyzer MADA (Habash and Rambow, 2005). morphological and syntactic features. Our con- We use the form-based gender and number fea- clusions are that for words seen in training, the tures produced by MADA after we filter MADA MLE model does best; for unseen word, Yamcha choices by tokenization. Since MADA does not does best; and most interestingly, we found that give a rationality value, we assign the value I (ir- syntactic features help the prediction for unseen rational) to nouns and proper nouns and the value words. N (not-specified) to verbs and adjectives. Every- In the future, we plan to explore training on pre- thing else receives N a (not-applicable). The POS dicted features instead of gold features to mini- tags are determined by MADA. mize the effect of tagger errors. Furthermore, we On the development set, MADA achieves plan to use our tools to collect vocabulary not cov- (72.6, 73.1, 58.6) for G+N+R (all, seen, unseen), ered by commonly used morphological analyzers where the seen/unseen distinction is based on the and try to assign them correct functional features. full training set in the previous section and is pro- Finally, we would like to use our predictions for vided for comparison reasons only. The results for gender, number and rationality as learning fea- the test set are (71.4, 72.2, 53.7). These results are tures for relevant NLP applications such as senti- consistent with our expectation that MADA will ment analysis, phrase-based chunking and named do badly on this task since it is not designed for entity recognition. it (Alkuhlani and Habash, 2011). We should re- mind the reader that MADA-derived features are Acknowledgments used as machine learning features in this paper, We would like to thank Yuval Marton for help where they actually help. In the future, we plan to with the parsing experiments. The first author was integrate this task inside of MADA. funded by a scholarship from the Saudi Arabian Ministry of Higher Education. The rest of the 6.9 Extrinsic Evaluation work was funded under DARPA projects number We use the predicted gender, number and rational- HR0011-08-C-0004 and HR0011-08-C-0110. ity features that we get from training on the full train set in a dependency syntactic parsing exper- References iment. The parsing feature set we use is the best performing feature set described in (Marton et al., Ramzi Abbès, Joseph Dichy, and Mohamed Has- 2011), which used an earlier unpublished version soun. 2004. The Architecture of a Standard Arabic of our MLE model. The parser we use is the Easy- Lexical Database. Some Figures, Ratios and Cat- First Parser (Goldberg and Elhadad, 2010). More egories from the DIINAR.1 Source Program. In Ali Farghaly and Karine Megerdoomian, editors, details on this parsing experiment is in Marton et COLING 2004 Computational Approaches to Ara- al. (2012). bic Script-based Languages, pages 15–22, Geneva, The functional gender and number features in- Switzerland, August 28th. COLING. crease the labeled attachment score by 0.4% abso- Imad Al-Sughaiyer and Ibrahim Al-Kharashi. 2004. lute over a comparable model that uses the form- Arabic Morphological Analysis Techniques: A based gender and number features. Rationality on Comprehensive Survey. Journal of the American Society for Information Science and Technology, the other hand does not help much. One possible 55(3):189–213. reason for this is the lower quality of the predicted Sarah Alkuhlani and Nizar Habash. 2011. A Corpus rationality feature compared to the other features. for Modeling Morpho-Syntactic Agreement in Ara- Another possible reason is that the rationality fea- bic: Gender, Number and Rationality. In Proceed- ture is not utilized optimally in the parser. ings of the 49th Annual Meeting of the Association 683 for Computational Linguistics (ACL’11), Portland, tional Morphology: Knowledge-based and Empir- Oregon, USA. ical Methods. Springer. Mohamed Altantawy, Nizar Habash, Owen Rambow, Nizar Habash, Reem Faraj, and Ryan Roth. 2009. and Ibrahim Saleh. 2010. Morphological Analy- Syntactic Annotation in the Columbia Arabic Tree- sis and Generation of Arabic Nouns: A Morphemic bank. In Proceedings of MEDAR International Functional Approach. In Proceedings of the seventh Conference on Arabic Language Resources and International Conference on Language Resources Tools, Cairo, Egypt. and Evaluation (LREC), Valletta, Malta. Nizar Habash. 2004. Large Scale Lexeme Based Mohammed Attia. 2008. Handling Arabic Morpho- Arabic Morphological Generation. In Proceedings logical and Syntactic Ambiguity within the LFG of Traitement Automatique des Langues Naturelles Framework with a View to Machine Translation. (TALN-04), pages 271–276. Fez, Morocco. Ph.D. thesis, The University of Manchester, Manch- Nizar Habash. 2010. Introduction to Arabic Natural ester, UK. Language Processing. Morgan & Claypool Pub- Tim Buckwalter. 2004. Buckwalter arabic morpho- lishers. logical analyzer version 2.0. LDC catalog number Clive Holes. 2004. Modern Arabic: Structures, Func- LDC2004L02, ISBN 1-58563-324-0. tions, and Varieties. Georgetown Classics in Arabic Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. Language and Linguistics. Georgetown University 2004. Automatic Tagging of Arabic Text: From Press. Raw Text to Base Phrase Chunks. In Proceed- Taku Kudo and Yuji Matsumoto. 2003. Fast Meth- ings of the 5th Meeting of the North Ameri- ods for Kernel-Based Text Analysis. In Proceed- can Chapter of the Association for Computational ings of the 41st Annual Meeting of the Association Linguistics/Human Language Technologies Con- for Computational Linguistics (ACL’03), pages 24– ference (HLT-NAACL04), pages 149–152, Boston, 31, Sapporo, Japan, July. MA. Seth Kulick, Ryan Gabbard, and Mitch Marcus. 2006. Mona Diab. 2007. Towards an Optimal POS tag set Parsing the Arabic Treebank: Analysis and Im- for Modern Standard Arabic Processing. In Pro- provements. In Proceedings of the Treebanks ceedings of Recent Advances in Natural Language and Linguistic Theories Conference, pages 31–42, Processing (RANLP), Borovets, Bulgaria. Prague, Czech Republic. Khaled Elghamry, Rania Al-Sabbagh, and Nagwa El- Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Zeiny. 2008. Cue-based bootstrapping of Arabic Wigdan Mekki. 2004. The Penn Arabic Treebank: semantic features. In JADT 2008: 9es Journées Building a Large-Scale Annotated Arabic Corpus. internationales d’Analyse statistique des Données In NEMLAR Conference on Arabic Language Re- Textuelles. sources and Tools, pages 102–109, Cairo, Egypt. Yoav Goldberg and Michael Elhadad. 2010. An effi- Yuval Marton, Nizar Habash, and Owen Rambow. cient algorithm for easy-first non-directional depen- 2010. Improving Arabic Dependency Parsing with dency parsing. In Human Language Technologies: Lexical and Inflectional Morphological Features. In The 2010 Annual Conference of the North American Proceedings of the NAACL HLT 2010 First Work- Chapter of he Association for Computational Lin- shop on Statistical Parsing of Morphologically-Rich guistics, pages 742–750, Los Angeles, California, Languages, pages 13–21, Los Angeles, CA, USA, June. Association for Computational Linguistics. June. Abduelbaset Goweder, Massimo Poesio, Anne De Yuval Marton, Nizar Habash, and Owen Rambow. Roeck, and Jeff Reynolds. 2004. Identifying Bro- 2011. Improving Arabic Dependency Parsing with ken Plurals in Unvowelised Arabic Text. In Dekang Form-based and Functional Morphological Fea- Lin and Dekai Wu, editors, Proceedings of EMNLP tures. In Proceedings of the 49th Annual Meet- 2004, pages 246–253, Barcelona, Spain, July. ing of the Association for Computational Linguis- Nizar Habash and Owen Rambow. 2005. Arabic Tok- tics (ACL’11), Portland, Oregon, USA. enization, Part-of-Speech Tagging and Morpholog- Yuval Marton, Nizar Habash, and Owen Rabmow. ical Disambiguation in One Fell Swoop. In Pro- 2012. Dependency Parsing of Modern Stan- ceedings of the 43rd Annual Meeting of the Associa- dard Arabic with Lexical and Inflectional Features. tion for Computational Linguistics (ACL’05), pages Manuscript submitted for publication. 573–580, Ann Arbor, Michigan. Quinn McNemar. 1947. Note on the sampling error Nizar Habash and Ryan Roth. 2009. CATiB: The of the difference between correlated proportions or Columbia Arabic Treebank. In Proceedings of the percentages. Psychometrika, 12(2):153–157. ACL-IJCNLP 2009 Conference Short Papers, pages Otakar Smrž and Jan Hajiˇc. 2006. The Other Ara- 221–224, Suntec, Singapore. bic Treebank: Prague Dependencies and Functions. Nizar Habash, Abdelhadi Soudi, and Tim Buckwalter. In Ali Farghaly, editor, Arabic Computational Lin- 2007. On Arabic Transliteration. In A. van den guistics: Current Implementations. CSLI Publica- Bosch and A. Soudi, editors, Arabic Computa- tions. 684 Otakar Smrž. 2007a. ElixirFM – implementation of functional arabic morphology. In ACL 2007 Pro- ceedings of the Workshop on Computational Ap- proaches to Semitic Languages: Common Issues and Resources, pages 1–8, Prague, Czech Repub- lic. ACL. Otakar Smrž. 2007b. Functional Arabic Morphology. Formal System and Implementation. Ph.D. thesis, Charles University in Prague, Prague, Czech Re- public. Abdelhadi Soudi, Antal van den Bosch, and Gün- ter Neumann, editors. 2007. Arabic Computa- tional Morphology. Knowledge-based and Empiri- cal Methods, volume 38 of Text, Speech and Lan- guage Technology. Springer, August. 685 Framework of Semantic Role Assignment based on Extended Lexical Conceptual Structure: Comparison with VerbNet and FrameNet Yuichiroh Matsubayashi† Yusuke Miyao† Akiko Aizawa† † , National Institute of Informatics, Japan {y-matsu,yusuke,aizawa}@nii.ac.jp Sentence [John] threw [a ball] [from the window] . Abstract Affection Agent Patient Movement Source Theme Source/Path Widely accepted resources for semantic PropBank Arg0 Arg1 Arg2 parsing, such as PropBank and FrameNet, VerbNet Agent Theme Source are not perfect as a semantic role label- FrameNet Agent Theme Source ing framework. Their semantic roles are not strictly defined; therefore, their mean- Table 1: Examples of single role assignments with ex- ings and semantic characteristics are un- isting resources. clear. In addition, it is presupposed that a single semantic role is assigned to each syntactic argument. This is not necessarily ing, current usage of semantic labels for SRL sys- true when we consider internal structures of verb semantics. We propose a new frame- tems is questionable from a theoretical viewpoint. work for semantic role annotation which For example, most of the works on SRL have solves these problems by extending the the- used PropBank’s numerical role labels (Arg0 to ory of lexical conceptual structure (LCS). Arg5). However, the meanings of these numbers By comparing our framework with that of depend on each verb in principle and PropBank existing resources, including VerbNet and does not expect semantic consistency, namely on FrameNet, we demonstrate that our ex- Arg2 to Arg5. Moreover, Yi et al. (2007) explic- tended LCS framework can give a formal itly showed that Arg2 to Arg5 are semantically definition of semantic role labels, and that multiple roles of arguments can be repre- inconsistent. The reason why such labels have sented strictly and naturally. been used in SRL systems is that verb-specific roles generally have a small number of instances and are not suitable for learning. However, it is 1 Introduction necessary to avoid using inconsistent labels since Recent developments of large semantic resources those labels confuse machine learners and can be have accelerated empirical research on seman- a cause of low accuracy in automatic process- tic processing (M`arquez et al., 2008). Specif- ing. In addition, clarity of the definition of roles ically, corpora with semantic role annotations, are particularly important for users to rationally such as PropBank (Kingsbury and Palmer, 2002) know how to use each role in their applications. and FrameNet (Ruppenhofer et al., 2006), are in- For this reasons, well-organized and generalized dispensable resources for semantic role labeling. labels grounded in linguistic characteristics are However, there are two topics we have to carefully needed in practice. Semantic roles of FrameNet take into consideration regarding role assignment and VerbNet (Kipper et al., 2000) are used more frameworks: (1) clarity of semantic role meanings consistently to some extent, but the definition of and (2) the constraint that a single semantic role the roles is not given in a formal manner and their is assigned to each syntactic argument. semantic characteristics are unclear. While these resources are undoubtedly invalu- Another somewhat related problem of existing able for empirical research on semantic process- annotation frameworks is that it is presupposed 686 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 686–695, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics that a single semantic role is assigned to each syn- proach, we demonstrate that some sort of seman- tactic argument.1 In fact, one syntactic argument tic characteristics that VerbNet and FrameNet in- can play multiple roles in the event (or events) ex- formally/implicitly describe in their roles can be pressed by a verb. For example, Table 1 shows a given formal definitions and that multiple argu- sentence containing the verb “throw” and seman- ment roles can be represented strictly and natu- tic roles assigned to its arguments in each frame- rally by extending the LCS theory. work. The table shows that each framework as- In the first half of this paper, we define our ex- signs a single role, such as Arg0 and Agent, to tended LCS framework and describe how it gives each syntactic argument. However, we can ac- a formal definition of roles and solves the problem quire information from this sentence that John of multiple roles. In the latter half, we discuss is an agent of the throwing event (the “Affec- the analysis of the empirical data we collected tion” row), as well as a source of the movement for 60 Japanese verbs and also discuss theoreti- event of the ball (the “Movement” row). Existing cal relationships with the frameworks of existing frameworks of assigning single roles simply ig- resources. We discuss in detail the relationships nore such information that verbs inherently have between our role labels and VerbNet’s thematic in their semantics. We believe that giving a clear roles. We also describe the relationship between definition of multiple argument roles would be our framework and FrameNet, with regards to the beneficial not only as a theoretical framework but definitions of the relationships between semantic also for practical applications that require detailed frames. meanings derived from secondary roles. This issue is also related to fragmentation and 2 Related works the unclear definition of semantic roles in these There have been several attempts in linguistics frameworks. As we exemplify in this paper, mul- to assign multiple semantic properties to one ar- tiple semantic characteristics are conflated in a gument. Gruber (1965) demonstrated the dis- single role label in these resources due to the man- pensability of the constraint that an argument ner of single-role assignment. This means that se- takes only one semantic role, with some concrete mantic roles of existing resources are not mono- examples. Rozwadowska (1988) suggested an lithic and inherently not mutually independent, approach of feature decomposition for semantic but they share some semantic characteristics. roles using her three features of change, cause, The aim of this paper is more on theoreti- and sentient, and defined typical thematic roles cal discussion for role-labeling frameworks rather by combining these features. This approach made than introducing a new resource. We developed it possible for us to classify semantic properties a framework of verb lexical semantics, which is across thematic roles. However, Levin and Rap- an extension of the lexical conceptual structure paport Hovav (2005) argued that the number of (LCS) theory, and compare it with other exist- combinations using defined features is usually ing frameworks which are used in VerbNet and larger than the actual number of possible com- FrameNet, as an annotation scheme of SRL. LCS binations; therefore, feature decomposition ap- is a decomposition-based approach to verb se- proaches should predict possible feature combi- mantics and describes a meaning by composing nations. a set of primitive predicates. The advantage of Culicover and Wilkins (1984) divided their this approach is that primitive predicates and their roles into two groups, action and perceptional compositions are formally defined. As a result, roles, and explained that dual assignment of roles we can give a strict definition of semantic roles always involves one role from each set. Jackend- by grounding them to lexical semantic structures off (1990) proposed an LCS framework for rep- of verbs. In fact, we define semantic roles as ar- resenting the meaning of a verb by using several gument slots in primitive predicates. With this ap- primitive predicates. Jackendoff also stated that an LCS represents two tiers in its structure, action 1 To be precise, FrameNet permits multiple-role assign- tier and thematic tier, which are similar to Culi- ment, while it does not perform this systematically as we show in Table 1. It mostly defines a single role label for a cover and Wilkins’s two sets. Essentially, these corresponding syntactic argument, that plays multiple roles two approaches distinguished roles related to ac- in several sub-events in a verb. tion and change, and successfully restricted com- 687 2 2 3 3 from(locate(in(i))) Predicates Semantic Functions 6 7 7 6cause(affect(i,j), go(j, 6 4fromward(locate(at(k)))5))7 state(x, y) First argument is in state specified by 4 5 toward(locate(at(l))) second argument. cause(x, y) Action in first argument causes change specified in second argument. Figure 1: LCS of the verb throw. act(x) First argument affects itself. affect(x, y) First argument affects second argument. react(x, y) First argument affects itself, due to the binations of roles by taking a role from each set. effect from second argument. Dorr (1997) created an LCS-based lexical re- go(x, y) First argument changes according to the path described in the second argument. source as an interlingual representation for ma- from(x) Starting point of certain change event. chine translation. This framework was also used fromward(x) Direction of starting point. for text generation (Habash et al., 2003). How- via(x) Pass point of certain change event. ever, the problem of multiple-role assignment was toward(x) Direction of end point. not completely solved on the resource. As a to(x) End point of certain change event. along(x) Linear-shaped path of change event. comparison of different semantic structures, Dorr (2001) and Hajiˇcov´a and Kuˇcerov´a (2002) ana- Table 2: Major primitive predicates and their semantic lyzed the connection between LCS and PropBank functions. roles, and showed that the mapping between LCS and PropBank roles was many to many correspon- dence and roles can map only by comparing a ure 1 represents the action changing the state of j. whole argument structure of a verb. Habash and The inner structure of the second argument of go Dorr (2001) tried to map LCS structures into the- represents the path of the change. matic roles by using their thematic hierarchy. The overall definition of our extended LCS framework is shown in Figure 2.2 Basically, our 3 Multiple role expression using lexical definition is based on Jackendoff’s LCS frame- conceptual structure work (1990), but performed some simplifications and added extensions. The modification is per- Lexical conceptual structure is an approach to de- formed in order to increase strictness and gen- scribe a generalized structure of an event or state erality of representation and also a coverage for represented by a verb. A meaning of a verb is rep- various verbs appearing in a corpus. The main resented as a structure composed of several prim- differences between the two LCS frameworks are itive predicates. For example, the LCS structure as follows. In our extended LCS framework, (i) for the verb “throw” is shown in Figure 1 and the possible combinations of cause, act, affect, includes the predicates cause, affect, go, from, react, and go are clearly restricted, (ii) multiple fromward, toward, locate, in, and at. The argu- actions or changes in an event can be described ments of primitive predicates are filled by core ar- by introducing a combination function (comb for guments of the verb. This type of decomposition short), (iii) GO, STAY and INCH in Jackendoff’s approach enables us to represent a case that one theory are incorporated into one function go, and syntactic argument fills multiple slots in the struc- (iv) most of the change-of-state events are repre- ture. In Figure 1, the argument i appears twice in sented as a metaphor using a spatial transition. the structure: as the first argument of affect and The idea of a comb function comes from a nat- the argument in from. ural extension of Jackendoff’s EXCH function. The primitives are designed to represent a full In our case, comb is not limited to describing or partial action-change-state chain, which con- a counter-transfer of the main event but can de- sists of a state, a change in or maintaining of a scribe subordinate events occurring in relation to state, or an action that changes/maintains a state. the main event.3 We can also describe multiple Table 2 shows primitives that play important roles 2 to represent that chain. Some primitives embed Here we omitted the attributes taken by each predicate, other primitives as their arguments and the seman- in order to simplify the explanation. We also omitted an explanation for lower level primitives, such as STATE and tics of the entire structure of an LCS structure PLACE groups, which are not necessarily important for the is calculated according to the definition of each topic of this paper. 3 primitive. For instance, the LCS structure in Fig- In our extended LCS theory, we can describe multiple 688 2 3 8 9 EVENT+ Role Description h i >be > > > > > LCS = 4 5 > > locate(PLACE) > > Protagonist Entity which is viewpoint of verb. comb EVENT * < = STATE = orient(PLACE) Theme Entity in which its state or change of state > > > > is mentioned. >extent(PLACE)> > > > : > ; connect(arg) State Current state of certain entity. 28 93 Actor Entity which performs action that state(arg, STATE) 6> > > >7 > > changes/maintains its state. 6> > < go(arg, PATH) > >7 = Effector Entity which performs action that 6 7 6 cause(act(arg1), go(arg1, PATH)) 7 changes/maintains a state of another entity. 6> >7 6> >cause(affect(arg1, arg2), go(arg2, PATH))> >7 > Patient Entity which is changed/maintained its 6> 7 6> : > ;7 state by another entity. EVENT = 6 6 cause(react(arg1, arg2), go(arg1, PATH)) 7 7 6 7 Stimulus Entity which is cause of the action. 6manner(constant)? 7 6 7 Source Starting point of certain change event. 6mean(constant)? 7 6 7 Source dir Direction of starting point. 6instrument(constant)? 7 4 5 Middle Pass point of certain change event. purpose(EVENT)* Goal End point of certain change event. 8 9 2 3 > > Goal dir Direction of end point. >in(arg) > > > from(STATE)? > > > > 6 7 Route Linear-shaped path of certain change event. > > on(arg) > > 6fromward(STATE)?7 > > > > 6 7 >cover(arg) > > > > PATH=6 > 6via(STATE)? 7 7 > >fit(arg) > > 6toward(STATE)? 7 Table 3: Semantic role list for proposing extended LCS > > > > 6 7 <inscribed(arg)> > = 6 7 framework. 4to(STATE)? 5 PLACE = > > beside(arg) > > along(arg)? > > > > > > >around(arg) > > > > > tions of the arguments of the primitive predicates > > near(arg) > > > > > > can be explained using generalized semantic roles > > > > > > inside(arg) > > > : > ; such as typical thematic roles. In order to sim- at(arg) ply represent the semantic functions of the ar- Figure 2: Description system of our LCS. Operators guments in the LCS primitives or make it eas- +, ∗, ? follow the basic regular expression syntax. {} ier to compare our extended LCS framework with represents a choice of the elements. other SRL frameworks, we define a semantic role set that corresponds to the semantic functions of the primitive predicates in the LCS structure (Ta- main events if the agent does more than two ac- ble 3). We employed role names similarly to typ- tions simultaneously and all the actions are the ical thematic roles in order to easily compare the focus (e.g., John exchanges A with B). This ex- role sets, but the definition is different. Also, due tension is simple, but essential for creating LCS to the increase of the generality of LCS represen- structures of predicates appearing in actual data. tation, we obtained clearer definition to explain a In our development of 60 Japanese predicates correspondence between LCS primitives and typ- (verb and verbal noun) frequently appearing in ical thematic roles than the Jackendoff’s predi- Kyoto University Text Corpus (KTC) (Kurohashi cates. Note that the core semantic information of and Nagao, 1997) , 37.6% of the frames included a verb represented by a LCS framework is em- multiple events. By using the comb function, we bodied directly in its LCS structure and the in- can express complicated events with predicate de- formation decreases if the structure is mapped to composition and prevent missing (multiple) roles. the semantic roles. The mapping is just for con- A key point for associating LCS framework trasting thematic roles. Each role is given an ob- with the existing frameworks of semantic roles is vious meaning and designed to fit to the upper- that each primitive predicate of LCS represents level primitives of the LCS structure, which are a fundamental function in semantics. The func- the arguments of EVENT and PATH functions. In events in the semantic structure of a verb. However, gener- Table 4, we can see that these roles correspond al- ally, a verb focuses on one of those events and this makes most one-to-one to the primitive arguments. One a semantic variation among verbs such as buy, sell, and pay special role is Protagonist, which does not match as well as difference of syntactic behavior of the arguments. an argument of a specific primitive. The Pro- Therefore, focused event should be distinguished from the others as lexical information. We expressed focused events tagonist is assigned to the first argument in the as main formulae (formulae that are not surrounded by a main formula to distinguish that formula from the comb function). sub formulae. There are 13 defined roles, and 689 Predicate 1st arg 2nd arg Role Single Multiple Grow (%) state Theme State Theme 21 108 414 act Actor – State 1 1 0 affect Effector Patient Actor 12 13 8.3 react Actor Stimulus Effector 73 92 26 go Theme PATH Patient 77 79 2.5 from Source – Stimulus 0 0 0 fromward Source dir – Source 11 44 300 via Middle – Source dir 4 4 0 toward Goal dir – Middle 1 8 700 to Goal – Goal 42 81 93 along Route – Goal dir 2 3 50 Route 2 2 0 Table 4: Correspondence between semantic roles and w/o Theme 225 327 45 arguments of LCS primitives Total 246 435 77 this number is comparatively smaller than that in Table 5: Number of appearances of each role VerbNet. The discussion with regard to this num- ber is described in the next section. ated the dictionary looking at the instances of Essentially, the semantic functions of the ar- the target verbs in KTC. To increase the cover- guments in LCS primitives are similar to those age of senses and case frames, we also consulted of traditional, or basic, thematic roles. However, the online Japanese dictionary Digital Daijisen5 there are two important differences. Our extended and Kyoto university case frames (Kawahara and LCS framework principally guarantees that the Kurohashi, 2006) which is a compilation of case primitive predicates do not contain any informa- frames automatically acquired from a huge web tion concerning (i) selectional preference and (ii) corpus. There were 97 constructed frames in the complex structural relation of arguments. Primi- dictionary. tives are designed to purely represent a function in an action-change-state chain, thus the informa- Then we analyzed how many roles are addi- tion of selectional preference is annotated to a dif- tionally assigned by permitting multiple role as- ferent layer; specifically, it is directly annotated to signment (see Table 5). The numbers of assigned core arguments (e.g., we can annotate i with sel- roles for single role are calculated by counting Pref(animate ∨ organization) in Figure 1). Also, roles that appear first for each target argument in the semantic function is already decomposed and the structure. Table 5 shows that the total number the structural relation among the arguments is rep- of assigned roles is 1.77 times larger than single- resented as a structure of primitives in LCS rep- role assignment. The main reason is an increase in resentation. Therefore, each argument slot of Theme. For single-role assignment, Theme, in our the primitive predicates does not include compli- sense, in action verbs is always duplicated with cated meanings and represents a primitive seman- Actor/Patient. On the other hand, LCS strictly tic property which is highly functional. These divides a function for action and change; there- characteristics are necessary to ensure clarity of fore the duplicated Theme is correctly annotated. the semantic role meanings. We believe that even Moreover, we obtained a 45% increase even when though there surely exists a certain type of com- we did not count duplicated Theme. Most of in- plex semantic role, it is reasonable to represent crease are a result from the increase in Source that role based on decomposed properties. and Goal. For example, Effectors of transmission In order to show an instance of our extended verbs are also annotated with a Source, and Effec- LCS theory, we constructed a dictionary of LCS tors of movement verbs are sometimes annotated structures for 60 Japanese verbs (including event with Source or Goal. nouns) using our extended LCS framework. The contain a phonogram form (Hiragana form) of a certain verb 60 verbs were the most frequent verbs in KTC af- written with Kanji characters, and that phonogram form gen- ter excluding 100 most frequent ones.4 We cre- erally has a huge ambiguity because many different verbs have same pronunciation in Japanese. 4 5 We omitted top 100 verbs since these most frequent ones Available at http://dictionary.goo.ne.jp/jn/. 690 Resource Frame-independent # of roles into account specific syntactic behaviors of cer- LCS yes 13 tain semantic roles. Packing such complex infor- VerbNet (v3.1) yes 30 mation to semantic roles is useful for analyzing FrameNet (r1.4) no 8884 argument realization. However, from the view- point of semantic representation, the clarity for Table 6: Number of roles in each resource. semantic properties provided using a predicate de- composition approach is beneficial. The 13 roles 4 Comparison with other resources for the LCS approach is sufficient for obtaining a function in the action-change-state chain. In 4.1 Number of semantic roles our LCS framework, selectional preference can The number of roles is related to the number of se- be assigned to arguments in an individual verb or mantic properties represented in a framework and verb class level instead of role labels themselves to the generality of that property. Table 6 lists the to maintain generality of semantic functions. In number of semantic roles defined in our extended addition, our extended LCS framework can easily LCS framework, VerbNet and FrameNet. separate complex structural information from role There are two ways to define semantic roles. labels because LCS directly represents a structure One is frame specific, where the definition of each among the arguments. We can calculate the infor- role depends on a specific lexical entry and such mation from the LCS structure instead of coding a role is never used in the other frames. The other it into role labels. As a result, our extended LCS is frame independent, which is to construct roles framework maintains generality of roles and the whose semantic function is generalized across number of roles is smaller than other frameworks. all verbs. The number of roles in FrameNet is comparatively large because it defines roles in a 4.2 Clarity of role meanings frame-specific way. FrameNet respects individual We showed that an approach of predicate decom- meanings of arguments rather than generality of position used in LCS theory clarified role mean- roles. ings assigned to syntactic arguments. Moreover, Compared with VerbNet, the number of roles LCS achieves high generality of roles by separat- defined in our extended LCS framework is less ing selectional preference or structural informa- than half. However, this fact does not mean tion from role labels. The complex meaning of that the representation ability of our framework is one syntactic argument is represented by multi- lower than VerbNet. We manually checked and ple appearances of the argument in an LCS struc- listed a corresponding representation in our ex- ture. For example, we show an LCS structure tended LCS framework for each thematic role in and a frame in VerbNet with regard to the verb VerbNet in Table 6. This table does not provide a “buy” in Figure 3. The LCS structure consists perfect or complete mapping between the roles in of four formulae. The first one is the main for- these two frameworks because the mappings are mula and the others are sub-formulae that rep- not based on annotated data. However, we can resent co-occurring actions. The semantic-role- roughly say that the VerbNet roles combine three like representation of the structure is given in Ta- types of information, a function of the argument ble 4: i = {Protagonist, Effector, Source, Goal}, in the action-change-state chain, selectional pref- j = {Patient, Theme}, k = {Effector, Source, erence, and structural information of arguments, Goal}, and l = {Patient, Theme}. Selectional which are in different layers in LCS representa- preference is annotated to each argument as i: tion. VerbNet has many roles whose functions in selPref(animate ∨ organization), j: selPref(any), the action-change-state chain are duplicated. For k: selPref(animate ∨ organization), and l: sel- example, Destination, Recipient, and Beneficiary Pref(valuable entity). If we want to represent the have the same property end-state (Goal in LCS) information, such as “Source of what?”, then we of a changing event. The difference between such can extend the notation as Source(j) to refer to a roles comes from a specific sub-type of a chang- changing object. ing event (possession), selectional preference, and On the other hand, VerbNet combines mul- structural information among the arguments. By tiple types of information into a single role as distinguishing such roles, VerbNet roles may take mentioned above. Also, the meaning of some 691 VerbNet role (# of uses) Representation in LCS Actor (9), Actor1 (9), Actor2 (9) Actor or Effector in symmetric formulas in the structure Agent (212) (Actor ∨ Effector) ∧ Protagonist Asset (6) Theme ∧ Source of the change is (locate(in()) ∧ Protagonist) ∧ selPref(valuable entity) Beneficiary (9) (peripheral role ∨ (Goal ∧ locate(in()))) ∧ selPref(animate ∨ organization) ∧ ¬(Actor ∨ Effector) ∧ a transferred entity is something beneficial Cause (21) ((Effector ∧ selPref(¬animate ∧ ¬organization)) ∨ Stimulus ∨ peripheral role) Destination (32) Goal Experiencer (24) Actor of react() Instrument (25) ((Effector ∧ selPref(¬animate ∧ ¬organization)) ∨ peripheral role) Location (45) (Theme ∨ PATH roles ∨ peripheral role) ∧ selPref(location) Material (6) Theme ∨ Source of a change ∧ The Goal of the change is locate(fit()) ∧ the Goal fullfills selPref(physical object) Patient (59), Patient 1(11) Patient ∨ Theme Patient2 (11) (Source ∨ Goal) ∧ connect() Predicate (23) Theme ∨ (Goal ∧ locate(fit())) ∨ peripheral role Product (7) Theme ∨ (Goal ∧ locate(fit()) ∧ selPref(physical object)) Proposition (11) Theme Recipient (33) Goal ∧ locate(in()) ∧ selPref(animate ∨ organization) Source (34) Source Theme (162) Theme Theme1 (13), Theme2 (13) Both of the two is Theme ∨ Theme1 is Theme and Theme2 is State Topic (18) Theme ∧ selPref(knowledge ∨ infromation) Table 7: Relationship of roles between VerbNet and our LCS framework. VerbNet roles that appears more than five times in frame definition are analyzed. Each relationship shown here is only a partial and consistent part of the complete correspondence table. Note that complete table of mapping highly depends on each lexical entry (or verb class). Here, locate(in()) generally means possession or recognizing. roles depends more on selectional preference or Example: “John bought a book from Mary for $10.” the structure of the arguments than a primitive VerbNet: Agent V Theme {from} Source {for} Asset. function in the action-change-state chain. Such has possession(start(E), Source, Theme), VerbNet roles are used for several different func- has possession(end(E), Agent, Theme), tions depending on verbs and their alternations, transfer(during(E), Theme), cost(E, Asset) LCS: and it is therefore difficult to capture decomposed 2 h i 3 properties from the role label without having spe- 6 cause(aff(i:John, j:a book), go(j, to(loc(in(i))) )) 7 6 2 " # 37 cific lexical knowledge. Moreover, some seman- 6 7 6 from(loc(in(i))) 57 tic functions, such as Mary is a Goal of the money 6comb4cause(aff(i,l:$10), go(l, )) 7 6 to(loc(at(k:Mary))) 7 6 7 in Figure 3, are completely discarded from the 6 2 " # 3 7 6 7 representation at the level of role labels. 6 from(loc(in(k))) 7 6comb4cause(aff(k,j), go(j, )) 5 7 6 to(loc(at(i))) 7 There is another representation related to the 6 7 6 » – 7 6 h i 7 argument meanings in VerbNet. This representa- 4 5 comb cause(aff(k,l), go(l, to(loc(in(k))) )) tion is a type of predicate decomposition using its original set of predicates, which are referred to as Figure 3: Comparison between the semantic predicate semantic predicates. For example, the verb “buy” representation and the LCS structure of the verb buy. in Figure 3 has the predicates has possession, transfer and cost for composing the meaning of its event structure. The thematic roles are fillers publicly available. A requirement for obtaining of the predicates’ arguments, thus the semantic implicit semantic functions from these semantic predicates may implicitly provide additional func- predicates is clearly defining how the roles (or tions to the roles and possibly represent multiple functions) are calculated from these complex re- roles. Unfortunately, we cannot discover what lations of semantic predicates. each argument of the semantic predicates exactly FrameNet does not use semantic roles general- means since the definition of each predicate is not ized among all verbs or does not represent seman- 692 i: selPref(animate ∨ organization), j: selPref(any), k: selPref(animate ∨ organization), l: selPref(valuable entity) Figure 4: LCS of the verbs get, buy, sell, pay, and collect and their relationships calculated from the structures. tic properties of roles using a predicate decom- position approach, but defines specific roles for each conceptual event/state to represent a specific background of the roles in the event/state. How- ever, at the same time, FrameNet defines several types of parent-child relations between most of the frames and between their roles; therefore, we may say FrameNet implicitly describes a sort of Figure 5: The frame relations among the verbs get, decomposed property using roles in highly gen- buy, sell, pay, and collect in FrameNet. eral or abstract frames and represents the inher- itance of these semantic properties. One advan- tage of this approach is that the inheritance of a ments in a lexical structure. The primitive proper- meaning between roles is controlled through the ties can be clearly defined, even though the repre- relations, which are carefully maintained by hu- sentation ability is restricted under the generality man efforts, and is not restricted by the represen- of roles. tation ability of the decomposition system. On the In addition, the frame-to-frame relations in other hand, the only way to represent generalized FrameNet may be a useful resource for some ap- properties of a certain semantic role is enumerat- plication tasks such as paraphrasing and entail- ing all inherited roles by tracing ancestors. Also, ment. We argue that some types of relationships a semantic relation between arguments in a cer- between frames are automatically calculated us- tain frame, which is given by LCS structure and ing the LCS approach. For example, one of the semantic predicates of VerbNet, is only defined relations is based on an inclusion relation of two by a natural language description for each frame LCS structures. Figure 4 shows automatically in FrameNet. From a CL point of view, we con- calculated relations surrounding the verb “buy”. sider that, at least, a certain level of formalization Note that we chose a sense related to a com- of semantic relation of arguments is important for mercial transaction, which means a exchange of utilize this information for application. LCS ap- a goods and money, for each word in order to proach, or an approach using a well-defined pred- compare the resulted relation graph with that of icate decomposition, can explicitly describe se- FrameNet. We call relations among “buy”, “sell”, mantic properties and relationships between argu- “pay” and “collect” as different viewpoints since 693 they contain exactly the same formulae, and the the problems that are directly related to a seman- only difference is the main formula. The rela- tic role annotation on that we focus in this paper, tion between “buy” and “get” is defined as in- but we plan to solve these problems with further heritance; a part of the child structure exactly extensions. equals the parent structure. Interestingly, the re- lations surrounding the “buy” are similar to those 5 Conclusion in FrameNet (see Figure 5). We cannot describe all types of the relations we considered due to We discussed the two problems in current labeling space limitations. However, the point is that these approaches for argument-structure analysis: the relationships are represented as rewriting rules problems in clarity of role meanings and multiple- between the two LCS representations and thus role assignment. By focusing on the fact that an they are automatically calculated. Moreover, the approach of predicate decomposition is suitable grounds for relations maintain clarity based on for solving these problems, we proposed a new concrete structural relations. A semantic relation framework for semantic role assignment by ex- construction of frames based on structural rela- tending Jackendoff’s LCS framework. The statis- tionships is another possible application of LCS tics of our LCS dictionary for 60 Japanese verbs approaches that connects traditional LCS theo- showed that 37.6% of the created frames included ries with resources representing a lexical network multiple events and the number of assigned roles such as FrameNet. for one syntactic argument increased 77% from that in single-role assignment. 4.3 Consistency on semantic structures Compared to the other resources such as Verb- Constructing a LCS dictionary is generally a dif- Net and FrameNet, the role definitions in our ex- ficult work since LCS has a high flexibility for tended LCS framework are clearer since the prim- describing structures and different people tend to itive predicates limit the meaning of each role to write different structures for a single verb. We a function in the action-change-state chain. We maintained consistency of the dictionary by tak- also showed that LCS can separate three types of ing into account a similarity of the structures be- information, the functions represented by primi- tween the verbs that are in paraphrasing or entail- tives, the selectional preference and structural re- ment relations. This idea was inspired by auto- lation of arguments, which are conflated in role la- matic calculation of semantic relations of lexicon bels in existing resources. As a potential of LCS, as we mentioned above. We created a LCS struc- we demonstrated that several types of frame re- ture for each lexical entry as we can calculate se- lations, which are similar to those in FrameNet, mantic relations between related verbs and main- are automatically calculated using the structural tained high-level consistency among the verbs. relations between LCSs. We still must perform a Using our extended LCS theory, we success- thorough investigation for enumerating relations fully created 97 frames for 60 predicates without which can be represented in terms of rewriting any extra modification. From this result, we be- rules for LCS structures. However, automatic lieve that our extended theory is stable to some construction of a consistent relation graph of se- extent. On the other hand, we found that an extra mantic frames may be possible based on lexical extension of the LCS theory is needed for some structures. verbs to explain the different syntactic behaviors We believe that this kind of decomposed analy- of one verb. For example, a condition for a cer- sis will accelerate both fundamental and applica- tain syntactic behavior of a verb related to re- tion research on argument-structure analysis. As a ciprocal alteration (see class 2.5 of Levin (Levin, future work, we plan to expand the dictionary and 1993)) such as つながる (connect) and 統一 (in- construct a corpus based on our LCS dictionary. tegrate) cannot be explained without considering the number of entities in some arguments. Also, Acknowledgment some verbs need to define an order of the internal events. For example, the Japanese verb 往復す This work was partially supported by JSPS Grant- る (shuttle) means that going is a first action and in-Aid for Scientific Research #22800078. coming back is a second action. These are not 694 References J. Ruppenhofer, M. Ellsworth, M.R.L. Petruck, C.R. Johnson, and J. Scheffczyk. 2006. FrameNet II: P.W. Culicover and W.K. Wilkins. 1984. Locality in Extended Theory and Practice. Berkeley FrameNet linguistic theory. Academic Press. Release, 1. Bonnie J. Dorr. 1997. Large-scale dictionary con- Szu-ting Yi, Edward Loper, and Martha Palmer. 2007. struction for foreign language tutoring and inter- Can semantic roles generalize across genres? In lingual machine translation. Machine Translation, Proceedings of HLT-NAACL 2007, pages 548–555. 12(4):271–322. Bonnie J. Dorr. 2001. Lcs database. http://www. umiacs.umd.edu/˜bonnie/LCS Database Document ation.html. Jeffrey S Gruber. 1965. Studies in lexical relations. Ph.D. thesis, MIT. N. Habash and B. Dorr. 2001. Large scale language independent generation using thematic hierarchies. In Proceedings of MT summit VIII. N. Habash, B. Dorr, and D. Traum. 2003. Hybrid natural language generation from lexical conceptual structures. Machine Translation, 18(2):81–128. Eva Hajiˇcov´a and Ivona Kuˇcerov´a. 2002. Argu- ment/valency structure in propbank, lcs database and prague dependency treebank: A comparative pilot study. In Proceedings of the Third Inter- national Conference on Language Resources and Evaluation (LREC 2002), pages 846–851. Ray Jackendoff. 1990. Semantic Structures. The MIT Press. D. Kawahara and S. Kurohashi. 2006. Case frame compilation from the web using high-performance computing. In Proceedings of LREC-2006, pages 1344–1347. Paul Kingsbury and Martha Palmer. 2002. From Tree- bank to PropBank. In Proceedings of LREC-2002, pages 1989–1993. Karin Kipper, Hoa Trang Dang, and Martha Palmer. 2000. Class-based construction of a verb lexicon. In Proceedings of the National Conference on Arti- ficial Intelligence, pages 691–696. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Sadao Kurohashi and Makoto Nagao. 1997. Kyoto university text corpus project. Proceedings of the Annual Conference of JSAI, 11:58–61. Beth Levin and Malka Rappaport Hovav. 2005. Argu- ment realization. Cambridge University Press. Beth Levin. 1993. English verb classes and alter- nations: A preliminary investigation. University of Chicago Press. Llu´ıs M`arquez, Xavier Carreras, Kenneth C. Litkowski, and Suzanne Stevenson. 2008. Se- mantic role labeling: an introduction to the special issue. Computational linguistics, 34(2):145–159. B. Rozwadowska. 1988. Thematic restrictions on de- rived nominals. In W Wlikins, editor, Syntax and Semantics, volume 21, pages 147–165. Academic Press. 695 Unsupervised Detection of Downward-Entailing Operators By Maximizing Classification Certainty Jackie CK Cheung and Gerald Penn Department of Computer Science University of Toronto Toronto, ON, M5S 3G4, Canada {jcheung,gpenn}@cs.toronto.edu Abstract a downward-entailing operator (DEO), however, this entailment relation is reversed, such as in We propose an unsupervised, iterative the scope of the classical DEO not (2). There method for detecting downward-entailing are also operators which are neither upward- nor operators (DEOs), which are important for deducing entailment relations between sen- downward entailing, such as the expression ex- tences. Like the distillation algorithm of actly three (3). Danescu-Niculescu-Mizil et al. (2009), the initialization of our method depends on the (1) She sang in French. ⇒ She sang. correlation between DEOs and negative po- (upward-entailing) larity items (NPIs). However, our method trusts the initialization more and aggres- (2) She did not sing in French. ⇐ She did not sively separates likely DEOs from spuri- sing. (downward-entailing) ous distractors and other words, unlike dis- tillation, which we show to be equivalent (3) Exactly three students sang. 6⇔ Exactly to one iteration of EM prior re-estimation. three students sang in French. (neither Our method is also amenable to a bootstrap- upward- nor downward-entailing) ping method that co-learns DEOs and NPIs, and achieves the best results in identifying Danescu-Niculescu-Mizil et al. (2009) (hence- DEOs in two corpora. forth DLD09) proposed the first computational methods for detecting DEOs from a corpus. They proposed two unsupervised algorithms which rely 1 Introduction on the correlation between DEOs and negative Reasoning about text has been a long-standing polarity items (NPIs), which by the definition of challenge in NLP, and there has been consider- Ladusaw (1980) must appear in the context of able debate both on what constitutes inference and DEOs. An example of an NPI is yet, as in the what techniques should be used to support infer- sentence This project is not complete yet. The ence. One task involving inference that has re- first baseline method proposed by DLD09 sim- cently received much attention is that of recog- ply calculates a ratio of the relative frequencies nizing textual entailment (RTE), in which the goal of a word in NPI contexts versus in a general is to determine whether a hypothesis sentence can corpus, and the second is a distillation method be entailed from a piece of source text (Bentivogli which appears to refine the baseline ratios using a et al., 2010, for example). task-specific heuristic. Danescu-Niculescu-Mizil An important consideration in RTE is whether and Lee (2010) (henceforth DL10) extend this ap- a sentence or context produces an entailment re- proach to Romanian, where a comprehensive list lation for events that are a superset or subset of of NPIs is not available, by proposing a bootstrap- the original sentence (MacCartney and Manning, ping approach to co-learn DEOs and NPIs. 2008). By default, contexts are upward-entailing, DLD09 are to be commended for having iden- allowing reasoning from a set of events to a su- tified a crucial component of inference that nev- perset of events as seen in (1). In the scope of ertheless lends itself to a classification-based ap- 696 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 696–705, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics proach, as we will show. However, as noted sitional operator for proposition p, then an oper- by DL10, the performance of the distillation ator is non-veridical if F p 6⇒ p. Positive opera- method is mixed across languages and in the tors such as past tense adverbials are veridical (4), semi-supervised bootstrapping setting, and there whereas questions, negation and other DEOs are is no mathematical grounding of the heuristic to non-veridical (5, 6). explain why it works and whether the approach can be refined or extended. This paper supplies (4) She sang yesterday. ⇒ She sang. the missing mathematical basis for distillation and (5) She denied singing. 6⇒ She sang. shows that, while its intentions are fundamentally sound, the formulation of distillation neglects an (6) Did she sing? 6⇒ She sang. important requirement that the method not be easily distracted by other word co-occurrences While Ladusaw’s hypothesis is thus accepted in NPI contexts. We call our alternative cer- to be insufficient from a linguistic perspective, it tainty, which uses an unusual posterior classifica- is nevertheless a useful starting point for compu- tion confidence score (based on the max function) tational methods for detecting NPIs and DEOs, to favour single, definite assignments of DEO- and has inspired successful techniques to detect hood within every NPI context. DLD09 actually DEOs, like the work by DLD09, DL10, and also speculated on the use of max as an alternative, this work. In addition to this hypothesis, we fur- but within the context of an EM-like optimization ther assume that there should only be one plausi- procedure that throws away its initial parameter ble DEO candidate per NPI context. While there settings too willingly. Certainty iteratively and are counterexamples, this assumption is in prac- directly boosts the scores of the currently best- tice very robust, and is a useful constraint for our ranked DEO candidates relative to the alternatives learning algorithm. An analogy can be drawn to in a Na¨ıve Bayes model, which thus pays more re- the one sense per discourse assumption in word spect to the initial weights, constructively build- sense disambiguation (Gale et al., 1992). ing on top of what the model already knows. This The related—and as we will argue, more method proves to perform better on two corpora difficult—problem of detecting NPIs has also than distillation, and is more amenable to the co- been studied, and in fact predates the work on learning of NPIs and DEOs. In fact, the best DEO detection. Hoeksema (1997) performed the results are obtained by co-learning the NPIs and first corpus-based study of NPIs, predominantly DEOs in conjunction with our method. for Dutch, and there has also been work on de- tecting NPIs in German which assumes linguistic 2 Related work knowledge of licensing contexts for NPIs (Lichte and Soehn, 2007). Richter et al. (2010) make There is a large body of literature in linguis- this assumption as well as use syntactic structure tic theory on downward entailment and polar- to extract NPIs that are multi-word expressions. ity items1 , of which we will only mention the Parse information is an especially important con- most relevant work here. The connection between sideration in freer-word-order languages like Ger- downward-entailing contexts and negative polar- man where a MWE may not appear as a contigu- ity items was noticed by Ladusaw (1980), who ous string. In this paper, we explicitly do not as- stated the hypothesis that NPIs must be gram- sume detailed linguistic knowledge about licens- matically licensed by a DEO. However, DEOs ing contexts for NPIs and do not assume that a are not the sole licensors of NPIs, as NPIs can parser is available, since neither of these are guar- also be found in the scope of questions, certain anteed when extending this technique to resource- numeric expressions (i.e., non-monotone quanti- poor languages. fiers), comparatives, and conditionals, among oth- ers. Giannakidou (2002) proposes that the prop- 3 Distillation as EM Prior Re-estimation erty shared by these constructions and downward entailment is non-veridicality. If F is a propo- Let us first review the baseline and distillation 1 methods proposed by DLD09, then show that dis- See van der Wouden (1997) for a comprehensive refer- tillation is equivalent to one iteration of EM prior ence. 697 re-estimation in a Na¨ıve Bayes generative proba- bilistic model up to constant rescaling. The base- Y DEO line method assigns a score to each word-type based on the ratio of its relative frequency within NPI contexts to its relative frequency within a general corpus. Suppose we are given a corpus C X Context words with extracted NPI contexts N and they contain tokens(C) and tokens(N ) tokens respectively. L Let y be a candidate DEO, countC (y) be the uni- gram frequency of y in a corpus, and countN (y) Figure 1: Na¨ıve Bayes formulation of DEO detection. be the unigram frequency of y in N . Then, we define S(y) to be the ratio between the relative frequencies of y within NPI contexts and in the entire corpus2 : NPI contexts which contain y. countN (y)/tokens(N ) DLD09 find that distillation seems to improve S(y) = . (7) countC (y)/tokens(C) the performance of DEO detection in BLLIP. Later work by DL10, however, shows that distil- The scores are then used as a ranking to de- lation does not seem to improve performance over termine word-types that are likely to be DEOs. the baseline method in Romanian, and the authors This method approximately captures Ladusaw’s also note that distillation does not improve perfor- hypothesis by highly ranking words that appear mance in their experiments on co-learning NPIs in NPI contexts more often than would be ex- and DEOs via bootstrapping. pected by chance. However, the problem with A better mathematical grounding of the distilla- this approach is that DEOs are not the only words tion method’s apparent heuristic in terms of exist- that co-occur with NPIs. In particular, there exist ing probabilistic models sheds light on the mixed many piggybackers, which, as defined by DLD09, performance of distillation across languages and collocate with DEOs due to semantic relatedness experimental settings. In particular, it turns out or chance, and would thus incorrectly receive a that the distillation method of DLD09 is equiva- high S(y) score. lent to one iteration of EM prior re-estimation in Examples of piggybackers found by DLD09 in- a Na¨ıve Bayes model. Given a lexicon L of L clude the proper noun Milken, and the adverb vig- words, let each NPI context be one sample gen- orously, which collocate with DEOs like deny in erated by the model. One sample consists of a the corpus they used. DLD09’s solution to the latent categorical (i.e., a multinomial with one piggybacker problem is a method that they term trial) variable Y whose values range over L, cor- distillation. Let Ny be the NPI contexts that con- responding to the DEO that licenses the context, tain word y; i.e., Ny = {c ∈ N |c ∋ y}. In dis- ~ = Xi=1...L and observed Bernoulli variables X tillation, each word-type is given a distilled score which indicate whether a word appears in the NPI according to the following equation: context (Figure 1). This method does not attempt to model the order of the observed words, nor the 1 X S(y) number of times each word appears. Formally, a Sd (y) = P ′ . (8) |Ny | y ′ ∈p S(y ) Na¨ıve Bayes model is given by the following ex- p∈Ny pression: where p indexes the set of NPI contexts which L contain y 3 , and the denominator is the number of ~ Y)= Y P (X, P (Xi |Y )P (Y ). (9) 2 DLD09 actually use the number of NPI contexts con- i=1 taining y rather than countN (y), but we find that using the raw count works better in our experiments. The probability of a DEO given a particular 3 In DLD09, the corresponding equation does not indicate NPI context is that p should be the contexts that include y, but it is clear L from the surrounding text that our version is the intended ~ ∝ Y meaning. If all the NPI contexts were included in the sum- P (Y |X) P (Xi |Y )P (Y ). (10) mation, Sd (y) would reduce to inverse relative frequency. i=1 698 The probability of a set of observed NPI con- P (Y ) gives a prior probability that a certain texts N is the product of the probabilities for each word-type y is a DEO in an NPI context, without sample: normalizing for the frequency of y in NPI con- texts. Since we are interested in estimating the ~ Y P (N ) = P (X) (11) context-independent probability that y is a DEO, ~ X∈N we must calculate the probability that a word is ~ = ~ y). X P (X) P (X, (12) a DEO given that it appears in an NPI context. y∈L Let Xy be the observed variable corresponding to y. Then, the expression we are interested in is We first instantiate the baseline method of P (y|Xy = 1). We now show that P (y|Xy = DLD09 by initializing the parameters to the 1) = P (y)/P (Xy = 1), and that this expression model, P (Xi = 1|y) and P (Y = y), such that is equivalent to (8). P (Y = y) is proportional to S(y). Recall that this initialization utilizes domain knowledge about the P (y, Xy = 1) P (y|Xy = 1) = (17) correlation between NPIs and DEOs, inspired by P (Xy = 1) Ladusaw’s hypothesis: X Recall that P (y, Xy = 0) = 0 because of the P (Y = y) = S(y)/ S(y ′ ) (13) assumption that a DEO appears in the NPI context y′ that it generates. Thus, 1 if Xi corresponds to y P (Xi = 1|y) = P (y, Xy = 1) = P (y, Xy = 1) + P (y, Xy = 0) 0.5 otherwise. (14) = P (y) (18) This initialization of P (Xi = 1|y) ensures that One iteration of EM to calculate this proba- the the value of y corresponds to one of the words bility is equivalent to the distillation method of in the NPI context, and the initialization of P (Y ) DLD09. In particular, the numerator of (17), is simply a normalization of S(y). which we just showed to be equal to the estimate Since we are working in an unsupervised set- of P (Y ) given by (16), is exactly the sum of the ting, there are no labels for Y available. A com- responsibilities for a particular y, and is propor- mon and reasonable assumption about learning tional to the summation in (8) modulo normaliza- ~ tion, because P (X|y) is constant for all y in the the parameter settings in this case is to find the pa- rameters that maximize the likelihood of the ob- context. The denominator P (Xy = 1) is simply served training data; i.e., the NPI contexts: the proportion of contexts containing y, which is proportional to |Ny |. Since both the numerator θˆ = argmax P (N ; θ). (15) and denominator are equivalent up to a constant θ factor, an identical ranking is produced by distil- The EM algorithm is a well-known iterative al- lation and EM prior re-estimation. gorithm for performing this optimization. Assum- Unfortunately, the EM algorithm does not pro- ing that the prior P (Y = y) is a categorical distri- vide good results on this task. In fact, as more bution, the M-step estimate of these parameters iterations of EM are run, the performance drops after one iteration through the corpus is as fol- drastically, even though the corpus likelihood lows: is increasing. The reason is that unsupervised EM learning is not constrained or biased towards ~ P t (y|X) learning a good set of DEOs. Rather, a higher data X P t+1 (Y = y) = (16) P ~ P t (y ′ |X) likelihood can be achieved simply by assigning ~ X∈N y′ high prior probabilities to frequent word-types. We do not re-estimate P (Xi = 1|y) because This can be seen qualitatively by consider- their role is simply to ensure that the DEO re- ing the top-ranking DEOs after several itera- sponsible for an NPI context exists in the context. tions of EM/distillation (Figure 2). The top- Estimating these parameters would exacerbate the ranking words are simply function words or other problems with EM for this task which we will dis- words common in the corpus, which have noth- cuss shortly. ing to do with downward entailment. In effect, 699 1 iteration 2 iterations 3 iterations fication problem, and then maximizing an objec- denies the the tive ratio that favours one DEO per context. Our denied to to method is not guaranteed to increase classification unaware denied that certainty between iterations, but we will show that longest than than it does increase certainty very quickly in practice. hardly that and The key observation that allows us to resolve lacking if has the tension between trusting the initialization and deny has if enforcing one DEO per NPI context is that the nobody denies of distributions of words that co-occur with DEOs opposes and denied and piggybackers are different, and that this dif- highest but denies ference follows from Ladusaw’s hypothesis. In particular, while DEOs may appear with or with- Figure 2: Top 10 DEOs after iterations of EM on out piggybackers in NPI contexts, piggybackers BLLIP. do not appear without DEOs in NPI contexts, be- cause Ladusaw’s hypothesis stipulates that a DEO is required to license the NPI in the first place. Thus, the presence of a high-scoring DEO candi- EM/distillation overrides the initialization based date among otherwise low-scoring words is strong on Ladusaw’s hypothesis and finds another solu- evidence that the high-scoring word is not a pig- tion with a higher data likelihood. We will also gybacker and its high score from the initialization provide a quantitative analysis of the effects of is deserved. Conversely, a DEO candidate which EM/distillation in Section 5. always appears in the presence of other strong DEO candidates is likely a piggybacker whose 4 Alternative to EM: Maximizing the initial high score should be discounted. Posterior Classification Certainty We now describe our heuristic method that is We have seen that in trying to solve the piggy- based on this intuition. For clarity, we use scores backer problem, EM/distillation too readily aban- rather than probabilities in the following explana- dons the initialization based on Ladusaw’s hy- tion, though it is equally applicable to either. As pothesis, leading to an incorrect solution. Instead in EM/distillation, the method is initialized with of optimizing the data likelihood, what we need is the baseline S(y) scores. One iteration of the a measure of the number of plausible DEO candi- method proceeds as follows. Let the score of the dates there are in an NPI context, and a method strongest DEO candidate in an NPI context p be: that refines the scores towards having only one M (p) = max Sht (y), (20) such plausible candidate per context. To this end, y∈p we define the classification certainty to be the where Sht (y) is the score of candidate y at the tth product of the maximum posterior classification iteration according to this heuristic method. probabilities over the DEO candidates. For a set Then, for each word-type y in each context p, of hidden variables y N for NPI contexts N , this we compare the current score of y to the scores of is the expression: the other words in p. If y is currently the strongest DEO candidate in p, then we give y credit equal Certainty(y N |N ) = ~ Y max P (y|X). (19) to the proportional change to M (p) if y were re- y ~ X∈N moved (Context p without y is denoted p \ y). A large change means that y is the only plausible To increase this certainty score, we propose DEO candidate in p, while a small change means a novel iterative heuristic method for refining that there are other plausible DEO candidates. If the baseline initializations of P (Y ). Unlike y is not currently the strongest DEO candidate, it EM/distillation, our method biases learning to- receives no credit: wards trusting the initialization, but refines the ( M (p)−M (p\y) scores towards having only one plausible DEO M (p) if Sht (y) = M (p) cred(p, y) = per context in the training corpus. This is accom- 0 otherwise. plished by treating the problem as a DEO classi- (21) 700 NPI contexts unlikely to be a DEO according to the initializa- A B C, B C, B C, D C tion. Original scores 5 Experiments S(A) = 5, S(B) = 4, S(C) = 1, S(D) = 2 We evaluate the performance of these methods on Updated scores the BLLIP corpus (∼30M words) and the AFP Sh (A) = 5 × (5 − 4)/5 =1 portion of the Gigaword corpus (∼338M words). Sh (B) = 4 × (0 + 2 × (4 − 1)/4)/3 =2 Following DLD09, we define an NPI context to be all the words to the left of an NPI, up to the Sh (C) = 1 × (0 + 0 + 0) =0 closest comma or semi-colon, and removed NPI Sh (D) = 2 × (2 − 1)/2 =1 contexts which contain the most common DEOs Figure 3: Example of one iteration of the certainty- like not. We further removed all empty NPI con- based heuristic on four NPI contexts with four words texts or those which only contain other punctua- in the lexicon. tion. After this filtering, there were 26696 NPI contexts in BLLIP and 211041 NPI contexts in AFP, using the same list of 26 NPIs defined by DLD09. We first define an automatic measure of per- Then, the average credit received by each y is formance that is common in information retrieval. a measure of how much we should trust the cur- We use average precision to quantify how well a rent score for y. The updated score for each DEO system separates DEOs from non-DEOs. Given a candidate is the original score multiplied by this list of known DEOs, G, and non-DEOs, the aver- average: age precision of a ranked list of items, X, is de- Sht (y) X fined by the following equation: Sht+1 (y) = × cred(p, y). (22) |Ny | Pn P (X1...k ) × 1(xk ∈ G) p∈Ny AP (X) = k=1 , |G| The probability P t+1 (Y = y) is then simply (24) Sht+1 (y) normalized: where P (X1...k ) is the precision of the first k S t+1 (y) P t+1 (Y = y) = X h . (23) items and 1(xk ∈ G) is an indicator function Sht+1 (y ′ ) which is 1 if x is in the gold standard list of DEOs y ′ ∈L and 0 otherwise. We iteratively reduce the scores in this fashion DLD09 simply evaluated the top 150 output to get better estimates of the relative suitability of DEO candidates by their systems, and qualita- word-types as DEOs. tively judged the precision of the top-k candidates An example of this method and how it solves at various values of k up to 150. Average preci- the piggybacker problem is given in Figure 3. In sion can be seen as a generalization of this evalu- this example, we would like to learn that B and ation procedure that is sensitive to the ranking of D are DEOs, A is a piggybacker, and C is a fre- DEOs and non-DEOs. For development purposes, quent word-type, such as a stop word. Using the we use the list of 150 annotations by DLD09. Of original scores, piggybacker A would appear to these, 90 were DEOs, 30 were not, and 30 were be the most likely word to be a DEO. However, classified as “other” (they were either difficult to by noticing that it never occurs on its own with classify, or were other types of non-veridical oper- words that are unlikely to be DEOs (in the exam- ators like comparatives or conditionals). We dis- ple, word C), our heuristic penalizes A more than carded the 30 “other” items and ignored all items B, and ranks B higher after one iteration. EM not in the remaining 120 items when evaluating a prior re-estimation would not correctly solve this ranked list of DEO candidates. We call this mea- example, as it would converge on a solution where sure AP120 . C receives all of the probability mass because it In addition, we annotated DEO candidates from appears in all of the contexts, even though it is the top-150 rankings produced by our certainty- 701 absolve, abstain, banish, bereft, boycott, cau- Method BLLIP AP120 AFP AP246 tion, clear, coy, delay, denial, desist, devoid, Baseline .879 .734 disavow, discount, dispel, disqualify, down- Distillation .946 .785 play, exempt, exonerate, foil, forbid, forego, This work .955 .809 impossible, inconceivable, irrespective, limit, Table 1: Average precision results on the BLLIP and mitigate, nip, noone, omit, outweigh, pre- AFP corpora. condition, pre-empt, prerequisite, refute, re- move5 , repel, repulse, scarcely, scotch, scuttle, seldom, sensitive, shy, sidestep, snuff, thwart, waive, zero-tolerance be obtained by examining the data likelihood and Figure 4: Lemmata of DEOs identified in this work not the classification certainty at each iteration of the found by DLD09. algorithms (Figure 5). Whereas EM/distillation maximizes the former expression, the certainty- based heuristic method actually decreases data likelihood for the first couple of iterations before based heuristic on BLLIP and also by the dis- increasing it again. In terms of classification cer- tillation and heuristic methods on AFP, in order tainty, EM/distillation converges to a lower classi- to better evaluate the final output of the meth- fication certainty score compared to our heuristic ods. This produced an additional 68 DEOs (nar- method. Thus, our method better captures the as- rowly defined) (Figure 4), 58 non-DEOs, and 31 sumption of one DEO per NPI context. “other” items4 . Adding the DEOs and non-DEOs we found to the 120 items from above, we have 6 Bootstrapping to Co-Learn NPIs and an expanded list of 246 items to rank, and a corre- DEOs sponding average precision which we call AP246 . The above experiments show that the heuristic We employ the frequency cut-offs used by method outperforms the EM/distillation method DLD09 for sparsity reasons. A word-type must given a list of NPIs. We would like to extend appear at least 10 times in an NPI context and this result to novel domains, corpora, and lan- 150 times in the corpus overall to be considered. guages. DLD09 and DL10 proposed the follow- We treat BLLIP as a development corpus and use ing bootstrapping algorithm for co-learning NPIs AP120 on AFP to determine the number of itera- and DEOs given a much smaller list of NPIs as a tions to run our heuristic (5 iterations for BLLIP seed set. and 13 iterations for AFP). We run EM/distillation for one iteration in development and testing, be- 1. Begin with a small set of seed NPIs cause more iterations hurt performance, as ex- 2. Iterate: plained in Section 3. We first report the AP120 results of our ex- (a) Use the current list of NPIs to learn a periments on the BLLIP corpus (Table 1 sec- list of DEOs ond column). Our method outperforms both (b) Use the current list of DEOs to learn a EM/distillation and the baseline method. These list of NPIs results are replicated on the final test set from AFP using the full set of annotations AP246 (Ta- Interestingly, DL10 report that while this ble 1 third column). Note that the scores are lower method works in Romanian data, it does not work when using all the annotations because there are in the English BLLIP corpus. They speculate that more non-DEOs relative to DEOs in this list, mak- the reason might be due to the nature of the En- ing the ranking task more challenging. glish DEO any, which can occur in all classes of A better understanding of the algorithms can DE contexts according to an analysis by Haspel- 4 math (1997). Further, they find that in Romanian, The complete list will be made publicly available. 5 We disagree with DLD09 that remove is not downward- distillation does not perform better than the base- entailing; e.g., The detergent removed stains from his cloth- line method during Step (2a). While this linguis- ing. ⇒ The detergent removed stains from his shirts. tic explanation may certainly be a factor, we raise 702 6 5 x 10 x 10 0 0 -0.5 -0.5 Log probability Log probability -1 -1 -1.5 -1.5 -2 -2 -2.5 -2.5 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Iterations Iterations (a) Data log likelihood. (b) Log classification certainty probabilities. Figure 5: Log likelihood and classification certainty probabilities of NPI contexts in two corpora. Thinner lines near the top are for BLLIP; thicker lines for AFP. Blue dotted: baseline; red dashed: distillation; green solid: ~ our certainty-based heuristic method. P (X|y) probabilities are not included since they would only result in a constant offset in the log domain. a second possibility that the distillation algorithm other spurious correlations such as piggybackers itself may be responsible for these results. As ev- as discussed earlier. In the other direction, it is idence, we show that the heuristic algorithm is not the case that DEOs always or nearly always able to work in English with just the single seed appear in the context of an NPI. Rather, the most NPI any, and in fact the bootstrapping approach in common collocations of DEOs are the selectional conjunction with our heuristic even outperforms preferences of the DEO, such as common argu- the above approaches when using a static list of ments to verbal DEOs, prepositions that are part NPIs. of the subcategorization of the DEO, and words In particular, we use the methods described in that together with the surface form of the DEO the previous sections for Step (2a), and the follow- comprise an idiomatic expression or multi-word ing ratio to rank NPI candidates in Step (2b), cor- expression. Further, NPIs are more likely to be responding to the baseline method to detect DEOs composed of multiple words, while many DEOs in reverse: are single words, possibly with PP subcategoriza- tion requirements which can be filled in post hoc. countD (x)/tokens(D) T (x) = . (25) Because of these issues, we cannot trust the ini- countC (x)/tokens(C) tialization to learn NPIs nearly as much as with DEOs, and cannot use the distillation or certainty Here, countD (x) refers to the number of oc- methods for this step. Rather, the hope is that currences of NPI candidate x in DEO contexts learning a noisy list of “pseudo-NPIs”, which of- D, defined to be the words to the right of a DEO ten occur in negative contexts but may not actu- operator up to a comma or semi-colon. We do ally be NPIs, can still improve the performance of not use the EM/distillation or heuristic methods in DEO detection. Step (2b). Learning NPIs from DEOs is a much There are a number of parameters to the method harder problem than learning DEOs from NPIs. which we tuned to the BLLIP corpus using Because DEOs (and other non-veridical opera- AP120 . At the end of Step (2a), we use the cur- tors) license NPIs, the majority of occurrences of rent top 25 DEOs plus 5 per iteration as the DEO NPIs will be in the context of a DEO, modulo am- list for the next step. To the initial seed NPI of biguity of DEOs such as the free-choice any and 703 Method BLLIP AP120 AFP AP246 be an instance of EM prior re-estimation, our Baseline .889 (+.010) .739 (−.005) method directly addresses the issue of piggyback- Distillation .930 (−.016) .804 (+.019) ers which spuriously correlate with NPIs but are This work .962 (+.007) .821 (+.012) not downward-entailing. This is achieved by maximizing the posterior classification certainty Table 2: Average precision results with bootstrapping of the corpus in a way that respects the initializa- on the BLLIP and AFP corpora. Absolute gain in av- erage precision compared to using a fixed list of NPIs tion, rather than maximizing the data likelihood given in brackets. as in EM/distillation. Our method outperforms distillation and a baseline method on two corpora anymore, anything, anytime, avail, bother, as well as in a bootstrapping setting where NPIs bothered, budge, budged, countenance, faze, and DEOs are jointly learned. It achieves the best fazed, inkling, iota, jibe, mince, nor, whatso- performance in the bootstrapping setting, rather ever, whit than when using a fixed list of NPIs. The perfor- mance of our algorithm suggests that it is suitable Figure 6: Probable NPIs found by bootstrapping using for other corpora and languages. the certainty-based heuristic method. Interesting future research directions include detecting DEOs of more than one word as well as distinguishing the particular word sense and sub- categorization that is downward-entailing. An- any, we add the top 5 ranking NPI candidates at other problem that should be addressed is the the end of Step (2b) in each subsequent iteration. scope of the downward entailment, generalizing We ran the bootstrapping algorithm for 11 itera- work being done in detecting the scope of nega- tions for all three algorithms. The final evaluation tion (Councill et al., 2010, for example). was done on AFP using AP246 . Acknowledgments The results show that bootstrapping can indeed improve performance, even in English (Table 2). We would like to thank Cristian Danescu- Using bootstrapping to co-learn NPIs and DEOs Niculescu-Mizil for his help with replicating his actually results in better performance than spec- results on the BLLIP corpus. This project was ifying a static list of NPIs. The certainty-based supported by the Natural Sciences and Engineer- heuristic in particular achieves gains with boot- ing Research Council of Canada. strapping in both corpora, in contrast to the base- line and distillation methods. Another factor that we found to be important is to add a sufficient References number of NPIs to the NPI list each iteration, as Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa T. adding too few NPIs results in only a small change Dang, and Danilo Giampiccolo. 2010. The sixth in the NPI contexts available for DEO detection. pascal recognizing textual entailment challenge. In The Text Analysis Conference (TAC 2010). DL10 only added one NPI per iteration, which Isaac G. Councill, Ryan McDonald, and Leonid Ve- may explain why they did not find any improve- likovich. 2010. What’s great and what’s not: ment with bootstrapping in English. It also ap- Learning to classify the scope of negation for im- pears that learning the pseudo-NPIs does not hurt proved sentiment analysis. In Proceedings of the performance in detecting DEO, and further, that Workshop on Negation and Speculation in Natural a number of true NPIs are learned by our method Language Processing, pages 51–59. Association for (Figure 6). Computational Linguistics. Cristian Danescu-Niculescu-Mizil and Lillian Lee. 7 Conclusion 2010. Don’t ‘have a clue’?: Unsupervised co- learning of downward-entailing operators. In Pro- We have proposed a novel unsupervised method ceedings of the ACL 2010 Conference Short Papers, for discovering downward-entailing operators pages 247–252. Association for Computational Lin- from raw text based on their co-occurrence with guistics. negative polarity items. Unlike the distilla- Cristian Danescu-Niculescu-Mizil, Lillian Lee, and tion method of DLD09, which we show to Richard Ducott. 2009. Without a ‘doubt’?: Un- supervised discovery of downward-entailing oper- 704 ators. In Proceedings of Human Language Tech- nologies: The 2009 Annual Conference of the North American Chapter of the Association for Computa- tional Linguistics. William A. Gale, Kenneth W. Church, and David Yarowsky. 1992. One sense per discourse. In Pro- ceedings of the Workshop on Speech and Natural Language, pages 233–237. Association for Compu- tational Linguistics. Anastasia Giannakidou. 2002. Licensing and sensitiv- ity in polarity items: from downward entailment to nonveridicality. CLS, 38:29–53. Martin Haspelmath. 1997. Indefinite pronouns. Ox- ford University Press. Jack Hoeksema. 1997. Corpus study of negative po- larity items. IV-V Jornades de corpus linguistics 1996–1997. William A. Ladusaw. 1980. On the notion ‘affective’ in the analysis of negative-polarity items. Journal of Linguistic Research, 1(2):1–16. Timm Lichte and Jan-Philipp Soehn. 2007. The re- trieval and classification of negative polarity items using statistical profiles. Roots: Linguistics in Search of Its Evidential Base, pages 249–266. Bill MacCartney and Christopher D. Manning. 2008. Modeling semantic containment and exclusion in natural language inference. In Proceedings of the 22nd International Conference on Computational Linguistics. Frank Richter, Fabienne Fritzinger, and Marion Weller. 2010. Who can see the forest for the trees? ex- tracting multiword negative polarity items from dependency-parsed text. Journal for Language Technology and Computational Linguistics, 25:83– 110. Ton van der Wouden. 1997. Negative Contexts: Col- location, Polarity and Multiple Negation. Rout- ledge. 705 Elliphant: Improved Automatic Detection of Zero Subjects and Impersonal Constructions in Spanish Luz Rello∗ Ricardo Baeza-Yates Ruslan Mitkov NLP and Web Research Groups Yahoo! Research Research Group in Univ. Pompeu Fabra Barcelona, Spain Computational Linguistics Barcelona, Spain Univ. of Wolverhampton, UK Abstract the computational treatment of anaphora (Hobbs, 1977; Hirst, 1981). However, this task is of cru- In pro-drop languages, the detection of cial importance when processing pro-drop lan- explicit subjects, zero subjects and non- guages since subject ellipsis is a pervasive phe- referential impersonal constructions is cru- nomenon in these languages (Chomsky, 1981). cial for anaphora and co-reference resolu- tion. While the identification of explicit For instance, in our Spanish corpus, 29% of the and zero subjects has attracted the atten- subjects are elided. tion of researchers in the past, the auto- Our method is based on classification of all ex- matic identification of impersonal construc- pressions in subject position, including the recog- tions in Spanish has not been addressed yet nition of Spanish non-referential impersonal con- and this work is the first such study. In structions which, to the best of our knowledge, this paper we present a corpus to under- has not yet been addressed. The necessity of iden- pin research on the automatic detection of these linguistic phenomena in Spanish and tifying such kind of elliptical constructions has a novel machine learning-based methodol- been specifically highlighted in work about Span- ogy for their computational treatment. This ish zero pronouns (Ferr´andez and Peral, 2000) study also provides an analysis of the fea- and co-reference resolution (Recasens and Hovy, tures, discusses performance across two 2009). different genres and offers error analysis. The main contributions of this study are: The evaluation results show that our system performs better in detecting explicit sub- • A public annotated corpus in Spanish to jects than alternative systems. compare different strategies for detecting ex- plicit subjects, zero subjects and impersonal 1 Introduction constructions. Subject ellipsis is the omission of the subject in • The first ML based approach to this problem a sentence. We consider not only missing refer- in Spanish and a thorough analysis regarding ential subject (zero subject) as manifestation of features, learnability, genre and errors. ellipsis, but also non-referential impersonal con- • The best performing algorithms to automati- structions. cally detect explicit subjects and impersonal Various natural language processing (NLP) constructions in Spanish. tasks benefit from the identification of ellip- tical subjects, primarily anaphora resolution The remainder of the paper is organized as fol- (Mitkov, 2002) and co-reference resolution (Ng lows. Section 2 describes the classes of Spanish and Cardie, 2002). The difficulty in detect- subjects, while Section 3 provides a literature re- ing missing subjects and non-referential pronouns view. Section 4 describes the creation and the an- has been acknowledged since the first studies on notation of the corpus and in Section 5 the ma- ∗ This work was partially funded by a ‘La Caixa’ grant chine learning (ML) method is presented. The for master students. analysis of the features, the learning curves, the 706 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 706–715, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics genre impact and the error analysis are all detailed 3 Related Work in Section 6. Finally, in Section 7, conclusions Identification of non-referential pronouns, al- are drawn and plans for future work are discussed. though a crucial step in co-reference and anaphora This work is an extension of the first author mas- resolution systems (Mitkov, 2010),2 has been ap- ter’s thesis (Rello, 2010) and a preliminary ver- plied only to the pleonastic it in English (Evans, sion of the algorithm was presented in Rello et al. 2001; Boyd et al., 2005; Bergsma et al., 2008) (2010). and expletive pronouns in French (Danlos, 2005). 2 Classes of Spanish Subjects Machine learning methods are known to perform better than rule-based techniques for identifying Literature related to ellipsis in NLP (Ferr´andez non-referential expressions (Boyd et al., 2005). and Peral, 2000; Rello and Illisei, 2009a; Mitkov, However, there is some debate as to which ap- 2010) and linguistic theory (Bosque, 1989; Bru- proach may be optimal in anaphora resolution cart, 1999; Real Academia Espa˜nola, 2009) has systems (Mitkov and Hallett, 2007). served as a basis for establishing the classes of Both English and French texts use an ex- this work. plicit word, with some grammatical information Explicit subjects are phonetically realized and (a third person pronoun), which is non-referential their syntactic position can be pre-verbal or post- (Mitkov, 2010). By contrast, in Spanish, non- verbal. In the case of post-verbal subjects (a), the referential expressions are not realized by exple- syntactic position is restricted by some conditions tive or pleonastic pronouns but rather by a certain (Real Academia Espa˜nola, 2009). kind of ellipsis. For this reason, it is easy to mis- (a) Carecer´an de validez las disposiciones que con- take them for zero pronouns, which are, in fact, tradigan otra de rango superior.1 referential. The dispositions which contradict higher range Previous work on detecting Spanish subject el- ones will not be valid. lipsis focused on distinguishing verbs with ex- plicit subjects and verbs with zero subjects (zero Zero subjects (b) appear as the result of a nomi- pronouns), using rule-based methods (Ferr´andez nal ellipsis. That is, a lexical element –the elliptic and Peral, 2000; Rello and Illisei, 2009b). The subject–, which is needed for the interpretation of Ferr´andez and Peral algorithm (2000) outper- the meaning and the structure of the sentence, is forms the (Rello and Illisei, 2009b) approach elided; therefore, it can be retrieved from its con- with 57% accuracy in identifying zero subjects. text. The elision of the subject can affect the en- In (Ferr´andez and Peral, 2000), the implementa- tire noun phrase and not just the noun head when tion of a zero subject identification and resolution a definite article occurs (Brucart, 1999). module forms part of an anaphora resolution sys- (b) Ø Fue refrendada por el pueblo espa˜nol. tem. (It) was countersigned by the people of Spain. ML based studies on the identification of explicit non-referential constructions in English The class of impersonal constructions is present accuracies of 71% (Evans, 2001), 87.5% formed by impersonal clauses (c) and reflex- (Bergsma et al., 2008) and 88% (Boyd et al., ive impersonal clauses with particle se (d) (Real 2005), while 97.5% is achieved for French (Dan- Academia Espa˜nola, 2009). los, 2005). However, in these languages, non- referential constructions are explicit and not omit- (c) No hay matrimonio sin consentimiento. ted which makes this task more challenging for (There is) no marriage without consent. Spanish. (d) Se estar´a a lo que establece el apartado siguiente. (It) will be what is established in the next section. 4 Corpus 1 All the examples provided are taken from our corpus. We created and annotated a corpus composed In the examples, explicit subjects are presented in italics. of legal texts (law) and health texts (psychiatric Zero subjects are presented by the symbol Ø and in the En- 2 glish translations the subjects which are elided in Spanish are In zero anaphora resolution, the identification of zero marked with parentheses. Impersonal constructions are not anaphors first requires that they be distinguished from non- explicitly indicated. referential impersonal constructions (Mitkov, 2010). 707 papers) originally written in peninsular Spanish. for each of the three categories is shown against The corpus is named after its annotated content the thirteen annotation tags to which they belong “Explicit Subjects, Zero Subjects and Impersonal (Table 1). Constructions” (ESZIC es Corpus). Afterwards, each of the tags are grouped in one To the best of our knowledge, the existing cor- of the three main classes. pora annotated with elliptical subjects belong to other genres. The Blue Book (handbook) and • Explicit subjects: [- elliptic, + referential]. Lexesp (journalistic texts) used in (Ferr´andez and Peral, 2000) contain zero subjects but not imper- • Zero subjects: [+ elliptic, + referential]. sonal constructions. On the other hand, the Span- • Impersonal constructions: [+ elliptic, - refer- ish AnCora corpus based on journalistic texts in- ential]. cludes zero pronouns and impersonal construc- tions (Recasens and Mart´ı, 2010) while the Z- Of these annotated verbs, 71% have an explicit corpus (Rello and Illisei, 2009b) comprises legal, subject, 26% have a zero subject and 3% belong instructional and encyclopedic texts but has no an- to an impersonal construction (see Table 2). notated impersonal constructions. The ESZIC corpus contains a total of 6,827 Number of instances Legal Health All verbs including 1,793 zero subjects. Except for Explicit subjects 2,739 2,116 4,855 AnCora-ES, with 10,791 elliptic pronouns, our Zero subjects 619 1,174 1,793 corpus is larger than the ones used in previous ap- Impersonals 71 108 179 proaches: about 1,830 verbs including zero and Total 3,429 3,398 6,827 explicit subjects in (Ferr´andez and Peral, 2000) Table 2: Instances per class in ESZIC Corpus. (the exact number is not mentioned in the pa- per) and 1,202 zero subjects in (Rello and Illisei, To measure inter-annotator reliability we use 2009b). Fleiss’ Kappa statistical measure (Fleiss, 1971). The corpus was parsed by Connexor’s Ma- We extracted 10% of the instances of each of the chinese Syntax (Connexor Oy, 2006), which re- texts of the corpus covering the two genres. turns lexical and morphological information as well as the dependency relations between words Fleiss’ Kappa Legal Health All by employing a functional dependency grammar Two Annotators 0.934 0.870 0.902 (Tapanainen and J¨arvinen, 1997). Three Annotators 0.925 0.857 0.891 To annotate our corpus we created an annota- Table 3: Inter-annotator Agreement. tion tool that extracts the finite clauses and the annotators assign to each example one of the de- In Table 3 we present the Fleiss kappa inter- fined annotation tags. Two volunteer graduate stu- annotator agreement for two and three annota- dents of linguistics annotated the verbs after one tors. These results suggest that the annotation training session. The annotations of a third volun- is reliable since it is common practice among re- teer with the same profile were used to compute searchers in computational linguistics to consider the inter-annotator agreement. During the anno- 0.8 as a minimum value of acceptance (Artstein tation phase, we evaluated the adequacy and clar- and Poesio, 2008). ity of the annotation guidelines and established a typology of the rising borderline cases, which is 5 Machine Learning Approach included in the annotation guidelines. Table 1 shows the linguistic and formal criteria We opted for an ML approach given that our used to identify the chosen categories that served previous rule-based methodology improved only as the basis for the corpus annotation. For each 0.02 over the 0.55 F-measure of a simple base- tag, in addition to the two criteria that are crucial line (Rello and Illisei, 2009b). Besides, ML based for identifying subject ellipsis ([± elliptic] and methods for the identification of explicit non- [± referential]) a combination of syntactic, se- referential constructions in English appear to per- mantic and discourse knowledge is also encoded form better than than rule-based ones (Boyd et al., during the annotation. The linguistic motivation 2005). 708 L INGUISTIC INFORMATION P HONETIC S YNTACTIC V ERBAL S EMANTIC D ISCOURSE R EALIZATION CATEGORY D IATHESIS I NTERPR . Annotation Annotation Elliptic Ell. noun Nominal Active Active Referential Categories Tags noun phrase subject participant subject phrase head Explicit subject – – + + + + Explicit Reflex passive – – + + – + subject subject Passive subject – – + – – + Omitted subject + – + + + + Omitted subject – + + + + + head Non-nominal – – – + + + subject Zero Reflex passive + – + + – + subject omitted subject Reflex pass. omit- – + + + – + ted subject head Reflex pass. non- – – – + – + nominal subject Passive omitted + – + – – + subject Pass. non-nominal – – – – – + subject Impersonal Reflex imp. clause – – n/a – n/a – construction (with se) Imp. construction – – n/a + n/a – (without se) Table 1: ESZIC Corpus Annotation Tags. 5.1 Features complex conjunction, clauses starting with a We built the training data from the annotated cor- simple conjunction, and clauses introduced pus and defined fourteen features. The linguisti- using punctuation marks (commas, semi- cally motivated features are inspired by previous colons, etc). We implemented a method ML approaches in Chinese (Zhao and Ng, 2007) to identify these different types of clauses, and English (Evans, 2001). The values for the fea- as the parser does not explicitly mark the tures (see Table 4) were derived from information boundaries of clauses within sentences. The provided both by Connexor’s Machinese Syntax method took into account the existence of a parser and a set of lists. finite verb, its dependencies, the existence of We can describe each of the features as broadly conjunctions and punctuation marks. belonging to one of ten classes, as follows: 3 LEMMA: lexical information extracted from 1 PARSER: the presence or absence of a sub- the parser, the lemma of the finite verb. ject in the clause, as identified by the parser. We are not aware of a formal evaluation of 4-5 NUMBER, PERSON: morphological infor- Connexor’s accuracy. It presents an accu- mation of the verb, its grammatical number racy of 74.9% evaluated against our corpus and its person. and we used it as a simple baseline. 6 AGREE: feature which encodes the tense, 2 CLAUSE: the clause types considered are: mood, person, and number of the verb in the main clauses, relative clauses starting with a clause, and its agreement in person, number, 709 Feature Definition Value 1 PARSER Parsed subject True, False 2 CLAUSE Clause type Main, Rel, Imp, Prop, Punct 3 LEMMA Verb lemma Parser’s lemma tag 4 NUMBER Verb morphological number SG, PL 5 PERSON Verb morphological person P1, P2, P3 6 AGREE Agreement in person, number, tense FTFF, TTTT, FFFF, TFTF, TTFF, FTFT, FTTF, TFTT, and mood FFFT, TTTF, FFTF, TFFT, FFTT, FTTT, TFFF, TTFT 7 NHPREV Previous noun phrases Number of noun phrases previous to the verb 8 NHTOT Total noun phrases Number of noun phrases in the clause 9 INF Infinitive Number of infinitives in the clause 10 SE Spanish particle se True, False 11 A Spanish preposition a True, False 12 POSpre Four parts of the speech previous to 292 different values combining the parser’s the verb POS tags 14 POSpos Four parts of the speech following 280 different values combining the parser’s the verb POS tags 14 VERBtype Type of verb: copulative, impersonal CIPX, XIXX, XXXT, XXPX, XXXI, CIXX, XXPT, XIPX, pronominal, transitive and intransitive XIPT, XXXX, XIXI, CXPI, XXPI, XIPI, CXPX Table 4: Features, definitions and values. tense, and mood with the preceding verb in (e) Se admiten los alumnos que re´unan los req- the sentence and also with the main verb of uisitos. the sentence.3 Ø (They) accept the students who fulfill the requirements. 7-9 NHPREV, NHTOT, INF: the candidates for (f) Se admite a los alumnos que re´unan los req- the subject of the clause are represented by uisitos. the number of noun phrases in the clause that (It) is accepted for the students who fulfill precede the verb, the total number of noun the requirements. phrases in the clause, and the number of in- finitive verbs in the clause. 12-3 POSpre , POSpos : the part of the speech (POS) of eight tokens, that is, the 4-grams 10 SE: a binary feature encoding the presence preceding and the 4-grams following the in- or absence of the Spanish particle se when it stance. occurs immediately before or after the verb or with a maximum of one token lying be- 14 VERBtype : the verb is classified as copula- tween the verb and itself. Particle se occurs tive, pronominal, transitive, or with an im- in passive reflex clauses with zero subjects personal use.4 Verbs belonging to more than and in some impersonal constructions. one class are also accommodated with dif- ferent feature values for each of the possible 11 A: a binary feature encoding the presence or combinations of verb type. absence of the Spanish preposition a in the 5.2 Evaluation clause. Since the distinction between passive reflex clauses with zero subjects and imper- To determine the most accurate algorithm for our sonal constructions sometimes relies on the classification task, two comparisons of learning appearance of preposition a (to, for, etc.). algorithms implemented in W EKA (Witten and For instance, example (e) is a passive reflex Frank, 2005) were carried out. Firstly, the classi- clause containing a zero subject while exam- fication was performed using 20% of the training ple (s) is an impersonal construction. instances. Secondly, the seven highest perform- 3 ing classifiers were compared using 100% of the In Spanish, when a finite verb appears in a subordinate 4 clause, its tense and mood can assist in recognition of these We used four lists provided by Molino de Ideas s.a. con- features in the verb of the main clause and help to enforce taining 11,060 different verb lemmas belonging to the Royal some restrictions required by this verb, especially when both Spanish Academy Dictionary (Real Academia Espa˜nola, verbs share the same referent as subject. 2001). 710 Class P R F Acc. Algorithm Explicit Zero Impersonals Explicit subj. 90.1% 92.3% 91.2% 87.3% subjects subjects Zero subj. 77.2% 74.0% 75.5% 87.4% RAE – – 70.4% Impersonals 85.6% 63.1% 72.7% 98.8% Connexor 71.7% 83.0% Ferr./Peral 79.7% 98.4% – Table 5: K* performance (87.6% accuracy for ten-fold Elliphant 87.3% 87.4% 98.8% cross validation). Table 6: Summary of accuracy comparison with previ- ous work. training data and ten-fold cross-validation. The corpus was partitioned into training and tested it without impersonal constructions. We achieve using ten-fold cross-validation for randomly or- a precision of 87% for explicit subjects compared dered instances in both cases. The lazy learn- to 80%, and a precision of 87% for zero subjects ing classifier K* (Cleary and Trigg, 1995), us- compared to their 98%. The overall accuracy ing a blending parameter of 40%, was the best is the same for both techniques, 87.5%, but our performing one, with an accuracy of 87.6% for results are more balanced. Nevertheless, the ten-fold cross-validation. K* differs from other approaches and corpora used in both studies are instance-based learners in that it computes the dis- different, and hence it is not possible to do a fair tance between two instances using a method mo- comparison. For example, their corpus has 46% tivated by information theory, where a maximum of zero subjects while ours has only 26%. entropy-based distance function is used (Cleary and Trigg, 1995). Table 5 shows the results For impersonal constructions our method out- for each class using ten-fold cross-validation. performs the RAE baseline (precision 6.5%, In contrast to previous work, the K* algorithm recall 77.7%, F-measure 12.0% and accuracy (Cleary and Trigg, 1995) was found to provide the 70.4%). Table 6 summarizes the comparison. The most accurate classification in the current study. low performance of the RAE baseline is due to the Other approaches have employed various clas- fact that verbs with impersonal use are often am- sification algorithms, including JRip in WEKA biguous. For these cases, we first tagged them as (M¨uller, 2006), with precision of 74% and recall ambiguous and then, we defined additional crite- of 60%, and K-nearest neighbors in TiMBL: both ria after analyzing then manually. The resulting in (Evans, 2001) with precision of 73% and recall annotated criteria are stated in Table 1. of 69%, and in (Boyd et al., 2005) with precision 6 Analysis of 82% and recall of 71%. Since there is no previous ML approach for this Through these analyses we aim to extract the most task in Spanish, our baselines for the explicit sub- effective features and the information that would jects and the zero subjects are the parser output complement the output of an standard parser to and the previous rule-based work with the high- achieve this task. We also examine the learning est performance (Ferr´andez and Peral, 2000). For process of the algorithm to find out how many in- the impersonal constructions the baseline is a sim- stances are needed to train it efficiently and de- ple greedy algorithm that classifies as an imper- termine how much Elliphant is genre dependent. sonal construction every verb whose lemma is cat- The analyses indicate that our approach is robust: egorized as a verb with impersonal use according it performs nearly as well with just six features, to the RAE dictionary (Real Academia Espa˜nola, has a steep learning curve, and seems to general- 2001). ize well to other text collections. Our method outperforms the Connexor parser which identifies the explicit subjects but makes no 6.1 Best Features distinction between zero subjects and impersonal We carried out three different experiments to eval- constructions. Connexor yields 74.9% overall ac- uate the most effective group of features, and curacy and 80.2% and 65.6% F-measure for ex- the features themselves considering the individ- plicit and elliptic subjects, respectively. ual predictive ability of each one along with their To compare with Ferr´andez and Peral degree of redundancy. (Ferr´andez and Peral, 2000) we do consider Based on the following three feature selection 711 methods we can state that there is a complex and Omission of all but one of the “simple” features balanced interaction between the features. led to a reduction in accuracy, justifying their in- clusion in the training instances. Nevertheless, the 6.1.1 Grouping Features majority of features present low informativeness In the first experiment we considered the 11 except for feature A which does not make any groups of relevant ordered features from the train- meaningful contribution to the classification. The ing data, which were selected using each W EKA feature PARSER presents the greatest difference attribute selection algorithm and performed the in performance (86.3% total accuracy); however, classifications over the complete training data, us- this is no big loss, considering it is the main fea- ing only the different groups features selected. ture. Hence, as most features do not bring a sig- The most effective group of six features (NH- nificant loss in accuracy, the features need to be PREV, PARSER, NHTOT, POSpos , PERSON, combined to improve the performance. LEMMA) was the one selected by W EKA’s Sym- metricalUncertAttribute technique, which gives 6.2 Learning Analysis an accuracy of 83.5%. The most frequently The learning curve of Figure 1 (left) presents the selected features by all methods are PARSER, increase of the performance obtained by Elliphant POSpos , and NHTOT, and they alone get an accu- using the training data randomly ordered. The racy of 83.6% together. As expected, the two pairs performance reaches its plateau using 90% of the of features that perform best (both 74.8% accu- training instances. Using different ordering of the racy) are PARSER with either POSpos or NHTOT. training set we obtain the same result. Based on how frequent each feature is selected Figure 1 (right) presents the precision for each by W EKA’s attribute selection algorithms, we can class and overall in relation to the number of train- rank the features as following: (1) PARSER, ing instances for each one of them. Recall grows (2) NHTOT, (3) POSpos , (4) NHPREV and (5) similarly to precision. Under all conditions, sub- LEMMA. jects are classified with a high precision since the 6.1.2 “Complex” vs. “Simple” Features information given by the parser (collected in the Second, a set of experiments was conducted features) achieves an accuracy of 74.9% for the in which features were selected on the basis identification of explicit subjects. of the degree of computational effort needed to The impersonal construction class has the generate them. We propose two sets of fea- fastest learning curve. When utilizing a training tures. One group corresponds to “simple” fea- set of only 163 instances (90% of the training tures, whose values can be obtained by trivial data), it reaches a precision of 63.2%. The un- exploitation of the tags produced in the parser’s stable behaviour for impersonal constructions can output (PARSER, LEMMA, PERSON, POSpos , be attributed to not having enough training data POSpre ). The second group of features, “com- for that class, since impersonals are not frequent plex” features (CLAUSE, AGREE, NHPREV, in Spanish. On the other hand, the zero subject NHTOT, VERBtype ) have values that required the class is learned more gradually. implementation of more sophisticated modules to The learning curve for the explicit subject class identify the boundaries of syntactic constituents is almost flat due to the great variety of subjects such as clauses and noun phrases. The accuracy occurring in the training data. In addition, reach- obtained when the classifier exclusively exploits ing a precision of 92.0% for explicit subjects us- “complex” features is 82.6% while for “simple” ing just 20% of the training data is far more ex- features is 79.9%. No impersonal constructions pensive in terms of the number of training in- are identified when only “complex” features are stances (978) as seen in Figure 1 (right). Actually, used. with just 20% of the training data we can already achieve a precision of 85.9%. 6.1.3 One-left-out Feature This demonstrates that Elliphant does not need In the third experiment, to estimate the weight very large sets of expensive training data and of each feature, classifications were made in is able to reach adequate levels of performance which each feature was omitted from the train- when exploiting far fewer training instances. In ing instances that were presented to the classifier. fact, we see that we only need a modest set of 712 % 86.5% 86.6% 498 978 1461 1929 2433 2898 3400 3899 4386 4854 86.60 85.9% 93.00 86.0% Explicit subjects 85.8% 86.4% 86.71 86.00 86.3% 85.5% Overall 80.43 85.40 85.8% 1593 1793 Precision (%) 85.6% 85.7% 74.14 354 537 735 898 1094 1249 1416 85.3% Zero subjects 84.80 85.2% 67.86 167 163 179 84.20 82 61.57 103 146 66 129 Impersonal 83.60 55.29 17 49 32 constructions 83.00 49.00 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Precision Recall F-measure Figure 1: Learning curve for precision, recall and F-measure (left) and with respect to the number of instances of each class (right) for a given percentage of training data. annotated instances (fewer than 1,500) to achieve are more homogeneous, as the classifier obtains good results. higher accuracy when testing and training only on legal instances (90.0%). In addition, legal texts 6.3 Impact of Genre are also more informative, because when both le- To examine the influence of the different text gen- gal and health genres are combined as training res on this method, we divided our training data data, only instances from the health genre show into two subgroups belonging to different genres a significant increased accuracy (93.7%). These (legal and health) and analyze the differences. results reveal that the health texts are the most het- A comparative evaluation using ten-fold cross- erogeneous ones. In fact, we also found subsets of validation over the two subgroups shows that El- the legal documents where our method achieves liphant is more successful when classifying in- an accuracy of 94.6%, implying more homoge- stances of explicit subjects in legal texts (89.8% neous texts. accuracy) than health texts (85.4% accuracy). This may be explained by the greater uniformity 6.4 Error Analysis of the sentences in the legal genre compared to Since the features of the system are linguisti- ones from the health genre, as well as the fact that cally motivated, we performed a linguistic anal- there are a larger number of explicit subjects in the ysis of the erroneously classified instances to find legal training data (2,739 compared with 2,116 in out which patterns are more difficult to classify the health texts). Further, texts from the health and which type of information would improve the genre present the additional complication of spe- method (Rello et al., 2011). cialized named entities and acronyms, which are used quite frequently. Similarly, better perfor- We extract the erroneously classified instances mance in the detection of zero subjects and imper- of our training data and classify the errors. Ac- sonal sentences in the health texts may be due to cording to the distribution of the errors per class their more frequent occurrence and hence greater (Table 8) we take into account the following four learnability. classes of errors for the analysis: (a) impersonal constructions classified as zero subjects, (b) im- Training/Testing Legal Health All personal constructions classified as explicit sub- Legal 90.0% 86.8% 89.3% jects, (c) zero subjects classified as explicit sub- Health 86.8% 85.9% 88.7% jects, and (d) explicit subjects classified as zero All 92.5% 93.7% 87.6% subjects. The diagonal numbers are the true pre- Table 7: Accuracy of cross-genre training and testing dicted cases. The classification of impersonal evaluation (ten-fold evaluation). constructions is less balanced than the ones for explicit subjects and zero subjects. Most of the We have also studied the effect of training the wrongly identified instances are classified as ex- classifier on data derived from one genre and test- plicit subject, given that this class is the largest ing on instances derived from a different genre. one. On the other hand, 25% of the zero subjects Table 7 shows that instances from legal texts are classified as explicit subject, while only 8% of 713 the explicit subjects are identified as zero subjects. class. A possible future avenue to explore could be Class Zero Explicit Impers. to combine our approach with Ferr´andez and subjects subjects Peral (Ferr´andez and Peral, 2000) by employing Zero subj. 1327 453 (c) 13 both algorithms in sequence: first Ferr´andez and Explicit subj. 368 (d) 4481 6 Peral’s algorithm to detect all zero subjects and Impersonals 25 (a) 41 (b) 113 then ours to identify explicit subjects and imper- sonals. Assuming that the same accuracy could be Table 8: Confusion Matrix (ten-fold validation). maintained, on our data set the combined perfor- For the analysis we first performed an explo- mance could potentially be in the range of 95%. ration of the feature values which allows us to Future research goals are the extrinsic evalua- generate smaller samples of the groups of errors tion of our system by integrating our system in for the further linguistic analyses. Then, we ex- NLP tasks and its adaptation to other Romance plore the linguistic characteristics of the instances pro-drop languages. Finally, we believe that our by examining the clause in which the instance ap- ML approach could be improved as it is the first pears in our corpus. A great variety of different attempt of this kind. patterns are found. We mention only the linguistic Acknowledgements characteristics in the errors which at least double We thank Richard Evans, Julio Gonzalo and the the corpus general trends. anonymous reviewers for their wise comments. In all groups (a-d) there is a tendency of using the following elements: post-verbal prepositions, auxiliary verbs, future verbal tenses, subjunctive References verbal mode, negation, punctuation marks ap- R. Artstein and M. Poesio. 2008. Inter-coder agree- pearing before the verb and the preceding noun ment for computational linguistics. Computational phrases, concessive and adverbial subordinate Linguistics, 34(4):555–596. clauses. In groups (a) and (b) the lemma of the S. Bergsma, D. Lin, and R. Goebel. 2008. Distri- verb may play a relevant role, for instance verb butional identification of non-referential pronouns. haber (‘there is/are’) appears in the errors seven In Proceedings of the 46th Annual Meeting of the times more than in the training while verb tratar Association for Computational Linguistics: Human Language Technologies (ACL/HLT-08), pages 10– (‘to be about’, ‘to deal with’) appears 12 times 18. more. Finally, in groups (c) and (d) we notice I. Bosque. 1989. Clases de sujetos t´acitos. In Julio the frequent occurrence of idioms which include Borrego Nieto, editor, Philologica: homenaje a An- verbs with impersonal uses, such as es decir (‘that tonio Llorente, volume 2, pages 91–112. Servicio is to say’) and words which can be subject on their de Publicaciones, Universidad Pontificia de Sala- own i.e. ambos (‘both’) or todo (‘all’). manca, Salamanca. A. Boyd, W. Gegg-Harrison, and D. Byron. 2005. 7 Conclusions and Future Work Identifying non-referential it: a machine learning approach incorporating linguistically motivated pat- In this study we learn which is the most accurate terns. In Proceedings of the ACL Workshop on Fea- approach for identifying explicit subjects and im- ture Engineering for Machine Learning in Natural personal constructions in Spanish and which are Language Processing. 43rd Annual Meeting of the Association for Computational Linguistics (ACL- the linguistic characteristics and features that help 05), pages 40–47. to perform this task. The corpus created is freely J. M. Brucart. 1999. La elipsis. In I. Bosque available online.5 Our method complements pre- and V. Demonte, editors, Gram´atica descriptiva de vious work on Spanish anaphora resolution by ad- la lengua espa˜nola, volume 2, pages 2787–2863. dressing the identification of non-referential con- Espasa-Calpe, Madrid. structions. It outperforms current approaches in N. Chomsky. 1981. Lectures on Government and explicit subject detection and impersonal con- Binding. Mouton de Gruyter, Berlin, New York. structions, doing better than the parser for every J.G. Cleary and L.E. Trigg. 1995. K*: an instance- based learner using an entropic distance measure. 5 In Proceedings of the 12th International Conference ESZIC es Corpus is available at: http: //luzrello.com/Projects.html. on Machine Learning (ICML-95), pages 108–114. 714 Connexor Oy, 2006. Machinese language model. M. Recasens and M.A. Mart´ı. 2010. Ancora- L. Danlos. 2005. Automatic recognition of French co: Coreferentially annotated corpora for Spanish expletive pronoun occurrences. In Robert Dale, and Catalan. Language resources and evaluation, Kam-Fai Wong, Jiang Su, and Oi Yee Kwong, ed- 44(4):315–345. itors, Natural language processing. Proceedings of L. Rello and I. Illisei. 2009a. A comparative study the 2nd International Joint Conference on Natural of Spanish zero pronoun distribution. In Proceed- Language Processing (IJCNLP-05), pages 73–78, ings of the International Symposium on Data and Berlin, Heidelberg, New York. Springer. Lecture Sense Mining, Machine Translation and Controlled Notes in Computer Science, Vol. 3651. Languages, and their application to emergencies R. Evans. 2001. Applying machine learning: toward and safety critical domains (ISMTCL-09), pages an automatic classification of it. Literary and Lin- 209–214. Presses Universitaires de Franche-Comt´e, guistic Computing, 16(1):45–57. Besanc¸on. A. Ferr´andez and J. Peral. 2000. A computational ap- L. Rello and I. Illisei. 2009b. A rule-based approach proach to zero-pronouns in Spanish. In Proceedings to the identification of Spanish zero pronouns. In of the 38th Annual Meeting of the Association for Student Research Workshop. International Confer- Computational Linguistics (ACL-2000), pages 166– ence on Recent Advances in Natural Language Pro- 172. cessing (RANLP-09), pages 209–214. J. L. Fleiss. 1971. Measuring nominal scale agree- L. Rello, P. Su´arez, and R. Mitkov. 2010. A machine ment among many raters. Psychological Bulletin, learning method for identifying non-referential im- 76(5):378–382. personal sentences and zero pronouns in Spanish. G. Hirst. 1981. Anaphora in natural language under- Procesamiento del Lenguaje Natural, 45:281–287. standing: a survey. Springer-Verlag. L. Rello, G. Ferraro, and A. Burga. 2011. Error analy- J. Hobbs. 1977. Resolving pronoun references. Lin- sis for the improvement of subject ellipsis detection. gua, 44:311–338. Procesamiento de Lenguaje Natural, 47:223–230. R. Mitkov and C. Hallett. 2007. Comparing pronoun L. Rello. 2010. Elliphant: A machine learning method resolution algorithms. Computational Intelligence, for identifying subject ellipsis and impersonal con- 23(2):262–297. structions in Spanish. Master’s thesis, Erasmus R. Mitkov. 2002. Anaphora resolution. Longman, Mundus, University of Wolverhampton & Univer- London. sitat Aut`onoma de Barcelona. R. Mitkov. 2010. Discourse processing. In Alexander P. Tapanainen and T. J¨arvinen. 1997. A non-projective Clark, Chris Fox, and Shalom Lappin, editors, The dependency parser. In Proceedings of the 5th Con- handbook of computational linguistics and natural ference on Applied Natural Language Processing language processing, pages 599–629. Wiley Black- (ANLP-97), pages 64–71. well, Oxford. I. H. Witten and E. Frank. 2005. Data mining: practi- C. M¨uller. 2006. Automatic detection of nonrefer- cal machine learning tools and techniques. Morgan ential it in spoken multi-party dialog. In Proceed- Kaufmann, London, 2 edition. ings of the 11th Conference of the European Chap- S. Zhao and H.T. Ng. 2007. Identification and resolu- ter of the Association for Computational Linguistics tion of Chinese zero pronouns: a machine learning (EACL-06), pages 49–56. approach. In Proceedings of the 2007 Joint Con- V. Ng and C. Cardie. 2002. Identifying anaphoric ference on Empirical Methods in Natural Language and non-anaphoric noun phrases to improve coref- Processing and Computational Natural Language erence resolution. In Proceedings of the 19th Inter- Learning (EMNLP/CNLL-07), pages 541–550. national Conference on Computational Linguistics (COLING-02), pages 1–7. Real Academia Espa˜nola. 2001. Diccionario de la lengua espa˜nola. Espasa-Calpe, Madrid, 22 edi- tion. Real Academia Espa˜nola. 2009. Nueva gram´atica de la lengua espa˜nola. Espasa-Calpe, Madrid. M. Recasens and E. Hovy. 2009. A deeper look into features for coreference resolution. In Lalitha Devi Sobha, Ant´onio Branco, and Ruslan Mitkov, editors, Anaphora Processing and Applica- tions. Proceedings of the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-09), pages 29–42. Springer, Berlin, Heidelberg, New York. Lecture Notes in Computer Science, Vol. 5847. 715 Validation of sub-sentential paraphrases acquired from parallel monolingual corpora Houda Bouamor Aur´elien Max Anne Vilnat LIMSI-CNRS & Univ. Paris Sud Orsay, France

[email protected]

Abstract synonyms, enumerating meaning equivalences at the level of phrases is too daunting a task for hu- The task of paraphrase acquisition from re- mans. Because this type of knowledge can how- lated sentences can be tackled by a variety ever greatly benefit many NLP applications, au- of techniques making use of various types tomatic acquisition of such paraphrases has at- of knowledge. In this work, we make the tracted a lot of attention (Androutsopoulos and hypothesis that their performance can be increased if candidate paraphrases can be Malakasiotis, 2010; Madnani and Dorr, 2010), validated using information that character- and significant research efforts have been devoted izes paraphrases independently of the set of to this objective (Callison-Burch, 2007; Bhagat, techniques that proposed them. We imple- 2009; Madnani, 2010). ment this as a bi-class classification prob- Central to acquiring paraphrases is the need of lem (i.e. paraphrase vs. not paraphrase), allowing any paraphrase acquisition tech- assessing the quality of the candidate paraphrases nique to be easily integrated into the com- produced by a given technique. Most works to bination system. We report experiments on date have resorted to human evaluation of para- two languages, English and French, with phrases on the levels of grammaticality and mean- 5 individual techniques on parallel mono- ing equivalence. Human evaluation is however lingual parallel corpora obtained via multi- often criticized as being both costly and non re- ple translation, and a large set of classifi- producible, and the situation is even more compli- cation features including surface to contex- tual similarity measures. Relative improve- cated by the inherent complexity of the task that ments in F-measure close to 18% are ob- can produce low inter-judge agreement. Task- tained on both languages over the best per- based evaluation involving the use of paraphras- forming techniques. ing into some application thus seem an acceptable solution, provided the evaluation methodologies for the given task are deemed acceptable. This, 1 Introduction in turn, puts the emphasis on observing the im- The fact that natural language allows messages pact of paraphrasing on the targeted application to be conveyed in a great variety of ways consti- and is rarely accompanied by a study of the intrin- tutes an important difficulty for NLP, with appli- sic limitations of the paraphrase acquisition tech- cations in both text analysis and generation. The nique used. term paraphrase is now commonly used in the The present work is concerned with the task of NLP litterature to refer to textual units of equiva- sub-sentential paraphrase acquisition from pairs lent meaning at the phrasal level (including single of related sentences. A large variety of tech- words). For instance, the phrases six months and niques have been proposed that can be applied half a year form a paraphrase pair applicable in to this task. They typically make use of differ- many different contexts, as they would appropri- ent kinds of automatically or manually acquired ately denote the same concept. Although one can knowledge. We make the hypothesis that their envisage to manually build high-coverage lists of performance can be increased if candidate para- 716 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 716–725, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics phrases can be validated using information that of a similar technique on a very large scale. characterize paraphrases in complement to the set The hypothesis that two words or phrases are of techniques that proposed them. We propose to interchangeable if they share a common trans- implement this as a bi-class classification problem lation into one or more other languages has (i.e. paraphrase vs. not paraphrase), allowing also been extensively studied in works on sub- any paraphrase acquisition technique to be easily sentential paraphrase acquisition. Bannard and integrated into the combination system. In this Callison-Burch (2005) described a pivoting ap- article, we report experiments on two languages, proach that can exploit bilingual parallel corpora English and French, with 5 individual techniques in several languages. The same technique has based on a) statistical word alignment models, been applied to the acquisition of local paraphras- b) translational equivalence, c) handcoded rules of ing patterns in Zhao et al. (2008). The work of term variation, d) syntactic similarity, and e) edit Callison-Burch (2008) has shown how the mono- distance on word sequences. We used parallel lingual context of a sentence to paraphrase can be monolingual parallel corpora obtained via mul- used to improve the quality of the acquired para- tiple translation from a single language as our phrases. sources of related sentences, and a large set of Another approach consists in modelling local features including surface to contextual similarity paraphrasing identification rules. The work of measures. Relative improvements in F-measure Jacquemin (1999) on the identification of term close to 18% are obtained on both languages over variants, which exploits rewriting morphosyntac- the best performing techniques. tic rules and descriptions of morphological and The remainder of this article is organized as semantic lexical families, can be extended to ex- follows. We first briefly review previous work tract the various forms corresponding to input pat- on sub-sentential paraphrase acquisition in sec- terns from large monolingual corpora. tion 2. We then describe our experimental setting When parallel monolingual corpora aligned at in section 3 and the individual techniques that we the sentence level are available (e.g. multiple have studied in section 4. Section 5 is devoted to translations into the same language), the task of our approach for validating paraphrases proposed sub-sentential paraphrase acquisition can be cast by individual techniques. Finally, section 6 con- as one of word alignment between two aligned cludes the article and presents some of our future sentences (Cohn et al., 2008). Barzilay and work in the area of paraphrase acquisition. McKeown (2001) applied the distributionality hy- pothesis on such parallel sentences, and Pang et 2 Related work al. (2003) proposed an algorithm to align sen- The hypothesis that if two words or, by exten- tences by recursive fusion of their common syn- sion, two phrases, occur in similar contexts then tactic constituants. they may be interchangeable has been extensively Finally, they has been a recent interest in auto- tested. The distributional hypothesis, attributed to matic evaluation of paraphrases (Callison-Burch Zellig Harris, was for example applied to syntac- et al., 2008; Liu et al., 2010; Chen and Dolan, tic dependency paths in the work of Lin and Pan- 2011; Metzler et al., 2011). tel (2001). Their results take the form of equiva- lence patterns with two arguments such as {X asks 3 Experimental setting for Y, X requests Y, X’s request for Y, X wants Y, Y is requested by X, . . .}. We used the main aspects of the methodology Using comparable corpora, where the same in- described by Cohn et al. (2008) for constructing formation probably exists under various linguis- evaluation corpora and assessing the performance tic forms, increases the likelihood of finding very of techniques on the task of sub-sentential para- close contexts for sub-sentential units. Barzilay phrase acquisition. Pairs of related sentences are and Lee (2003) proposed a multi-sequence align- hand-aligned to define a set of reference atomic ment algorithm that takes structurally similar sen- paraphrase pairs at the level of words or phrases, tences and builds a compact lattice representation denoted as Ratom 1 . that encodes local variations. The work by Bhagat 1 Note that in this study we do not distinguish between and Ravichandran (2008) describes an application “Sure” and “Possible” alignments, and when reusing anno- 717 single language multiple language video descriptions multiply-translated news headlines translation translation subtitles # tokens 4,476 4,630 1,452 2,721 1,908 # unique tokens 656 795 357 830 716 % aligned tokens (excluding identities) 60.58 48.80 23.82 29.76 14.46 lexical overlap (tokens) 77.21 61.03 59.50 32.51 39.63 lexical overlap (lemmas content words) 83.77 71.04 64.83 39.54 45.31 translation edit rate (TER) 0.32 0.55 0.76 0.68 0.62 penalized n-gram prec. (BLEU) 0.33 0.15 0.13 0.14 0.39 Table 1: Various indicators of sentence pair comparability for different corpus types. Statistics are reported for French on sets of 100 sentence pairs. We conducted a small-scale study to assess dif- presence of common token may serve as useful ferent types of corpora of related sentences: clues to guide paraphrase extraction. For our experiments, we chose to use parallel 1. single language translation Corpora ob- monolingual corpora obtained by single language tained by several independent human trans- translation, the most direct resource type for ac- lation of the same sentences (e.g. (Barzilay quiring sub-sentential paraphrase pairs. This al- and McKeown, 2001)). lows us to define acceptable references for the 2. multiple language translation Same as task and resort to the most consensual evaluation above, but where a sentence is translated technique for paraphrase acquisition to date. Us- from 4 different languages into the same lan- ing such corpora, we expect to be able to extract guage (Bouamor et al., 2010). precise paraphrases (see Table 1), which will be natural candidates for further validation, which 3. video descriptions Descriptions of short will be addressed in section 5.3. YouTube videos obtained via Mechanical Figure 1 illustrates a reference alignment ob- Turk (Chen and Dolan, 2011). tained on a pair of English sentential paraphrases and the list of atomic paraphrase pairs that can be 4. multiply-translated subtitles Aligned mul- extracted from it, against which acquisition tech- tiple translations of contributed movie subti- niques will be evaluated. Note that we do not con- tles (Tiedemann, 2007). sider pairs of identical units during evaluation, so 5. comparable news headlines News head- we filter them out from the list of reference para- lines collected from Google News clusters phrase pairs. (e.g. (Dolan et al., 2004)). The example in Figure 1 shows different cases that point to the inherent complexity of this task, We collected 100 sentence pairs of each type even for human annotators: it could be argued, in French, for which various comparability mea- for instance, that a correct atomic paraphrase sures are reported on Table 1. In particular, the pair should be reached ↔ amounted to rather “% aligned tokens” row indicates the propor- than reached ↔ amounted. Also, aligning in- tion of tokens from the sentence pairs that could dependently 260 ↔ 0.26 and million ↔ billion be manually aligned by a native-speaker annota- is assuredly an error, while the pair 260 mil- tor.2 Obviously, the more common tokens two lion ↔ 0.26 billion would have been appropriate. sentences from a pair contain, the fewer sub- A case of alignment that seems non trivial can be sentential paraphrases may be extracted from that observed in the provided example (during the en- pair. However, high lexical overlap increases the tire year ↔ annual). The abovementioned rea- probability that two sentences be indeed para- sons will explain in part the difficulties in reach- phrases, and in turn the probability that some of ing high performance values using such gold stan- their phrases be paraphrases. Furthermore, the dards. tated corpora using them we considered all alignments as be- Reference composite paraphrase pairs (denoted ing correct. 2 as R), obtained by joining adjacent atomic para- The same annotator hand-aligned the 5*100=500 para- phrase pairs from Ratom up to 6 tokens3 , will phrase pairs using the YAWAT (Germann, 2008) manual 3 alignment tool. We used standard biphrase extraction heuristics (Koehn 718 corpus described in (Cohn et al., 2008), consist- ing of multiply-translated Chinese sentences into investment amounted English, and used as our gold standard both the actually foreign annual billion alignments marked as “Sure” and “Possible”. For used 0.26 us$ the to French, we used the CESTA corpus of news ar- the ticles4 obtained by translating into French from amount English. of foreign We used the YAWAT (Germann, 2008) manual capital actually alignment tool. Inter-annotator agreement val- utilized ues (averaging with each annotation set as the during the gold standard) are 66.1 for English and 64.6 for entire French, which we interpret as acceptable val- year reached ues. Manual inspection of the two corpora reveals 260 that the French corpus tends to contain more lit- million eral translations, possibly due to the original lan- us dollars guages of the sentences, which are closer to the . target language than Chinese is to English. capital ↔ investment utilized ↔ used 4 Individual techniques for paraphrase during the entire year ↔ annual reached ↔ amounted acquisition 260 ↔ 0.26 million ↔ billion As discussed in section 2, the acquisition of sub- us dollars ↔ us$ sentential paraphrases is a challenging task that has previously attracted a lot of work. In this Figure 1: Reference alignments for a pair of English work, we consider the scenario where sentential sentential paraphrases from the annotation corpus of paraphrases are available and words and phrases Cohn et al. (2008) (note that possible and sure align- from one sentence can be aligned to words and ments are not distinguished here) and the list of atomic phrases from the other sentence to form atomic paraphrase pairs extracted from these alignments. paraphrase pairs. We now describe several tech- niques that perform the task of sub-sentential unit also be considered when measuring performance. alignment. We have selected and implemented Evaluated techniques have to output atomic can- five techniques which we believe are representa- didate paraphrase pairs (denoted as Hatom ) from tive of the type of knowledge that these techniques which composite paraphrase pairs (denoted as use, and have reused existing tools, initially devel- H) are computed. The usual measures of pre- oped for other tasks, when possible. cision (P ), recall (R) and F-measure (F1 ) can then be defined in the following way (Cohn et al., 4.1 Statistical learning of word alignments 2008): (Giza) The GIZA++ tool (Och and Ney, 2004) computes |Hatom ∩ R| |H ∩ Ratom | 2pr P = R= F1 = statistical word alignment models of increasing |Hatom | |Ratom | p+r complexity from parallel corpora. While origi- nally developed in the bilingual context of Statis- We conducted experiments using two different tical Machine Translation, nothing prevents build- corpora in English and French. In each case, ing such models on monolingual corpora. How- a held-out development corpus of 150 sentential ever, in order to build reliable models, it is nec- paraphrase pairs was used for development and essary to use enough training material includ- tuning, and all techniques were evaluated on the ing minimal redundancy of words. To this end, same test set consisting of 375 sentential para- we provided GIZA++ with all possible sentence phrase pairs. For English, we used the MTC pairs from our mutiply-translated corpus to im- et al., 2007) : all words from a phrase must be aligned to at prove the quality of its word alignments (note that least one word from the other and not to words outside, but 4 unaligned words at phrase boundaries are not used. http://www.elda.org/article125.html 719 we used symmetrized alignments from the align- the first sentence and search for variants in the ments in both directions). This constitutes a sig- other sentence, then do the reverse process and nificant advantage for this technique that tech- finally take the intersection of the two sets. niques working on each sentence pair indepen- dently do not have. 4.4 Syntactic similarity (Synt) The algorithm introduced by Pang et al. (2003) 4.2 Translational equivalence (Pivot) takes two sentences as input and merges them by Translational equivalence can be exploited to de- top-down syntactic fusion guided by compatible termine that two phrases may be paraphrases. syntactic substructure. A lexical blocking mecha- Bannard and Callison-Burch (2005) defined a nism prevents constituents from fusionning when paraphrasing probability between two phrases there is evidence of the presence of a word in an- based on their translation probability through all other constituent of one of the sentence. We use possible pivot phrases as: the Berkeley Probabilistic parser (Klein and Man- X ning, 2003) to obtain syntactic trees for English Ppara (p1 , p2 ) = Pt (piv|p1 )Pt (p2 |piv) and its adapted version for French (Candito et al., piv 2010). Because this process is highly sensitive to where Pt denotes translation probabilies. We used syntactic parse errors, we use in our implemen- the Europarl corpus5 of parliamentary debates in tation k-best parses and retain the most compact English and French, consisting of approximately fusion from any pair of candidate parses. 1.7 million parallel sentences : this allowed us to use the same resource to build paraphrases for 4.5 Edit rate on word sequences (TERp ) English, using French as the pivot language, and TERp (Translation Edit Rate Plus) (Snover et al., for French, using English as the pivot language. 2010) is a score designed for the evaluation of The GIZA++ tool was used for word alignment Machine Translation output. Its typical use takes and the M OSES Statistical Machine Translation a system hypothesis to compute an optimal set of toolkit (Koehn et al., 2007) was used to com- word edits that can transform it into some exist- pute phrase translation probabilities from these ing reference translation. Edit types include ex- word alignments. For each sentential paraphrase act word matching, word insertion and deletion, pair, we applied the following algorithm: for each block movement of contiguous words (computed phrase, we build the entire set of paraphrases us- as an approximation), as well as optionally vari- ing the previous definition. We then extract its ants substitution through stemming, synonym or best paraphrase as the one exactly appearing in the paraphrase matching.6 Each edit type is parame- other sentence with maximum paraphrase proba- terized by at least one weight which can be opti- bility, using a minimal threshold value of 10−4 . mized using e.g. hill climbing. TERp being a tun- able metric, our experiments will include tuning 4.3 Linguistic knowledge on term variation TERp systems towards either precision (→ P ), (Fastr) recall (→ R), or F-measure (→ F1 ).7 The FASTR tool (Jacquemin, 1999) was designed to spot term/phrase variants in large corpora. 4.6 Evaluation of individual techniques Variants are described through metarules express- Results for the 5 individual techniques are given ing how the morphosyntactic structure of a term on the left part of Table 2. It is first apparent variant can be derived from a given term by means that all techniques but TERp fared better on the of regular expressions on word morphosyntactic French corpus than on the English corpus. This categories. Paradigmatic variation can also be ex- can certainly be explained by the fact that the for- pressed by expressing constraints between words, mer results from more literal translations (from imposing that they be of the same morphologi- 6 Note that for these experiments we did not use the stem- cal or semantic family. Both constraints rely on ming module, the interface to WordNet for synonym match- preexisting repertoires available for English and ing and the provided paraphrase table for English, due to the French. To compute candidate paraphrase pairs fact that these resources were available for English only. 7 using FASTR, we first consider all phrases from Hill climbing was used for all tunings as done by Snover et al. (2010), and we used one iteration starting with uniform 5 http://statmt.org/europarl weights and 100 random restarts. 720 Individual techniques Combinations TERp G IZA P IVOT FASTR S YNT union validation → P → R → F1 English P 31.01 31.78 37.38 52.17 50.00 29.15 33.37 21.44 50.51 R 38.30 18.50 6.71 2.53 5.83 45.19 45.37 60.87 41.19 F1 34.27 23.39 11.38 4.83 10.44 35.44 38.46 31.71 45.37 French P 28.99 29.53 52.48 62.50 31.35 30.26 31.43 17.58 40.77 R 45.98 26.66 8.59 8.65 44.22 44.60 44.10 63.36 45.85 F1 35.56 28.02 14.77 15.20 36.69 36.05 36.70 27.53 43.16 Table 2: Results on the test set on English and French for the 5 individual paraphrase acquisition techniques (left part) and for the 2 combination techniques (right part). English to French, compared with from Chinese than for highly-inflected French. to English), which should be consequently eas- P IVOT is on par with G IZA as regards preci- ier to word-align. This is for example clearly sion, but obtains a comparatively much lower re- shown by the results of the statistical aligner call (differences of 19.32 and 19.80 on recall on G IZA, which obtains a 7.68 advantage on recall French and English respectively). This may first for French over English. be due in part to the paraphrasing score threshold The two linguistically-aware techniques, used for P IVOT, but most certainly to the use of FASTR and S YNT, have a very strong precision a bilingual corpus from the domain of parliamen- on the more parallel French corpus, but fail to tary debates to extract paraphrases when our test achieve an acceptable recall on their own. This sets are from the news domain: we may be ob- is not surprising : FASTR metarules are focussed serving differences inherent to the domain, and on term variant extraction, and S YNT requires possibly facing the issue of numerous “out-of- two syntactic trees to be highly comparable vocabulary” phrases, in particular for named en- to extract sub-sentential paraphrases. When tities which frequently occur in the news domain. these constrained conditions are met, these two Importantly, we can note that we obtain at best techniques appear to perform quite well in terms a recall of 45.98 on French (G IZA) and of 45.37 of precision. on English (TERp ). This may come as a disap- G IZA and TERp perform roughly in the same pointment but, given the broad set of techniques range on French, with acceptable precision and evaluated, this should rather underline the inher- recall, TERp performing overall better, with e.g. ent complexity of the task. Also, recall that the a 1.14 advantage on F-measure on French and metrics used do not consider identity paraphrases 4.19 on English. The fact that TERp performs (e.g. at the same time ↔ at the same time), as comparatively better on English than on French8 , well as the fact that gold standard alignment is with a 1.76 advantage on F-measure, is not con- a very difficult process as shown by interjudge tradictory: the implemented edit distance makes agreement values and our example from section 3. it possible to align reasonably distant words and This, again, confirms that the task that is ad- phrases independently from syntax, and to find dressed is indeed a difficult one, and provides fur- alignments for close remaining words, so the dif- ther justification for initially focussing on parallel ferences of performance between the two lan- monolingual corpora, albeit scarce, for conduct- guages are not necessarily expected to be com- ing fine-grained studies on sub-sentential para- parable with the results of a statistical alignment phrasing. technique. English being a poorly-inflected lan- Lastly, we can also note that precision is not guage, alignment clues between two sentential very high, with (at best, using TERp→P ) average paraphrases are expected to be more numerous values for all techniques of 40.97 and 40.46 on 8 French and English, respectively. Several facts Recall that all specific linguistic modules for English only from TERp had been disabled, so the better perfor- may provide explanations for this observation. mance on English cannot be explained by a difference in First, it should be noted that none of those tech- terms of resources used. niques, except S YNT, was originally developed 721 for the task of sub-sentential paraphrase acqui- Results on the test set for the two languages sition from monolingual parallel corpora. This are given in Table 3. A number of pairs of tech- results in definitions that are at best closely re- niques have strong complementarity values, the lated to this task.9 Designing new techniques strongest one being for G IZA and TERp for both was not one of the objectives of our study, so we languages. According to these figures, P IVOT have reused existing techniques, originally devel- identify paraphrases which are slightly more sim- oped with different aims (bilingual parallel cor- ilar to those of TERp than those of G IZA. Inter- pora word alignment (G IZA), term variant recog- estingly, FASTR and S YNT exhibit a strong com- nition (FASTR), Machine Translation evaluation plementarity, where in French, for instance, they (TERp )). Also, techniques such as G IZA and only have a very small proportion of paraphrases TERp attempt to align as many words as possi- in common. Considering the set of all other tech- ble in a sentence pair, when gold standard align- niques, G IZA provides the more new paraphrases ments sometimes contain gaps.10 Finally, the met- on French and TERp on English. rics used will count as false small variations of G IZA P IVOT FASTR S YNT TERp→R all others gold standard paraphrases (e.g. missing function English word): the acceptability or not of such candi- G IZA - 4.65 2.83 0.59 10.31 8.31 dates could be either evaluated in a scenario where P IVOT 4.65 - 2.30 1.88 3.12 3.72 FASTR 2.83 2.30 - 2.42 1.71 0.53 such “acceptable” variants would be taken into S YNT 0.59 1.88 2.42 - 0.59 0.00 account, and could be considered in the context TERp→R 10.31 3.12 1.71 0.59 - 12.20 of some actual use of the acquired paraphrases French G IZA - 9.79 3.64 2.20 10.73 8.91 in some application. Nonetheless, on average the P IVOT 9.79 - 2.26 5.22 7.84 3.39 techniques in our study produce more candidates FASTR 3.64 2.26 - 7.28 3.01 0.19 that are not in the gold standard: this will be an S YNT 2.20 5.22 7.28 - 1.76 0.44 important fact to keep in mind when tackling the TERp→R 10.73 7.84 3.01 1.76 - 5.65 task of combining their outputs. In particular, we will investigate the use of features indicating the Table 3: Values of complementarity on the test set for both languages, where the following formula was used combination of techniques that predicted a given for the set of technique outputs T = {t1 , t2 , ..., tn } : paraphrase pair, aiming to capture consensus in- C(ti , tj ) = recall(ti ∪tj )−max(recall(ti ), recall(tj )). formation. Complementarity values are computed between all pairs of individual techniques, and each individual 5 Paraphrase validation technique and the set of all other techniques. Values in bold indicate highest values for the technique of each 5.1 Technique complementarity row. Before considering combining and validating the outputs of individual techniques, it is informative 5.2 Naive combination by union to look at some notion of “complementarity” be- tween techniques, in terms of how many correct We first implemented a naive combination ob- paraphrases a technique would add to a combined tained by taking the union of all techniques. Re- set. The following formula was used to account sults are given in the first column of the right part for the complementarity between the set of can- of Table 2. The first result is quite encouraging: didates from some technique i, ti , and the set for in both languages, more than 6 paraphrases from some technique j, tj : the gold standard out of 10 are found by at least C(ti , tj ) = recall(ti ∪tj )−max(recall(ti ), recall(tj )) one of the techniques, which, given our previous discussion, constitutes a good result and provide a clear justification for combining different tech- 9 Recall, however, that our best performing technique on niques for improving performance on this task. F-measure, TERp , was optimized to our task using a held Precision is mechanically lowered to account for out development set. roughly 1 correct paraphrase over 5 candidates 10 It is arguable whether such cases should happen in sen- for both languages. F-measure values are much tence pairs obtained by translating the same original sentence into the same language, but this clearly depends on the inter- lower than those of TERp and G IZA, showing pretation of the expected level of annotation by the annota- that the union of all techniques is only interest- tors. ing for recall-oriented paraphrase acquisition. In 722 the next section, we will show how the results of of tokens for the two phrases of a candidate para- the union can be validated using machine learning phrase pair. to improve these figures. Context similarity (CTXT) It can be derived 5.3 Paraphrase validation via automatic from the distributionality hypothesis that the more classification two phrases will be seen in similar contexts, the A natural improvement to the naive combination more they are likely to be paraphrases. We used of paraphrase candidates from all techniques can discretized features indicating how similar the consist in validating candidate paraphrases by us- contexts of occurrences of two paraphrases are. ing several models that may be good indicators of For this, we used the full set of bilingual English- their paraphrasing status. We can therefore cast French data available for the translation task of our problem as one of biclass classification (i.e. the Workshop on Statistical Machine Transla- “paraphrase” vs. “not paraphrase”). tion13 , totalling roughly 30 million parallel sen- We have used a maximum entropy classifier11 tences: this again ensures that the same resources with the following features, aiming at capturing are used for experiments in the two languages. We information on the paraphrase status of a candi- collect all occurrences for the phrases in a pair, date pair: and build a vector of content words cooccurring within a distance of 10 words from each phrase. Morphosyntactic equivalence (POS) It may We finally compute the cosine between the vec- be the case that some sequences of part-of-speech tors of the two phrases of a candidate paraphrase can be rewritten as different sequences, e.g. as pair. a result of verb nominalization. We therefore use features to indicate the sequences of part-of- Relative position in a sentence (REL) De- speech for a pair of candidate paraphrases. We pending on the language in which parallel sen- used the preterminal symbols of the syntactic tences are analyzed, it may be the case that sub- trees of the parser used for S YNT. sentential paraphrases occur at close locations in their respective sentence. We used a discretized Character-based distance (CAR) Morpholog- feature indicating the relative position of the two ical variants often have close word forms, and phrases in their original sentence. more generally close word forms in sentential paraphase pairs may indicate related words. We Identity check (COOC) We used a binary fea- used features for discretized values of the edit ture indicating whether one of the two phrases distance between the two phrases of a candidate from a candidate pair, or the two, occurred at paraphrase pair as measured by the Levenshtein some other location in the other sentence. distance. Phrase length ratio (LEN) We used a dis- Stem similarity (STEM) Inflectional morphol- cretized feature indicating phrase length ratio. ogy, which is quite productive in languages such Source techniques (SRC) Finally, as our set- as French, can increase vocabulary size signifi- ting validates paraphrase candidates produced by cantly, while in sentential paraphrases common a set of techniques, we used features indicat- stems may indicate related words. We used a ing which combination of techniques predicted a binary feature indicating whether the stemmed paraphrase candidate. This can allow learning that phrases of a candidate paraphrase pair match.12 paraphrases in the intersection of the predicted Token set identity (BOW) Syntactic rearrange- sets for some techniques may produce good re- ments may involve the same sets of words in var- sults. ious orders. We used discretized features indicat- We used a held out training set consisting of ing the proportion of common tokens in the set 150 sentential paraphrase pairs from the same cor- 11 We used the implementation available at: pora as our previous developement and test sets http://homepages.inf.ed.ac.uk/lzhang10/ for both languages. Positive examples were taken maxent_toolkit.html from the candidate paraphrase pairs from any of 12 We use the implementations of the Snowball stem- 13 mer from English and French available from: http:// http://www.statmt.org/wmt11/ snowball.tartarus.org translation-task.html 723 the 5 techniques in our study which belong to 43 the gold standard, and we used a corresponding number of negative examples (randomly selected) 41 from candidate pairs not in the gold standard. The right part of Table 2 provides the results for our 39 F-measure validation experiments of the union set for all pre- vious techniques. 37 We obtain our best results for this study using All the output of our validation classifier over the set 35 \POS \SRC of all candidate paraphrase pairs. On French, it \CTXT 33 \STEM yields an improvement in F-measure (43.16) of \LEN +6.46 over the best individual technique (TERp ) \COOC 31 and of +15.63 over the naive union from all indi- 10 20 30 40 50 60 70 80 90 100 vidual techniques. On English, the improvement % of examples from training corpus in F-measure (45.37) is for the same conditions of respectively +6.91 (over TERp ) and +13.66. We Figure 2: Learning curves obtained on French by re- unfortunately observe an important decrease in re- moving features individually. call over the naive union, of respectively -17.54 and -19.68 for French and English. Increasing our and F-measure in the range 36-38, indicating that amount of training data to better represent the full the task under study is a very challenging one. range of paraphrase types may certainly overcome Our validation strategy based on bi-class classi- this in part. This would indeed be sensible, as bet- fication using a broad set of features applicable to ter covering the variety of paraphrase types as a all candidate paraphrase pairs allowed us to obtain one-time effort would help all subsequent valida- a 18% relative improvement in F-measure over tions. Figure 2 shows how performance varies on the best individual technique for both languages. French with number of training examples for var- Our future work include performing a deeper ious feature configurations. However, some para- error analysis of our current results, to better com- phrase types will require integration of more com- prehend what characteristics of paraphrase still plex knowledge, as is the case, for instance, for defy current validation. Also, we want to inves- paraphrase pairs involving some anaphora and its tigate adding new individual techniques to pro- antecedent (e.g. China ↔ it). vide so far unseen candidates. Another possible While these results, which are very comparable approach would be to submit all pairs of sub- for the two languages studied, are already satisfy- sentential paraphrase pairs from a sentence pair ing given the complexity of our task, further in- to our validation process, which would obviously spection of false positives and negatives may help require some optimization and devising sensible us to develop additional models that will help us heuristics to limit time complexity. We also in- obtain a better classification performance. tend to collect larger corpora for all other corpus types appearing in Table 1 and conducting anew 6 Conclusions and future work our acquisition and validation tasks. In this article, we have addressed the task of com- bining the results of sub-sentential paraphrase ac- Acknowledgements quition from parallel monolingual corpora using a The authors would like to thank the reviewers for large variety of techniques. We have provided jus- their comments and suggestions, as well as Guil- tifications for using highly parallel corpora con- laume Wisniewski for helpful discussions. This sisting of multiply translated sentences from a work was partly funded by ANR project Edylex single language. All our experiments were con- (ANR-09-CORD-008). ducted on both English and French using com- parable resources, so although the results cannot be directly compared they give some acceptable References comparison points. The best recall of any indi- Ion Androutsopoulos and Prodromos Malakasiotis. vidual technique is around 45 for both language, 2010. A Survey of Paraphrasing and Textual En- 724 tailment Methods. Journal of Artificial Intelligence Philipp Koehn, Hieu Hoang, Alexandra Birch, Research, 38:135–187. Chris Callison-Burch, Marcello Federico, Nicola Colin Bannard and Chris Callison-Burch. 2005. Para- Bertoldi, Brooke Cowan, Wade Shen, Christine phrasing with Bilingual Parallel Corpora. In Pro- Moran, Richard Zens, Chris Dyer, Ondrej Bojar, ceedings of ACL, Ann Arbor, USA. Alexandra Constantin, and Evan Herbst. 2007. Regina Barzilay and Lillian Lee. 2003. Learn- Moses: Open Source Toolkit for Statistical Machine ing to paraphrase: an unsupervised approach us- Translation. In Proceedings of ACL, demo session, ing multiple-sequence alignment. In Proceedings Prague, Czech Republic. of NAACL-HLT, Edmonton, Canada. Dekang Lin and Patrick Pantel. 2001. Discovery of in- ference rules for question answering. Natural Lan- Regina Barzilay and Kathleen R. McKeown. 2001. guage Engineering, 7(4):343–360. Extracting paraphrases from a parallel corpus. In Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng. Proceedings of ACL, Toulouse, France. 2010. PEM: A paraphrase evaluation metric ex- Rahul Bhagat and Deepak Ravichandran. 2008. Large ploiting parallel texts. In Proceedings of EMNLP, scale acquisition of paraphrases for learning surface Cambridge, USA. patterns. In Proceedings of ACL-HLT, Columbus, Nitin Madnani and Bonnie J. Dorr. 2010. Generat- USA. ing Phrasal and Sentential Paraphrases: A Survey Rahul Bhagat. 2009. Learning Paraphrases from Text. of Data-Driven Methods . Computational Linguis- Ph.D. thesis, University of Southern California. tics, 36(3). Houda Bouamor, Aur´elien Max, and Anne Vilnat. Nitin Madnani. 2010. The Circle of Meaning: From 2010. Comparison of Paraphrase Acquisition Tech- Translation to Paraphrasing and Back. Ph.D. the- niques on Sentential Paraphrases. In Proceedings of sis, University of Maryland College Park. IceTAL, Rejkavik, Iceland. Donald Metzler, Eduard Hovy, and Chunliang Zhang. Chris Callison-Burch, Trevor Cohn, and Mirella La- 2011. An empirical evaluation of data-driven para- pata. 2008. Parametric: An automatic evaluation phrase generation techniques. In Proceedings of metric for paraphrasing. In Proceedings of COL- ACL-HLT, Portland, USA. ING, Manchester, UK. Franz Josef Och and Herman Ney. 2004. The align- Chris Callison-Burch. 2007. Paraphrasing and Trans- ment template approach to statistical machine trans- lation. Ph.D. thesis, University of Edinburgh. lation. Computational Linguistics, 30(4). Chris Callison-Burch. 2008. Syntactic Constraints Bo Pang, Kevin Knight, and Daniel Marcu. 2003. on Paraphrases Extracted from Parallel Corpora. In Syntax-based alignement of multiple translations: Proceedings of EMNLP, Hawai, USA. Extracting paraphrases and generating new sen- Marie Candito, Benoˆıt Crabb´e, and Pascal Denis. tences. In Proceedings of NAACL-HLT, Edmonton, 2010. Statistical French dependency parsing: tree- Canada. bank conversion and first results. In Proceedings of Matthew Snover, Nitin Madnani, Bonnie J. Dorr, and LREC, Valletta, Malta. Richard Schwartz. 2010. TER-Plus: paraphrase, semantic, and alignment enhancements to Transla- David Chen and William Dolan. 2011. Collecting tion Edit Rate. Machine Translation, 23(2-3). highly parallel data for paraphrase evaluation. In J¨org Tiedemann. 2007. Building a Multilingual Paral- Proceedings of ACL, Portland, USA. lel Subtitle Corpus. In Proceedings of the Confer- Trevor Cohn, Chris Callison-Burch, and Mirella Lap- ence on Computational Linguistics in the Nether- ata. 2008. Constructing corpora for the develop- lands, Leuven, Belgium. ment and evaluation of paraphrase systems. Com- Shiqi Zhao, Haifeng Wang, Ting Liu, and Sheng Li. putational Linguistics, 34(4). 2008. Pivot Approach for Extracting Paraphrase Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Patterns from Bilingual Corpora. In Proceedings Unsupervised construction of large paraphrase cor- of ACL-HLT, Columbus, USA. pora: Exploiting massively parallel news sources. In Proceedings of COLING, Geneva, Switzerland. Ulrich Germann. 2008. Yawat : Yet Another Word Alignment Tool. In Proceedings of the ACL-HLT, demo session, Columbus, USA. Christian Jacquemin. 1999. Syntagmatic and paradig- matic representations of term variation. In Proceed- ings of ACL, College Park, USA. Dan Klein and Christopher D. Manning. 2003. Accu- rate unlexicalized parsing. In Proceedings of ACL, Sapporo, Japan. 725 Determining the placement of German verbs in English–to–German SMT Anita Gojun Alexander Fraser Institute for Natural Language Processing University of Stuttgart, Germany {gojunaa, fraser}@ims.uni-stuttgart.de Abstract German language model, making the translations difficult to understand. When translating English to German, exist- A common approach for handling the long- ing reordering models often cannot model range reordering problem within PSMT is per- the long-range reorderings needed to gen- forming syntax-based or part-of-speech-based erate German translations with verbs in the (POS-based) reordering of the input as a prepro- correct position. We reorder English as a preprocessing step for English-to-German cessing step before translation (e.g., Collins et al. SMT. We use a sequence of hand-crafted (2005), Gupta et al. (2007), Habash (2007), Xu reordering rules applied to English parse et al. (2009), Niehues and Kolss (2009), Katz- trees. The reordering rules place English Brown et al. (2011), Genzel (2010)). verbal elements in the positions within the We reorder English to improve the translation clause they will have in the German transla- to German. The verb reordering process is im- tion. This is a difficult problem, as German plemented using deterministic reordering rules on verbal elements can appear in different po- English parse trees. The sequence of reorderings sitions within a clause (in contrast with En- glish verbal elements, whose positions do is derived from the clause type and the composi- not vary as much). We obtain a significant tion of a given verbal complex (a (possibly dis- improvement in translation performance. contiguous) sequence of verbal elements in a sin- gle clause). Only one rule can be applied in a given context and for each word to be reordered, 1 Introduction there is a unique reordered position. We train a standard PSMT system on the reordered English Phrase-based SMT (PSMT) systems translate training and tuning data and use it to translate the word sequences (phrases) from a source language reordered English test set into German. into a target language, performing reordering of This paper is structured as follows: in section target phrases in order to generate a fluent target 2, we outline related work. In section 3, English language output. The reordering models, such as, and German verb positioning is described. The for example, the models implemented in Moses reordering rules are given in section 4. In sec- (Koehn et al., 2007), are often limited to a cer- tion 5, we show the relevance of the reordering, tain reordering range since reordering beyond this present the experiments and present an extensive distance cannot be performed accurately. This re- error analysis. We discuss some problems ob- sults in problems of fluency for language pairs served in section 7 and conclude in section 8. with large differences in constituent order, such as English and German. When translating from 2 Related work English to German, verbs in the German output are often incorrectly left near their position in En- There have been a number of attempts to handle glish, creating problems of fluency. Verbs are also the long-range reordering problem within PSMT. often omitted since the distortion model cannot Many of them are based on the reordering of a move verbs to positions which are licensed by the source language sentence as a preprocessing step 726 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 726–735, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics before translation. Our approach is related to the dev set which is then translated and evaluated. work of Collins et al. (2005). They reordered Only those rule sequences are extracted which German sentences as a preprocessing step for maximize the translation performance of the German-to-English SMT. Hand-crafted reorder- reordered dev set. ing rules are applied on German parse trees in For the extraction of reordering rules, Gen- order to move the German verbs into the posi- zel (2010) uses shallow constituent parse trees tions corresponding to the positions of the English which are obtained from dependency parse trees. verbs. Subsequently, the reordered German sen- The trees are annotated using both Penn Tree- tences are translated into English leading to better bank POS tags and using Stanford dependency translation performance when compared with the types. However, the constraints on possible re- translation of the original German sentences. orderings are too restrictive in order to model all We apply this method on the opposite trans- word movements required for English-to-German lation direction, thus having English as a source translation. In particular, the reordering rules in- language and German as a target language. How- volve only the permutation of direct child nodes ever, we cannot simply invert the reordering rules and do not allow changing of child-parent rela- which are applied on German as a source lan- tionships (deleting of a child or attaching a node guage in order to reorder the English input. While to a new father node). In our implementation, a the reordering of German implies movement of verb can be moved to any position in a parse tree the German verbs into a single position, when re- (according to the reordering rules): the reordering ordering English, we need to split the English ver- can be a simple permutation of child nodes, or at- bal complexes and, where required, move their tachment of these nodes to a new father node (cf. parts into different positions. Therefore, we need movement of bought and read in figure 11 ). to identify exactly which parts of a verbal com- Thus, in contrast to Genzel (2010), our ap- plex must be moved and their possible positions proach does not have any constraints with respect in a German sentence. to the position of nodes marking a verb within the Reordering rules can also be extracted automat- tree. Only the syntactic structure of the sentence ically. For example, Niehues and Kolss (2009) restricts the distance of the linguistically moti- automatically extracted discontiguous reordering vated verb movements. rules (allowing gaps between POS tags which can include an arbitrary number of words) from 3 Verb positions in English and German a word-aligned parallel corpus with POS tagged 3.1 Syntax of German sentences source side. Since many different rules can be ap- plied on a given sentence, a number of reordered Since in this work, we concentrate on verbs, we sentence alternatives are created which are en- use the notion verbal complex for a sequence con- coded as a word lattice (Dyer et al., 2008). They sisting of verbs, verbal particles and negation. dealt with the translation directions German-to- The verb positions in the German sentences de- English and English-to-German, but translation pend on clause type and the tense as shown in ta- improvement was obtained only for the German- ble 1. Verbs can be placed in 1st, 2nd or clause- to-English direction. This may be due to miss- final position. Additionally, if a composed tense ing information about clause boundaries since En- is given, the parts of a verbal complex can be glish verbs often have to be moved to the clause interrupted by the middle field (MF) which con- end. Our reordering has access to this kind of tains arbitrary sentence constituents, e.g., sub- knowledge since we are working with a full syn- jects and objects (noun phrases), adjuncts (prepo- tactic parser of English. sitional phrases), adverbs, etc. We assume that the Genzel (2010) proposed a language- German sentences are SVO (analogously to En- independent method for learning reordering glish); topicalization is beyond the scope of our rules where the rules are extracted from parsed work. source language sentences. For each node, all In this work, we consider two possible posi- possible reorderings (permutations) of a limited tions of the negation in German: (1) directly in number of the child nodes are considered. The 1 The verb movements shown in figure 1 will be explained candidate reordering rules are applied on the in detail in section 4. 727 1st 2nd MF clause- position in declarative, or in the 1st position in in- final terrogative clauses, in German, the entire verbal subject finV any ∅ complex can additionally be placed at the clause decl subject finV any mainV end in subordinate or infinitival clauses (cf. row finV subject any ∅ sub/inf in table 1). int/perif finV subject any mainV Because of these differences, for nearly all relCon subject any finV types of English clauses, reordering is needed in sub/inf relCon subject any VC order to place the English verbs in the positions Table 1: Position of the German subjects and verbs which correspond to the correct verb positions in in declarative clauses (decl), interrogative clauses and German. Only English declarative clauses with clauses with a peripheral clause (int/perif ), subordi- simple present and simple past tense have the nate/infinitival (sub/inf ) clauses. mainV = main verb, same verb position as their German counterparts. finV = finite verb, VC = verbal complex, any = arbi- We give statistics on clause types and their rele- trary words, relCon = relative pronoun or conjunction. vance for the verb reordering in section 5.1. We consider extraponed consituents in perif, as well as optional interrogatives in int to be in position 0. 4 Reordering of the English input front of the main verb, and (2) directly after the The reordering is carried out on English parse finite verb. The two negation positions are illus- trees. We first enrich the parse trees with clause trated in the following examples: type labels, as described below. Then, for each node marking a clause (S nodes), the correspond- (1) Ich behaupte, dass ich es nicht gesagt habe. ing sequence of reordering rules is carried out. I claim that I it not say did. The appropriate reordering is derived from the (2) Ich denke nicht, dass er das gesagt hat. clause type label and the composition of the given I think not that he that said has. verbal complex. The reordering rules are deter- It should, however, be noted that in German, the ministic. Only one rule can be applied in a given negative particle nicht can have several positions context and for each verb to be reordered, there is in a sentence depending on the context (verb argu- a unique reordered position. ments, emphasis). Thus, more analysis is ideally The reordering procedure is the same for the needed (e.g., discourse, etc.). training and the testing data. It is carried out on English parse trees resulting in modified parse 3.2 Comparison of verb positions trees which are read out in order to generate the English and German verbal complexes differ both reordered English sentences. These are input for in their construction and their position. The Ger- training a PSMT system or input to the decoder. man verbal complex can be discontiguous, i.e., its The processing steps are shown in figure 1. parts can be placed in different positions which For the development of the reordering rules, we implies that a (large) number of other words can used a small sample of the training data. In par- be placed between the verbs (situated in the MF). ticular, by observing the English parse trees ex- In English, the verbal complex can only be inter- tracted randomly from the training data, we de- rupted by adverbials and subjects (in interrogative veloped a set of rules which transform the origi- clauses). Furthermore, in German, the finite verb nal trees in such a way that the English verbs are can sometimes be the last element of the verbal moved to the positions which correspond to the complex, while in English, the finite verb is al- placement of verbs in German. ways the first verb in the verbal complex. 4.1 Labeling clauses with their type In terms of positions, the verbs in English and German can differ significantly. As previously As shown in section 3.1, the verb positions in Ger- noted, the German verbal complex can be discon- man depend on the clause type. Since we use En- tiguous, simultaneously occupying 1st/2nd and glish parse trees produced by the generative parser clause-final position (cf. rows decl and int/perif in of Charniak and Johnson (2005) which do not table 1), which is not the case in English. While in have any function labels, we implemented a sim- English, the verbal complex is placed in the 2nd ple rule-based clause type labeling script which 728 S−EXTR tence end. The starting node is the node which ADVP , NP VP1 . marks the verbal phrase in which the verbs are , . enclosed. When the next node marking a clause RB PRP VBD NP is identified, the search stops and returns the posi- reordering Yesterday I read NP S−SUB tion in front of the identified clause marking node. DT NN WHNP S When, for example, searching for the clause boundary of S-EXTR in figure 1, we search re- a book which NP VP cursively for the first clause marking node within PRP VBD NP VP1 , which is S-SUB. The position in front of S- S−EXTR SUB is marked as clause-final position of S-EXTR. I bought JJ NN ADVP , VBD NP VP1 . last week 4.3 Basic verb reordering rules , . RB read PRP NP The reordering procedure takes into account the Yesterday I NP S−SUB following word categories: verbs, verb particles, the infinitival particle to and the negative parti- DT NN WHNP S cle not, as well as its abbreviated form ’t. The a book which NP VP reordering rules are based on POS labels in the PRP NP VBD parse tree. read out and translate The reordering procedure is a sequence of ap- I JJ NN bought plications of the reordering rules. For each el- last week ement of an English verbal complex, its proper- ties are derived (tense, main verb/auxiliary, finite- Figure 1: Processing steps: Clause type labeling an- ness). The reordering is then carried out corre- notates the given original tree with clause type labels sponding to the clause type and verbal properties (in figure, S-EXTR and S-SUB). Subsequently, the re- of a verb to be processed. ordering is performed (cf. movement of the verbs read In the following, the reordering rules are pre- and bought). The reordered sentence is finally read out and given to the decoder. sented. Examples of reordered sentences are given in table 2, and are discussed further here. enriches every clause starting node with the corre- Main clause (S-MAIN) sponding clause type label. The label depends on (i) simple tense: no reordering required the context (father, child nodes) of a given clause (cf. appearsfinV in input 1); node. If, for example, the first child node of a given S node is WH* (wh-word) or IN (subordi- (ii) composed tense: the main verb is moved to nating conjunction), then the clause type label is the clause end. If a negative particle exists, it SUB (subordinate clause, cf. figure 1). is moved in front of the reordered main verb, We defined five clause type labels which indi- while the optional verb particle is moved af- cate main clauses (MAIN), main clauses with a ter the reordered main verb (cf. [has]finV peripheral clause in the prefield (EXTR), subor- [been developing]mainV in input 2). dinate (SUB), infinitival (XCOMP) and interroga- tive clauses (INT). Main clause with peripheral clause (S-EXTR) 4.2 Clause boundary identification (i) simple tense: the finite verb is moved to- The German verbs are often placed at the clause gether with an optional particle to the 1st po- end (cf. rows decl, int/perif and sub/inf in ta- sition (i.e. in front of the subject); ble 1), making it necessary to move their En- (ii) composed tense: the main verb, as well glish counterparts into the corresponding posi- as optional negative and verb particles are tions within an English tree. For this reason, we moved to the clause end. The finite verb is identify the clause ends (the right boundaries). moved in the 1st position, i.e. in front of the The search for the clause end is implemented as subject (cf. havef inV [gone up]mainV in in- a breadth-first search for the next S node or sen- put 3). 729 Subordinate clause (S-SUB) 4.4.3 Flexible position of German verbs We stated that the English verbs are never moved (i) simple tense: the finite verb is moved to the outside the subclause they were originally in. In clause end (cf. boastsfinV in input 3); German there are, however, some constructions (ii) composed tense: the main verb, as well (infinitival and relative clauses), in which the as optional negative and verb particles are main verb can be placed after a subsequent clause. moved to the clause end, the finite verb is Consider two German translations of the English placed after the reordered main verb (cf. sentence He has promised to come: havefinV [been executed]mainV in input 5). (3a) Er hat [zu kommen]S versprochen. he has to come promised. Infinitival clause (S-XCOMP) The entire English verbal complex is moved from (3b) Er hat versprochen, [zu kommen]S . the 2nd position to the clause-final position (cf. he has promised, to come. [to discuss]VC in input 4). In (3a), the German main verb versprochen is placed after the infinitival clause zu kommen (to Interrogative clause (S-INT) come), while in (3b), the same verb is placed in front of it. Both alternatives are grammatically (i) simple tense: no reordering required; correct. (ii) composed tense: the main verb, as well If a German verb should come after an em- as optional negative and verb particles are bedded clause as in example (3a) or precede it moved to the clause end (cf. [did]finV (cf. example (3b)), depends not only on syntac- knowmainV in input 5). tic but also on stylistic factors. Regarding the 4.4 Reordering rules for other phenomena verb reordering problem, we would therefore have to examine the given sentence in order to derive 4.4.1 Multiple auxiliaries in English the correct (or more probable) new verb position Some English tenses require a sequence of aux- which is beyond the scope of this work. There- iliaries, not all of which have a German coun- fore, we allow only for reorderings which do not terpart. In the reordering process, non-finite cross clause boundaries as shown in example (3b). auxiliaries are considered to be a part of the main verb complex and are moved together with 5 Experiments the main verb (cf. movement of hasfinV [been In order to evaluate the translation of the re- developing]mainV in input 2). ordered English sentences, we built two SMT sys- 4.4.2 Simple vs. composed tenses tems with Moses (Koehn et al., 2007). As train- In English, there are some tenses composed of ing data, we used the Europarl corpus which con- an auxiliary and a main verb which correspond sists of 1,204,062 English/German sentence pairs. to a German tense composed of only one verb, The baseline system was trained on the original e.g., am reading ⇔ lese and does John read? ⇔ English training data while the contrastive system liest John? Splitting such English verbal com- was trained on the reordered English training data. plexes and only moving the main verbs would In both systems, the same original German sen- lead to constructions which do not exist in Ger- tences were used. We used WMT 2009 dev and man. Therefore, in the reordering process, the test sets to tune and test the systems. The baseline English verbal complex in present continuous, as system was tuned and tested on the original data well as interrogative phrases composed of do and while for the contrastive system, we used the re- a main verb, are not split. They are handled as ordered English side of the dev and test sets. The one main verb complex and reordered as a sin- German 5-gram language model used in both sys- gle unit using the rules for main verbs (e.g. [be- tems was trained on the WMT 2009 German lan- cause I am reading a book]SUB ⇒ because I a guage modeling data, a large German newspaper corpus consisting of 10,193,376 sentences. book am reading ⇔ weil ich ein Buch lese.2 other tenses which could (or should) be treated in the same 2 We only consider present continuous and verbs in com- way (cf. has been developing on input 2, table 2). We do not bination with do for this kind of reordering. There are also do this to keep the reordering rules simple and general. 730 Input 1 The programme appears to be successful for published data shows that MRSA is on the decline in the UK. Reordered The programme appears successful to be for published data shows that MRSA on the decline in the UK is. Input 2 The real estate market in Bulgaria has been developing at an unbelievable rate - all of Europe has its eyes on this heretofore rarely heard-of Balkan nation. Reordered The real estate market in Bulgaria has at an unbelievable rate been developing - all of Europe has its eyes on this heretofore rarely heard-of Balkan nation. Input 3 While Bulgaria boasts the European Union’s lowest real estate prices, they have still gone up by 21 percent in the past five years. Reordered While Bulgaria the European Union’s lowest real estate prices boasts, have they still by 21 percent in the past five years gone up. Input 4 Professionals and politicians from 192 countries are slated to discuss the Bali Roadmap that focuses on efforts to cut greenhouse gas emissions after 2012, when the Kyoto Protocol expires. Reordered Professionals and politicians from 192 countries are slated the Bali Roadmap to discuss that on efforts focuses greenhouse gas emissions after 2012 to cut, when the Kyoto Protocol expires. Input 5 Did you know that in that same country, since 1976, 34 mentally-retarded offenders have been executed? Reordered Did you know that in that same country, since 1976, 34 mentally-retarded offenders been executed have? Table 2: Examples of reordered English sentences 5.1 Applied rules tense MAIN EXTR SUB INT XCOMP simple 675,095 170,806 449,631 8,739 - In order to see how many English clauses are rel- composed 343,178 116,729 277,733 8,817 314,573 evant for reordering, we derived statistics about rest 98,464 5,158 90,139 306 146,746 clause types and the number of reordering rules applied on the training data. Table 3: Counts of English clause types and used In table 3, the number of the English clauses tenses. Bold numbers indicate clause type/tense com- with all considered clause type/tense combination binations where reordering is required. are shown. The bold numbers indicate combina- Baseline Reordered tions which are relevant to the reordering. Over- BLEU 13.02 13.63 all, 62% of all EN clauses from our training data (2,706,117 clauses) are relevant for the verb re- Table 4: Scores of baseline and contrastive systems ordering. Note that there is an additional category rest which indicates incorrect clause type/tense lation in which all verbs are placed correctly. In combinations and might thus not be correctly re- the baseline translation, only the translation of the ordered. These are mostly due to parsing and/or finite verb was, namely war, is placed correctly, tagging errors. while the translation of the main verb (diagnosed The performance of the systems was measured → festgestellt) should be placed at the clause end by BLEU (Papineni et al., 2002). The evaluation as in the translation produced by our system. results are shown in table 4. The contrastive sys- tem outperforms the baseline. Its BLEU score is 5.2 Evaluation 13.63 which is a gain of 0.61 BLEU points over Often, the English verbal complex is translated the baseline. This is a statistically significant im- only partially by the baseline system. For exam- provement at p<0.05 (computed with Gimpel’s ple, the English verbal complexes in sentence 2 in implementation of the pairwise bootstrap resam- table 5 will climb and will drop are only partially pling method (Koehn, 2004)). translated (will climb → wird (will), will drop → Manual examination of the translations pro- fallen (fall)). Moreover, the generated verbs are duced by both systems confirms the result of placed incorrectly. In our translation, all verbs are the automatic evaluation. Many translations pro- translated and placed correctly. duced by the contrastive system now have verbs in Another problem which was often observed in the correct positions. If we compare the generated the baseline is the omission of the verbs in the translations for input sentence 1 in table 5, we German translations. The baseline translation of see that the contrastive system generates a trans- the example sentence 3 in table 5 illustrates such 731 a case. There is no translation of the English in- On the other hand, in the baseline SMT system, finitival verbal complex to have. In the transla- the subject they is likely to be a part of a trans- tion generated by the contrastive system, the ver- lation phrase with the correct German equivalent bal complex does get translated (zu haben) and (they have said → sie haben gesagt). They is then is also placed correctly. We think this is because used as a disambiguating context which is missing the reordering model is not able to identify the in the reordered sentence (but the order is wrong). position for the verb which is licensed by the lan- guage model, causing a hypothesis with no verb 6.2.2 Verb dependency to be scored higher than the hypotheses with in- A similar problem occurs in a verbal complex: correctly placed verbs. (5a) They have said it to me yesterday. (5b) They have it to me yesterday said. 6 Error analysis In sentence (5a), the English consecutive verbs have said are a sequence consisting of a finite 6.1 Erroneous reordering in our system auxiliary have and the past participle said. They In some cases, the reordering of the English parse should be translated into the corresponding Ger- trees fails. Most erroneous reorderings are due to man verbal complex haben gesagt. But, if the a number of different parsing and tagging errors. verbs are split, we will probably get translations Coordinated verbs are also problematic due to which are completely independent. Even if the their complexity. Their composition can vary, and German auxiliary is correctly inflected, it is hard thus it would require a large number of different to predict how said is going to be translated. If reordering rules to fully capture this. In our re- the distance between the auxiliary habe and the ordering script, the movement of complex struc- hypothesized translation of said is large, the lan- tures such as verbal phrases consisting of a se- guage model will not be able to help select the quence of child nodes is not implemented (only correct translation. Here, the baseline SMT sys- nodes with one child, namely the verb, verbal par- tem again has an advantage as the verbs are con- ticle or negative particle are moved). secutive. It is likely they will be found in the train- ing data and extracted with the correct German 6.2 Splitting of the English verbal complex phrase (but the German order is again incorrect). Since in many cases, the German verbal complex 6.3 Collocations is discontiguous, we need to split the English ver- bal complex and move its parts into different posi- Collocations (verb–object pairs) are another case tions. This ensures the correct placement of Ger- which can lead to a problem: man verbs. However, this does not ensure that the (6a) I think that the discussion would take place German verb forms are correct because of highly later this evening. ambiguous English verbs. In some cases, we can (6b) I think that the discussion place later this lose contextual information which would be use- evening take would. ful for disambiguating ambiguous verbs and gen- The English collocation in (6a) consisting of the erating the appropriate German verb forms. verb take and the object place corresponds to the German verb stattfinden. Without this specific ob- 6.2.1 Subject–verb agreement ject, the verb take is likely to be translated liter- Let us consider the English clause in (4a) and its ally. In the reordered sentence, the verbal com- reordered version in (4b): plex take would is indeed separated from the ob- (4a) ... because they have said it to me yesterday. ject place which would probably lead to the literal (4b) ... because they it to me yesterday said have. translation of both parts of the mentioned collo- In (4b), the English verbs said have are separated cation. So, as already described in the preceding from the subject they. The English said have can paragraphs, an important source of contextual in- be translated in several ways into German. With- formation is lost which could ensure the correct out any information about the subject (the dis- translation of the given phrase. tance between the verbs and the subject can be This problem is not specific to English–to– very large), it is relatively likely that an erroneous German. For instance, the same problem occurs German translation is generated. when translating German into English. If, for ex- 732 Input 1 An MRSA - an antibiotic resistant staphylococcus - infection was recently diagnosed in the trauma- tology ward of J´anos hospital. Reordered An MRSA - an antibiotic resistant staphylococcus - infection was recently in the traumatology ward input of J´anos hospital diagnosed. Baseline Ein MRSA - ein Antibiotikum resistenter Staphylococcus - war vor kurzem in der festgestellt translation A MRSA - an antibiotic resistant Staphylococcus - was before recent in the diagnosed traumatology Ward von J´anos Krankenhaus. traumatology ward of J´anos hospital. Reordered Ein MRSA - ein Antibiotikum resistenter Staphylococcus - Infektion wurde vor kurzem in den translation A MRSA - an antibiotic resistant Staphylococcus - infection was before recent in the traumatology Station der J´anos Krankenhaus diagnostiziert. traumatology ward of J´anos hospital diagnosed. Input 2 The ECB predicts that 2008 inflation will climb to 2.5 percent from the earlier 2.1, but will drop back to 1.9 percent in 2009. Reordered The ECB predicts that 2008 inflation to 2.5 percent from the earlier 2.1 will climb, but back to 1.9 input percent in 2009 will drop. Baseline Die EZB sagt, dass 2008 die Inflationsrate wird auf 2,5 Prozent aus der fr¨uheren 2,1, sondern translation The ECB says, that 2008 the inflation rate will to 2.5 percent from the earlier 2.1, but fallen zur¨uck auf 1,9 Prozent im Jahr 2009. fall back to 1.9 percent in the year 2009. Reordered Die EZB prophezeit, dass 2008 die Inflation zu 2,5 Prozent aus der fr¨uheren 2,1 ansteigen translation The ECB predicts, that 2008 the inflation rate to 2.5 percent from the earlier 2.1 climb wird, aber auf 1,9 Prozent in 2009 sinken wird. will, but to 1.9 percent in 2009 fall will. Input 3 Labour Minister M´onika Lamperth appears not to have a sensitive side. R. input Labour Minister M´onika Lamperth appears a sensitive side not to have . Baseline Arbeitsminister M´onika Lamperth scheint nicht eine sensible Seite. translation Labour Minister M´onika Lamperth appears not a sensitive side. Reordered Arbeitsminister M´onika Lamperth scheint eine sensible Seite nicht zu haben. translation Labour Minister M´onika Lamperth appears a sensitive side not to have. Table 5: Example translations, the baseline has problems with verbal elements, reordered is correct ample, the object Kauf (buying) of the colloca- pose a problem for translation (see sections 6.2– tion nehmen + in Kauf (accept) is separated from 6.3). Although the positions of the verbs in the the verb nehmen (take), they are very likely to be translations are now correct, the distance between translated literally (rather than as the idiom mean- subjects and verbs, or between verbs in a single ing “to accept”), thus leading to an erroneous En- VP might lead to the generation of erroneously glish translation. inflected verbs. The separate generation of Ger- man verbal morphology is an interesting area of 6.4 Error statistics future work, see (de Gispert and Mari˜no, 2008). We also found 2 problematic collocations but note We manually checked 100 randomly chosen En- that this only gives a rough idea of the problem, glish sentences to see how often the problems de- further study is needed. scribed in the previous sections occur. From a total of 276 clauses, 29 were not reordered cor- 6.5 POS-based disambiguation of the rectly. 20 errors were caused by incorrect parsing English verbs and/or POS tags, while the remaining 9 are mostly due to different kinds of coordination. Table 6 With respect to the problems described in 6.2.1 shows correctly reordered clauses which might and 6.2.2, we carried out an experiment in which 733 total d ≥ 5 tokens the main verb of a verbal complex can occupy subject–verb 40 19 different positions in a clause, we had to define verb dependency 32 14 the English counterparts of the two components collocations 8 2 of the German verbal complex. We defined non- Table 6: total is the number of clauses found for the finite English verbal elements as a part of the main respective phenomenon. d ≥ 5 tokens is the number of verb complex which are then moved together with clauses where the distance between relevant tokens is the main verb. This rigid definition could be re- at least 5, which is problematic. laxed by considering multiple different splittings and movements of the English verbs. Baseline + POS Reordered + POS BLEU 13.11 13.68 Furthermore, the reordering rules are applied on a clause not allowing for movements across the Table 7: BLEU scores of the baseline and the con- clause boundaries. However, we also showed that trastive SMT system using verbal POS tags in some cases, the main verbs may be moved after the succeeding subclause. Stochastic rules could we used POS tags in order to disambiguate the allow for both placements or carry out the more English verbs. For example, the English verb said probable reordering given a specific context. We corresponds to the German participle gesagt, as will address these issues in future work. well as to the finite verb in simple past, e.g. sagte. Unfortunately, some important contextual in- We attached the POS tags to the English verbs in formation is lost when splitting and moving En- order to simulate a disambiguating suffix of a verb glish verbs. When English verbs are highly am- (e.g. said ⇒ said VBN, said VBD). The idea be- biguous, erroneous German verbs can be gener- hind this was to extract the correct verbal trans- ated. The experiment described in section 6.5 lation phrases and score them with appropriate shows that more effort should be made in order to translation probabilities (e.g. p(said VBN, gesagt) overcome this problem. The incorporation of sep- > p(said VBN, sagte). arate morphological generation of inflected Ger- We built and tested two PSMT systems using man verbs would improve translation. the data enriched with verbal POS tags. The first system is trained and tested on the original 8 Conclusion English sentences, while the contrastive one was trained and tested on the reordered English sen- We presented a method for reordering English as a tences. Evaluation results are shown in table 7. preprocessing step for English–to–German SMT. The baseline obtains a gain of 0.09 and the con- To our knowledge, this is one of the first papers trastive system of 0.05 BLEU points over the cor- which reports on experiments regarding the re- responding PSMT system without POS tags. Al- ordering problem for English–to–German SMT. though there are verbs which are now generated We showed that the reordering rules specified in correctly, the overall translation improvement lies this work lead to improved translation quality. We under our expectation. We will directly model the observed that verbs are placed correctly more of- inflection of German verbs in future work. ten than in the baseline, and that verbs which were omitted in the baseline are now often generated. 7 Discussion and future work We carried out a thorough analysis of the rules We implemented reordering rules for English ver- applied and discussed problems which are related bal complexes because their placement differs to highly ambiguous English verbs. Finally we significantly from German placement. The imple- presented ideas for future work. mentation required dealing with three important problems: (i) definition of the clause boundaries, Acknowledgments (ii) identification of the new verb positions and (iii) correct splitting of the verbal complexes. This work was funded by Deutsche Forschungs- We showed some phenomena for which a gemeinschaft grant Models of Morphosyntax for stochastic reordering would be more appropriate. Statistical Machine Translation. For example, since in German, the auxiliary and 734 References Eugene Charniak and Mark Johnson. 2005. Coarse- to-fine n-best parsing and MaxEnt discriminative reranking. In ACL. Michael Collins, Philipp Koehn, and Ivona Kuˇcerov´a. 2005. Clause restructuring for statistical machine translation. In ACL. Adri`a de Gispert and Jos´e B. Mari˜no. 2008. On the impact of morphology in English to Spanish statis- tical MT. Speech Communication, 50(11-12). Chris Dyer, Smaranda Muresan, and Philip Resnik. 2008. Generalizing word lattice translation. In ACL-HLT. Dmitriy Genzel. 2010. Automatically learning source-side reordering rules for large scale machine translation. In COLING. Deepa Gupta, Mauro Cettolo, and Marcello Federico. 2007. POS-based reordering models for statistical machine translation. In Proceedings of the Machine Translation Summit (MT-Summit). Nizar Habash. 2007. Syntactic preprocessing for sta- tistical machine translation. In Proceedings of the Machine Translation Summit (MT-Summit). Jason Katz-Brown, Slav Petrov, Ryan McDon- ald, Franz Och, David Talbot, Hiroshi Ichikawa, Masakazu Seno, and Hideto Kazawa. 2011. Train- ing a parser for machine translation reordering. In EMNLP. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL, Demonstration Program. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In EMNLP. Jan Niehues and Muntsin Kolss. 2009. A POS-based model for long-range reorderings in SMT. In EACL Workshop on Statistical Machine Translation. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for auto- matic evaluation of machine translation. In ACL. Peng Xu, Jaecho Kang, Michael Ringgaard, and Franz Och. 2009. Using a dependency parser to improve SMT for subject-object-verb languages. In NAACL. 735 Syntax-Based Word Ordering Incorporating a Large-Scale Language Model Yue Zhang Graeme Blackwood Stephen Clark University of Cambridge University of Cambridge University of Cambridge Computer Laboratory Engineering Department Computer Laboratory

[email protected] [email protected] [email protected]

Abstract In phrase-based machine translation (Koehn et al., 2003; Koehn et al., 2007), a distortion limit is A fundamental problem in text generation used to constrain the position of output phrases. is word ordering. Word ordering is a com- In syntax-based machine translation systems such putationally difficult problem, which can as Wu (1997) and Chiang (2007), synchronous be constrained to some extent for particu- grammars limit the search space so that poly- lar applications, for example by using syn- chronous grammars for statistical machine nomial time inference is feasible. In fluency translation. There have been some recent improvement (Blackwood et al., 2010), parts of attempts at the unconstrained problem of translation hypotheses identified as having high generating a sentence from a multi-set of local confidence are held fixed, so that word or- input words (Wan et al., 2009; Zhang and dering elsewhere is strictly local. Clark, 2011). By using CCG and learn- ing guided search, Zhang and Clark re- Some recent work attempts to address the fun- ported the highest scores on this task. One damental word ordering task directly, using syn- limitation of their system is the absence tactic models and heuristic search. Wan et al. of an N-gram language model, which has (2009) uses a dependency grammar to solve word been used by text generation systems to ordering, and Zhang and Clark (2011) uses CCG improve fluency. We take the Zhang and (Steedman, 2000) for word ordering and word Clark system as the baseline, and incor- choice. The use of syntax models makes their porate an N-gram model by applying on- line large-margin training. Our system sig- search problems harder than word permutation us- nificantly improved on the baseline by 3.7 ing an N -gram language model only. Both meth- BLEU points. ods apply heuristic search. Zhang and Clark de- veloped a bottom-up best-first algorithm to build output syntax trees from input words, where 1 Introduction search is guided by learning for both efficiency One fundamental problem in text generation is and accuracy. The framework is flexible in allow- word ordering, which can be abstractly formu- ing a large range of constraints to be added for lated as finding a grammatical order for a multi- particular tasks. set of words. The word ordering problem can also We extend the work of Zhang and Clark (2011) include word choice, where only a subset of the (Z&C) in two ways. First, we apply online large- input words are used to produce the output. margin training to guide search. Compared to the Word ordering is a difficult problem. Finding perceptron algorithm on “constituent level fea- the best permutation for a set of words accord- tures” by Z&C, our training algorithm is theo- ing to a bigram language model, for example, is retically more elegant (see Section 3) and con- NP-hard, which can be proved by linear reduction verges more smoothly empirically (see Section 5). from the traveling salesman problem. In prac- Using online large-margin training not only im- tice, exploring the whole search space of permu- proves the output quality, but also allows the in- tations is often prevented by adding constraints. corporation of an N -gram language-model into 736 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 736–746, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics the system. N -gram models have been used as a Steedman (2002), meaning that the CCG combina- standard component in statistical machine trans- tory rules are encoded as rule instances, together lation, but have not been applied to the syntac- with a number of additional rules which deal with tic model of Z&C. Intuitively, an N -gram model punctuation and type-changing. Given a sentence, can improve local fluency when added to a syntax its CCG derivation can be produced by first assign- model. Our experiments show that a four-gram ing a lexical category to each word, and then re- model trained using the English GigaWord cor- cursively applying CCG rules bottom-up. pus gave improvements when added to the syntax- based baseline system. 2.2 The decoding algorithm The contributions of this paper are as follows. In the decoding algorithm, a hypothesis is an First, we improve on the performance of the Z&C edge, which corresponds to a sub-tree in a CCG system for the challenging task of the general derivation. Edges are built bottom-up, starting word ordering problem. Second, we develop a from leaf edges, which are generated by assigning novel method for incorporating a large-scale lan- all possible lexical categories to each input word. guage model into a syntax-based generation sys- Each leaf edge corresponds to an input word with tem. Finally, we analyse large-margin training in a particular lexical category. Two existing edges the context of learning-guided best-first search, can be combined if there exists a CCG rule which offering a novel solution to this computationally combines their category labels, and if they do not hard problem. contain the same input word more times than its total count in the input. The resulting edge is as- 2 The statistical model and decoding signed a category label according to the combi- algorithm natory rule, and covers the concatenated surface strings of the two sub-edges in their order or com- We take Z&C as our baseline system. Given bination. New edges can also be generated by ap- a multi-set of input words, the baseline system plying unary rules to a single existing edge. Start- builds a CCG derivation by choosing and ordering ing from the leaf edges, the bottom-up process is words from the input set. The scoring model is repeated until a goal edge is found, and its surface trained using CCGBank (Hockenmaier and Steed- string is taken as the output. man, 2007), and best-first decoding is applied. We This derivation-building process is reminiscent apply the same decoding framework in this paper, of a bottom-up CCG parser in the edge combina- but apply an improved training process, and incor- tion mechanism. However, it is fundamentally porate an N -gram language model into the syntax different from a bottom-up parser. Since, for model. In this section, we describe and discuss the generation problem, the order of two edges the baseline statistical model and decoding frame- in their combination is flexible, the search prob- work, motivating our extensions. lem is much harder than that of a parser. With no input order specified, no efficient dynamic- 2.1 Combinatory Categorial Grammar programming algorithm is available, and less con- CCG , and parsing with CCG , has been described textual information is available for disambigua- elsewhere (Clark and Curran, 2007; Hockenmaier tion due to the lack of an input string. and Steedman, 2002); here we provide only a In order to combat the large search space, best- short description. first search is applied, where candidate hypothe- CCG (Steedman, 2000) is a lexicalized gram- ses are ordered by their scores, and kept in an mar formalism, which associates each word in a agenda, and a limited number of accepted hy- sentence with a lexical category. There is a small potheses are recorded in a chart. Here the chart number of basic lexical categories, such as noun is essentially a set of beams, each of which con- (N), noun phrase (NP), and prepositional phrase tains the highest scored edges covering a particu- (PP). Complex lexical categories are formed re- lar number of words. Initially, all leaf edges are cursively from basic categories and slashes, which generated and scored, before they are put onto the indicate the directions of arguments. The CCG agenda. During each step in the decoding process, grammar used by our system is read off the deriva- the top edge from the agenda is expanded. If it is tions in CCGbank, following Hockenmaier and a goal edge, it is returned as the output, and the 737 Algorithm 1 The decoding algorithm. During decoding, feature vectors are computed a ← I NITAGENDA( ) incrementally. When an edge is constructed, its c ← I NIT C HART( ) score is computed from the scores of its sub-edges while not T IME O UT( ) do and the incrementally added structure: new ← [] e ← P OP B EST(a) f (e) = Φ(e) · θ X if G OALT EST(e) then = Φ(es ) + φ(e) · θ return e es ∈e end if X = Φ(es ) · θ + φ(e) · θ for e′ ∈ U NARY(e, grammar) do es ∈e A PPEND(new, e) X = f (es ) + φ(e) · θ end for es ∈e for e˜ ∈ c do if C AN C OMBINE(e, e˜) then In the equation, es ∈ e represents a sub-edge of e′ ← B INARY(e, e˜, grammar) e. Leaf edges do not have any sub-edges. Unary- A PPEND(new, e′ ) branching edges have one sub-edge, and binary- end if branching edges have two sub-edges. The fea- if C AN C OMBINE(˜e, e) then ture vector φ(e) represents the incremental struc- e′ ← B INARY(˜ e, e, grammar) ture when e is constructed over its sub-edges. A PPEND(new, e′ ) It is called the “constituent-level feature vector” end if by Z&C. For leaf edges, φ(e) includes informa- end for tion about the lexical category label; for unary- for e′ ∈ new do branching edges, φ(e) includes information from A DD(a, e′ ) the unary rule; for binary-branching edges, φ(e) end for includes information from the binary rule, and ad- A DD(c, e) ditionally the token, POS and lexical category bi- end while grams and trigrams that result from the surface string concatenation of its sub-edges. The score f (e) is therefore the sum of f (es ) (for all es ∈ e) decoding finishes. Otherwise it is extended with plus φ(e) · θ. The feature templates we use are the unary rules, and combined with existing edges in same as those in the baseline system. the chart using binary rules to produce new edges. An important aspect of the scoring model is that The resulting edges are scored and put onto the edges with different sizes are compared with each agenda, while the original edge is put onto the other during decoding. Edges with different sizes chart. The process repeats until a goal edge is can have different numbers of features, which can found, or a timeout limit is reached. In the latter make the training of a discriminative model more case, a default output is produced using existing difficult. For example, a leaf edge with one word edges in the chart. can be compared with an edge over the entire in- Pseudocode for the decoder is shown as Algo- put. One way of reducing the effect of the size dif- rithm 1. Again it is reminiscent of a best-first ference is to include the size of the edge as part of parser (Caraballo and Charniak, 1998) in the use feature definitions, which can improve the compa- of an agenda and a chart, but is fundamentally dif- rability of edges of different sizes by reducing the ferent due to the fact that there is no input order. number of features they have in common. Such 2.3 Statistical model and feature templates features are applied by Z&C, and we make use of them here. Even with such features, the question The baseline system uses a linear model to score of whether edges with different sizes are linearly hypotheses. For an edge e, its score is defined as: separable is an empirical one. f (e) = Φ(e) · θ, 3 Training where Φ(e) represents the feature vector of e and The efficiency of the decoding algorithm is de- θ is the parameter vector of the model. pendent on the statistical model, since the best- 738 first search is guided to a solution by the model, Algorithm 2 The training algorithm. and a good model will lead to a solution being a ← I NITAGENDA( ) found more quickly. In the ideal situation for the c ← I NIT C HART( ) best-first decoding algorithm, the model is perfect while not T IME O UT( ) do and the score of any gold-standard edge is higher new ← [] than the score of any non-gold-standard edge. As e ← P OP B EST(a) a result, the top edge on the agenda is always a if G OLD S TANDARD(e) and G OALT EST(e) gold-standard edge, and therefore all edges on the then return e chart are gold-standard before the gold-standard end if goal edge is found. In this oracle procedure, the if not G OLD S TANDARD(e) then minimum number of edges is expanded, and the e− ← e output is correct. The best-first decoder is perfect e+ ← M IN G OLD(a) in not only accuracy, but also speed. In practice U PDATE PARAMETERS(e+ , e− ) this ideal situation is rarely met, but it determines R E C OMPUTE S CORES(a, c) the goal of the training algorithm: to produce the continue perfect model and hence decoder. end if If we take gold-standard edges as positive ex- for e′ ∈ U NARY(e, grammar) do amples, and non-gold-standard edges as negative A PPEND(new, e) examples, the goal of the training problem can be end for viewed as finding a large separating margin be- for e˜ ∈ c do tween the scores of positive and negative exam- if C AN C OMBINE(e, e˜) then ples. However, it is infeasible to generate the full e′ ← B INARY(e, e˜, grammar) space of negative examples, which is factorial in A PPEND(new, e′ ) the size of input. Like Z&C, we apply online end if learning, and generate negative examples based if C AN C OMBINE(˜e, e) then on the decoding algorithm. e′ ← B INARY(˜ e, e, grammar) Our training algorithm is shown as Algo- A PPEND(new, e′ ) rithm 2. The algorithm is based on the decoder, end if where an agenda is used as a priority queue of end for edges to be expanded, and a set of accepted edges for e′ ∈ new do is kept in a chart. Similar to the decoding algo- A DD(a, e′ ) rithm, the agenda is intialized using all possible end for leaf edges. During each step, the top of the agenda A DD(c, e) e is popped. If it is a gold-standard edge, it is ex- end while panded in exactly the same way as the decoder, with the newly generated edges being put onto the agenda, and e being inserted into the chart. for further work possible alternative methods to If e is not a gold-standard edge, we take it as a generate more negative examples during training. negative example e− , and take the lowest scored Another way of viewing the training process is gold-standard edge on the agenda e+ as a positive that it pushes gold-standard edges towards the top example, in order to make an udpate to the model of the agenda, and crucially pushes them above parameter vector θ. Our parameter update algo- non-gold-standard edges. This is the view de- rithm is different from the baseline perceptron al- scribed by Z&C. Given a positive example e+ and gorithm, as will be discussed later. After updating a negative example e− , they use the perceptron the parameters, the scores of agenda edges above algorithm to penalize the score for φ(e− ) and re- and including e− , together with all chart edges, ward the score of φ(e+ ), but do not update pa- are updated, and e− is discarded before the start rameters for the sub-edges of e+ and e− . An argu- of the next processing step. By not putting any ment for not penalizing the sub-edge scores for e− non-gold-standard edges onto the chart, the train- is that the sub-edges must be gold-standard edges ing speed is much faster; on the other hand a wide (since the training process is constructed so that range of negative examples is pruned. We leave only gold-standard edges are expanded). From 739 the perspective of correctness, it is unnecessary have been used as a standard component in statis- to find a margin between the sub-edges of e+ and tical machine translation systems to control out- those of e− , since both are gold-standard edges. put fluency. For the syntax-based generation sys- However, since the score of an edge not only tem, the incorporation of an N -gram language represents its correctness, but also affects its pri- model can potentially improve the local fluency ority on the agenda, promoting the sub-edge of of output sequences. In addition, the N -gram e+ can lead to “easier” edges being constructed language model can be trained separately using before “harder” ones (i.e. those that are less a large amount of data, while the syntax-based likely to be correct), and therefore improve the model requires manual annotation for training. output accuracy. This perspective has been ob- The standard method for the combination of served by other works of learning-guided-search a syntax model and an N -gram model is linear (Shen et al., 2007; Shen and Joshi, 2008; Gold- interpolation. We incorporate fourgram, trigram berg and Elhadad, 2010). Intuitively, the score and bigram scores into our syntax model, so that difference between easy gold-standard and harder the score of an edge e becomes: gold-standard edges should not be as great as the difference between gold-standard and non-gold- F (e) = f (e) + g(e) standard edges. The perceptron update cannot = f (e) + α · gfour (e) + β · gtri (e) + γ · gbi (e), provide such control of separation, because the amount of update is fixed to 1. where f is the syntax model score, and g is the As described earlier, we treat parameter update N -gram model score. g consists of three com- as finding a separation between correct and incor- ponents, gfour , gtri and gbi , representing the log- rect edges, in which the global feature vectors Φ, probabilities of fourgrams, trigrams and bigrams rather than φ, are considered. Given a positive ex- from the language model, respectively. α, β and ample e+ and a negative example e− , we make a γ are the corresponding weights. minimum update so that the score of e+ is higher During decoding, F (e) is computed incremen- than that of e− with some margin: tally. Again, denoting the sub-edges of e as es , θ ← arg min k θ ′ −θ0 k, s.t.Φ(e+ )θ ′ −Φ(e− )θ ′ ≥ 1 F (e) = f (e) + g(e) θ′ X = F (es ) + φ(e)θ + gδ (e) where θ0 and θ denote the parameter vectors be- es ∈e fore and after the udpate, respectively. The up- date is similar to the update of online large-margin Here gδ (e) = α · gδfour (e) + β · gδtri (e) + γ · gδbi (e) learning algorithms such as 1-best MIRA (Cram- is the sum of log-probabilities of the new N - mer et al., 2006), and has a closed-form solution: grams resulting from the construction of e. For f (e− ) − f (e+ ) + 1 leaf edges and unary-branching edges, no new N - θ ← θ0 + 2 Φ(e+ ) − Φ(e− ) grams result from their construction (i.e. gδ = 0). k Φ(e+ ) − Φ(e− ) k For a binary-branching edge, new N -grams result In this update, the global feature vectors Φ(e+ ) from the surface-string concatenation of its sub- and Φ(e− ) are used. Unlike Z&C, the scores edges. The sum of log-probabilities of the new of sub-edges of e+ and e− are also udpated, so fourgrams, trigrams and bigrams contribute to gδ that the sub-edges of e− are less prioritized than with weights α, β and γ, respectively. those of e+ . We show empirically that this train- For training, there are at least three methods to ing algorithm significantly outperforms the per- tune α, β, γ and θ. One simple method is to train ceptron training of the baseline system in Sec- the syntax model θ independently, and select α, tion 5. An advantage of our new training algo- β, and γ empirically from a range of candidate rithm is that it enables the accommodation of a values according to development tests. We call separately trained N -gram model into the system. this method test-time interpolation. An alterna- tive is to select α, β and γ first, initializing the 4 Incorporating an N-gram language vector θ as all zeroes, and then run the training model algorithm for θ taking into account the N -gram Since the seminal work of the IBM models language model. In this process, g is considered (Brown et al., 1993), N -gram language models when finding a separation between positive and 740 negative examples; the training algorithm finds a CCGBank Sentences Tokens value of θ that best suits the precomputed α, β training 39,604 929,552 and γ values, together with the N -gram language development 1,913 45,422 model. We call this method g-precomputed in- GigaWord v4 Sentences Tokens terpolation. Yet another method is to initialize α, AFP 30,363,052 684,910,697 β, γ and θ as all zeroes, and run the training al- XIN 15,982,098 340,666,976 gorithm taking into account the N -gram language model. We call this method g-free interpolation. Table 1: Number of sentences and tokens by language The incorporation of an N -gram language model source. model into the syntax-based generation system is weakly analogous to N -gram model insertion for standard sequence, assuming that for some prob- syntax-based statistical machine translation sys- lems the ambiguities can be reduced (e.g. when tems, both of which apply a score from the N - the input is already partly correctly ordered). gram model component in a derivation-building Z&C use different probability cutoff levels (the process. As discussed earlier, polynomial-time β parameter in the supertagger) to control the decoding is typically feasible for syntax-based pruning. Here we focus mainly on the dictionary machine translation systems without an N -gram method, which leaves lexical category disam- language model, due to constraints from the biguation entirely to the generation system. For grammar. In these cases, incorporation of N - comparison, we also perform experiments with gram language models can significantly increase lexical category pruning. We chose β = 0.0001, the complexity of a dynamic-programming de- which leaves 5.4 leaf edges per word on average. coder (Bar-Hillel et al., 1961). Efficient search We used the SRILM Toolkit (Stolcke, 2002) has been achieved using chart pruning (Chiang, to build a true-case 4-gram language model es- 2007) and iterative numerical approaches to con- timated over the CCGBank training and develop- strained optimization (Rush and Collins, 2011). ment data and a large additional collection of flu- In contrast, the incorporation of an N -gram lan- ent sentences in the Agence France-Presse (AFP) guage model into our decoder is more straightfor- and Xinhua News Agency (XIN) subsets of the ward, and does not add to its asymptotic complex- English GigaWord Fourth Edition (Parker et al., ity, due to the heuristic nature of the decoder. 2009), a total of over 1 billion tokens. The Gi- gaWord data was first pre-processed to replicate 5 Experiments the CCGBank tokenization. The total number We use sections 2–21 of CCGBank to train our of sentences and tokens in each LM component syntax model, section 00 for development and is shown in Table 1. The language model vo- section 23 for the final test. Derivations from cabulary consists of the 46,574 words that oc- CCGBank are transformed into inputs by turn- cur in the concatenation of the CCGBank train- ing their surface strings into multi-sets of words. ing, development, and test sets. The LM proba- Following Z&C, we treat base noun phrases (i.e. bilities are estimated using modified Kneser-Ney NP s that do not recursively contain other NPs) as smoothing (Kneser and Ney, 1995) with interpo- atomic units for the input. Output sequences are lation of lower n-gram orders. compared with the original sentences to evaluate their quality. We follow previous work and use 5.1 Development experiments the BLEU metric (Papineni et al., 2002) to com- A set of development test results without lexical pare outputs with references. category pruning (i.e. using the full dictionary) is Z&C use two methods to construct leaf edges. shown in Table 2. We train the baseline system The first is to assign lexical categories according and our systems under various settings for 10 iter- to a dictionary. There are 26.8 lexical categories ations, and measure the output BLEU scores after for each word on average using this method, cor- each iteration. The timeout value for each sen- responding to 26.8 leaf edges. The other method tence is set to 5 seconds. The highest score (max is to use a pre-processing step — a CCG supertag- BLEU) and averaged score (avg. BLEU) of each ger (Clark and Curran, 2007) — to prune can- system over the 10 training iterations are shown didate lexical categories according to the gold- in the table. 741 Method max BLEU avg. BLEU baseline 38.47 37.36 margin 41.20 39.70 margin +LM (g-precomputed) 41.50 40.84 margin +LM (α = 0, β = 0, γ = 0) 40.83 — margin +LM (α = 0.08, β = 0.016, γ = 0.004) 38.99 — margin +LM (α = 0.4, β = 0.08, γ = 0.02) 36.17 — margin +LM (α = 0.8, β = 0.16, γ = 0.04) 34.74 — Table 2: Development experiments without lexical category pruning. The first three rows represent the baseline sys- 45 tem, our largin-margin training system (margin), 44 and our system with the N -gram model incorpo- 43 rated using g-precomputed interpolation. For in- 42 BLEU terpolation we manually chose α = 0.8, β = 0.16 41 and γ = 0.04, respectively. These values could 40 be optimized by development experiments with 39 baseline margin alternative configurations, which may lead to fur- 38 margin +LM ther improvements. Our system with large-margin 37 1 2 3 4 5 6 7 8 9 10 training gives higher BLEU scores than the base- training iteration line system consistently over all iterations. The N -gram model led to further improvements. Figure 1: Development experiments with lexical cate- gory pruning (β = 0.0001). The last four rows in the table show results of our system with the N -gram model added us- ing test-time interpolation. The syntax model is training. One question that arises is whether g- trained with the optimal number of iterations, and free interpolation will outperform g-precomputed different α, β, and γ values are used to integrate interpolation. g-free interpolation offers the free- the language model. Compared with the system dom of α, β and γ during training, and can poten- using no N -gram model (margin), test-time inter- tially reach a better combination of the parameter polation did not improve the accuracies. values. However, the training algorithm failed to The row with α, β, γ = 0 represents our system converge with g-free interpolation. One possible with the N -gram model loaded, and the scores explanation is that real-valued features from the gf our , gtri and gbi computed for each N -gram language model made our large-margin training during decoding, but the scores of edges are com- harder. Another possible reason is that our train- puted without using N -gram probabilities. The ing process with heavy pruning does not accom- scoring model is the same as the syntax model modate this complex model. (margin), but the results are lower than the row Figure 1 shows a set of development experi- “margin”, because computing N -gram probabil- ments with lexical category pruning (with the su- ities made the system slower, exploring less hy- pertagger parameter β = 0.0001). The scores potheses under the same timeout setting.1 of the three different systems are calculated by The comparison between g-precomputed inter- varying the number of training iterations. The polation and test-time interpolation shows that the large-margin training system (margin) gave con- system gives better scores when the syntax model sistently better scores than the baseline system, takes into consideration the N -gram model during and adding a language model (margin +LM) im- proves the scores further. 1 More decoding time could be given to the slower N - Table 3 shows some manually chosen examples gram system, but we use 5 seconds as the timeout setting for all the experiments, giving the methods with the N -gram for which our system gave significant improve- language model a slight disadvantage, as shown by the two ments over the baseline. For most other sentences rows “margin” and “margin +LM (α, β, γ = 0). the improvements are not as obvious. For each 742 baseline margin margin +LM as a nonexecutive director Pierre Vinken 61 years old , the board will join as a as a nonexecutive director Pierre Vinken , 61 years old , will join the board . 29 nonexecutive director Nov. 29 , Pierre , 61 years old , will join the board Nov. Nov. Vinken . 29 . Lorillard nor smokers were aware of the of any research who studied Neither the Neither Lorillard nor any research on the Kent cigarettes of any research on the workers were aware of smokers on the workers who studied the Kent cigarettes workers who studied the researchers Kent cigarettes nor the researchers were aware of smokers of the researchers . you But 35 years ago have to recognize recognize But you took place that these But you have to recognize that these that these events took place . events have to 35 years ago . events took place 35 years ago . investors to pour cash into money funds Despite investors , yields continue to Despite investors , recent declines in continue in Despite yields recent declines pour into money funds recent declines in yields continue to pour cash into money cash . funds . yielding The top money funds are cur- The top money funds currently are yield- The top money funds are yielding well rently well over 9 % . ing well over 9 % . over 9 % currently . where A buffet breakfast , held in the mu- everyday visitors are banned to where A buffet breakfast , everyday visitors are seum was food and drinks to . everyday A buffet breakfast was held , food and banned to where food and drinks was visitors banned drinks in the museum . held in the museum . A Commonwealth Edison spokesman tracking A Commonwealth Edison an administrative nightmare whose ad- said an administrative nightmare would spokesman said that the two million cus- dresses would be tracking down A Com- be tracking down the past 3 12 years that tomers whose addresses have changed monwealth Edison spokesman said that the two million customers have . whose down during the past 3 12 years would the two million customers have changed changed be an administrative nightmare . during the past 3 12 years . The $ 2.5 billion Byron 1 plant , Ill. , was The $ 2.5 billion Byron 1 plant was near The $ 2.5 billion Byron 1 plant near completed . near Rockford in 1985 completed in Rockford , Ill. , 1985 . Rockford , Ill. , was completed in 1985 . will ( During its centennial year , The as The Wall Street Journal ( During its During its centennial year events will re- Wall Street Journal report events of the centennial year , milestones stand of port , The Wall Street Journal that stand past century that stand as milestones of American business history that will re- as milestones of American business his- American business history . ) port events of the past century . ) tory ( of the past century ) . Table 3: Some chosen examples with significant improvements (supertagger parameter β = 0.0001). method, the examples are chosen from the devel- syntactically grammatical, but are semantically opment output with lexical category pruning, af- anomalous. For example, person names are often ter the optimal number of training iterations, with confused with company names, verbs often take the timeout set to 5s. We also tried manually se- unrelated subjects and objects. The problem is lecting examples without lexical category prun- much more severe for long sentences, which have ing, but the improvements were not as obvious, more ambiguities. For specific tasks, extra infor- partly because the overall fluency was lower for mation (such as the source text for machine trans- all the three systems. lation) can be available to reduce ambiguities. Table 4 shows a set of examples chosen ran- domly from the development test outputs of our 6 Final results system with the N -gram model. The optimal number of training iterations is used, and a time- The final results of our system without lexical cat- out of 1 minute is used in addition to the 5s time- egory pruning are shown in Table 5. Row “W09 out for comparison. With more time to decode CLE” and “W09 AB” show the results of the each input, the system gave a BLEU score of maximum spanning tree and assignment-based al- 44.61, higher than 41.50 with the 5s timout. gorithms of Wan et al. (2009); rows “margin” While some of the outputs we examined are and “margin +LM” show the results of our large- reasonably fluent, most are to some extent frag- margin training system and our system with the mentary.2 In general, the system outputs are N -gram model. All these results are directly com- still far below human fluency. Some samples are parable since we do not use any lexical category 2 pruning for this set of results. For each of our Part of the reason for some fragmentary outputs is the default output mechanism: partial derivations from the chart systems, we fix the number of training iterations are greedily put together when timeout occurs before a goal according to development test scores. Consis- hypothesis is found. tent with the development experiments, our sys- 743 timeout = 5s timeout = 1m drooled the cars and drivers , like Fortune 500 executives . over After schoolboys drooled over the cars and drivers , the race the race like Fortune 500 executives . One big reason : thin margins . One big reason : thin margins . You or accountants look around ... and at an eye blinks . pro- blinks nobody You or accountants look around ... and at an eye fessional ballplayers . professional ballplayers most disturbing And of it , are educators , not students , for the And blamed for the wrongdoing , educators , not students who wrongdoing is who . are disturbing , much of it is most . defeat coaching aids the purpose of which is , He and other gauge coaching aids learning progress can and other critics say critics say can to . standardized tests learning progress the purpose of which is to defeat , standardized tests . The federal government of government debt because Congress The federal government suspended sales of government debt has lifted the ceiling on U.S. savings bonds suspended sales because Congress has n’t lifted the ceiling on U.S. savings bonds . Table 4: Some examples chosen at random from development test outputs without lexical category pruning. System BLEU 2011). Unlike our system, and Wan et al. (2009), W09 CLE 26.8 input dependencies provide additional informa- W09 AB 33.7 tion to these systems. Although the search space Z&C11 40.1 can be constrained by the assumption of projec- margin 42.5 tivity, permutation of modifiers of the same head margin +LM 43.8 word makes exact inference for tree lineariza- tion intractable. The above systems typically ap- Table 5: Test results without lexical category pruning. ply approximate inference, such as beam-search. While syntax-based features are commonly used System BLEU by these systems for linearization, Filippova and Z&C11 43.2 Strube (2009) apply a trigram model to control local fluency within constituents. A dependency- margin 44.7 based N-gram model has also been shown effec- margin +LM 46.1 tive for the linearization task (Guo et al., 2011). The best-first inference and timeout mechanism Table 6: Test results with lexical category pruning (su- of our system is similar to that of White (2004), a pertagger parameter β = 0.0001). surface realizer from logical forms using CCG. tem outperforms the baseline methods. The acu- 8 Conclusion racies are significantly higher when the N -gram We studied the problem of word-ordering using model is incorporated. a syntactic model and allowing permutation. We Table 6 compares our system with Z&C using took the model of Zhang and Clark (2011) as the lexical category pruning (β = 0.0001) and a 5s baseline, and extended it with online large-margin timeout for fair comparison. The results are sim- training and an N -gram language model. These ilar to Table 5: our large-margin training systems extentions led to improvements in the BLEU eval- outperforms the baseline by 1.5 BLEU points, and uation. Analyzing the generated sentences sug- adding the N -gram model gave a further 1.4 point gests that, while highly fluent outputs can be pro- improvement. The scores could be significantly duced for short sentences (≤ 10 words), the sys- increased by using a larger timeout, as shown in tem fluency in general is still way below human our earlier development experiments. standard. Future work remains to apply the sys- tem as a component for specific text generation 7 Related Work tasks, for example machine translation. There is a recent line of research on text-to- text generation, which studies the linearization of Acknowledgements dependency structures (Barzilay and McKeown, Yue Zhang and Stephen Clark are supported by the Eu- 2005; Filippova and Strube, 2007; Filippova and ropean Union Seventh Framework Programme (FP7- Strube, 2009; Bohnet et al., 2010; Guo et al., ICT-2009-4) under grant agreement no. 247762. 744 References Short Papers, pages 225–228, Boulder, Colorado, June. Association for Computational Linguistics. Yehoshua Bar-Hillel, M. Perles, and E. Shamir. 1961. Yoav Goldberg and Michael Elhadad. 2010. An effi- On formal properties of simple phrase structure cient algorithm for easy-first non-directional depen- grammars. Zeitschrift f¨ur Phonetik, Sprachwis- dency parsing. In Human Language Technologies: senschaft und Kommunikationsforschung, 14:143– The 2010 Annual Conference of the North American 172. Reprinted in Y. Bar-Hillel. (1964). Language Chapter of the Association for Computational Lin- and Information: Selected Essays on their Theory guistics, pages 742–750, Los Angeles, California, and Application, Addison-Wesley 1964, 116–150. June. Association for Computational Linguistics. Regina Barzilay and Kathleen McKeown. 2005. Sen- Yuqing Guo, Deirdre Hogan, and Josef van Genabith. tence fusion for multidocument news summariza- 2011. Dcu at generation challenges 2011 surface tion. Computational Linguistics, 31(3):297–328. realisation track. In Proceedings of the Generation Graeme Blackwood, Adri`a de Gispert, and William Challenges Session at the 13th European Workshop Byrne. 2010. Fluency constraints for minimum on Natural Language Generation, pages 227–229, Bayes-risk decoding of statistical machine trans- Nancy, France, September. Association for Compu- lation lattices. In Proceedings of the 23rd Inter- tational Linguistics. national Conference on Computational Linguistics Julia Hockenmaier and Mark Steedman. 2002. Gen- (Coling 2010), pages 71–79, Beijing, China, Au- erative models for statistical parsing with Combi- gust. Coling 2010 Organizing Committee. natory Categorial Grammar. In Proceedings of the Bernd Bohnet, Leo Wanner, Simon Mill, and Alicia 40th Meeting of the ACL, pages 335–342, Philadel- Burga. 2010. Broad coverage multilingual deep phia, PA. sentence generation with a stochastic multi-level re- Julia Hockenmaier and Mark Steedman. 2007. CCG- alizer. In Proceedings of the 23rd International bank: A corpus of CCG derivations and dependency Conference on Computational Linguistics (Coling structures extracted from the Penn Treebank. Com- 2010), pages 98–106, Beijing, China, August. Col- putational Linguistics, 33(3):355–396. ing 2010 Organizing Committee. R. Kneser and H. Ney. 1995. Improved backing-off Peter F. Brown, Stephen Della Pietra, Vincent J. Della for m-gram language modeling. In International Pietra, and Robert L. Mercer. 1993. The mathe- Conference on Acoustics, Speech, and Signal Pro- matics of statistical machine translation: Parameter cessing, 1995. ICASSP-95, volume 1, pages 181– estimation. Computational Linguistics, 19(2):263– 184. 311. Philip Koehn, Franz Och, and Daniel Marcu. 2003. Sharon A. Caraballo and Eugene Charniak. 1998. Statistical phrase-based translation. In Proceedings New figures of merit for best-first probabilistic chart of NAACL/HLT, Edmonton, Canada, May. parsing. Comput. Linguist., 24:275–298, June. Philipp Koehn, Hieu Hoang, Alexandra Birch, David Chiang. 2007. Hierarchical Phrase- Chris Callison-Burch, Marcello Federico, Nicola based Translation. Computational Linguistics, Bertoldi, Brooke Cowan, Wade Shen, Christine 33(2):201–228. Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Stephen Clark and James R. Curran. 2007. Wide- Alexandra Constantin, and Evan Herbst. 2007. coverage efficient statistical parsing with CCG Moses: Open source toolkit for statistical ma- and log-linear models. Computational Linguistics, chine translation. In Proceedings of the 45th An- 33(4):493–552. nual Meeting of the Association for Computational Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Linguistics Companion Volume Proceedings of the Shalev-Shwartz, and Yoram Singer. 2006. Online Demo and Poster Sessions, pages 177–180, Prague, passive-aggressive algorithms. Journal of Machine Czech Republic, June. Association for Computa- Learning Research, 7:551–585. tional Linguistics. Katja Filippova and Michael Strube. 2007. Gener- Kishore Papineni, Salim Roukos, Todd Ward, and ating constituent order in german clauses. In Pro- Wei-Jing Zhu. 2002. Bleu: a method for auto- ceedings of the 45th Annual Meeting of the Asso- matic evaluation of machine translation. In Pro- ciation of Computational Linguistics, pages 320– ceedings of 40th Annual Meeting of the Associa- 327, Prague, Czech Republic, June. Association for tion for Computational Linguistics, pages 311–318, Computational Linguistics. Philadelphia, Pennsylvania, USA, July. Association Katja Filippova and Michael Strube. 2009. Tree lin- for Computational Linguistics. earization in english: Improving language model Robert Parker, David Graff, Junbo Kong, Ke Chen, and based approaches. In Proceedings of Human Lan- Kazuaki Maeda. 2009. English Gigaword Fourth guage Technologies: The 2009 Annual Conference Edition, Linguistic Data Consortium. of the North American Chapter of the Association Alexander M. Rush and Michael Collins. 2011. Exact for Computational Linguistics, Companion Volume: decoding of syntactic translation models through la- 745 grangian relaxation. In Proceedings of the 49th An- nual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 72–82, Portland, Oregon, USA, June. Association for Computational Linguistics. Libin Shen and Aravind Joshi. 2008. LTAG depen- dency parsing with bidirectional incremental con- struction. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Pro- cessing, pages 495–504, Honolulu, Hawaii, Octo- ber. Association for Computational Linguistics. Libin Shen, Giorgio Satta, and Aravind Joshi. 2007. Guided learning for bidirectional sequence classi- fication. In Proceedings of ACL, pages 760–767, Prague, Czech Republic, June. Mark Steedman. 2000. The Syntactic Process. The MIT Press, Cambridge, Mass. Andreas Stolcke. 2002. SRILM - an extensible lan- guage modeling toolkit. In Proceedings of the In- ternational Conference on Spoken Language Pro- cessing, pages 901–904. Stephen Wan, Mark Dras, Robert Dale, and C´ecile Paris. 2009. Improving grammaticality in statisti- cal sentence generation: Introducing a dependency spanning tree algorithm with an argument satisfac- tion model. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 852–860, Athens, Greece, March. Associa- tion for Computational Linguistics. Michael White. 2004. Reining in CCG chart realiza- tion. In Proc. INLG-04, pages 182–191. Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3). Yue Zhang and Stephen Clark. 2011. Syntax- based grammaticality improvement using CCG and guided search. In Proceedings of the 2011 Confer- ence on Empirical Methods in Natural Language Processing, pages 1147–1157, Edinburgh, Scot- land, UK., July. Association for Computational Lin- guistics. 746 Midge: Generating Image Descriptions From Computer Vision Detections Margaret Mitchell† Jesse Dodge‡‡ Amit Goyal†† Kota Yamaguchi§ Karl Stratosk Xufeng Han§ Alyssa Mensch∗∗ Alex Berg§ Tamara Berg§ Hal Daum´e III†† † U. of Aberdeen and Oregon Health and Science University,

[email protected]

§ Stony Brook University, {aberg,tlberg,xufhan,kyamagu}@cs.stonybrook.edu †† U. of Maryland, {hal,amit}@umiacs.umd.edu k Columbia University,

[email protected]

‡‡ U. of Washington,

[email protected]

, ∗∗ MIT,

[email protected]

Abstract This paper introduces a novel generation system that composes humanlike descrip- tions of images from computer vision de- tections. By leveraging syntactically in- formed word co-occurrence statistics, the generator filters and constrains the noisy detections output from a vision system to The bus by the road with a clear blue sky generate syntactic trees that detail what Figure 1: Example image with generated description. the computer vision system sees. Results show that the generation system outper- formation from a language model, or to be short forms state-of-the-art systems, automati- and simple, but as true to the image as possible. cally generating some of the most natural Rather than using a fixed template capable of image descriptions to date. generating one kind of utterance, our approach therefore lies in generating syntactic trees. We 1 Introduction use a tree-generating process (Section 4.3) simi- lar to a Tree Substitution Grammar, but preserv- It is becoming a real possibility for intelligent sys- ing some of the idiosyncrasies of the Penn Tree- tems to talk about the visual world. New ways of bank syntax (Marcus et al., 1995) on which most mapping computer vision to generated language statistical parsers are developed. This allows us have emerged in the past few years, with a fo- to automatically parse and train on an unlimited cus on pairing detections in an image to words amount of text, creating data-driven models that (Farhadi et al., 2010; Li et al., 2011; Kulkarni et flesh out descriptions around detected objects in a al., 2011; Yang et al., 2011). The goal in connect- principled way, based on what is both likely and ing vision to language has varied: systems have syntactically well-formed. started producing language that is descriptive and poetic (Li et al., 2011), summaries that add con- An example generated description is given in tent where the computer vision system does not Figure 1, and example vision output/natural lan- (Yang et al., 2011), and captions copied directly guage generation (NLG) input is given in Fig- from other images that are globally (Farhadi et al., ure 2. The system (“Midge”) generates descrip- 2010) and locally similar (Ordonez et al., 2011). tions in present-tense, declarative phrases, as a na¨ıve viewer without prior knowledge of the pho- A commonality between all of these ap- tograph’s content.1 proaches is that they aim to produce natural- sounding descriptions from computer vision de- Midge is built using the following approach: tections. This commonality is our starting point: An image processed by computer vision algo- We aim to design a system capable of producing rithms can be characterized as a triple <Ai , Bi , natural-sounding descriptions from computer vi- Ci >, where: sion detections that are flexible enough to become 1 Midge is available to try online at: more descriptive and poetic, or include likely in- http://recognition.cs.stonybrook.edu:8080/˜mitchema/midge/. 747 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 747–756, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics stuff: sky .999 a description to select for the kinds of structures it id: 1 tends to appear in (syntactic constraints) and the atts: clear:0.432, blue:0.945 grey:0.853, white:0.501 ... other words it tends to occur with (semantic con- b. box: (1,1 440,141) straints). This is a data-driven way to generate stuff: road .908 likely adjectives, prepositions, determiners, etc., id: 2 taking the intersection of what the vision system atts: wooden:0.722 clear:0.020 ... b. box: (1,236 188,94) predicts and how the object noun tends to be de- object: bus .307 scribed. id: 3 atts: black:0.872, red:0.244 ... 2 Background b. box: (38,38 366,293) Our approach to describing images starts with preps: id 1, id 2: by id 1, id 3: by id 2, id 3: below a system from Kulkarni et al. (2011) that com- Figure 2: Example computer vision output and natu- ral language generation input. Values correspond to poses novel captions for images in the PASCAL scores from the vision detections. sentence data set,2 introduced in Rashtchian et al. (2010). This provides multiple object detec- tions based on Felzenszwalb’s mixtures of multi- • Ai is the set of object/stuff detections with scale deformable parts models (Felzenszwalb et bounding boxes and associated “attribute” al., 2008), and stuff detections (roughly, mass detections within those bounding boxes. nouns, things like sky and grass) based on linear • Bi is the set of action or pose detections as- SVMs for low level region features. sociated to each ai ∈ Ai . Appearance characteristics are predicted using • Ci is the set of spatial relationships that hold trained detectors for colors, shapes, textures, and between the bounding boxes of each pair materials, an idea originally introduced in Farhadi ai , aj ∈ Ai . et al. (2009). Local texture, Histograms of Ori- Similarly, a description of an image can be char- ented Gradients (HOG) (Dalal and Triggs, 2005), acterized as a triple <Ad , Bd , Cd > where: edge, and color descriptors inside the bounding box of a recognized object are binned into his- • Ad is the set of nouns in the description with tograms for a vision system to learn to recognize associated modifiers. when an object is rectangular, wooden, metal, • Bd is the set of verbs associated to each ad ∈ etc. Finally, simple preposition functions are used Ad . to compute the spatial relations between objects • Cd is the set of prepositions that hold be- based on their bounding boxes. tween each pair of ad , ae ∈ Ad . The original Kulkarni et al. (2011) system gen- erates descriptions with a template, filling in slots With this representation, mapping <Ai , Bi , Ci > by combining computer vision outputs with text to <Ad , Bd , Cd > is trivial. The problem then based statistics in a conditional random field to becomes: (1) How to filter out detections that predict the most likely image labeling. Template- are wrong; (2) how to order the objects so that based generation is also used in the recent Yang et they are mentioned in a natural way; (3) how to al. (2011) system, which fills in likely verbs and connect these ordered objects within a syntacti- prepositions by dependency parsing the human- cally/semantically well-formed tree; and (4) how written UIUC Pascal-VOC dataset (Farhadi et al., to add further descriptive information from lan- 2010) and selecting the dependent/head relation guage modeling alone, if required. with the highest log likelihood ratio. Our solution lies in using Ai and Ad as descrip- Template-based generation is useful for auto- tion anchors. In computer vision, object detec- matically generating consistent sentences, how- tions form the basis of action/pose, attribute, and ever, if the goal is to vary or add to the text pro- spatial relationship detections; therefore, in our duced, it may be suboptimal (cf. Reiter and Dale approach to language generation, nouns for the (1997)). Work that does not use template-based object detections are used as the basis for the de- generation includes Yao et al. (2010), who gener- scription. Likelihood estimates of syntactic struc- ate syntactic trees, similar to the approach in this ture and word co-occurrence are conditioned on 2 object nouns, and this enables each noun head in http://vision.cs.uiuc.edu/pascal-sentences/ 748 black, blue, brown, colorful, golden, gray, green, orange, pink, red, silver, white, yel- low, bare, clear, cute, dirty, feathered, flying, furry, pine, plastic, rectangular, rusty, shiny, spotted, striped, wooden Table 1: Modifiers used to extract training corpus. Kulkarni et al.: This is a pic- Kulkarni et al.: This is ture of three persons, one bot- a picture of two potted- for naturally varied but well-formed text, generat- tle and one diningtable. The plants, one dog and one ing syntactic trees rather than filling in a template. first rusty person is beside the person. The black dog is In addition to these tasks, Midge automatically second person. The rusty bot- by the black person, and tle is near the first rusty per- near the second feathered decides what the subject and objects of the de- son, and within the colorful pottedplant. scription will be, leverages the collected word co- diningtable. The second per- occurrence statistics to filter possible incorrect de- son is by the third rusty per- tections, and offers the flexibility to be as de- son. The colorful diningtable scriptive or as terse as possible, specified by the is near the first rusty person, and near the second person, user at run-time. The end result is a fully au- and near the third rusty person. tomatic vision-to-language system that is begin- Yang et al.: Three people Yang et al.: The person is ning to generate syntactically and semantically are showing the bottle on the sitting in the chair in the well-formed descriptions with naturalistic varia- street room tion. Example descriptions are given in Figures 4 Midge: people with a bottle at Midge: a person in black and 5, and descriptions from other recent systems the table with a black dog by potted plants are given in Figure 3. Figure 3: Descriptions generated by Midge, Kulkarni The results are promising, but it is important to et al. (2011) and Yang et al. (2011) on the same images. note that Midge is a first-pass system through the Midge uses the Kulkarni et al. (2011) front-end, and so steps necessary to connect vision to language at outputs are directly comparable. a deep syntactic/semantic level. As such, it uses basic solutions at each stage of the process, which paper. However, their system is not automatic, re- may be improved: Midge serves as an illustration quiring extensive hand-coded semantic and syn- of the types of issues that should be handled to tactic details. Another approach is provided in automatically generate syntactic trees from vision Li et al. (2011), who use image detections to se- detections, and offers some possible solutions. It lect and combine web-scale n-grams (Brants and is evaluated against the Kulkarni et al. system, the Franz, 2006). This automatically generates de- Yang et al. system, and human-written descrip- scriptions that are either poetic or strange (e.g., tions on the same set of images in Section 5, and “tree snowing black train”). is found to significantly outperform the automatic A different line of work transfers captions of systems. similar images directly to a query image. Farhadi et al. (2010) use <object,action,scene> triples 3 Learning from Descriptive Text predicted from the visual characteristics of the To train our system on how people describe im- image to find potential captions. Ordonez et al. ages, we use 700,000 (Flickr, 2011) images with (2011) use global image matching with local re- associated descriptions from the dataset in Or- ordering from a much larger set of captioned pho- donez et al. (2011). This is separate from our tographs. These transfer-based approaches result evaluation image set, consisting of 840 PASCAL in natural captions (they are written by humans) images. The Flickr data is messier than datasets that may not actually be true of the image. created specifically for vision training, but pro- This work learns and builds from these ap- vides the largest corpus of natural descriptions of proaches. Following Kulkarni et al. and Li et al., images to date. the system uses large-scale text corpora to esti- We normalize the text by removing emoticons mate likely words around object detections. Fol- and mark-up language, and parse each caption lowing Yang et al., the system can hallucinate using the Berkeley parser (Petrov, 2010). Once likely words using word co-occurrence statistics parsed, we can extract syntactic information for alone. And following Yao et al., the system aims individual (word, tag) pairs. 749 a cow with sheep with a gray sky people with boats a brown cow people at green grass by the road a wooden table Figure 4: Example generated outputs. Awkward Prepositions Incorrect Detections a person boats under a black bicycle at the sky a yellow bus cows by black sheep on the dog the sky a green potted plant with people by the road Figure 5: Example generated outputs: Not quite right We compute the probabilities for different 4 Generation prenominal modifiers (shiny, clear, glowing, ...) and determiners (a/an, the, None, ...) given a Following Penn Treebank parsing guidelines head noun in a noun phrase (NP), as well as the (Marcus et al., 1995), the relationship between probabilities for each head noun in larger con- two head nouns in a sentence can usually be char- structions, listed in Section 4.3. Probabilities are acterized among the following: conditioned only on open-class words, specifi- 1. prepositional (a boy on the table) cally, nouns and verbs. This means that a closed- 2. verbal (a boy cleans the table) class word (such as a preposition) is never used to generate an open-class word. 3. verb with preposition (a boy sits on the table) 4. verb with particle (a boy cleans up the table) In addition to co-occurrence statistics, the 5. verb with S or SBAR complement (a boy parsed Flickr data adds to our understanding of sees that the table is clean) the basic characteristics of visually descriptive text. Using WordNet (Miller, 1995) to automati- The generation system focuses on the first three cally determine whether a head noun is a physical kinds of relationships, which capture a wide range object or not, we find that 92% of the sentences of utterances. The process of generation is ap- have no more than 3 physical objects. This in- proached as a problem of generating a semanti- forms generation by placing a cap on how many cally and syntactically well-formed tree based on objects are mentioned in each descriptive sen- object nouns. These serve as head noun anchors tence: When more than 3 objects are detected, in a lexicalized syntactic derivation process that the system splits the description over several sen- we call tree growth. tences. We also find that many of the descriptions Vision detections are associated to a {tag are not sentences as well (tagged as S, 58% of the word} pair, and the model fleshes out the tree de- data), but quite commonly noun phrases (tagged tails around head noun anchors by utilizing syn- as NP, 28% of the data), and expect that the num- tactic dependencies between words learned from ber of noun phrases that form descriptions will be the Flickr data discussed in Section 3. The anal- much higher with domain adaptation. This also ogy of growing a tree is quite appropriate here, informs generation, and the system is capable of where nouns are bundles of constraints akin to generating both sentences (contains a main verb) seeds, giving rise to the rest of the tree based on and noun phrases (no main verb) in the final im- the lexicalized subtrees in which the nouns are age description. We use the term ‘sentence’ in the likely to occur. An example generated tree struc- rest of this paper to refer to both kinds of complex ture is shown in Figure 6, with noun anchors in phrases. bold. 750 NP Unordered Ordered NP PP bottle, table, person → person, bottle, table road, sky, cow → cow, road, sky NP PP IN NP Figure 8: Example nominal orderings. DT NN IN NP at DT NN pipeline. The hand-built component contains plu- - people with DT NN the table ral forms of singular nouns, the list of possible a bottle spatial relations shown in Table 3, and a map- Figure 6: Tree generated from tree growth process. ping between attribute values and modifier sur- face forms (e.g., a green detection for person is to Midge was developed using detections run on be realized as the postnominal modifier in green). Flickr images, incorporating action/pose detec- tions for verbs as well as object detections for 4.2 Content Determination nouns. In testing, we generate descriptions for 4.2.1 Step 1: Group the Nouns the PASCAL images, which have been used in An initial set of object detections must first be earlier work on the vision-to-language connection split into clusters that give rise to different sen- (Kulkarni et al., 2011; Yang et al., 2011), and al- tences. If more than 3 objects are detected in the lows us to compare systems directly. Action and image, the system begins splitting these into dif- pose detection for this data set still does not work ferent noun groups. In future work, we aim to well, and so the system does not receive these de- compare principled approaches to this task, e.g., tections from the vision front-end. However, the using mutual information to cluster similar nouns system can still generate verbs when action and together. The current system randomizes which pose detectors have been run, and this framework nouns appear in the same group. allows the system to “hallucinate” likely verbal constructions between objects if specified at run- 4.2.2 Step 2: Order the Nouns time. A similar approach was taken in Yang et al. Each group of nouns are then ordered to deter- (2011). Some examples are given in Figure 7. mine when they are mentioned in a sentence. Be- We follow a three-tiered generation process cause the system generates declarative sentences, (Reiter and Dale, 2000), utilizing content determi- this automatically determines the subject and ob- nation to first cluster and order the object nouns, jects. This is a novel contribution for a general create their local subtrees, and filter incorrect de- problem in NLG, and initial evaluation (Section tections; microplanning to construct full syntactic 5) suggests it works reasonably well. trees around the noun clusters, and surface real- To build the nominal ordering model, we use ization to order selected modifiers, realize them as WordNet to associate all head nouns in the Flickr postnominal or prenominal, and select final out- data to all of their hypernyms. A description is puts. The system follows an overgenerate-and- represented as an ordered set [a1 ...an ] where each select approach (Langkilde and Knight, 1998), ap is a noun with position p in the set of head which allows different final trees to be selected nouns in the sentence. For the position pi of each with different settings. hypernym ha in each sentence with n head nouns, we estimate p(pi |n, ha ). 4.1 Knowledge Base During generation, the system greedily maxi- Midge uses a knowledge base that stores models mizes p(pi |n, ha ) until all nouns have been or- for different tasks during generation. These mod- dered. Example orderings are shown in Figure 8. els are primarily data-driven, but we also include This model automatically places animate objects a hand-built component to handle a small set of near the beginning of a sentence, which follows rules. The data-driven component provides the psycholinguistic work in object naming (Branigan syntactically informed word co-occurrence statis- et al., 2007). tics learned from the Flickr data, a model for or- dering the selected nouns in a sentence, and a 4.2.3 Step 3: Filter Incorrect Attributes model to change computer vision attributes to at- For the system to be able to extend coverage as tribute:value pairs. Below, we discuss the three new computer vision attribute detections become main data-driven models within the generation available, we develop a method to automatically 751 A person sitting on a sofa Cows grazing Airplanes flying A person walking a dog Figure 7: Hallucinating: Creating likely actions. Straightforward to do, but can often be wrong. COLOR purple blue green red white ... member of the group. MATERIAL plastic wooden silver ... SURFACE furry fluffy hard soft ... 4.2.5 Step 5: Gather Local Subtrees Around QUALITY shiny rust dirty broken ... Object Nouns Table 2: Example attribute classes and values. 1 2 NP group adjectives into broader attribute classes,3 DT{0,1} ↓ JJ* ↓ NN S and the generation system uses these classes when n NP{NN n} VP{VBZ} ↓ deciding how to describe objects. To group adjec- 3 4 tives, we use a bootstrapping technique (Kozareva NP NP et al., 2008) that learns which adjectives tend to NP{NN n} VP{VB(G|N)} ↓ NP{NN n} PP{IN} ↓ co-occur, and groups these together to form an at- 5 6 tribute class. Co-occurrence is computed using PP VP cosine (distributional) similarity between adjec- tives, considering adjacent nouns as context (i.e., IN ↓ NP{NN n} VB(G|N|Z) ↓ PP{IN} ↓ JJ NN constructions). Contexts (nouns) for adjec- 7 VP tives are weighted using Pointwise Mutual Infor- mation and only the top 1000 nouns are selected VB(G|N|Z) ↓ NP{NN n} for every adjective. Some of the learned attribute Figure 9: Initial subtree frames for generation, present- classes are given in Table 2. tense declarative phrases. ↓ marks a substitution site, In the Flickr corpus, we find that each attribute * marks ≥ 0 sister nodes of this type permitted, {0,1} (COLOR, SIZE, etc.), rarely has more than a single marks that this node can be included of excluded. Input: set of ordered nouns, Output: trees preserving value in the final description, with the most com- nominal ordering. mon (COLOR) co-occurring less than 2% of the time. Midge enforces this idea to select the most Possible actions/poses and spatial relationships likely word v for each attribute from the detec- between objects nouns, represented by verbs and tions. In a noun phrase headed by an object noun, prepositions, are selected using the subtree frames NP{NN noun}, the prenominal adjective (JJ v) for listed in Figure 9. Each head noun selects for its each attribute is selected using maximum likeli- likely local subtrees, some of which are not fully hood. formed until the Microplanning stage. As an ex- ample of how this process works, see Figure 10, 4.2.4 Step 4: Group Plurals which illustrates the combination of Trees 4 and How to generate natural-sounding spatial rela- 5. For simplicity, we do not include the selection tions and modifiers for a set of objects, as opposed of further subtrees. The subject noun duck se- to a single object, is still an open problem (Fu- lects for prepositional phrases headed by different nakoshi et al., 2004; Gatt, 2006). In this work, we prepositions, and the object noun grass selects use a simple method to group all same-type ob- for prepositions that head the prepositional phrase jects together, associate them to the plural form in which it is embedded. Full PP subtrees are cre- listed in the KB, discard the modifiers, and re- ated during Microplanning by taking the intersec- turn spatial relations based on the first recognized tion of both. 3 What in computer vision are called attributes are called The leftmost noun in the sequence is given a values in NLG. A value like red belongs to a COLOR at- rightward directionality constraint, placing it as tribute, and we use this distinction in the system. the subject of the sentence, and so it will only se- 752 a over b a above b b below a b beneath a a by b b by a a on b b under a b underneath a a upon b a over b a by b a against b b against a b around a a around b a at b b at a a beside b b beside a a by b b by a a near b b near a b with a a with b a in b a in b b outside a a within b a by b b by a Table 3: Possible prepositions from bounding boxes. Subtree frames: a given noun as a mass or count noun (not taking a NP PP determiner or taking a determiner, respectively) or NP{NN n1 } PP{IN} ↓ IN ↓ NP{NN n2 } as a given or new noun (phrases like a sky sound unnatural because sky is given knowledge, requir- Generated subtrees: NP PP ing the definite article the). The selection of de- terminer is not independent of the selection of ad- NP PP IN NP jective; a sky may sound unnatural, but a blue sky NN IN on, by, over NN is fine. These trees take the dependency between duck above, on, by grass determiner and adjective into account. Trees 2 and 3: Combined trees: NP NP Collect beginnings of VP subtrees headed by (VBZ verb), (VBG verb), and (VBN verb), no- NP PP NP PP tated here as VP{VBX verb}, where: NN IN NP NN IN NP • p(VP{VBX verb}|NP{NN noun}=SUBJ) > α duck on NN duck by NN Tree 4: grass grass Collect beginnings of PP subtrees headed by (IN Figure 10: Example derivation. prep), where: lect for trees that expand to the right. The right- • p(PP{IN prep}|NP{NN noun}=SUBJ) > α most noun is given a leftward directionality con- Tree 5: straint, placing it as an object, and so it will only Collect PP subtrees headed by (IN prep) with select for trees that expand to its left. The noun in NP complements (OBJ) headed by (NN noun), the middle, if there is one, selects for all its local where: subtrees, combining first with a noun to its right or to its left. We now walk through the deriva- • p(PP{IN prep}|NP{NN noun}=OBJ) > α tion process for each of the listed subtree frames. Tree 6: Because we are following an overgenerate-and- Collect VP subtrees headed by (VBX verb) with select approach, all combinations above a proba- embedded PP complements, where: bility threshold α and an observation cutoff γ are created. • p(PP{IN prep}|VP{VBX verb}=SUBJ) > α Tree 1: Tree 7: Collect all NP → (DT det) (JJ adj)* (NN noun) Collect VP subtrees headed by (VBX verb) with and NP → (JJ adj)* (NN noun) subtrees, where: embedded NP objects, where: • p((JJ adj)|(NN noun)) > α for each adj • p(VP{VBX verb}|NP{NN noun}=OBJ) > α • p((DT det)|JJ, (NN noun)) > α, and the proba- 4.3 Microplanning bility of a determiner for the head noun is higher than the probability of no determiner. 4.3.1 Step 6: Create Full Trees Any number of adjectives (including none) may In Microplanning, full trees are created by tak- be generated, and we include the presence or ab- ing the intersection of the subtrees created in Con- sence of an adjective when calculating which de- tent Determination. Because the nouns are or- terminer to include. dered, it is straightforward to combine the sub- The reasoning behind the generation of these trees surrounding a noun in position 1 with sub- subtrees is to automatically learn whether to treat trees surrounding a noun in position 2. Two 753 NP words. We find that the second method produces VP NP ↓ CC NP ↓ descriptions that seem more natural and varied VP* ↓ than the n-gram ranking method for our develop- and ment set, and so use the longest string method in Figure 11: Auxiliary trees for generation. evaluation. further trees are necessary to allow the subtrees 4.4.2 Step 8: Prenominal Modifier Ordering gathered to combine within the Penn Treebank syntax. These are given in Figure 11. If two To order sets of selected adjectives, we use the nouns in a proposed sentence cannot be combined top-scoring prenominal modifier ordering model with prepositions or verbs, we backoff to combine discussed in Mitchell et al. (2011). This is an n- them using (CC and). gram model constructed over noun phrases that Stepping through this process, all nouns will were extracted from an automatically parsed ver- have a set of subtrees selected by Tree 1. Prepo- sion of the New York Times portion of the Giga- sitional relationships between nouns are created word corpus (Graff and Cieri, 2003). With this by substituting Tree 1 subtrees into the NP nodes in place, blue clear sky becomes clear blue sky, of Trees 4 and 5, as shown in Figure 10. Verbal wooden brown table becomes brown wooden ta- relationships between nouns are created by substi- ble, etc. tuting Tree 1 subtrees into Trees 2, 3, and 7. Verb 5 Evaluation with preposition relationships are created between nouns by substituting the VBX node in Tree 6 Each set of sentences is generated with α (likeli- with the corresponding node in Trees 2 and 3 to hood cutoff) set to .01 and γ (observation count grow the tree to the right, and the PP node in Tree cutoff) set to 3. We compare the system against 6 with the corresponding node in Tree 5 to grow human-written descriptions and two state-of-the- the tree to the left. Generation of a full tree stops art vision-to-language systems, the Kulkarni et al. when all nouns in a group are dominated by the (2011) and Yang et al. (2011) systems. same node, either an S or NP. Human judgments were collected using Ama- zon’s Mechanical Turk (Amazon, 2011). We 4.4 Surface Realization follow recommended practices for evaluating an In the surface realization stage, the system se- NLG system (Reiter and Belz, 2009) and for run- lects a single tree from the generated set of pos- ning a study on Mechanical Turk (Callison-Burch sible trees and removes mark-up to produce a fi- and Dredze, 2010), using a balanced design with nal string. This is also the stage where punctua- each subject rating 3 descriptions from each sys- tion may be added. Different strings may be gen- tem. Subjects rated their level of agreement on erated depending on different specifications from a 5-point Likert scale including a neutral mid- the user, as discussed at the beginning of Section dle position, and since quality ratings are ordinal 4 and shown in the online demo. To evaluate the (points are not necessarily equidistant), we evalu- system against other systems, we specify that the ate responses using a non-parametric test. Partici- system should (1) not hallucinate likely verbs; and pants that took less than 3 minutes to answer all 60 (2) return the longest string possible. questions and did not include a humanlike rating for at least 1 of the 3 human-written descriptions 4.4.1 Step 7: Get Final Tree, Clear Mark-Up were removed and replaced. It is important to note We explored two methods for selecting a final that this evaluation compares full generation sys- string. In one method, a trigram language model tems; many factors are at play in each system that built using the Europarl (Koehn, 2005) data with may also influence participants’ perception, e.g., start/end symbols returns the highest-scoring de- sentence length (Napoles et al., 2011) and punc- scription (normalizing for length). In the second tuation decisions. method, we limit the generation system to select The systems are evaluated on a set of 840 the most likely closed-class words (determiners, images evaluated in the original Kulkarni et al. prepositions) while building the subtrees, over- (2011) system. Participants were asked to judge generating all possible adjective combinations. the statements given in Figure 12, from Strongly The final string is then the one with the most Disagree to Strongly Agree. 754 Grammaticality Main Aspects Correctness Order Humanlikeness Human 4 (3.77, 1.19) 4 (4.09, 0.97) 4 (3.81, 1.11) 4 (3.88, 1.05) 4 (3.88, 0.96) Midge 3 (2.95, 1.42) 3 (2.86, 1.35) 3 (2.95, 1.34) 3 (2.92, 1.25) 3 (3.16, 1.17) Kulkarni et al. 2011 3 (2.83, 1.37) 3 (2.84, 1.33) 3 (2.76, 1.34) 3 (2.78, 1.23) 3 (3.13, 1.23) Yang et al. 2011 3 (2.95, 1.49) 2 (2.31, 1.30) 2 (2.46, 1.36) 2 (2.53, 1.26) 3 (2.97, 1.23) Table 4: Median scores for systems, mean and standard deviation in parentheses. Distance between points on the rating scale cannot be assumed to be equidistant, and so we analyze results using a non-parametric test. G RAMMATICALITY: This description is grammatically correct. side. On the computer vision side, incorrect ob- M AIN A SPECTS : jects are often detected and salient objects are of- This description describes the main aspects of this ten missed. Midge does not yet screen out un- image. likely objects or add likely objects, and so pro- C ORRECTNESS : vides no filter for this. On the language side, like- This description does not include extraneous or in- lihood is estimated directly, and the system pri- correct information. O RDER : marily uses simple maximum likelihood estima- The objects described are mentioned in a reasonable tions to combine subtrees. The descriptive cor- order. pus that informs the system is not parsed with H UMANLIKENESS : a domain-adapted parser; with this in place, the It sounds like a person wrote this description. syntactic constructions that Midge learns will bet- Figure 12: Mechanical Turk prompts. ter reflect the constructions that people use. In future work, we hope to address these issues We report the scores for the systems in Table as well as advance the syntactic derivation pro- 4. Results are analyzed using the non-parametric cess, providing an adjunction operation (for ex- Wilcoxon Signed-Rank test, which uses median ample, to add likely adjectives or adverbs based values to compare the different systems. Midge on language alone). We would also like to incor- outperforms all recent automatic approaches on porate meta-data – even when no vision detection C ORRECTNESS and O RDER, and Yang et al. ad- fires for an image, the system may be able to gen- ditionally on H UMANLIKENESS and M AIN A S - erate descriptions of the time and place where an PECTS . Differences between Midge and Kulkarni image was taken based on the image file alone. et al. are significant at p < .01; Midge and Yang et al. at p < .001. For all metrics, human-written de- 7 Conclusion scriptions still outperform automatic approaches We have introduced a generation system that uses (p < .001). a new approach to generating language, tying a These findings are striking, particularly be- syntactic model to computer vision detections. cause Midge uses the same input as the Kulka- Midge generates a well-formed description of an rni et al. system. Using syntactically informed image by filtering attribute detections that are un- word co-occurrence statistics from a large corpus likely and placing objects into an ordered syntac- of descriptive text improves over state-of-the-art, tic structure. Humans judge Midge’s output to be allowing syntactic trees to be generated that cap- the most natural descriptions of images generated ture the variation of natural language. thus far. The methods described here are promis- 6 Discussion ing for generating natural language descriptions of the visual world, and we hope to expand and Midge automatically generates language that is as refine the system to capture further linguistic phe- good as or better than template-based systems, nomena. tying vision to language at a syntactic/semantic level to produce natural language descriptions. 8 Acknowledgements Results are promising, but, there is more work to Thanks to the Johns Hopkins CLSP summer be done: Evaluators can still tell a difference be- workshop 2011 for making this system possible, tween human-written descriptions and automati- and to reviewers for helpful comments. This cally generated descriptions. work is supported in part by Michael Collins and Improvements to the generated language are by NSF Faculty Early Career Development (CA- possible at both the vision side and the language REER) Award #1054133. 755 References Siming Li, Girish Kulkarni, Tamara L. Berg, Alexan- der C. Berg, and Yejin Choi. 2011. Composing Amazon. 2011. Amazon mechanical turk: Artificial simple image descriptions using web-scale n-grams. artificial intelligence. Proceedings of CoNLL 2011. Holly P. Branigan, Martin J. Pickering, and Mikihiro Mitchell Marcus, Ann Bies, Constance Cooper, Mark Tanaka. 2007. Contributions of animacy to gram- Ferguson, and Alyson Littman. 1995. Treebank II matical function assignment and word order during bracketing guide. production. Lingua, 118(2):172–189. George A. Miller. 1995. WordNet: A lexical Thorsten Brants and Alex Franz. 2006. Web 1T 5- database for english. Communications of the ACM, gram version 1. 38(11):39–41. Chris Callison-Burch and Mark Dredze. 2010. Creat- Margaret Mitchell, Aaron Dunlop, and Brian Roark. ing speech and language data with Amazon’s Me- 2011. Semi-supervised modeling for prenomi- chanical Turk. NAACL 2010 Workshop on Creat- nal modifier ordering. Proceedings of the 49th ing Speech and Language Data with Amazon’s Me- ACL:HLT. chanical Turk. Courtney Napoles, Benjamin Van Durme, and Chris Navneet Dalal and Bill Triggs. 2005. Histograms of Callison-Burch. 2011. Evaluating sentence com- oriented gradients for human detections. Proceed- pression: Pitfalls and suggested remedies. ACL- ings of CVPR 2005. HLT Workshop on Monolingual Text-To-Text Gen- Ali Farhadi, Ian Endres, Derek Hoiem, and David eration. Forsyth. 2009. Describing objects by their at- Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. tributes. Proceedings of CVPR 2009. 2011. Im2text: Describing images using 1 million Ali Farhadi, Mohsen Hejrati, Mohammad Amin captioned photographs. Proceedings of NIPS 2011. Sadeghi, Peter Young, Cyrus Rashtchian, Julia Slav Petrov. 2010. Berkeley parser. GNU General Hockenmaier, and David Forsyth. 2010. Every pic- Public License v.2. ture tells a story: generating sentences for images. Cyrus Rashtchian, Peter Young, Micah Hodosh, and Proceedings of ECCV 2010. Julia Hockenmaier. 2010. Collecting image anno- Pedro Felzenszwalb, David McAllester, and Deva Ra- tations using amazon’s mechanical turk. Proceed- maman. 2008. A discriminatively trained, mul- ings of the NAACL HLT 2010 Workshop on Creat- tiscale, deformable part model. Proceedings of ing Speech and Language Data with Amazon’s Me- CVPR 2008. chanical Turk. Flickr. 2011. http://www.flickr.com. Accessed Ehud Reiter and Anja Belz. 2009. An investiga- 1.Sep.11. tion into the validity of some metrics for automat- ically evaluating natural language generation sys- Kotaro Funakoshi, Satoru Watanabe, Naoko tems. Computational Linguistics, 35(4):529–558. Kuriyama, and Takenobu Tokunaga. 2004. Ehud Reiter and Robert Dale. 1997. Building ap- Generating referring expressions using perceptual plied natural language generation systems. Journal groups. Proceedings of the 3rd INLG. of Natural Language Engineering, pages 57–87. Albert Gatt. 2006. Generating collective spatial refer- Ehud Reiter and Robert Dale. 2000. Building Natural ences. Proceedings of the 28th CogSci. Language Generation Systems. Cambridge Univer- David Graff and Christopher Cieri. 2003. English Gi- sity Press. gaword. Linguistic Data Consortium, Philadelphia, Yezhou Yang, Ching Lik Teo, Hal Daum´e III, and PA. LDC Catalog No. LDC2003T05. Yiannis Aloimonos. 2011. Corpus-guided sen- Philipp Koehn. 2005. Europarl: A parallel cor- tence generation of natural images. Proceedings of pus for statistical machine translation. MT Summit. EMNLP 2011. http://www.statmt.org/europarl/. Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. Lee, and Song-Chun Zhu. 2010. I2T: Image pars- 2008. Semantic class learning from the web with ing to text description. Proceedings of IEEE 2010, hyponym pattern linkage graphs. Proceedings of 98(8):1485–1508. ACL-08: HLT. Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Sim- ing Li, Yejin Choi, Alexander C. Berg, and Tamara Berg. 2011. Baby talk: Understanding and gener- ating image descriptions. Proceedings of the 24th CVPR. Irene Langkilde and Kevin Knight. 1998. Gener- ation that exploits corpus-based statistical knowl- edge. Proceedings of the 36th ACL. 756 Generation of landmark-based navigation instructions from open-source data Markus Dr¨ager Alexander Koller Dept. of Computational Linguistics Dept. of Linguistics Saarland University University of Potsdam

[email protected] [email protected]

Abstract limited to distance-based route instructions. Even in academic research, there has been remarkably We present a system for the real-time gen- little work on NLG for landmark-based naviga- eration of car navigation instructions with landmarks. Our system relies exclusively tion systems. Some of these systems rely on map on freely available map data from Open- resources that have been hand-crafted for a par- StreetMap, organizes its output to fit into ticular city (Malaka et al., 2004), or on a com- the available time until the next driving ma- bination of multiple complex resources (Raubal neuver, and reacts in real time to driving er- and Winter, 2002), which effectively limits their rors. We show that female users spend sig- coverage. Others, such as Dale et al. (2003), fo- nificantly less time looking away from the cus on non-interactive one-shot instruction dis- road when using our system compared to a courses. However, commercially successful car baseline system. navigation systems continuously monitor whether the driver is following the instructions and pro- 1 Introduction vide modified instructions in real time when nec- Systems that generate route instructions are be- essary. That is, two key problems in designing coming an increasingly interesting application NLG systems for car navigation instructions are area for natural language generation (NLG) sys- the availability of suitable map resources and the tems. Car navigation systems are ubiquitous ability of the NLG system to generate instructions already, and with the increased availability of and react to driving errors in real time. powerful mobile devices, the wide-spread use of In this paper, we explore solutions to both of pedestrian navigation systems is on the horizon. these points. We present the Virtual Co-Pilot, One area in which NLG systems could improve a system which generates route instructions for existing navigation systems is in the use of land- car navigation using landmarks that are extracted marks, which would enable them to generate in- from the open-source OpenStreetMap resource.1 structions such as “turn right after the church” in- The system computes a route plan and splits it stead of “after 300 meters”. It has been shown in into episodes that end in driving maneuvers. It human-human studies that landmark-based route then selects landmarks that describe the locations instructions are easier to understand (Lovelace of these driving maneuvers, and aggregates in- et al., 1999) than distance-based ones and re- structions such that they can be presented (via duce driver distraction in in-car settings (Bur- a TTS system) in the time available within the nett, 2000), which is crucial for improved traffic episode. The system monitors the user’s position safety (Stutts et al., 2001). From an NLG per- and computes new, corrective instructions when spective, navigation systems are an obvious ap- the user leaves the intended path. We evaluate plication area for situated generation, for which our system using a driving simulator, and com- there has recently been increasing interest (see pare it to a baseline that is designed to replicate e.g. (Lessmann et al., 2006; Koller et al., 2010; a typical commercial navigation system. The Vir- Striegnitz and Majda, 2009)). tual Co-Pilot performs comparably to the baseline Current commercial navigation systems use 1 only trivial NLG technology, and in particular are http://www.openstreetmap.org/ 757 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 757–766, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics on the number of driving errors and on user sat- is a crucial goal of improved navigation systems, isfaction, and outperforms it significantly on the as driver inattention of various kinds is a lead- time female users spend looking away from the ing cause of traffic accidents (25% of all police- road. To our knowledge, this is the first time that reported car crashes in the US in 2000, according the generation of landmarks has been shown to to Stutts et al. (2001)). Another road-based study significantly improve the instructions of a wide- conducted by May and Ross (2006) yielded simi- coverage navigation system. lar results. Plan of the paper. We start by reviewing ear- One recurring finding in studies on landmarks lier literature on landmarks, route instructions, in navigation is that some user groups are able and the use of NLG for route instructions in Sec- to benefit more from their inclusion than oth- tion 2. We then present the way in which we ers. This is particularly the case for female users. extract information on potential landmarks from While men tend to outperform women in wayfind- OpenStreetMap in Section 3. Section 4 shows ing tasks, completing them faster and with fewer how we generate route instructions, and Section 5 navigation errors (c.f. Allen (2000)), women are presents the evaluation. Section 6 concludes. likely to show improved wayfinding performance when landmark information is given (e.g. Saucier 2 Related Work et al. (2002)). What makes an object in the environment a good Despite all of this evidence from human-human landmark has been the topic of research in vari- studies, there has been remarkably little research ous disciplines, including cognitive science, com- on implemented navigation systems that use land- puter science, and urban planning. Lynch (1960) marks. Commercial systems make virtually no defines landmarks as physical entities that serve use of landmark information when giving direc- as external points of reference that stand out from tions, relying on metric representations instead their surroundings. Kaplan (1976) specified a (e.g. “Turn right in one hundred meters”). In aca- landmark as “a known place for which the in- demic research, there have only been a handful of dividual has a well-formed representation”. Al- relevant systems. A notable example is the DEEP though there are different definitions of land- MAP system, which was created in the SmartKom marks, a common theme is that objects are con- project as a mobile tourist information system for sidered landmarks if they have some kind of cog- the city of Heidelberg (Malaka and Zipf, 2000; nitive salience (both in terms of visual distinctive- Malaka et al., 2004). DEEP MAP uses landmarks ness and frequeny of interaction). as waypoints for the planning of touristic routes The usefulness of landmarks in route instruc- for car drivers and pedestrians, while also making tions has been shown in a number of different use of landmark information in the generation of human-human studies. Experimental results from route directions. Raubal and Winter (2002) com- Lovelace et al. (1999) show that people not only bine data from digital city maps, facade images, use landmarks intuitively when giving directions, cultural heritage information, and other sources but they also perceive instructions that are given to to compute landmark descriptions that could be them to be of higher quality when those instruc- used in a pedestrian navigation system for the city tions contain landmark information. Similar find- of Vienna. ings have also been reported by Michon and Denis The key to the richness of these systems is a (2001) and Tom and Denis (2003). set of extensive, manually curated geographic and Regarding car navigation systems specifically, landmark databases. However, creation and main- Burnett (2000) reports on a road-based user study tenance of such databases is expensive, which which compared a landmark-based navigation makes it impractical to use these systems outside system to a conventional car navigation system. of the limited environments for which they were Here the provision of landmark information in created. There have been a number of suggestions route directions led to a decrease of navigational for automatically acquiring landmark data from errors. Furthermore, glances at the navigation existing electronic databases, for instance cadas- display were shorter and fewer, which indicates tral data (Elias, 2003) and airborne laser scans less driver distraction in this particular experi- (Brenner and Elias, 2003). But the raw data for mental condition. Minimizing driver distraction these approaches is still hard to obtain; informa- 758 tion about landmarks is mostly limited to geomet- ric data and does not specify the semantic type of a landmark (such as “church”); and updating the landmark database frequently when the real world changes (e.g., a shop closes down) remains an open issue. The closest system in the literature to the re- search we present here is the CORAL system (Dale et al., 2003). CORAL generates a text of driving instructions with landmarks out of the out- Figure 1: A graphical representation of some nodes put of a commercial web-based route planner. Un- and ways in OpenStreetMap. like CORAL, our system relies purely on open- Landmark Type source map data. Also, our system generates driv- Street Furniture stop sign ing instructions in real time (as opposed to a sin- traffic lights gle discourse before the user starts driving) and pedestrian crossing reacts in real time to driving errors. Finally, we Visual Landmarks church evaluate our system thoroughly for driving errors, certain video stores user satisfaction, and driver distraction on an ac- certain supermarkets tual driving task, and find a significant improve- gas station ment over the baseline. pubs and bars Figure 2: Landmarks used by the Virtual Co-Pilot. 3 OpenStreetMap A system that generates landmark-based route di- rections requires two kinds of data. First, it must XML format for offline use. plan routes between points in space, and therefore Geographical data in OpenStreetMap is repre- needs data on the road network, i.e. the road seg- sented in terms of nodes and ways. Nodes rep- ments that make up streets along with their con- resent points in space, defined by their latitude nections. Second, the system needs information and longitude. Ways consist of sequences of about the landmarks that are present in the envi- edges between adjacent nodes; we call the in- ronment. This includes geographic information dividual edges segments below. They are used such as position, but also semantic information to represent streets (with curved streets consist- such as the landmark type. ing of multiple straight segments approximating We have argued above that the availability of their shape), but also a variety of other real-world such data has been a major bottleneck in the entities: buildings, rivers, trees, etc. Nodes and development of landmark-based navigation sys- ways can both be enriched with further infor- tems. In the Virtual Co-Pilot system, which mation by attaching tags. Tags encode a wide we present below, we solve this problem by us- range of additional information using a predefined ing data from OpenStreetMap, an on-line map type ontology. Among other things, they specify resource that provides both types of informa- the types of buildings (church, cafe, supermarket, tion mentioned above, in a unified data struc- etc.); where a shop or restaurant has a name, it too ture. The OpenStreetMap project is to maps what is specified in a tag. Fig. 1 is a graphical represen- Wikipedia is to encyclopedias: It is a map of tation of some OpenStreetMap data, consisting of the entire world which can be edited by anyone nodes and ways for two streets (with two and five wishing to participate. New map data is usually segments) and a building which has been tagged added by volunteers who measure streets using as a gas station. GPS devices and annotate them via a Web inter- For the Virtual Co-Pilot system, we have cho- face. The decentralized nature of the data entry sen a set of concrete landmark types that we con- process means that when the world changes, the sider useful (Fig. 2). We operationalize the crite- map will be updated quickly. Existing map data ria for good landmarks sketched in Section 2 by can be viewed as a zoomable map on the Open- requiring that a landmark should be easily visible, StreetMap website, or it can be downloaded in an and that it should be generic in that it is appli- 759 cable not just for one particular city, but for any place for which OpenStreetMap data is available. We end up with two classes of landmark types: street furniture and visual landmarks. Street fur- niture is a generic term for objects that are in- stalled on streets. In this subset, we include stop signs, traffic lights, and pedestrian crossings. Our assumption is that these objects inherently pos- sess a high salience, since they already require particular attention from the driver. “Visual land- marks” encompass roadside buildings that are not directly connected to the road infrastructure, but draw the driver’s attention due to visual salience. Churches are an obvious member of this group; in Figure 3: Schematic representation of an episode addition, we include gas stations, pubs, and bars, (dashed red line), with sample trigger positions of pre- as well as certain supermarket and video store view, turn instruction, and confirmation messages. chains (selected for wide distribution over differ- ent cities and recognizable, colorful signs). Given a certain location at which the Virtual Co-Pilot is to be used, we automatically extract most interesting. Our system avoids the genera- suitable landmarks along with their types and lo- tion of metric distance indicators, as in “turn left cations from OpenStreetMap. We also gather in 100 meters”. Instead, it tries to find landmarks the road network information that is required that describe the position of the decision point: for route planning, and collect informations on “Prepare to turn left after the church.” When no streets, such as their names, from the tags. We landmark is available, the system tries to use street then transform this information into a directed intersections as secondary landmarks, as in “Turn street graph. The nodes of this graph are the right at the next/second/third intersection.” Metric OpenStreetMap nodes that are part of streets; two distances are only used when both of these strate- adjacent nodes are connected by a single directed gies fail. edge for segments of one-way streets and a di- In-car NLG takes place in a heavily real-time rected edge in each direction for ordinary street setting, in which an utterance becomes uninter- segments. Each edge is weighted with the Eu- pretable or even misleading if it is given too late. clidean distance between the two nodes. This problem is exacerbated for NLG of speech 4 Generation of route directions because simply speaking the utterance takes time as well. One consequence that our system ad- We will now describe how the Virtual Co-Pilot dresses is the problem of planning preview mes- generates route directions from OpenStreetMap sages in such a way that they can be spoken be- data. The system generates three types of mes- fore the decision point without overlapping each sages (see Fig. 3). First, at every decision point, other. We handle this problem in the sentence i.e. at the intersection where a driving maneu- planner, which may aggregate utterances to fit ver such as turning left or right is required, the into the available time. A second problem is that user is told to turn immediately in the given di- the user’s reactions to the generated utterances are rection (“now turn right”). Second, if the driver unpredictable; if the driver takes a wrong turn, the has followed an instruction correctly, we gener- system must generate updated instructions in real ate a confirmation message after the driver has time. made the turn, letting them know they are still on the right track. Finally, we generate preview Below, we describe the individual components messages on the street leading up to the decision of the system. We mostly follow a standard NLG point. These preview messages describe the loca- pipeline (Reiter and Dale, 2000), with a focus on tion of the next driving maneuver. the sentence planner and an extension to interac- Of the three types, preview messages are the tive real-time NLG. 760 Segment123 when the road makes a sharp turn where a minor From: Node1 street forks off. To handle this case, we introduce To: Node2 On: “Main Street” decision points at nodes with multiple adjacent segments if the angle between the incoming and Segment124 outgoing segment of the street exceeds a certain From: Node2 threshold. Conversely, our heuristic will some- To: Node3 On: “Main Street” times end an episode where no driving maneuver is necessary, e.g. when an ongoing street changes Segment125 its name. This is unproblematic in practice; the From: Node3 system will simply generate an instruction to keep To: Node4 driving straight ahead. Fig. 3 shows a graphical On: “Park Street” representation of an episode, with the street seg- Segment126 ments belonging to it drawn as red dashed lines. From: Node4 To: Node5 4.2 Aggregation On: “Park Street” Because we generate spoken instructions that are Figure 4: A simple example of a route plan consisting given to the user while they are driving, the timing of four street segments. of the instructions becomes a crucial issue, espe- cially because a driver moves faster than the user of a pedestrian navigation system. It is undesir- 4.1 Content determination and text planning able for a second instruction to interrupt an ear- The first step in our system is to obtain a plan for lier one. On the other hand, the second instruc- reaching the destination. To this end, we com- tion cannot be delayed because this might make pute a shortest path on the directed street graph the user miss a turn or interpret the instruction in- described in Section 3. The result is an ordered correctly. list of street segments that need to be traversed in We must therefore control at which points in- the given order to successfully reach the destina- structions are given and make sure that they do tion; see Fig. 4 for an example. not overlap. We do this by always presenting pre- To be suitable as the input for an NLG system, view messages at trigger positions at certain fixed this flat list of OpenStreetMap nodes needs to be distances from the decision point. The sentence subdivided into smaller message chunks. In turn- planner calculates where these trigger positions by-turn navigation, the general delimiter between are located for each episode. In this way, we cre- such chunks are the driving maneuvers that the ate time frames during which there is enough time driver must execute at each decision point. We for instructions to be presented. call each span between two decision points an However, some episodes are too short to ac- episode. Episodes are not explicitly represented commodate the three trigger positions for the con- in the original route plan: although every segment firmation message and the two preview messages. has a street name associated with it, the name of In such episodes, we aggregate different mes- a street sometimes changes as we go along, and sages. We remove the trigger positions for the two because chains of segments are used to model preview messages from the episode, and instead curved streets in OpenStreetMap, even segments add the first preview message to the turn instruc- that are joined at an angle may be parts of the tion message of the previous episode. This allows same street. Thus, in Fig. 4 it is not apparent our system to generate instructions like “Now turn which segment traversals require any navigational right, and then turn left after the church.” maneuvers. We identify episode boundaries with the fol- 4.3 Generation of landmark descriptions lowing heuristic. We first assume that episode The Virtual Co-Pilot computes referring expres- boundaries occur when the street name changes sions to decision points by selecting appropriate from one segment to the next. However, stay- landmarks. To this end, it first looks up landmark ing on the road may involve a driving maneu- candidates within a given range of the decision ver (and therefore a decision point) as well, e.g. point from the database created in Section 3. This 761 yields an initial list of landmark candidates. Preview message p1 : Trigger position: Node3 − 50m Some of these landmark candidates may be un- Turn direction: right suitable for the given situation because of lack of Landmark: church uniqueness. If there are several visual landmarks Preposition: after of the same type along the course of an episode, all of these landmark candidates are removed. For Preview message p2 = p1 , except: Trigger position: Node3 − 100m episodes which contain multiple street furniture landmarks of the same type, the first three in each Turn instruction t1 : episode are retained; a referring expression for the Trigger position: Node3 decision point might then be “at the second traf- Turn direction: right fic light”. If the decision point is no more than Confirmation message c1 : three intersections away, we also add a landmark Trigger position: Node3 + 50m description of the form “at the third intersection”. Furthermore, a landmark must be visible from the Figure 5: Semantic representations of the different last segment of the current episode; we only retain types of instructions in one episode. a candidate if it is either adjacent to a segment of the current episode or if it is close to the end point “Turn direction preposition landmark”). of the very last segment of the episode. Among the landmarks that are left over, the system prefers 4.4 Interactive generation visual landmarks over street furniture, and street As a final point, the NLG process of a car naviga- furniture over intersections. If no landmark candi- tion system takes place in an interactive setting: dates are left over, the system falls back to metric as the system generates and utters instructions, the distances. user may either follow them correctly, or they may Second, the Virtual Co-Pilot determines the miss a turn or turn incorrectly because they mis- spatial relationship between the landmark and the understood the instruction or were forced to disre- decision point so that an appropriate preposition gard it by the traffic situation. The system must be can be used in the referring expression. If the de- able to detect such problems, recover from them, cision point occurs before the landmark along the and generate new instructions in real time. course of the episode, we use the preposition “in Our system receives a continuous stream of in- front of”, otherwise, we use “after”. Intersections formation about the position and direction of the are always used with “at” and metric distances user. It performs execution monitoring to check with “in”. whether the user is still following the intended Finally, the system decides how to refer to the route. If a trigger position is reached, we present landmark objects themselves. Although it has ac- the instruction that we have generated for this po- cess to the names of all objects from the Open- sition. If the user has left the route, the system StreetMap data, the user may not know these reacts by planning a new route starting from the names. We therefore refer to churches, gas sta- user’s current position and generating a new set of tions, and any street furniture simply as “the instructions. We check whether the user is follow- church”, “the gas station”, etc. For supermar- ing the intended route in the following way. The kets and bars, we assume that these buildings are system keeps track of the current episode of the more saliently referred to by their names, which route plan, and monitors the distance of the car are used in everyday language, and therefore use to the final node of the episode. While the user the names to refer to them. is following the route correctly, the distance be- The result of the sentence planning stage is tween the car and the final node should decrease a list of semantic representations, specifying the or at least stay the same between two measure- individual instructions that are to be uttered in ments. To accommodate for occasional deviations each episode; an example is shown in Fig. 5. from the middle of the road, we allow five subse- For each type of instruction, we then use a sen- quent measurements to increase the distance; the tence template to generate linguistic surface forms sixth increase of the distance triggers a recompu- by inserting the information contained in those tation of the route plan and a freshly generated plans into the slots provided by the templates (e.g. instruction. On the other hand, when the distance 762 of the car to the final node falls below a certain threshold, we assume that the end of the episode has been reached, and activate the next episode. By monitoring whether the user is now approach- ing the final node of this new episode, we can in particular detect wrong turns at intersections. Because each instruction carries the risk that it may not be followed correctly, there is a question as to whether it is worth planning out all remain- ing instructions for the complete route plan. After all, if the user does not follow the first instruc- tion, the computation of all remaining instructions was a waste of time. We decided to compute all future instructions anyway because the aggrega- Figure 6: Experiment setup. A) Main screen B) Navi- tion procedure described above requires them. In gation screen C) steering wheel D) eye tracker practice, the NLG process is so efficient that all instructions can be done in real time, but this de- cision would have to be revisited for a slower sys- a separate 7” monitor (B). The driving simula- tem. tor was controlled by means of a steering wheel (C), along with a pair of brake and acceleration 5 Evaluation pedals. We recorded user eye movements using We will now report on an experiment in which we a Tobii IS-Z1 table-mounted eye tracker (D). The evaluated the performance of the Virtual Co-Pilot. generated instructions were converted to speech using MARY, an open-source text-to-speech sys- 5.1 Experimental Method tem (Schr¨oder and Trouvain, 2003), and played 5.1.1 Subjects back on loudspeakers. The task of the user was to drive the car in In total, 12 participants were recruited through the virtual environment towards a given destina- printed ads and mailing lists. All of them were tion; spoken instructions were presented to them university students aged between 21 and 27 years. as they were driving, in real time. Using the Our experiment was balanced for gender, hence steering wheel and the pedals, users had full con- we recruited 6 male and 6 female participants. All trol over steering angles, acceleration and brak- participants were compensated for their effort. ing. The driving speed was limited to 30 km/h, but 5.1.2 Design there were no restrictions otherwise. The driving The driving simulator used in the experiment simulator sent the NLG system a message with the replicates a real-world city center using a 3D current position of the car (as GPS coordinates) model that contains buildings and streets as they once per second. can be perceived in reality. The street layout 3D Each user was asked to drive three short routes model used by the driving simulator is based on in the driving simulator. Each route took about OpenStreetMap data, and buildings were added to four minutes to complete, and the travelled dis- the virtual environment based on cadastral data. tance was about 1 km. The number of episodes To increase the perceived realism of the model, per route ranged from three to five. Landmark some buildings were manually enhanced with candidates were sufficiently dense that the Virtual photographic images of their real-world counter- Co-Pilot used landmarks to refer to all decision parts (see Fig. 7). points and never had to fall back to the metric dis- Figure 6 shows the set-up of the evaluation ex- tance strategy. periment. The virtual driving simulator environ- There were three experimental conditions, ment (main picture in Fig. 7) was presented to the which differed with respect to the spoken route participants on a 20” computer screen (A). In ad- instructions and the use of the navigation screen. dition, graphical navigation instructions (shown In the baseline condition, designed to replicate the in the lower right of Fig. 7) were displayed on behavior of an off-the-shelf commercial car nav- 763 All Users Males Females B VCP B VCP B VCP Total Fixation Duration (seconds) 4.9 3.5 2.7 4.1 7.0 2.9* Total Fixation Count (N) 21.8 15.4 13.5 16.5 30.0 14.3* ”The system provided the right amount 3.9 2.9 4.2* 3.3 3.5 2.5 of information at any time” ”I was insecure at times about still be- 2.3 3.2 1.9* 2.8 2.6 3.5 ing on the right track.” ”It was important to have a visual rep- 4.3 4.0 4.2 4.2 4.3 3.7 resentation of route directions” ”I could trust the navigation system” 3.6 3.7 4.1 3.7 3.0 3.7 Figure 8: Mean values for gaze behavior and subjective evaluation, separated by user group and condition (B = baseline, VCP = our system). Significant differences are indicated by *; better values are printed in boldface. aspects of their cognitive workload (general, vi- sual, auditive and temporal workload, as well as perceived stress level). In the second question- naire, participants were state to rate their agree- ment with a number of statements about their sub- jective impression of the system on a 5-point un- labelled Likert scale, e.g. whether they had re- ceived instructions at the right time or whether they trusted the navigation system to give them the right instructions during trials. 5.2 Results Figure 7: Screenshot of a scene in the driving simula- There were no significant differences between the tor. Lower right corner: matching screenshot of navi- Virtual Co-Pilot and the baseline system on task gation display. completion time, rate of driving errors, or any of the questions of the DALI questionnaire. Driv- ing errors in particular were very rare: there were igation system, participants were provided with only four driving errors in total, two of which spoken metric distance-to-turn navigation instruc- were due to problems with left/right coordination. tions. The navigation screen showed arrows de- We then analyzed the gaze data collected by the picting the direction of the next turn, along with table-mounted eye tracker, which we set up such the distance to the decision point (cf. Fig. 7). The that it recognized glances at the navigation screen. second condition replaced the spoken route in- In particular, we looked at the total fixation dura- structions by those generated by the Virtual Co- tion (TFD), i.e. the total amount of time that a user Pilot. In a third condition, the output of the nav- spent looking at the navigation screen during a igation screen was further changed to display an given trial run. We also looked at the total fixation icon for the next landmark along with the arrow count (TFC), i.e. the total number of times that a and distance indicator. The three routes were pre- user looked at the navigation screen in each run. sented to the users in different orders, and com- Mean values for both metrics are given in Fig. 8, bined with the conditions in a Latin Squares de- averaged over all subjects and only male and fe- sign. In this paper, we focus on the first and sec- male subjects, respectively; the “VCP” column is ond condition, in order to contrast the two styles for the Virtual Co-Pilot, whereas “B” stands for of spoken instruction. the baseline. We found that male users tended Participants were asked to answer two ques- to look more at the navigation screen in the VCP tionnaires after each trial run. The first was the condition than in B, although the difference is not DALI questionnaire (Pauzi´e, 2008), which asks statistically significant. However, female users subjects to report how they perceived different looked at the navigation screen significantly fewer 764 times (t(5) = 3.2, p < 0.05, t-test for dependent other subjective questions. This may partly be due samples) and for significantly shorter amounts of to the fact that the subjects were familiar with ex- time (t(5) = 3.2, p < 0.05) in the VCP condition isting commercial car navigation systems and not than in B. used to landmark-based instructions. On the other On the subjective questionnaire, most questions hand, this finding is also consistent with results yielded no significant differences (and are not re- of other evaluations of NLG systems, in which ported here). However, we found that female an improvement in the objective task usefulness users tended to rate the Virtual Co-Pilot more pos- of the system does not necessarily correlate with itively than the baseline on questions concerning improved scores from subjective questionnaires trust in the system and the need for the navigation (Gatt et al., 2009). screen (but not significantly). Male users found that the baseline significantly outperformed the 6 Conclusion Virtual Co-Pilot on presenting instructions at the In this paper, we have described a system for gen- right time (t(5) = 2.7, p < 0.05) and on giving erating real-time car navigation instructions with them a sense of security in still being on the right landmarks. Our system is distinguished from ear- track (t(5) = −2.7, p < 0.05). lier work in its reliance on open-source map data 5.3 Discussion from OpenStreetMap, from which we extract both the street graph and the potential landmarks. This The most striking result of the evaluation is that demonstrates that open resources are now infor- there was a significant reduction of looks to the mative enough for use in wide-coverage naviga- navigation display, even if only for one group tion NLG systems. The system then chooses ap- of users. Female users looked at the navigation propriate landmarks at decision points, and con- screen less and more rarely with the Virtual Co- tinuously monitors the driver’s behavior to pro- Pilot compared to the baseline system. In a real vide modified instructions in real time when driv- car navigation system, this translates into a driver ing errors occur. who spends less time looking away from the road, We evaluated our system using a driving simu- i.e. a reduction in driver distraction and an in- lator with respect to driving errors, user satisfac- crease in traffic safety. This suggests that female tion, and driver distraction. To our knowledge, users learned to trust the landmark-based instruc- we have shown for the first time that a landmark- tions, an interpretation that is further supported based car navigation system outperforms a base- by the trends we found in the subjective question- line significantly; namely, in the amount of time naire. female users spend looking away from the road. We did not find these differences in the male In many ways, the Virtual Co-Pilot is a very user group. Part of the reason may be the known simple system, which we see primarily as a start- gender differences in landmark use we mentioned ing point for future research. The evaluation in Section 2. But interestingly, the two signifi- confirmed the importance of interactive real-time cantly worse ratings by male users concerned the NLG for navigation, and we therefore see this as correct timing of instructions and the feedback for a key direction of future work. On the other hand, driving errors, i.e. issues regarding the system’s it would be desirable to generate more complex real-time capabilities. Although our system does referring expressions (“the tall church”). This not yet perform ideally on these measures, this would require more informative map data, as well confirms our initial hypothesis that the NLG sys- as a formal model of visual salience (Kelleher and tem must track the user’s behavior and schedule van Genabith, 2004; Raubal and Winter, 2002). its utterances appropriately. This means that ear- lier systems such as CORAL, which only com- Acknowledgments. We would like to thank the pute a one-shot discourse of route instructions DFKI CARMINA group for providing the driv- without regard to the timing of the presentation, ing simulator, as well as their support. We would miss a crucial part of the problem. furthermore like to thank the DFKI Agents and Apart from the exceptions we just discussed, Simulated Reality group for providing the 3D city the landmark-based system tended to score com- model. parably or a bit worse than the baseline on the 765 References A. J. May and T. Ross. 2006. Presence and quality of navigational landmarks: effect on driver perfor- G. L. Allen. 2000. Principles and practices for com- mance and implications for design. Human Fac- municating route knowledge. Applied Cognitive tors: The Journal of the Human Factors and Er- Psychology, 14(4):333–359. gonomics Society, 48(2):346. C. Brenner and B. Elias. 2003. Extracting land- P. E. Michon and M. Denis. 2001. When and why are marks for car navigation systems using existing visual landmarks used in giving directions? Spatial gis databases and laser scanning. International information theory, pages 292–305. archives of photogrammetry remote sensing and A. Pauzi´e. 2008. Evaluating driver mental workload spatial information sciences, 34(3/W8):131–138. using the driving activity load index (DALI). In G. Burnett. 2000. ‘Turn right at the Traffic Lights’: Proc. of European Conference on Human Interface The Requirement for Landmarks in Vehicle Nav- Design for Intelligent Transport Systems, pages 67– igation Systems. The Journal of Navigation, 77. 53(03):499–510. M. Raubal and S. Winter. 2002. Enriching wayfind- R. Dale, S. Geldof, and J. P. Prost. 2003. Using natural ing instructions with local landmarks. Geographic language generation for navigational assistance. In information science, pages 243–259. ACSC, pages 35–44. E. Reiter and R. Dale. 2000. Building natural lan- B. Elias. 2003. Extracting landmarks with data min- guage generation systems. Studies in natural lan- ing methods. Spatial information theory, pages guage processing. Cambridge University Press. 375–389. D. M. Saucier, S. M. Green, J. Leason, A. MacFadden, A. Gatt, F. Portet, E. Reiter, J. Hunter, S. Mahamood, S. Bell, and L. J. Elias. 2002. Are sex differences in W. Moncur, and S. Sripada. 2009. From data to text navigation caused by sexually dimorphic strategies in the neonatal intensive care unit: Using NLG tech- or by differences in the ability to use the strategies?. nology for decision support and information man- Behavioral Neuroscience, 116(3):403. agement. AI Communications, 22:153–186. M. Schr¨oder and J. Trouvain. 2003. The German S. Kaplan. 1976. Adaption, structure and knowledge. text-to-speech synthesis system MARY: A tool for In G. Moore and R. Golledge, editors, Environmen- research, development and teaching. International tal knowing: Theories, research and methods, pages Journal of Speech Technology, 6(4):365–377. 32–45. Dowden, Hutchinson and Ross. K. Striegnitz and F. Majda. 2009. Landmarks in J. D. Kelleher and J. van Genabith. 2004. Visual navigation instructions for a virtual environment. salience and reference resolution in simulated 3-D Online Proceedings of the First NLG Challenge environments. Artificial Intelligence Review, 21(3). on Generating Instructions in Virtual Environments A. Koller, K. Striegnitz, D. Byron, J. Cassell, R. Dale, (GIVE-1). J. Moore, and J. Oberlander. 2010. The First Chal- J. C. Stutts, D. W. Reinfurt, L. Staplin, and E. A. Rodg- lenge on Generating Instructions in Virtual Environ- man. 2001. The role of driver distraction in traf- ments. In E. Krahmer and M. Theune, editors, Em- fic crashes. Washington, DC: AAA Foundation for pirical Methods in Natural Language Generation. Traffic Safety. Springer. A. Tom and M. Denis. 2003. Referring to landmark N. Lessmann, S. Kopp, and I. Wachsmuth. 2006. Sit- or street information in route directions: What dif- uated interaction with a virtual human – percep- ference does it make? Spatial information theory, tion, action, and cognition. In G. Rickheit and pages 362–374. I. Wachsmuth, editors, Situated Communication, pages 287–323. Mouton de Gruyter. K. Lovelace, M. Hegarty, and D. Montello. 1999. El- ements of good route directions in familiar and un- familiar environments. Spatial information theory. Cognitive and computational foundations of geo- graphic information science, pages 751–751. K. Lynch. 1960. The image of the city. MIT Press. R. Malaka and A. Zipf. 2000. DEEP MAP – Chal- lenging IT research in the framework of a tourist in- formation system. Information and communication technologies in tourism, 7:15–27. R. Malaka, J. Haeussler, and H. Aras. 2004. SmartKom mobile: intelligent ubiquitous user in- teraction. In Proceedings of the 9th International Conference on Intelligent User Interfaces. 766 To what extent does sentence-internal realisation reflect discourse context? A study on word order Sina Zarrieß Jonas Kuhn Aoife Cahill Institut f¨ur maschinelle Sprachverarbeitung Educational Testing Service University of Stuttgart, Germany Princeton, NJ 08541, USA zarriesa,

[email protected] [email protected]

Abstract context (givenness or salience of particular refer- ents, prior mentioning of particular concepts). We compare the impact of sentence- Since so many factors are involved and there is internal vs. sentence-external features on word order prediction in two generation further interaction with subtle semantic and prag- settings: starting out from a discrimina- matic differentiations, lexical choice, stylistics tive surface realisation ranking model for and presumably processing factors, theoretical ac- an LFG grammar of German, we enrich counts making reliable predictions for real cor- the feature set with lexical chain features pus examples have for a long time proven elusive. from the discourse context which can be As for German, only quite recently, a number of robustly detected and reflect rough gram- corpus-based studies (Filippova and Strube, 2007; matical correlates of notions from theoreti- Speyer, 2005; Dipper and Zinsmeister, 2009) have cal approaches to discourse coherence. In a more controlled setting, we develop a con- made some good progress towards a coherence- stituent ordering classifier that is trained oriented account of at least the left edge of the on a German treebank with gold corefer- German clause structure, the Vorfeld constituent. ence annotation. Surprisingly, in both set- What makes the technological application of tings, the sentence-external features per- theoretical insights even harder is that for most form poorly compared to the sentence- internal ones, and do not improve over relevant factors, automatic recognition cannot be a baseline model capturing the syntactic performed with high accuracy (e.g., a coreference functions of the constituents. accuracy in the 70’s means there is a good deal of noise) and for the higher-level notions such as the information-structural focus, interannotator 1 Introduction agreement on real corpus data tends to be much The task of surface realization, especially in a rel- lower than for core-grammatical notions (Poesio atively free word order language like German, is and Artstein, 2005; Ritz et al., 2008). only partially determined by hard syntactic con- On the other hand, many of the relevant dis- straints. The space of alternative realizations that course factors are reflected indirectly in proper- are strictly speaking grammatical is typically con- ties of the sentence-internal material. Most no- siderable. Nevertheless, for any given choice of tably, knowing the shape of referring expressions lexical items and prior discourse context, only a narrows down many aspects of givenness and few realizations will come across as natural and salience of its referent; pronominal realizations will contribute to a coherent text. Hence, any NLP indicate givenness, and in German there are even application involving a non-trivial generation step two variants of the personal pronoun (er and der) is confronted with the issue of soft constraints on for distinguishing salience. So, if the genera- grammatical alternatives in one way or another. tion task is set in such a way that the actual lex- There are countless approaches to modelling ical choice, including functional categories such these soft constraints, taking into account their as determiners, is fully fixed (which is of course interaction with various aspects of the discourse not always the case), one can take advantage of 767 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 767–776, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics these reflexes. This explains in part the fairly high pata (2010) have improved a sentence compres- baseline performance of n-gram language mod- sion system by capturing prominence of phrases els in the surface realization task. And the effect or referents in terms of lexical chain information can indeed be taken much further: the discrimi- inspired by Morris and Hirst (1991) and Center- native training experiments of Cahill and Riester ing (Grosz et al., 1995). In their system, discourse (2009) show how effective it is to systematically context is represented in terms of hard constraints take advantage of asymmetry patterns in the mor- modelling whether a certain constituent can be phosyntactic reflexes of the discourse notion of deleted or not. information status (i.e., using a feature set with In the linearisation or surface realisation do- well-chosen purely sentence-bound features). main, there is a considerable body of work ap- These observations give rise to the question: in proximating information structure in terms of the light of the difficulty in obtaining reliable dis- sentence-internal realisation (Ringger et al., 2004; course information on the one hand and the effec- Filippova and Strube, 2009; Velldal and Oepen, tiveness of exploiting the reflexes of discourse in 2005; Cahill et al., 2007). Cahill and Riester the sentence-internal material on the other – can (2009) improve realisation ranking for German – we nevertheless expect to gain something from which mainly deals with word order variation – by adding sentence-external feature information? representing precedence patterns of constituents We propose two scenarios for adressing this in terms of asymmetries in their morphosyntac- question: first, we choose an approximative ac- tic properties. As a simple example, a pattern ex- cess to context information and relations between ploited by Cahill and Riester (2009) is the ten- discourse referents – lexical reiteration of head dency of definite elements tend to precede indef- words, combined with information about their inites, which, on a discourse level, reflects that grammatical relation and topological positioning given entities in a sentence tend to precede new in prior sentences. We apply these features in a entities. rich sentence-internal surface realisation ranking Other work on German surface realisation has model for German. Secondly, we choose a more highlighted the role of the initial position in the controlled scenario: we train a constituent order- German sentence, the so-called Vorfeld (or “pre- ing classifier based on a feature model that cap- field”). Filippova and Strube (2007) show that tures properties of discourse referents in terms of once the Vorfeld (i.e. the constituent that precedes manually annotated coreference relations. As we the finite verb) is correctly determined, the pre- get the same effect in both setups – the sentence- diction of the order in the Mittelfeld (i.e. the con- external features do not improve over a baseline stituents that follow the finite verb) is very easy. that captures basic morphosyntactic properties of Cheung and Penn (2010) extend the approach the constituents – we conclude that sentence- of Filippova and Strube (2007) and augment a internal realisation is actually a relatively accurate sentence-internal constituent ordering model with predictor of discourse context, even more accurate sentence-external features inspired from the en- than information that can be obtained from coref- tity grid model proposed by Barzilay and Lapata erence and lexical chain relations. (2008). 2 Related Work 3 Motivation In the generation literature, most works on ex- While there would be many ways to construe ploiting sentence-external discourse information or represent discourse context (e.g. in terms of are set in a summarisation or content ordering the global discourse or information structure), we framework. Barzilay and Lee (2004) propose an concentrate on capturing local coherence through account for constraints on topic selection based on the distribution of discourse referents in a text. probabilistic content models. Barzilay and Lapata These discourse referents basically correspond to (2008) propose an entity grid model which repre- the constituents that our surface realisation model sents the distribution of referents in a discourse has to put in the right order. As the order of refer- for sentence ordering. Karamanis et al. (2009) ents or constituents is arguably influenced by the use Centering-based metrics to assess coherence information structure of a sentence given the pre- in an information ordering system. Clarke and La- vious text, our main assumption was that infor- 768 (1) a. Kurze Zeit sp¨ater erkl¨arte ein Anrufer bei Nachrichtenagenturen in Pakistan , die Gruppe Gamaa bekenne sich. Shortly after, a caller declared at the news agencies in Pakistan, that the group Gamaa avowes itself. b. Diese Gruppe wird f¨ur einen Großteil der Gewalttaten verantwortlich gemacht , die seit dreieinhalb Jahren in ¨ Agypten ver¨ubt worden sind . This group is made responsible for most of the violent acts that have been committed in Egypt in the last three and a half years. (2) a. Belgien w¨unscht, dass sich WEU und NATO dar¨uber einigen. Belgium wants that WEU and NATO agree on that. b. Belgien sieht in der NATO die beste milit¨arische Struktur in Europa . Belgium sees the best military structure of Europe in the NATO. (3) a. Frauen vom Land k¨ampften aktiv darum , ein Staudammprojekt zu verhindern. Women from the countryside fighted actively to block the dam project. b. Auch in den St¨adten f¨anden sich immer mehr Frauen in Selbsthilfeorganisationen zusammen. Also in the cities, more and more women team up in self-help organisations. mation about the prior mentioning of a referent of the noun ‘group’ is modified by a demonstra- would be helpful for predicting the position of this tive pronoun such that its “known” and prominent referent in a sentence. discourse status is overt in the morpho-syntactic The idea that the occurence of discourse refer- realisation. In Example (2), both instances of ents in a text is a central aspect of discourse struc- “Belgium” are realised as bare proper nouns with- ture has been systematically pursued by Centering out an overt morphosyntactic clue indicating their Theory (Grosz et al., 1995). Its most important discourse status. notions are related to the realisation of discourse Beyond the simple presence of reitered items in referents (i.e. described as “centers”) and the way sequences of sentences, we expected that it would the centers are arranged in a sequence of utter- be useful to look at the position and syntactic ances to make this sequence a coherent discourse. function of the previous mentions of a discourse Another important concept is the “ranking” of dis- referent. In Example (1), the reiterated item is first course referents which basically determines the introduced in an embedded sentence and realised prominence of a referent in a certain sentence and in the Vorfeld in the second utterance. In terms is driven by several factors (e.g. their grammati- of centering, this transition would correspond to cal function). For free word order languages like a topic shift. In Example (2), both instances are German, word order has been proposed as one of realised in the Vorfeld, such that the topic of the the factors that account for the ranking (Poesio et first sentence is carried over to the next. al., 2004). In a similar spirit, Morris and Hirst In Example (3), we illustrate a further type of (1991) have proposed that chains of (related) lex- lexical reiteration. In this case, two identical head ical items in a text are an important indicator of nouns are realised in subsequent sentences, even text structure. though they refer to two different discourse refer- Our main hypothesis was that it is possible to ents. While this type of lexical chain is described exploit these intuitions from Centering Theory as “reiteration without identity of referents” by and the idea of lexical chains for word order pre- Morris and Hirst (1991), it would not be captured diction. Thus, we expected that it would be easier in Centering since this is not a case of strict coref- to predict the position of a referent in a sentence erence. On the other hand, lexical chains do not if we have not only given its realisation in the cur- capture types of reiterated discourse referents that rent utterance but also its prominence in the previ- have distinct morpho-syntactic realisations, e.g. ous discourse. Especially, we expected this intu- nouns and pronouns. ition to hold for cases where the morpho-syntactic Originally, we had the hypothesis that strict realisation of a constituent does not provide many corefence information is more useful and accurate clues. This is illustrated in Examples (1) and (2) for word order prediction than rather loose lexi- which both exemplify the reiteration of a lexical cal chains which conflate several types of referen- item in two subsequent sentences, (reiteration is tial and lexical relations. However, the advantage one type of lexical chain discussed in Morris and of chains, especially chains of reiteration, is that Hirst (1991)). In Example (1), the second instance they can be easily detected in any corpus text and 769 that they might capture “topics” of sentences be- The realisation ranking component is an SVM yond the identity of referents. Thus, we started ranking model implemented with SVMrank, out from the idea of lexical chains and added cor- a Support Vector Machine-based learning tool responding features in a statistical ranking model (Joachims, 2006). During training, each sentence for surface realisation of German (Section 4). As is annotated with a rank and a set of features ex- this strategy did not work out, we wanted to assess tracted from the F-structure, its surface string and whether an ideal coreference annotation would be external resources (e.g. a language model). If helpful at all for predicting word order. In a sec- the sentence matches the original corpus string, ond experiment, we use a corpus which is manu- its rank will be highest, the assumption being that ally annotated for coreference (Section 5). the original sentence corresponds to the optimal realisation in context. The output of generation, 4 Experiment 1: Realisation Ranking the top-ranked sentence, is evaluated against the with Lexical Chains original corpus sentence. In this Section, we present an experiment that in- 4.2 The Feature Models vestigates sentence-external context in a surface As the aim of this experiment is to better un- realisation task. The sentence-external context is derstand the nature of sentence-internal features represented in terms of lexical chain features and reflecting discourse context and compare them compared to sentence-internal models which are to sentence-external ones, we build several fea- based on morphosyntactic features. The experi- ture models which capture different aspects of the ment thus targets a generation scenario where no constituents in a given sentence. The sentence- coreference information is available and aims at internal features describe the morphosyntacic re- assessing whether relatively naive context infor- alisation of constituents, for instance their func- mation is also useful. tion (“subject”, “object”), and can be straightfor- wardly extracted from the f-structure. These fea- 4.1 System Description tures are then combined into discriminative prece- We carry out our first experiment in a regener- dence features, for instance “subject-precedes- ation set-up with two components: a) a large- object”. We implement the following types of scale hand-crafted Lexical Functional Grammar morphosyntactic features: (LFG) for German (Rohrer and Forst, 2006), used • syntactic function (arguments and adjuncts) to parse and regenerate a corpus sentence, b) a stochastic ranker that selects the most appro- • modification (e.g. nouns modified by relative priate regenerated sentence in context according clauses, genitive etc.) to an underlying, linguistically motivated feature • syntactic category (e.g. adverbs, proper model. In contrast to fully statistical linearisation nouns, phrasal arguments) methods, our system first generates the full set • definiteness for nouns of sentences that correspond to the grammatically • number and person for nominal elements well-formed realisations of the intermediate syn- • types of pronouns (e.g. demonstrative, re- tactic representation.1 This representation is an flexive) f-structure, which underspecifies the order of con- • constituent span and number of embedded stituents and, to some extent, their morphological nodes in the tree realisation, such that the output sentences contain In addition, we also include language model all possible combinations of word order permu- scores in our ranking model. In Section 4.4, tations and morphological variants. Depending we report on results for several subsets of these on the length and structure of the original corpus features where “BaseSyn” refers to a model that sentence, the set of regenerated sentences can be only includes the syntactic function features and huge (see Cahill et al. (2007) for details on regen- “FullMorphSyn” includes all features mentioned erating the German treebank TIGER). above. 1 There are occasional mistakes in the grammar which For extracting the lexical chains, we check for sometimes lead to ungrammatical strings being generated, any overlapping nouns in the n sentences previ- but this is rare. ous to the current one being generated. We check 770 Rank Sentence and Features % Diese Gruppe wird f¨ur einen Großteil der Gewalttaten verantwortlich gemacht. % This group is for a major part of the violent acts responsible made. 1 subject-<-pp-object, demonstrative-<-indefinite, overlap-<-no-overlap, overlap-in-vorfeld, lm:-7.89 % F¨ur einen Großteil der Gewalttaten wird diese Gruppe verantwortlich gemacht. % For a major part of the violent acts is this group responsible made. 3 pp-object-<-subject, indefinite-<-demonstrative, no-overlap-<-overlap, no-overlap-in-vorfeld, lm:-10.33 % Verantwortlich gemacht wird diese Gruppe f¨ur einen Großteil der Gewalttaten. % Responsible made is this group for a major part of the violent acts. 3 subject-<-pp-object, demonstrative-<-indefinite, overlap-<-no-overlap, lm:-9.41 Figure 1: Made-up training example for realisation ranking with precedence features proper and common nouns, considering full and # Sentences % Sentences with overlap in context Training Dev Test partial overlaps as shown in Examples (1) and 1 20.96 23.64 20.42 (2), where the (a) example is the previous sen- 2 35.42 40.74 35.00 tence in the corpus. For each overlap, we record 3 45.58 50.00 53.33 the following properties: (i) function in the previ- 4 52.66 53.70 58.75 5 57.45 58.18 64.58 ous sentence, (ii) position in the previous sentence 6 61.42 57.41 68.75 (e.g. Vorfeld), (iii) distance between sentences, 7 64.58 61.11 70.83 (iv) total number of overlaps. 8 67.05 62.96 72.08 These overlap features are then also 9 69.20 64.81 74.17 combined in terms of precedence, e.g. 10 71.16 70.37 75.83 “has subject overlap:3-precedes-no overlap”, Table 1: The percentage of sentences that have at least meaning that in the current sentence a noun one overlapping entity in the previous n sentences that was previously mentioned in a subject 3 sentences ago precedes a noun that was not mentioned before. In Figure 1, we give an example of a set of gen- coreference annotation, since we already have a eration alternatives and their (partial) feature rep- number of resources available to match the syn- resentation for the sentence (1-b). Precedence is tactic analyses produced by our grammar against indicated by ”<”. the analyses in the treebank. Thus, in our regen- Basically, our sentence-external feature model eration system, we parse the sentences with the is built on the intuition that lexical chains or over- grammar, and choose the parsed f-structures that laps approximate discourse status in a way which are compatible with the manual annotation in the is similar to sentence-internal morphosyntactic TIGER treebank as is done in Cahill et al. (2007). properties. Thus, we would expect that overlaps This compatibility check eliminates noise which indicate givenness, salience or prominence and would be introduced by generating from incorrect that asymmetries between overlapping and non- parses (e.g. incorrect PP-attachments typically re- overlapping entities are helpful in the ranking. sult in unnatural and non-equivalent surface reali- sations). 4.3 Data All our models are trained on 7,039 sentences For comparing the string chosen by the mod- (subdivided into 1259 texts) from the TIGER els against the original corpus sentence, we use Treebank of German newspaper text (Brants et al., BLEU, NIST and exact match. Exact match is 2002). We tune the parameters of our SVM model a strict measure that only credits the system if it on a development set of 55 sentences and report chooses the exact same string as the original cor- the final results for our unseen test set of 240 sen- pus string. BLEU and NIST are more relaxed tences. Table 1 shows how many sentences in our measures that compare the strings on the n-gram training, development and test sets have at least level. Finally, we report accuracy scores for the one textually overlapping phrase in the previous Vorfeld position (VF) corresponding to the per- 1–10 sentences. centage of sentences generated with a correct Vor- We choose the TIGER treebank, which has no feld. 771 Sc BLEU NIST Exact VF by morphosyntactic features. However, we cannot 0 0.766 11.885 50.19 64.0 exclude the possibility that the chain features are 1 0.765 11.756 49.78 64.0 2 0.765 11.886 50.01 64.1 too noisy as they conflate several types of lexical 3 0.765 11.885 50.08 63.8 and coreferential relations. This will be adressed 4 0.761 11.723 49.43 63.2 in the following experiment. 5 0.765 11.884 49.71 64.2 6 0.768 11.892 50.42 64.6 5 Experiment 2: Constituent Ordering 7 0.765 11.885 50.01 64.5 8 0.764 11.884 49.78 64.3 with Centering-inspired Features 9 0.765 11.888 49.82 63.6 10 0.764 11.889 49.7 63.5 We now look at a simpler generation setup where we concentrate on the ordering of constituents in Table 2: Tenfold-crossvalidation for feature model the German Vorfeld and Mittelfeld. This strat- FullMorphSyn and different context windows (Sc ) egy has also been adopted in previous investiga- Model BLEU VF tions of German word order: Filippova and Strube Language Model 0.702 51.2 (2007) show that once the German Vorfeld is cor- Language Model + Context Sc = 5 0.715 54.3 rectly chosen, the prediction accuracy for the Mit- BaseSyn 0.757 62.0 telfeld (the constituents following the finite verb) BaseSyn + Context Sc = 5 0.760 63.0 FullMorphSyn 0.766 64.0 is in the 90s. FullMorphSyn + Context Sc = 5 0.763 64.2 In order to eliminate noise introduced from po- tentially heterogeneous chain features, we look at Table 3: Evaluation for different feature models; ‘Lan- coreference features and, again, compare them to guage Model’: ranking based on language model scores, ‘BaseSyn’: precedence between constituent sentence-internal morphosyntactic features. We functions, ‘FullMorphSyn’: entire set of sentence- target a generation scenario where coreference in- internal features. formation is available. The aim is to establish an upper bound concerning the quality improvement 4.4 Results for word order prediction by recurring to manual In Table 2, we report the performance of the full corefence annotation. sentence-internal feature model combined with 5.1 Data and Setup context windows from zero to ten. The scores have been obtained from tenfold-crossvalidation. We carry out the constituent ordering experiment For none of the context windows, the model out- on the T¨uba-D/Z treebank (v5) of German news- performs the baseline with a zero context which paper articles (Telljohann et al., 2006). It com- has no sentence-external features. In Table 3, prises about 800k tokens in 45k sentences. We we compare the performance of several feature choose this corpus because it is not only annotated models corresponding to subsets of the features with syntactic analyses but also with coreference used so far which are combined with sentence- relations (Naumann, 2006). The syntactic annota- external features respectively. We note that the tion format differs from the TIGER treebank used function precedence features (i.e. the ‘BaseSyn’ in the previous experiment, for instance, it ex- model) are very powerful, leading to a major im- plicitely represents the Vorfeld and Mittelfeld as provement compared to a language model. The phrasal nodes in the tree. This format is very con- sentence-external features lead to an improvement venient for the extraction of constituents in the re- when combined with the language-model based spective positions. ranking. However, this improvement is leveled The T¨uba-D/Z coreference annotation distin- out in the BaseSyn model. guishes several relations between discourse ref- On the one hand, the fact that the lexical chain erents, most importantly “coreferential relation” features improve a language-model based ranking and “anaphoric relation” where the first denotes suggests these features are, to some extent, pre- a relation between noun phrases that refer to the dictive for certain patterns of German word order. same entity, and the latter refers to a link between On the other hand, the fact that they don’t improve a pronoun and a contextual antecedent, see Nau- over an informed sentence-internal baseline sug- mann (2006) for further detail. We expected the gests that these patterns are equally well captured coreferential relation to be particularly useful, as 772 it cannot always be read off the morphosyntac- # VF # MF Backward Center 3.5% 5.1% tic realisation of a noun phrase, whereas pronouns Forward Center 6.8% 6.8% are almost always used in an anaphoric relation. Coref Link 30.5% 23.4% The constituent ordering model is implemented as a classifier that is given a set of constituents Table 4: Backward and forward centers and their posi- and predicts the constituent that is most likely to tions be realised in the Vorfeld. The set of candidate constituents is determined chain model since there is no lexical overlap be- from the tree of the original corpus sentence. We tween the realisations of the discourse referents. will assume that all constituents under a Vorfeld These types of coreference features implicitly and Mittelfeld node can be freely reordered. Thus, carry the information that would also be consid- we do not check whether the word order variants ered in a Centering formalisation of discourse we look at are actually grammatical assuming that context. In addition to these, we designed features most of them are. In this sense, this experiment that explicitly describe centers as these might is close to fully statistical generation approaches. have a higher weight. In line with Clarke and As a further simplification, we do not look at mor- Lapata (2010), we compute backward (CB) and phological generation variants of the constituents forward centers (CF ) in the following way: or their head verb. The classifier is implemented with SVMrank 1. Extract all entities from the current sentence again. In contrast to the previous experiment and the previous sentence. where we learned to rank sentences, the classi- 2. Rank the entities of the previous sentence ac- fier now learns to rank constituents. The con- cording to their function (subject < direct stituents have been extracted using the tool de- object < indirect object ...). scribed in Bouma (2010). The final data set com- 3. Find the highest ranked entity in the previous prises 48.513 candidate sets of freely orderable sentence that has a link to an entity in the constituents. current sentence, this entity is the CB of the sentence. 5.2 Centering-inspired Feature Model To compare the discourse context model against a In the same way, we mark entities as forward sentence-based model, we implemented a number centers that are ranked highest in the current sen- of sentence-internal features that are very similar tence and have a link to an entity in the following to the features used in the previous experiment. sentence.2 In Table 4, we report the percentage of Since we extract them from the syntactic annota- sentences that have backward and forward centers tion instead of f-structures, some labels and fea- in the Vorfeld or Mittelfeld. While the percentage ture names will be different, however, the design of sentences that realise a backward center is quite of the sentence-internal model is identical to the low, the overall proportion of sentences contain- previous one in Section 4. ing some type of coreference link is in a dimen- The sentence-external features differ in some sion such that the learner could definitely pick up aspects from Section 4, since we extract coref- some predictive patterns. Going by the relative erence relations of several types (see (Naumann, frequencies, coreferential constituents have a bias 2006) for the anaphoric relations annotated in the towards appearing in the Vorfeld rather than in the Tueba-D/Z). For each type of coreference link, Mittelfeld. we extract the following properties: (i) function 5.3 Results of the antecedent, (ii) position of the antecedent, (iii) distance between sentences, (iv) type of rela- First, we build three coreference-based con- tion. We also distinguish coreference links anno- stituent classifiers on their entire training set and tated for the whole phrase (“head link”) and links compare them to their sentence-internal baseline. that are annotated for an element embedded by the The most simple baseline records the category of constituent (“contained link”). The two types are 2 In Centering, all entities in a given utterance can be seen illustrated in Examples (4) and (5). Note that both as forward centers, however we thought that this implemen- cases would not have been captured in the lexical tation would be more useful. 773 (4) a. Die Rechnung geht an die AWO. The bill goes to the AWO. b. [Hintergrund der gegenseitigen Vorw¨urfe in der Arbeiterwohlfahrt] sind offenbar scharfe Konkurrenzen zwischen Bremern und Bremerhavenern. Apparently, [the background of the mutual accusations at the labour welfare] are rivalries between people from Bremen and Bremerhaven. (5) a. Dies ist die Behauptung, mit der Bremens H¨afensenator die Skeptiker davon u¨ berzeugt hat, [...]. This is the claim, which Bremen’s harbour senator used to convince doubters, [...]. b. F¨ur diese Behauptung hat Beckmeyer bisher keinen Nachweis geliefert. So far, Beckmeyer has not given a prove of this claim. Model VF Model VF ConstituentLength + HeadPos 47.48% ConstituentLength + HeadPos 46.61% ConstituentLength + HeadPos + Coref 51.30% ConstituentLength + HeadPos + Coref 52.23% BaseSyn 54.82% BaseSyn 54.63% BaseSyn + Coref 56.21% BaseSyn + Coref 56.67% FullMorphSyn 57.24% FullMorphSyn 55.36% FullMorphSyn + Coref 57.40% FullMorphSyn + Coref 57.93% Table 5: Results from Vorfeld classification, training Table 6: Results from Vorfeld classification, training and evaluation on entire treebank and evaluation on sentences that contain a coreference link the constituent head and the number of words that the constituent spans. Additionally, in parallel to 5.4 Discussion the experiment in Section 4, we build a “BaseSyn” The results presented in this Section consis- model which has the syntactic function features, tently complete the picture that emerged from and a “FullMorphSyn” model which comprises the experiments in Section 4. Even if we have the entire set of sentence-internal features. To high quality information about discourse con- each of these baseline, we add the coreference text in terms of relations between referents, a features. The results are reported in Table 5. non-trivial sentence-internal model for word or- In this experiment, we find an effect of der prediction can be hardly improved. This the sentence-external features over the simple suggests that sentence-internal approximations of sentence-internal baselines. However, in the fully discourse context provide a fairly good way of spelled-out, sentence-internal model, the effect dealing with local coherence in a linearisation is, again, minimal. Moreover, for each base- task. It is also interesting that the sentence- line, we obtain higher improvements by adding external features improve over simple baselines, further sentence-internal features than by adding but get leveled out in rich sentence-internal fea- sentence-external ones the accuracy of the sim- ture models. From this, we conclude that the ple baseline (47.48%) improves by 7.34 points sentence-external features we implemented are to through adding function features (the accuracy some extent predictive for word order, but that of BaseSyn is 54.82%) and by only 3.48 points they can be covered by sentence-internal features through adding coreference features. as well. We run a second experiment in order to so see Our second evaluation concentrating on the whether the better performance of the sentence- sentences that have coreference information internal features is related to their coverage. We shows that the better performance of the sentence- build and evaluate the same set of classifiers on internal features is also related to their cover- the subset of sentences that contain at least one age. These results confirm our initial intuition coreference link for one of its constituents (see that coreference information can add to the pre- Table 4 for the distribution of coreference links dictive power of the morpho-syntactic features in in our data). The results are given in Table 6. In certain contexts. This positive effect disappears this experiment, the coreference features improve when sentences with and without coreferential over all sentence-internal baselines including the constituents are taken together. For future work, ‘FullMorphSyn’ model. it would be promising to investigate whether the 774 positive impact of coreference features can be to generation and summarization. In Proceedings of strengthened if the coreference annotation scheme HLT-NAACL 2004, Boston,MA. is more exhaustive, including, e.g., bridging and Anja Belz and Ehud Reiter. 2006. Comparing auto- event anaphora. matic and human evaluation of NLG systems. In Proceedings of EACL 2006, pages 313–320, Trento, 6 Conclusion Italy. Gerlof Bouma. 2010. Syntactic tree queries in prolog. We have carried out a number of experiments that In Proceedings of the Fourth Linguistic Annotation show that sentence-internal models for word order Workshop, ACL 2010. Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf- are hardly improved by features which explicitely gang Lezius, and George Smith. 2002. The TIGER represent the preceding context of a sentence in Treebank. In Proceedings of the Workshop on Tree- terms of lexical and referential relations between banks and Linguistic Theories. discourse entities. This suggests that sentence- Aoife Cahill and Arndt Riester. 2009. Incorporat- internal realisation implicitly carries a lot of im- ing information status into generation ranking. In formation about discourse context. On average, Proceedings of the Joint Conference of the 47th An- the morphosyntactic properties of constituents in nual Meeting of the ACL and the 4th International a text are better approximates of their discourse Joint Conference on Natural Language Processing of the AFNLP, pages 817–825, Suntec, Singapore, status than actual coreference relations. August. Association for Computational Linguistics. This result feeds into a number of research Aoife Cahill, Martin Forst, and Christian Rohrer. questions concerning the representation of dis- 2007. Stochastic Realisation Ranking for a Free course and its application in generation systems. Word Order Language. In Proceedings of the Although we should certainly not expect a com- Eleventh European Workshop on Natural Language putational model to achieve a perfect accuracy in Generation, pages 17–24, Saarbr¨ucken, Germany. the constituent ordering task – even humans only DFKI GmbH. Aoife Cahill. 2009. Correlating human and automatic agree to a certain extent in rating word order vari- evaluation of a german surface realiser. In Proceed- ants (Belz and Reiter, 2006; Cahill, 2009) – the ings of the ACL-IJCNLP 2009 Conference Short Pa- average accuracy in the 60’s for prediction of Vor- pers, pages 97–100, Suntec, Singapore, August. As- feld occupance is still moderate. An obvious di- sociation for Computational Linguistics. rection would be to further investigate more com- Jackie C.K. Cheung and Gerald Penn. 2010. Entity- plex representations of discourse that take into ac- based local coherence modelling using topological count the relations between utterances, such as fields. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics topic shifts. Moreover, it is not clear whether the (ACL 2010). Association for Computational Lin- effects we find for linearisation in this paper carry guistics. over to other levels of generation such as tacti- James Clarke and Mirella Lapata. 2010. Discourse cal generation where syntactic functions are not constraints for document compression. Computa- fully specified. In a broader perspective, our re- tional Linguistics, 36(3):411–441. sults underline the need for better formalisations Stefanie Dipper and Heike Zinsmeister. 2009. The of discourse that can be translated into features for role of the German Vorfeld for local coherence. In large-scale applications such as generation. Christian Chiarcos, Richard Eckart de Castilho, and Manfred Stede, editors, Von der Form zur Bedeu- Acknowledgments tung: Texte automatisch verarbeiten/From Form to Meaning: Processing Texts Automatically, pages This work was funded by the Collaborative Re- 69–79. Narr, T¨ubingen. search Centre (SFB 732) at the University of Katja Filippova and Michael Strube. 2007. The ger- man vorfeld and local coherence. Journal of Logic, Stuttgart. Language and Information, 16:465–485. Katja Filippova and Michael Strube. 2009. Tree Lin- References earization in English: Improving Language Model Based Approaches. In Proceedings of Human Lan- Regina Barzilay and Mirella Lapata. 2008. Modeling guage Technologies: The 2009 Annual Conference local coherence: An entity-based approach. Com- of the North American Chapter of the Association putational Linguistics, 34:1–34. for Computational Linguistics, Companion Volume: Regina Barzilay and Lillian Lee. 2004. Catching the Short Papers, pages 225–228, Boulder, Colorado, drift: Probabilistic content models with applications June. Association for Computational Linguistics. 775 Barbara J. Grosz, Aravind Joshi, and Scott Weinstein. 1995. Centering: A framework for modeling the local coherence of discourse. Computational Lin- guistics, 21(2):203–225. Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), pages 217–226. Nikiforos Karamanis, Massimo Poesioand Chris Mel- lish, and Jon Oberlander. 2009. Evaluating center- ing for information ordering using corpora. Com- putational Linguistics, 35(1). Jane Morris and Graeme Hirst. 1991. Lexical cohe- sion, the thesaurus, and the structure of text. Com- putational Linguistics, 17(1):21–225. Karin Naumann. 2006. Manual for the annotation of in-document referential relations. Technical report, Seminar f¨ur Sprachwissenschaft, Abt. Computerlin- guistik, Universit¨at T¨ubingen. Massimo Poesio and Ron Artstein. 2005. The relia- bility of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proc. of ACL Workshop on Frontiers in Corpus Annotation. Massimo Poesio, Rosemary Stevenson, Barbara di Eu- genio, and Janet Hitzeman. 2004. Centering: A parametric theory and its instantiations. Computa- tional Linguistics, 30(3):309–363. Eric K. Ringger, Michael Gamon, Robert C. Moore, David Rojas, Martine Smets, and Simon Corston- Oliver. 2004. Linguistically Informed Statisti- cal Models of Constituent Structure for Ordering in Sentence Realization. In Proceedings of the 2004 International Conference on Computational Linguistics, Geneva, Switzerland. Julia Ritz, Stefanie Dipper, and Michael G¨otze. 2008. Annotation of information structure: An evaluation across different types of texts. In Proceedings of the the 6th LREC conference. Christian Rohrer and Martin Forst. 2006. Improv- ing Coverage and Parsing Quality of a Large-Scale LFG for German. In Proceedings of the Fifth In- ternational Conference on Language Resources and Evaluation (LREC), Genoa, Italy. Augustin Speyer. 2005. Competing constraints on vorfeldbesetzung in german. In Proceedings of the Constraints in Discourse Workshop, pages 79–87. Heike Telljohann, Erhard Hinrichs, Sandra K¨ubler, and Heike Zinsmeister. 2006. Stylebook for the t¨ubingen treebank of written german (t¨uba-d/z). revised version. Technical report, Seminar f¨ur Sprachwissenschaft, Universit¨at T¨ubingen. Erik Velldal and Stephan Oepen. 2005. Maximum entropy models for realization ranking. In Proceed- ings of the 10th Machine Translation Summit, pages 109–116, Thailand. 776 Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages Oliver Ferschke‡ , Iryna Gurevych†‡ and Yevgen Chebotar‡ † Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational Information ‡ Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science Technische Universit¨at Darmstadt http://www.ukp.tu-darmstadt.de Abstract chronous co-authoring tool. A unique character- istic of Wikis is the documentation of the edit In this paper, we propose an annota- history which keeps track of every change that tion schema for the discourse analysis of is made to a Wiki page. With this information, Wikipedia Talk pages aimed at the coor- dination efforts for article improvement. it is possible to reconstruct the writing process We apply the annotation schema to a cor- from the beginning to the end. Additionally, many pus of 100 Talk pages from the Simple Wikis offer their users a communication platform, English Wikipedia and make the resulting the Talk pages, where they can discuss the ongo- dataset freely available for download1 . Fur- ing writing process with other users. thermore, we perform automatic dialog act classification on Wikipedia discussions and The most prominent example for a successful, achieve an average F1 -score of 0.82 with large-scale Wiki is Wikipedia, a collaboratively our classification pipeline. created online encyclopedia, which has grown considerably since its launch in 2001, and con- tains a total of almost 20 million articles in 282 1 Introduction languages and dialects, as of Sept. 2011. As there Over the past decade, the paradigm of information is no editorial body that manages Wikipedia top- sharing in the web has shifted towards participa- down, it is an open question how the huge on- tory and collaborative content production. Texts line community around Wikipedia regulates and are no longer exclusively prepared by individuals enforces standards of behavior and article qual- and then shared with the community. They are in- ity. The user discussions on the article Talk pages creasingly created collaboratively by multiple au- might shed light on this issue and give an insight thors and iteratively revised by the community. into the otherwise hidden processes of collabora- When researchers first conducted surveys on tion that, until now, could only be analyzed via professional writers in the 1980s, they found that interviews or group observations in experimental the collaborative writing process differs consider- settings. ably from the way individual writing is done (Pos- The main goal of the present paper is to analyze ner and Baecker, 1992). In joint writing, the writ- the content of the discussion pages of the Simple ers have to externalize processes that are other- English Wikipedia with respect to the dialog acts wise not made explicit, like the planning and the aimed at the coordination efforts for article im- organization of the text. The authors have to com- provement. Dialog acts, according to the classic municate how the text should be written and what speech act theory (Austin, 1962; Searle, 1969), exactly it should contain. represent the meaning of an utterance at the level Today, many tools are available that support of illocutionary force, i.e. a dialog act label con- collaborative writing. A tool that has particu- cisely characterizes the intention and the role of a larly taken hold is the Wiki, a web-based, asyn- contribution in a dialog. We chose the Simple En- 1 http://www.ukp.tu-darmstadt.de/data/ glish Wikipedia for our initial analysis, because wikidiscourse we are able to obtain more representative results 777 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 777–786, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics by covering almost 15% of all relevant Talk pages, peculiarities of the Switchboard corpus. The re- as opposed to the much smaller fraction we could sulting SWDB-DAMSL schema contained more achieve for the English Wikipedia. The long-term than 220 distinct labels which have been clustered goal of this work is to identify relations between to 42 coarse grained labels. Both schemata have contributions on the Talk pages and particular arti- often been adapted for special purpose annotation cle edits. We plan to analyze the relation between tasks. article discussions and article content and identify With the rise of the social web, the amount of the edits in the article revision history that react to research analyzing user generated discourse sub- the problems discussed on the Talk page. In com- stantially increased. In addition to analyzing web bination with article quality assessment (Yaari et forums (Kim et al., 2010a), chats (Carpenter and al., 2011), this opens up the possibility to iden- Fujioka, 2011) and emails (Cohen et al., 2004), tify successful patterns of collaboration which in- Wikipedia Talk pages have recently moved into crease the article quality. Furthermore, our work the center of attention of the research community. will enable practical applications. By augment- Vi´egas et al. (2007) manually annotate 25 ing Wikipedia articles with the information de- Wikipedia article discussion pages with a set of rived from automatically labeled discussions, arti- 11 labels in order to analyze how Talk pages are cle readers can be made aware of particular prob- used for planning the work on articles and resolv- lems that are being discussed on the Talk page ing disputes among the editors. Schneider et al. “behind the article”. (2011) extend this schema and manually annotate Our primary contributions in this paper are: (1) 100 Talk pages with 15 labels. They confirm the an annotation schema for dialog acts reflecting findings of Vi´egas et al. that coordination requests the efforts for coordinating the article improve- occur most frequently in the discussions. ment; (2) the Simple English Wikipedia Dis- Bender et al. (2011) describe a corpus of 47 cussion (SEWD) corpus, consisting of 100 seg- Talk pages which have been annotated for author- mented and annotated Talk pages which we make ity claims and alignment moves. With this cor- freely available for download; and (3) a dialog pus, the authors analyze how the participants in act classification pipeline that incorporates sev- Wikipedia discussions establish their credibility eral state of the art machine learning algorithms and how they express agreement and disagree- and feature selection techniques and achieves an ment towards other participants or topics. average F1 -score of .82 on our corpus. From a different perspective, Stvilia et al. (2008) analyze 60 discussion pages in regard to 2 Related Work how information quality (IQ) in Wikipedia arti- The analysis of speech and dialog acts has its cles is assessed on the Talk pages and which types roots in the linguistic field of pragmatics. In of IQ problems are identified by the community. 1962, John Austin shifted the focus from the mere They describe a Wikipedia IQ assessment model declarative use of language as a means for making and map it to established frameworks. Further- factual statements towards its non-declarative use more, they provide a list of IQ problems along as a tool for performing actions. The speech act with related causal factors and necessary actions theory was further systematized by Searle (1969), which has also inspired the design of our annota- whose classification of illocutionary acts (Searle, tion schema. 1976) is still used as a starting point for creating Finally, Laniado et al. (2011) examine dialog act classification schemata for natural lan- Wikipedia discussion networks in order to guage processing. capture structural patterns of interaction. They A well known, domain- and task-independent extract the thread structure from all Talk pages in annotation schema is DAMSL (Core and Allen, the English Wikipedia and create tree structures 1997). It was created as the standard annotation of the discussion. The analysis of the graphs schema for dialog tagging on the utterance level reveals patterns that are unique to Wikipedia by the Discourse Resource Initiative. It uses a discussions and might be used as a means to four-dimensional tagset that allows arbitrary label characterize different types of Talk pages. combinations for each utterance. Jurafsky et al. To the best of our knowledge, there is no (1997) augmented the DAMSL schema to fit the work yet that uses machine learning to automati- 778 are usually headed by a topic title. Finally, the thread structure designates the sequence of turns and their indentation levels on the Talk page. A structural overview of a Talk page and its con- stituents can be seen in Figure 1. We composed an annotation schema that re- flects the coordination efforts for article improve- ment. Therefore, we manually analyzed a set of thirty Talk pages from the Simple English Wikipedia to identify the types of article defi- ciencies that are discussed and the way article Figure 1: Structure of a Talk page: a) Talk page title, improvement is coordinated. We furthermore b) untitled discussion topic, c) titled discussion topic, incorporated the findings from an information- d) unsigned turns, e) signed turns, f) topic title scientific analysis of information quality in Wikipedia (Stvilia et al., 2008), which identifies cally classify user contributions in Wikipedia Talk twelve types of quality problems, like e.g. Accu- pages. Furthermore, there is no corpus available racy, Completeness or Relevance. Our resulting that reflects the efforts of article improvement in tagset consists of 17 labels (cf. Table 1) which can Wikipedia discussions. This is the subject of our be subdivided into four higher level categories: work. Article Criticism Denote comments that iden- tify deficiencies in the article. The criticism 3 Annotation Schema can refer to the article as a whole or to indi- The main purpose of Wikipedia Talk pages is the vidual parts of the article. coordination of the editing process with the goal Explicit Performative Announce, report or sug- of improving and sustaining the quality of the re- gest editing activities. spective article. The criteria for article quality in Wikipedia are loosely defined in the guidelines for Information Content Describe the direction of “good articles”2 and “very good articles”3 . Ac- the communication. A contribution can be cording to these guidelines, distinguished articles used to communicate new information to must be well-written in simple English, compre- others (IP), to request information (IS), or hensive, neutral, stable, accurate, verifiable and to suggest changes to established facts (IC). follow the Wikipedia style guidelines4 . These cri- The IP label applies to most of the contri- teria are the main points of reference in the dis- butions as most comments provide a certain cussions on the Talk pages. amount of new information. Discourse analysis, as it is performed in this pa- Interpersonal Describe the attitude that is ex- per, can be carried out on various levels, depend- pressed towards other participants in the dis- ing on what is regarded as the smallest unit of the cussion and/or their comments. discourse. In this work, we focus on turns, not on individual utterances, as we are interested in a Since a single turn may consist of several utter- coarse-grained analysis of the discourse-structure ances, it can consequently comprise multiple di- as a first step towards a finer-grained discourse alog acts. Therefore, we designed the annotation analysis. We define a turn (or contribution) as the study as a multi-label classification task, i.e. the body of text that is added by an individual contrib- annotators can assign one or more labels to each utor in one or more revisions to a single discus- annotation unit. Each label is chosen indepen- sion topic until another contributor edits the page. dently. Table 1 shows the labels, their respective Furthermore, a topic (or discussion) is the body definitions and an example from our corpus. of turns that revolve around a single matter. They 4 Corpus Creation and Analysis 2 http://simple.wikipedia.org/wiki/WP:RGA 3 http://simple.wikipedia.org/wiki/WP:RVGA The SEWD corpus consists of 100 annotated Talk 4 http://simple.wikipedia.org/wiki/WP:STYLE pages extracted from a snapshot of the Simple En- 779 Label Description Example Article Criticism It should be added (1) that voters may skip prefer- CM Content incomplete or lacking detail ences, but (2) that skipping preferences has no impact on the result of the elections. Kris Kringle is NOT a Germanic god, but an English CW Lack of accuracy or correctness mispronunciation of Christkind, a German word that means “the baby Jesus”. The references should be removed. The reason: The CU Unsuitable or unnecessary content references are too complicated for the typical reader of simple Wikipedia. CS Structural problems Also use sectioning, and interlinking This section needs to be simplified further; there are a CL Deficiencies in language or style lot of words that are too complex for this wiki. This article seems to take a clear pro-Christian, anti- COBJ Objectivity issues commercial view. I have started an article on Google. It needs improve- CO Other kind of criticism ment though. Explicit Performative PSR Explicit suggestion, recommendation or request This section needs to be simplified further Got it. The URL is http://www.dmbeatles.com/ PREF Explicit reference or pointer history.php?year=1968 PFC Commitment to an action in the future Okay, I forgot to add that, I’ll do so later tonight. I took and hopefully simplified the ”[[en:Prehistoric PPC Report of a performed action music—Prehistoric music]]” article from EnWP Information Content IP Information providing “Depression” is the most basic term there is. So what kind of theory would you use for your music IS Information seeking composing? In linguistics and generally speaking, when Talking about the lexicon in a language, words are usually cat- IC Information correcting egorized as ’nouns’, ’verbs’, ’adjectives’ and so on. The term ’doing word’ does not exist. Interpersonal Positive attitude towards other contributor or ATT+ Thank you. acceptance Okay, I can understand that, but some citations are ATTP Partial acceptance or partial rejection going to have to be included for [[WP:V]]. Negative attitude towards other contributor or Now what? You think you know so much about every- ATT- rejection thing, and you are not even helping?! Table 1: Annotation schema for the dialog act classification in Wikipedia discussion pages with examples from the SEWD Corpus. Some examples have been shortened to fit the table. glish Wikipedia from Apr 4th 2011.5 Technically pages with 11-20 turns, and (iii) pages with more speaking, a Talk page is a normal Wiki page lo- than 20 turns. We then randomly extracted 50 dis- cated in one of the Talk namespaces. In this work, cussion pages from class (i), 40 pages from class we focus on article Talk pages and do not re- (ii) and 10 pages from class (iii). This decision is gard User Talk pages. We selected the discussion grounded in the restricted resources for the human pages according to the number of turns they con- annotation task. tain. First, we discarded all discussion pages with less than four contributions. We then analyzed Data Preprocessing Due to a lack of discussion the distribution of turn counts per discussion page structure, extracting the discussion threads from in the remaining set of pages and defined three the Talk pages requires a substantial amount of classes: (i) discussion pages with 4-10 turns, (ii) preprocessing. Laniado et al. (2011) tackle the 5 The snapshot contains 69900 articles and 5783 Talk thread extraction by using text indentation and in- pages of which 683 contained more than 3 contributions. serted user signatures as clues. We found these 780 attributes to be insufficient for a reliable recon- Annotation Process For our annotation study, struction of the thread structure.6 we used the freely available MMAX2 annotation Our preprocessing approach consists of three tool8 . Two annotators were introduced to the an- steps: data retrieval, topic segmentation and turn notation schema by an instructor and trained on segmentation. For retrieving the discussion pages, an extra set of ten discussion pages. During the we use the Java Wikipedia Library (JWPL) (Zesch annotation of the corpus, the annotators were al- et al., 2008), which offers efficient, database- lowed to discuss difficult cases and could consult driven access to the contents of Wikipedia. We the instructor if in doubt. They had access to the segment the individual Talk pages into discus- segmented discussion pages within the MMAX2 sions topics using the MediaWiki parser that tool as well as to the original Wikipedia articles comes with JWPL. In our corpus, the parser man- and discussion pages on the web. aged to identify all topic boundaries without any The reconciliation of the annotations was car- errors. The most complex preprocessing step is ried out by an expert annotator. In order to obtain the turn segmentation. a consolidated gold standard, the expert decided First, we use the revision history of the Talk all cases in which the annotations of the two an- page to identify the author and the creation time notators did not match. Descriptive statistics for of each paragraph. We use the Wikipedia Revi- the label assignments of each annotator and for sion Toolkit (Ferschke et al., 2011) to examine the the gold standard can be seen in Table 2 and will changes between adjacent revisions of the Talk be further discussed in Section 4.2. page in order to identify the exact time a piece of text was added as well as the author of the con- Corpus Format We publish our SEWD cor- tribution. We have to filter out malicious edits pus in two formats9 , the original MMAX format, from the history, as they would negatively affect and as XMI files for further processing with the the segmentation process. We therefore disregard Apache Unstructured Information Management all edits that are reverted in later later revisions. Architecture10 . For the latter format, we also pro- In contrast to vandalism on article pages, this ap- vide the type system which defines all necessary proach has proven to be sufficient to detect van- corpus specific types needed for using the data in dalism in the Talk page history. an NLP pipeline. Within each discussion topic, we aggregate all 4.1 Inter-Annotator Agreement adjacent paragraphs with the same author and the same time stamp to one turn. In order to account To evaluate the reliability of our dataset, we per- for turns that were written in multiple revisions, form a detailed inter-rater agreement study. For we regard all time stamps within a window of 10 measuring the agreement of the individual labels, minutes7 as belonging to the same turn, unless the we report the observed agreement, Kappa statis- page was edited by another user in the meantime. tics (Carletta, 1996), and F1 -scores. The latter are Finally, the turn is marked with the indentation computed by treating one annotator as the gold level of its least indented paragraph. This infor- standard and the other one as predictions (Hripc- mation is used to identify the relationship between sak and Rothschild, 2005). The scores can be seen the turns, since indentation is used to indicate a in Table 2. reply to an existing comment in the discussion. The average observed agreement across all la- A co-author of this paper evaluated the ac- bels is P¯O = .94. The individual Kappa scores ceptability of the boundaries of each turn in the largely fall into the range that Landis and Koch SEWD corpus and found that 94% of the 1450 (1977) regard as substantial agreement, while turns were correctly segmented. Turns with seg- three labels are above the more strict .8 thresh- mentation errors were not included in the gold old for reliable annotations (Artstein and Poesio, standard. 2008). Furthermore, we obtain an overall pooled Kappa (De Vries et al., 2008) of κpool = .67, 6 Vi´egas et al. (2007) reported that only 67% of the con- 8 tributions on Wikipedia Talk pages are signed, which makes http://www.mmax2.net 9 signatures an unreliable predictor for turn boundaries. http://www.ukp.tu-darmstadt.de/data/ 7 We experimentally tested values between 1 and 60 min- wikidiscourse 10 utes. http://uima.apache.org 781 Annotator 1 Annotator 2 Inter-Annotator Agreement Gold Standard Label N Percent N Percent NA1 ∪A2 PO κ F1 N Percent Article Criticism CM 183 13.4% 105 7.7% 193 .93 .63 .66 116 8.5% CW 106 7.8% 57 4.2% 120 .95 .52 .55 70 5.1% CU 69 5.0% 35 2.6% 83 .95 .38 .40 42 3.1% CS 164 12.0% 101 7.4% 174 .94 .66 .69 136 9.9% CL 195 14.3% 199 14.6% 244 .93 .73 .77 219 16.0% COBJ 27 2.0% 23 1.7% 29 .99 .84 .84 27 2.0% CO 20 1.5% 59 4.3% 71 .95 .18 .20 48 3.5% Explicit Performative PSR 458 33.5% 351 25.7% 503 .86 .66 .76 406 29.7% PREF 43 3.1% 31 2.3% 51 .98 .61 .62 45 3.3% PFC 73 5.3% 65 4.8% 86 .98 .76 .77 77 5.6% PPC 357 26.1% 340 24.9% 371 .97 .92 .94 358 26.2% Information Content IP 1084 79.3% 1027 75.1% 1135 .89 .69 .93 1070 78.3% IS 228 16.7% 208 15.2% 256 .95 .80 .83 220 16.1% IC 187 13.7% 109 8.0% 221 .89 .46 .51 130 9.5% Interpersonal ATT+ 71 5.2% 140 10.2% 151 .94 .55 .58 144 10.5% ATTP 71 5.2% 30 2.2% 79 .96 .42 .44 33 2.4% ATT- 67 4.9% 74 5.4% 100 .96 .56 .58 87 6.4% Table 2: Label frequencies and inter-annotator agreement. NA1 ∪A2 denotes the number of turns that have been labeled with the given label by at least one annotator. PO denotes the observed agreement. which is defined as chose this label when they were unsure whether a particular criticism label would fit a certain turn P¯O − P¯E κpool = (1) or not. 1 − P¯E Labels in the interpersonal category all show with agreement scores below 0.6. It turned out that the L L annotators had a different understanding of these 1X 1X P¯O = POl , P¯E = PEl (2) labels. While one annotator assigned the labels L L for any kind of positive or negative sentiment, the l=1 l=1 other used the labels to express agreement and where L denotes the number of labels, PEl the disagreement between the participants of a dis- expected agreement and POl the observed agree- cussion. ment of the lth label. κpool is regarded to be more A common problem for all labels were contri- accurate than an averaged Kappa. butions with a high degree of indirectness and im- For assessing the overall inter-rater reliabil- plicitness. Indirect contributions have to be in- ity of the label set assignments per turn, we terpreted in the light of conversational implica- chose Krippendorff’s Alpha (Krippendorff, 1980) ture theory (Grice, 1975), which requires contex- using MASI, a measure of agreement on set- tual knowledge for decoding the intentions of a valued items, as the distance function (Passon- speaker. For example, the message neau, 2006). MASI accounts for partial agree- ment if the label sets of both annotators overlap Is population density allowed to be n/a? in at least one label. We achieved an Alpha score of α = .75. According to Krippendorff, datasets has the surface form of a question. However, the with this score are considered reliable and allow context of the discussion revealed that the author tentative conclusions to be drawn. tried to draw attention to the missing figure in the The CO label showed the lowest agreement of article and requested it to be filled or removed. only κ = .18. The label was supposed to cover The annotators rarely made use of the context, any criticism that is not covered by a dedicated which was a major source for disagreement in the label. However, the annotators reported that they study. 782 Another difficulty for the annotators were long that edit requests and reports of performed edits discussion turns. While the average turn consists are the main subject of discussion. Generally, it is of 42 tokens, the largest contribution in the cor- more common that edits are reported after they pus is 658 tokens long. Turns of this size can have been made than to announce them before cover multiple aspects and potentially comprise they are carried out, as can be seen in the ratio many different dialog acts, which increases the of PPC to PFC labels. The number of turns la- probability of disagreement. This issue can be ad- beled with PSR is almost the same as the number dressed by going from the turn level to the utter- of contributions labeled with either PPC or PFC. ance level in future work. This allows the tentative conclusion that nearly all A comparison of our results with the agreement requests potentially lead to an edit action. As a reported for other datasets shows that the reliabil- matter of fact, the most common label adjacency ity of our annotations lies well within the field of pair11 in the corpus is PSR→PPC, which substan- the related work. Bender et al. (2011) carried out tiates this assumption. an annotation study of social acts in 365 discus- Article criticism labels have been assigned to sions from 47 Wikipedia Talk pages. They report 39.4% of all turns. Almost half (241) of the labels Kappa scores for thirteen labels in two categories from this class are assigned to the first turn of a ranging from .13 to .66 per label. The overall discussion. This shows that it is common to open agreement for each category was .50 and .59, re- a discussion in reference to a particular deficiency spectively, which is considerably lower than our of the article. The large number of CL labels com- κpool = .67. Kim et al. (2010b) annotate pairs of pared to other labels from the same category is posts taken from an online forum. They use a di- due to the fact that the Simple English Wikipedia alog act tagset with twelve labels customized for requires authors to write articles in a way that they modeling troubleshooting-oriented forum discus- are understandable for non-native speakers of En- sions. For their corpus of 1334 posts, they report glish. Therefore, the use of adequate language is an overall Kappa of .59. Kim et al. (2010a) iden- one of the major concerns of the Simple English tify unresolved discussions in student online fo- Wikipedia community. rums by annotating 1135 posts with five different speech acts. They report Kappa scores per speech 5 Automatic Dialog Act Classification act between .72 and .94. Their better results might For the automatic classification of dialog acts in be due to a more coarse grained label set. Wikipedia Talk pages, we transform the multi- label classification problem into a binary classi- 4.2 Corpus Analysis fication task (Tsoumakas et al., 2010). We train a The SEWD corpus contains 313 discussions con- binary classifier for each label using the WEKA sisting of 1367 turns by 337 users. The average data-mining software (Hall et al., 2009). We use length of a turn is 42 words. 208 of the 337 three learners for the classification task, a Naive contributors are registered Wikipedia users, 129 Bayes classifier, J48, an implementation of the wrote anonymously. On average, each contributor C4.5 decision tree algorithm (Quinlan, 1992) and wrote 168 words in 4 turns. However, there was a SMO, an optimization algorithm for training sup- cluster of 16 people with ≥ 20 contributions. port vector machines (Platt, 1998). Finally, we Table 2 shows the frequencies of all labels in combine the best performing learners for each la- the SEWD corpus. The most frequent labels are bel in a UIMA-based classification pipeline (Fer- information providing (IP), requests (PSR) and rucci and Lally, 2004). reports of performed edits (PPC). The IP-label was assigned to more than 78% of all 1367 turns, Features for Dialog Act Classification As fea- because almost every contribution provides a cer- tures, we use all uni-, bi- and trigrams that oc- tain amount of information. The label was only curred in at least three different turns. Further- omitted if a turn merely consisted of a discussion more, we include the time distance to the previ- template but did not contain any text or if it exclu- ous and the next turn (in seconds), the length of sively contained questions. the current, previous and next turn (in tokens), the More than a quarter of the turns are labeled 11 A label transition A → B is recorded if two adjacent with PSR and PPC, respectively. This indicates turns are labeled with A and B, respectively. 783 position of the turn within the discussion, the in- Naive Label Human Base J48 SMO Best Bayes dentation level of the turn and two binary features CM .66 .07 .68 .48 .66 .68 indicating whether a turn references or is refer- CW .55 .01 .70 .20 .56 .70 enced by another turn.12 In order to capture the CU .40 .07 .66 .35 .59 .66 sequential nature of the discussions, we use the CS .69 .09 .67 .67 .75 .75 n-grams of the previous and the next turn as addi- CL .77 .11 .70 .66 .73 .73 tional features. COBJ .84 .04 .78 .51 .63 .78 CO .20 .02 .61 .06 .39 .61 Balancing Positive and Negative Instances PSR .76 .30 .72 .70 .76 .76 Since the number of positive instances for each PREF .62 .00 .76 .41 .64 .76 PFC .77 .04 .70 .62 .73 .73 label is small compared to the number of nega- PPC .94 .25 .74 .82 .85 .85 tive instances, we create a balanced dataset which IP .93 .74 .83 .93 .93 .93 contains an equal amount of positive and nega- IS .83 .16 .79 .86 .85 .86 tive instances. Therefore, we randomly select the IC .51 .06 .67 .32 .59 .67 appropriate number of negative instances and dis- ATT+ .58 .10 .61 .65 .72 .72 card the rest. This improves the classification per- ATTP .44 .03 .72 .25 .62 .72 formance on every label for all three learners. ATT- .58 .07 .52 .30 .52 .52 Macro .65 .13 .70 .52 .68 .73 Feature Selection Using the full set of features, Micro .79 .35 .74 .75 .80 .82 we achieve the following macro/micro averaged Table 3: F1 -Scores for the balanced set with feature F1 -scores: 0.29 / 0.57 for Naive Bayes, 0.42 / selection on 10-fold cross-validation. Base refers to 0.66 for J48 and 0.43 / 0.72 for SMO. To fur- the baseline performance, Best to our classification ther improve the classification performance, we pipeline. reduce the feature space using two feature selec- tion techniques, the χ2 metric (Yang and Ped- Classification Results Table 3 shows the per- ersen, 1997) and the Information Gain approach formance of all classifiers and our final classi- (Mitchell, 1997). For each label, we train separate fication pipeline evaluated on 10-fold cross val- classifiers using the top 100, 200 and 300 features idation. Naive Bayes performed surprisingly obtained by each feature selection technique and well and showed the best macro averaged scores choose the best performing set for our final clas- among the three learners while SMO showed the sification pipeline. best micro averaged performance. We compare Indentation and temporal distance to the pre- our results to a random baseline and to the per- ceding turn proved to be the best ranked non- formance of the human annotators (cf. Table 3 lexical features overall. Additionally, the turn po- and Figure 2). The baseline assigns the dialog act sition within the topic was a crucial feature for labels at random according to their frequency dis- most labels in the criticism class and for PSR and tribution in the gold standard. Our classifier out- IS labels. This is not surprising, because article performed the baseline significantly on all labels. criticism, suggestions and questions tend to oc- The comparison with the human performance cur in the beginning of a discussion. The two shows that our system is able to reach the human reference features have not proven to be useful. performance. In most cases, the annotation agree- The relational information was better covered by ment is reliable, and so are the results of the auto- the indentation feature. The subjective quality of matic classification. For the labels CU and CO, the lexical features seems to be correlated with the inter-annotator agreement is not high. The the inter-annotator agreement of the respective la- comparably good performance of the classifiers bels. Features for labels with low agreement con- on these labels shows that the instances do have tain many n-grams without any recognizable se- shared characteristics. Human raters, however, mantic connection to the label. For labels with have difficulties recognizing these labels consis- good agreement, the feature lists almost exclu- tently. Thus, their definitions need to be refined in sively contain meaningful lexical cues. future work. 12 A turn Y references a preceding turn X if the indenta- To our knowledge, none of the related work on tion level of Y is one level deeper than of X. discourse analysis of Wikipedia Talk pages per- 784 1 Best Human Baseline 0.8 F1 -score 0.6 0.4 0.2 0 CM CW CU CS CL BJ CO R EF C C IP IS IC T+ TP T- PS PF PP AT CO PR AT AT Figure 2: F1 -Scores for our classification pipeline (Best), the human performance and baseline performance. formed automatic dialog act classification. How- more, it will be the basis for practical applications ever, there has been previous work on classify- that bring the hidden content of Talk pages to the ing speech acts in other discourse types. Kim et attention of article readers. al. (2010a) use Support Vector Machines (SVM) and Transformation Based Learning (TBL) for Acknowledgments the automatic assignment of five speech acts to This work has been supported by the Volkswagen posts taken from student online forums. They re- Foundation as part of the Lichtenberg- port individual F1 -scores per label which result Professorship Program under grant No. I/82806, in a macro average of 0.59 for SVM and 0.66 and by the Hessian research excellence program for TBL. Cohen et al. (2004) classify speech acts “Landes-Offensive zur Entwicklung Wissen- in emails. They train five binary classifiers us- schaftlich-¨okonomischer Exzellenz” (LOEWE) ing several learners on 1375 emails and report F1 as part of the research center “Digital Humani- scores per speech act between .44 and .85. De- ties”. spite the larger tagset, our classification approach achieves an average F1 -score of .82 and therefore lies in the top ranks of the related work. References Ron Artstein and Massimo Poesio. 2008. Inter-Coder 6 Conclusions Agreement for Computational Linguistics. Compu- tational Linguistics, 34(4):555–596, December. In this paper, we proposed an annotation schema John L. Austin. 1962. How to Do Things with Words. for the discourse analysis of Wikipedia discus- Clarendon Press, Cambridge, UK. sions aimed at the coordination efforts for article Emily M. Bender, Jonathan T. Morgan, Meghan Ox- improvement. We applied the annotation schema ley, Mark Zachry, Brian Hutchinson, Alex Marin, to a corpus of 100 Wikipedia Talk pages, which Bin Zhang, and Mari Ostendorf. 2011. Annotat- we make freely available for download. A thor- ing Social Acts: Authority Claims and Alignment ough analysis of the inter-annotator agreement Moves in Wikipedia Talk Pages. In Proceedings of the Workshop on Language in Social Media, pages showed that the dataset is reliable. Finally, we 48–57, Portland, Oregon, USA. performed automatic dialog act classification on Jean Carletta. 1996. Assessing Agreement on Classi- Wikipedia Talk pages. Therefore, we combined fication Tasks: The Kappa Statistic. Computational three machine learning algorithms and two feature Linguistics, 22(2):249–254. selection techniques to a classification pipeline, Tamitha Carpenter and Emi Fujioka. 2011. The Role which we trained on our SEWD corpus. We and Identification of Dialog Acts in Online Chat. In achieve an average F1 -score of .82, which is com- Proceesings of the Workshop on Analyzing Micro- parable to the human performance of .79. The text at the 25th AAAI Conference on Artificial Intel- ligence, San Francisco, CA, USA. ability to automatically classify discussion pages William W. Cohen, Vitor R. Carvalho, and Tom M. will help to investigate the relations between arti- Mitchell. 2004. Learning to Classify Email into cle discussions and article edits, which is an im- ”Speech Acts”. In Proceedings of the 2004 Con- portant step towards understanding the processes ference on Empirical Methods in Natural Language of collaboration in large-scale Wikis. Further- Processing, pages 309–316, Barcelona, ES. 785 Mark G. Core and James F. Allen. 1997. Cod- Discussion Pages. In Proceedings of the 5th Inter- ing dialogs with the DAMSL annotation scheme. national AAAI Conference on Weblogs and Social In Proceedings of the Working Notes of the AAAI Media, Dublin, IE. Fall Symposium on Communicative Action in Hu- Tom Mitchell. 1997. Machine Learning. McGraw- mans and Machines, pages 28–35, Cambridge, MA, Hill Education (ISE Editions), 1st edition. USA. Rebecca Passonneau. 2006. Measuring Agreement on Han De Vries, Marc N. Elliott, David E. Kanouse, and Set-valued Items (MASI) for Semantic and Prag- Stephanie S. Teleki. 2008. Using Pooled Kappa matic Annotation. In Proceedings of the Fifth In- to Summarize Interrater Agreement across Many ternational Conference on Language Resources and Items. Field Methods, 20(3):272–282. Evaluation, Genoa, IT. David Ferrucci and Adam Lally. 2004. UIMA: An Ar- John C. Platt. 1998. Fast training of support vector chitectural Approach to Unstructured Information machines using sequential minimal optimization. Processing in the Corporate Research Environment. In Advances in Kernel Methods: Support Vector Natural Language Engineering, 10:327–348. Learning, pages 185–208, Cambridge, MA, USA. Oliver Ferschke, Torsten Zesch, and Iryna Gurevych. Ilona R. Posner and Ronald M. Baecker. 1992. How 2011. Wikipedia Revision Toolkit: Efficiently People Write Together. In Proceedings of the 25th Accessing Wikipedia’s Edit History. In Proceed- Hawaii International Conference on System Sci- ings of the 49th Annual Meeting of the Associa- ences, pages 127–138, Wailea, Maui, HI, USA. tion for Computational Linguistics: Human Lan- Ross Quinlan. 1992. C4.5: Programs for Machine guage Technologies. System Demonstrations, pages Learning. Morgan Kaufmann, 1st edition. 97–102, Portland, OR, USA. Jodi Schneider, Alexandre Passant, and John G. Bres- Paul Grice. 1975. Logic and Conversation. In Pe- lin. 2011. Understanding and Improving Wikipedia ter Cole and Jerry L. Morgan, editors, Syntax and Article Discussion Spaces. In Proceedings of the Semantics, volume 3. New York: Academic Press. 26th Symposium on Applied Computing, Taichung, Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard TW. Pfahringer, Peter Reutemann, and Ian H. Witten. John R. Searle. 1969. Speech Acts. Cambridge Uni- 2009. The WEKA Data Mining Software: An Up- versity Press, Cambridge, UK. date. SIGKDD Explorations, 11:10–18. John R. Searle. 1976. A classification of illocutionary George Hripcsak and Adam S. Rothschild. 2005. acts. Language in Society, 5:1–23. Agreement, the f-measure, and reliability in infor- mation retrieval. Journal of the American Medical Besiki Stvilia, Michael B. Twidale, Linda C. Smith, Informatics Association, 12(3):296–298. and Les Gasser. 2008. Information Quality Work Dan Jurafsky, Liz Shriberg, and Debbra Biasca. 1997. Organization in Wikipedia. Journal of the Ameri- Switchboard SWBD-DAMSL Shallow-Discourse- can Society for Information Science, 59:983–1001. Function Annotation Coders Manual. Technical Grigorios Tsoumakas, Ioannis Katakis, and Ioannis P. Report Draft 13, University of Colorado, Institute Vlahavas. 2010. Mining multi-label data. In Data of Cognitive Science. Mining and Knowledge Discovery Handbook, pages Jihie Kim, Jia Li, and Taehwan Kim. 2010a. To- 667–685. Springer. wards Identifying Unresolved Discussions in Stu- Fernanda Vi´egas, Martin Wattenberg, Jesse Kriss, and dent Online Forums. In Proceedings of the NAACL Frank Ham. 2007. Talk Before You Type: Coor- HLT 2010 Fifth Workshop on Innovative Use of NLP dination in Wikipedia. In Proceedings of the 40th for Building Educational Applications, pages 84– Annual Hawaii International Conference on System 91, Los Angeles, CA, USA. Sciences, Waikoloa, Big Island, HI, USA. Su Nam Kim, Li Wang, and Timothy Baldwin. 2010b. Eti Yaari, Shifra Baruchson-Arbib, and Judit Bar-Ilan. Tagging and linking web forum posts. In Pro- 2011. Information quality assessment of commu- ceedings of the Fourteenth Conference on Compu- nity generated content: A user study of Wikipedia. tational Natural Language Learning, CoNLL ’10, Journal of Information Science, 37:487–498. pages 192–202, Stroudsburg, PA, USA. Yiming Yang and Jan O. Pedersen. 1997. A Compara- Klaus Krippendorff. 1980. Content Analysis: An tive Study on Feature Selection in Text Categoriza- Introduction to Its Methodology. Thousand Oaks, tion. In Proceedings of the Fourteenth International CA: Sage Publications. Conference on Machine Learning, pages 412–420, J. Richard Landis and Gary G. Koch. 1977. An Appli- San Francisco, CA, USA. cation of Hierarchical Kappa-type Statistics in the Torsten Zesch, Christof M¨uller, and Iryna Gurevych. Assessment of Majority Agreement among Multi- 2008. Extracting Lexical Semantic Knowledge ple Observers. Biometrics, 33(2):363–374, June. from Wikipedia and Wiktionary. In Proceedings of David Laniado, Riccardo Tasso, Yana Volkovich, and the 6th International Conference on Language Re- Andreas Kaltenbrunner. 2011. When the Wikipedi- sources and Evaluation, Marrakech, MA. ans Talk: Network and Tree Structure of Wikipedia 786 An Unsupervised Dynamic Bayesian Network Approach to Measuring Speech Style Accommodation Mahaveer Jain1 , John McDonough1 , Gahgene Gweon2 , Bhiksha Raj1 , Carolyn Penstein Ros´e1,2 1. Language Technologies Institute; 2. Human Computer Interaction Institute Carnegie Mellon University Pittsburgh, PA 15213 {mmahavee,johnmcd,ggweon,bhiksha,cprose}@cs.cmu.edu Abstract the achievement of more natural interactions with speech dialogue systems (Levitan et al., 2011). Speech style accommodation refers to Monitoring social processes from speech or shifts in style that are used to achieve strate- gic goals within interactions. Models of language data has other practical benefits as well, stylistic shift that focus on specific fea- such as enabling monitoring how beneficial an in- tures are limited in terms of the contexts teraction is for group learning (Ward & Litman, to which they can be applied if the goal of 2007; Gweon, 2011), how equal participation is the analysis is to model socially motivated within a group (DiMicco et al., 2004), or how speech style accommodation. In this pa- conducive an environment is for fostering a sense per, we present an unsupervised Dynamic of belonging and identification with a community Bayesian Model that allows us to model stylistic style accommodation in a way that (Wang et al., 2011). is agnostic to which specific speech style Typical work on computational models of features will shift in a way that resem- speech style accommodation have focused on spe- bles socially motivated stylistic variation. cific aspects of style that may be accommodated, This greatly expands the applicability of the such as the frequency or timing of pauses or model across contexts. Our hypothesis is that stylistic shifts that occur as a result of backchannels (i.e., words that show attention like social processes are likely to display some ’Un huh’ or ’ok’), pitch, or speaking rate (Ed- consistency over time, and if we leverage lund et al., 2009; Levitan & Hirschberg, 2011). In this insight in our model,we will achieve this paper, we present an unsupervised Dynamic a model that better captures inherent struc- Bayesian Model that allows us to model speech ture within speech. style accommodation in a way that does not re- quire us to specify which linguistic features we are targeting. We explore a space of models de- 1 Introduction fined by two independent factors, namely the di- Sociolinguistic research on speech style and its rect influence of one speaker’s style on another resulting social interpretation has frequently fo- speaker’s style and the influence of the relational cused on the ways in which shifts in style are gestalt between the two speakers that motivates used to achieve strategic goals within interac- the stylistic accommodation, and thus may keep tions, for example the ways in which speakers the accommodation moving consistently, with the may adapt their speaking style to suppress differ- same momentum. Prior work has explored the in- ences and accentuate similarities between them- fluence of the first factor. However, because ac- selves and their interlocutors in order to build commodation reflects social processes that extend solidarity (Coupland, 2007; Eckert & Rickford, over time within an interaction, one may expect a 2001; Sanders, 1987). We refer to this stylis- certain consistency of motion within the stylistic tic convergence as speech style accommodation. shift. Furthermore, we can leverage this consis- In the language technologies community, one tar- tency of style shift to identify socially meaningful geted practical benefit of such modeling has been variation without specifying ahead of time which 787 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 787–797, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics particular stylistic elements we are focusing on. prior work on modeling emotional speech has Our evaluation provides support for this hypothe- sought to identify features that themselves have sis. a social interpretation, such as features that pre- When stylistic shifts are focused on specific dict emotional states like uncertainty (Liscombe linguistic features, then measuring the extent of et al., 2005), or surprise (Ang et al., 2002), or the stylistic accommodation is simple since a social strategies like flirting (Ranganath et al., speaker’s style may be represented on a one or two 2009). However, our goal is to monitor social pro- dimensional space, and movement can then be cesses that evolve over time and are reflected in measured precisely within this space using sim- the change in speech dynamics. Examples include ple linear functions. However, the rich sociolin- fostering trust, forming attachments, or building guistic literature on speech style accommodation solidarity. highlights a much greater variety of speech style characteristics that may be associated with social 2.1 Defining Speech Style Accommmodation status within an interaction and may thus be bene- The concept of what we refer to as Speech ficial to monitor for stylistic shifts. Unfortunately, Style Accommodation has its roots in the field within any given context, the linguistic features of the Social Psychology of Language, where that have these status associations, which we re- the many ways in which social processes are re- fer to as indexicality, are only a small subset of flected through language, and conversely, how the linguistic features that are being used in some language influences social processes, are the ob- way. Furthermore, which features carry this in- jects of investigation (Giles & Coupland, 1991). dexicality are specific to a context. Thus, separat- As a first step towards leveraging this broad range ing the socially meaningful variation from varia- of language processes, we refer to one very spe- tion in linguistic features occurring for other rea- cific topic, which has been referred to as entrain- sons is akin to searching for the proverbial needle ment, priming, accommodation, or adaptation in in a haystack. It is this technical challenge that we other computational work (Levitan & Hirschberg, address in this paper. 2011). Specifically we refer to the finding that In the remainder of the paper we review the lit- conversational partners may shift their speaking erature on speech style accommodation both from style within the interaction, either becoming more a sociolinguistic perspective and from a techno- similar or less similar to one another. logical perspective in order to motivate our hy- Our usage of the term accommodation specifi- pothesis and proposed model. We then describe cally refers to the process of speech style conver- the technical details of our model. Next, we gence within an interaction. Stylistic shifts may present an experiment in which we test our hy- occur at a variety of levels of speech or language pothesis about the nature of speech style accom- representation. For example, much of the early modation and find statistically significant con- work on speech style accommodation focused on firming evidence. We conclude with a discussion regional dialect variation, and specifically on as- of the limitations of our model and directions for pects of pronunciation, such as the occurrence of ongoing research. post-vocalic “r” in New York City, that reflected differences in age, regional identification, and so- 2 Theoretical Framework cioeconomic status (Labov, 2010a,b). Distribu- Our research goal is to model the structure of tion of backchannels and pauses have also been speech in a way that allows us to monitor so- the target of prior work on accommodation (Lev- cial processes through speech. One common goal itan & Hirschberg, 2011). These effects may be of prior work on modeling speech dynamics has moderated by other social factors. For example, been for the purpose of informing the design of Bilous & Krauss (1988) found that females ac- more natural spoken dialogue systems (Levitan et commodated to their male partners in conversa- al., 2011). The practical goal of our work is to tion in terms of average number of words uttered measure the social processes themselves, for ex- per turn. For example, Hecht et al. (1989) re- ample in order to estimate the extent to which ported that extroverts are more listener adaptive group discussions show signs of productive con- than introverts and hence extroverts converged sensus building processes (Gweon, 2011). Much more in their data. 788 Accommodation could be measured either ity of speech and lexical features either over full from textual or speech content of a conversation. conversations or by comparing the similarity in The former relates to ”what” people say whereas the first half and the second half of the conver- the latter to ’how’ they say it. We are only inter- sation. For example, Edlund et al. (2009) mea- ested in measuring accommodation from speech sure accommodation in pause and gap length us- in this work. There has been work on convergence ing measures such as synchrony and convergence. in text such as syntactic adaptation (Reitter et al., Levitan & Hirschberg (2011) found that accom- 2006) and language similarity in online commu- modation is also found in special social behaviors nities (Huffaker et al., 2006). within conversation such as backchannels. They show that speakers in conversation tend to use 2.2 Social Interpretation of Speech Style similar kinds of speech cues such as high pitch at Accommodation the end of utterance to invite a backchannel from It has long been established that while some their partner. In order to measure accommodation speech style shifts are subconscious, speakers on these cues, they compute the correlation be- may also choose to adapt their way of speaking tween the numerical values of these cues used by in order to achieve social effects within an in- partners. teraction (Sanders, 1987). One of the main mo- In our work we measure accommodation using tives for accommodation is to decrease social dis- Dynamic Bayesian Networks (DBNs). Our mod- tance. On a variety of levels, speech style accom- els are learnt in an unsupervised fashion. What modation has been found to affect the impression we are specifically interested in is the manner in that speakers give within an interaction. For ex- which the influence of one partner on the other is ample, Welkowitz & Feldstein (1970) found that modeled. What is novel in our approach is the when speakers become more similar to their part- introduction of the concept of an accommodation ners, they are liked more by partners. Another state, or relational gestalt variable, which essen- study by Putman & Street Jr (1984) demonstrated tially models the momentum of the influence that that interviewees who converge to the speaking one partner is having on the other partner’s speak- rate and response latency of their interviewers are ing style. It allows us to represent structurally the rated more favorably by the interviewers. Giles et insight that accommodation occurs over time as a al. (1987) found that more accommodating speak- reflection of a social process, and thus has some ers were rated as more intelligent and supportive consistency in the nature of the accommodation by their partners. Conversely, social factors in within some span of time. The prior work de- an interaction affect the extent to which speak- scribed in this section can be thought of as tak- ers engage in, and some times chose not to en- ing the influence of the partner’s style directly on gage in, accommodation. For example, Purcell the speaker’s style within an instant as the floor (1984) found that Hawaiian children exhibit more shifts from one speaker to the next. Thus, no con- convergence in interactions with peer groups that sistency in the manner in which the accommoda- they like more. Bourhis & Giles (1977) found that tion is occurring is explicitly encouraged by the Welsh speakers while answering to an English model. The major advantage of consistency of surveyor broadened their Welsh accent when their motion within the style shift over time is that it ethnic identity was challenged. Scotton (1985) provides a sign post for identifying which style found that few people hesitated to repeat lexi- variation within the speech is salient with respect cal patterns of their partners to maintain integrity. to social interpretation within a specific interac- Nenkova et al. (2008) found that accommodation tion so that the model may remain agnostic and on high frequency words correlates with natural- may thus be applied to a variety of interactions ness, task success, and coordinated turn-taking that differ with respect to which stylistic features behavior. are salient in this respect. 2.3 Computational models of speech style 3 A Dynamic Bayesian Network Model accommodation for Conversation Prior research has attempted to quantify accom- Speech stylistic information is reflected in modation computationally by measuring similar- prosodic features such as pitch, energy, speak- 789 ing rate etc. In this work, we leverage on sev- eral of these speech features to quantify accom- modation. We propose a series of models that can be trained unsupervised from speech features and can be used for predicting accommodation. The models attempt to capture the dependence of speech features on speaking style, as well as the Figure 1: An example Dynamic Bayesian Network effect of persistence and accommodation on style. (DBN) showing the temporal relationship between We use a dynamic Bayesian network (DBN) for- three random variables (A,B and C). A is observered malism to capture these relationships. Below we and dependent on two hidden variables B and C. Di- briefly review DBNs, and subsequently describe rected edges across time (t − 1 → t) indicate temporal the speech features used, and the proposed mod- relationships between variables. In this example, the els. variables At and Bt are both dependent on Bt−1 with the relationship defined through conditional distribu- 3.1 Dynamic Bayesian Networks tions P (At |Bt−1 ) and P (Bt |Bt−1 ). The theory of Bayesian networks is well doc- umented and understood (Jensen, 1996; Pearl, parents of xi in the network. We note that not 1988). A Bayesian network is a probabilistic all of these variables need to be observable; of- model that represents statistical relationships be- ten in such models several of the variables are tween random variables via a directed acyclic unobservable, i.e. they are latent. In order graph (DAG). Formally, it is a directed acyclic to obtain the joint distribution of the observable graph whose nodes represent random variables variables the latent variables must be marginal- (which may be observable quantities, latent unob- ized out. I.e. if x1 , · · · , xm are observable servable variables, or hypotheses to be estimated). Edges represent conditional dependencies; nodes ∑ xm+1 , · · · , xn are latent, P (x1 , · · · , xm ) = and xm+1,··· ,xn P (x1 , x2 , · · · , xn ). which are connected by an edge represent ran- Dynamic Bayesian networks (DBNs) further dom variables that have a direct influence on one represent time-series data through a recurrent for- another. The entire network represents the joint mulation of a basic Bayesian network that repre- probability of all the variables represented by the sents the relationship between variables. Within nodes, with appropriate factoring of the condi- a DBN a set of random variables at each time in- tional dependencies between variables. stance t is represented as a static Bayesian Net- Consider, for instance, a joint distribution work with temporal dependencies to variables at over a set of random variables x1 , x2 , · · · , xn , other instants. Namely, the distribution of a vari- modeled by a Bayesian network. Let V = able xi,t at time t is dependent on other variables v1 , v2 , · · · , vn represent the set of n nodes in at times t − τ , xj,t−τ through conditional prob- the network, representing the random variables abilities of the form P r(xi,t |xj,t−τ ). An exam- x1 , x2 , · · · , xn respectively. Let ℘(vi ) represent ple DBN, consisting of three variables (A, B and the set of parent nodes of vi , i.e. nodes in V C), two of which have temporal dependencies is that have a directed edge into a node vi . Then, shown in Figure 1. by the dependencies specified by the network, One benefit of the DBN formalism is that in P (xi |x1 , x2 , · · · , xn ) = P (xi |xj : vj ∈ ℘(vi )). addition to providing a compact graphical way In other words, any variable xi is directly depen- of representing statistical relationships between dent only on its parent variables, i.e. the random variables in a process, the constrained, directed variables represented by the nodes in ℘(vi ), and network structure also allows for simplified in- is independent of all other variables given these ference. Moreover, the conditional distributions variables. The joint probability of x1 , x2 , · · · , xn associated with the network are often assumed is hence given by not to vary over time, i.e. P r(xi,t |xj,t−τ ) = ∏ P r(xi,t′ |xj,t′ −τ ). This allows for a very com- p(x1 , x2 , ..., xn ) = p(xi |xπi ) (1) pact representation of DBNs and allows for ef- i ficient Expectation-Maximization (EM) learning Where xπi represents {xj : vj ∈ ℘(vi ), i.e. the algorithms to be applied. 790 In the discussion that follows we do not explic- O1t-1 O1t O1t+1 itly specify the random variables and the form of SY1t-1t-1 S1t S1t+1 Yt+1 the associated probability distributions, but only present them graphically. The joint distribution of the variables should nevertheless be obvious from Figure 2: The basic generative model. the figures. We employ EM to learn the param- O1t-1 O1t O1t+1 eters of the models from training data, and the junction tree algorithm (Lauritzen & Spiegelhal- SY1t-1t-1 S1t S1t+1 Yt+1 ter, 1988) to perform inference. 3.2 Speech Features S2t-1 S2t S2t+1 We characterize conversations as a series of spo- ken turns by the partners. We characterize the O2t-1 O2t O2t+1 speech in each turn through a vector that cap- tures several aspects of the signal that are salient Figure 3: ISM: The dynamics of each speaker are in- to style. We used the OPENSmile toolkit (opens- dependent of the other speaker. mile, 2011) to compute the features. Specifi- cally, within each turn the speech was segmented states are represented as At , where t is turn index. into analysis windows of 50ms, where adjacent Observation Vector: The observation vectors are windows overlapped by 40ms. From each anal- the feature vectors oit computed for each turn. ysis window a total of 7 features were com- puted: voice probability, harmonic to noise ratio, 3.4 Models for Accommodation voice quality , three measures of pitch (F0 , F0raw , Our models embody two premises. First, a per- F0env ), and loudness. A 10-bin histogram of fea- son’s speech in any turn is a function of his/her ture values was computed for each of these fea- speaking style in that turn. Second, a person’s tures, which was then normalized to sum to 1.0. speaking style at any turn depends not only by The normalized histogram effectively represents their own personal biases, but also by their ac- both the values and the fluctuation in the features. commodation to their partner. We represent these For instance, a histogram of loudness values cap- dependencies as a DBN. tures the variation in the loudness of the speaker Our basic model to represent the generation of within a turn. The logarithms of the normalized speech (i.e. speech features) by a speaker in the 10-bin histograms for the 7 features were concate- absence of other influences is shown in Figure 2. nated to result in a single 70-dimensional obser- The speech features oit in any turn depend only on vation vector for the turn. These 70 dimensional the speaking style sit in that turn. The style sit in observation vectors for each turn of any speaker any turn depends on the style sit−1 in the previ- are represented in our model as oit where t is turn ous turn, to capture the speaker-specific patterns index and i is speaker index. of variation in speaking style. We note that this is a rather simple model and patterns of variation 3.3 Elements of the Models in style are captured only through the statistical In this section we formally describe the elements dependence between styles in consequent turns. of our model. We now build our models for accommodation Speaking Style State: These states represent the on this basic model. speaking styles of the partners in a conversation. We represent these states as sit , where t represent 3.4.1 Style-based models turn index and i represents speaker index. These Our two first models assume that accommo- states are assumed to belong to a finite, discrete dation is demonstrated as a direct dependence set S = {s1 , s2 , · · · , sk }, i.e. sit ∈ S ∀(i, t). of a person’s speaking sytle on their partner’s Accommodation State: An accommodation state style. Therefore the models only consider speak- represents the indirect influence of partners on ing styles. each other in a conversation. In our present de- The Independent Speaker Model sign, it can take a value of either 1 or 0. These Our simplest model for a conversation assumes 791 O1t-1 O1t O1t+1 O1t O1t+1 SY1t-1t-1 S1t S1t+1 Yt+1 SY1t-1t S1t+1 Yt+1 AY 2t AYt-11t AY2t-1t+1 S2t-1 S2t S2t+1 S2t S2t+1 O2t-1 O2t O2t+1 O2t O2t+1 Figure 4: CSDM: A speaker’s style depends on their Figure 6: AASM: Accommodation state associated partner’s style at the previous turn. with every speaker turn O1t-1 O1t O1t+1 (SASM) we assume that accommodation is a SY1t-1t-1 S1 t S1 Yt+1 t+1 jointly experienced characteristic of the conversa- AYt+1 tion at any time, which enjoys some persistence, AYt t-1 but is also affected by the speaking styles exhib- S2t S2t-1 S2t+1 ited by the speakers at each turn. The accom- modation at any time in turn affects the speaking O2t-1 O2t O2t+1 styles of both speakers in the next turn. The DBN for this model is shown in Figure 5. Figure 5: SASM: Both partners’ styles depend on mu- The Asymmetric Accommodation State Model tual accommodation to one another. The asymmetric accommodation state model (AASM) represents accommodation as a speaker- that each person’s speaking style evolves indepen- turn-specific characteristic. In any turn, the ac- dently, uninfluenced by their partner. The DBN commodation for a speaker depends chiefly on for this is shown in Figure 3. We refer to this their partner’s most recent speaking style. The ac- model as the Independent Speaker Model (ISM). commodation state can change after each speaker Note that the set of values that the style states can turn. Figure 6 shows the DBN for this model. take is common for both speakers. The speaking Note that this model captures the asymmetric na- styles for the two speakers may be said to be con- ture of accommodation, e.g. it may be the case fluent in any turn if both of them are in the same that only one of the speakers is accommodating. style state at that turn. For instance, if if a1t = 0 and a2t = 1, only The Cross-speaker Dependence Model speaker2 is accommodating but not speaker1. Intuitively, in a conversation speakers are influ- enced by their partners’ speaking style in previ- 3.4.3 Accommodated style dependence ous turns. The Cross-Speaker Dependence Model models (CSDM) represents this dependence as shown in While accommodation state models explicitly the DBN in Figure 4. In this model a person’s models accommodation, they do not explicitly speaking style depends on both their own and represent how it is expressed. In reality, accom- their partner’s speaking styles in the previous turn. modation is a process of convergence – an ac- commodating speaker’s speaking style may be ex- 3.4.2 Accommodation state models pected to converge toward that of their partner. In Accommodation state models assume that con- other words, the person’s speaking style depends versations actually have an underlying state of ac- not only on whether they are accommodating or commodation, and that speakers in fact vary their not, but also on their partner’s style at the previ- speaking styles in response to it. We models this ous turn. Accommodated style dependence mod- through a binary-valued accommodation state that els explicitly represent this dependence. is embedded into the DBN. We posit two types of The Symmetric Accommodated Style Depen- accommodation state models. dence Model The Symmetric Accommodation State Model The Symmetric Accommodated Style Depen- In the symmetric accommodation state model dence Model (SASDM) extends the SASM, to in- 792 O1t-1 O1t O1t+1 tures, as represented in the observation vectors. It is hence reasonable to assume that they are both SY1t-1t-1 S1t S1t+1 Yt+1 speaking in similar style. Similarly, the accom- AYt AYt+1 t-1 modation state cannot be expected to actually de- S2t pict accommodation; nevertheless, it can capture S2t-1 S2t+1 the dependencies that govern when the two speak- O2t-1 O2t O2t+1 ers are likely to be in the same state. 4 Evaluation Figure 7: SASDM: A speaker’s style depends both on mutual accommodation and the partner’s style in the The model we have just described allows us to in- previous turn. vestigate two separate aspects of our concept of speech style accommodation. The first aspect is O1t O1t+1 that style accommodation occurs as a local influ- ence of one speaker’s style on the other speaker’s SY1t-1t S1t+1 Yt+1 style, as depicted by direct links between style AY 2t AYt-11t AY2t-1t+1 states. The second aspect is that although this is a S2t S2t+1 local phenomenon, because it is a reflection of a social process that extends over a period of time, O2t O2t+1 there will be some persistence of accommodation over longer periods of time, as characterized by Figure 8: AASDM: The accommodation state associ- the accommodation state. We presented two dif- ated with every speaker and a speaker’s style depends ferent operationalizations of the accommodation on the partner’s style. state above, namely Asymmetric and Symmetric. Accommodation is a phenomenon that occurs within interactions between speakers; we can ex- dicate that a speaker’s style in any turn depends pect not to observe accommodation occurring be- both on accommodation and on their partner’s tween individuals that have never met and are not style in the previous turn. Figure 7 shows the interacting. On average, then, we expect to see DBN for this model. more evidence of speech style accommodation in Asymmetric Accommodated Style Dependence pairs of individuals who are interacting (i.e., Real Model Pairs) than in pairs of individuals who are not in- The Asymmetric Accommodated Style Depen- teracting and have never met (i.e., Constructed dence Model (AASDM) extends the AASM by Pairs). Thus, we may evaluate the extent to which adding a direct dependence between a speaker’s our model is sensitive to social dynamics within style and their partner’s style in their most recent pairs by the extent to which it is able to distinguish turn. The DBN for this is shown in Figure 8. between true conversation between Real Pairs of speaker and synthetic conversation between Con- 3.5 Interpreting the states structed Pairs. A similar experimental paradigm We note that we have referred to the states in the has been adopted in prior work on speech style models above as “style” states. In reality, in all accommodation (Levitan et al., 2011). cases, we learn the parameters of the model in Hypothesis: Our hypothesis is that models that an unsupervised manner, since the data we use to explicitly represent the notion that accommoda- train it do not have either speaking style or ac- tion occurs over a span of time with consistency commodation indicated (although, if they were la- of momentum will achieve better success at dis- beled, the labels could be employed within our tinguishing between Real Pairs and Constructed models). Consequently, we have no assurance Pairs than models that do not. that the states learned will actually correspond to Experimental Manipulation: Thus, using the speaking styles. They can only be considered a model we have just described, we are able to proxy for speaking style. Nevertheless, if both test our hypothesis using a 2 × 3 factorial design speakers are in the same state, they can both be in which one factor is the inclusion of direct expected to be producing similar prosodic fea- links from the style of one speaker to the style 793 of the other speaker, which we refer to as the factors. Furthermore, because the participants did DirectInfluence (DI) factor, with values True not know each other before the debate, we can (T) and False (F), and the second factor is the assume that if accommodation happened, it was inclusion of links from style states to and from only during the conversation. Accommodation states, which we refer to as the Real versus Constructed Pairs: In our analy- IndirectInfluence (II) factor, with values False sis below, we compare measured accommodation (F), Asymmetric (A), and Symmetric (S). The between pairs of humans who had a real conver- result of this 2 × 3 factorial design are the 6 sation and a constructed pair in which one per- different models described in Section 3, namely son from that conversation is paired with a con- ISM (DI=False, II=False), CSDM (DI=True, structed partner, where the partner’s side of the II=False), SASM (DI=False, II=Symmetric), conversation was constructed from turns that oc- AASM (DI=False, II=Asymmetric), SASDM curred in other conversations. We set up this com- (DI=True, II=Symmetric), and AASDM parison in order to isolate speech style conver- (DI=True, II= Asymmetric). gence from lexical convergence when we evalu- Corpus: The success criterion in our experiment ate the performance of our model. The difference is the extent to which models of speech style between the measured accommodation between accommodation are able to distinguish between real and constructed pairs is treated as a weak op- Real Pairs and Constructed pairs. In order to set erationalization of model accuracy at measuring up this comparison, we began with a corpus of de- speech style accommodation. bates between students about the reasons for the For each of the 20 Real pairs in the test corpus fall of the Ottoman Empire. We obtained this cor- we composed one Constructed Pair. Each Con- pus from researchers who originally collected it structed Pair comprised one student from the cor- to investigate issues related to learning from con- responding Real Pair (i.e., the Real Student) and a versational interactions (Nokes et al., 2010). The Constructed Partner that resembled the real part- full corpus contains interactions between 76 pairs ner in content but not necessarily style. We did of students who interacted for 8 minutes. Within this by iterating through the real partner’s turns, each pair, one student was assigned the role of ar- replacing each with a turn that matched as well as guing that the fall of the Ottoman empire was due possible in terms of lexical content but came from to internal causes, whereas the other student was a different conversation. Lexical content match assigned the role of arguing that the fall of the Ot- was measured in terms of cosine similarity. Turns toman empire was due to external causes. Each were selected from the other Real pairs. Thus, the student was given a 4 page packet of supporting Constructed Partner had similar content to the cor- information for their side of the debate to draw responding real partner on a turn by turn basis, but from in the interaction. the style of expression could not be influenced by The speech from each participant was recorded the Real Student. Thus, ideally we should not see on a separate channel. As a first step, we aligned evidence of speech style accommodation within the speech recordings automatically to their tran- the Constructed Pairs. scriptions at the word and turn level. After align- Experimental Procedure: For each of the four ing the corpus at the word level, we identify the models we computed an Accommodation Score turn interval of each partner in the conversation. for each of the Real Pairs and Constructed Pairs. We use 66 of the debates out of the complete set In order to obtain a measure that can be used to of 76 for the experiments discussed in this paper. compute accommodation for all the models con- We had to eliminate 10 dialogues where the seg- sidered, we compute the accommodation value as mentation and alignment failed. For each of our the fraction of turns in a session where partners models, we used the same 3 fold cross-validation. exhibited the same speaking style. Participants: Participants were all male under- Results: In order to test our hypothesis we con- graduate students between the ages of 18 and 25. structed an ANOVA model with Accommodation In prior studies, it has been shown that accommo- Score as the dependent variable and DirectInflu- dation varies based on gender, age and familiar- ence, IndirectInfluence, RealVsConstructed as in- ity between partners. This corpus is particularly dependent variables. Additionally we included appropriate because it controls for most of these the interaction terms between all pairs of inde- 794 DI II Real Constructed Based on this analysis, we find support for our µ(σ) µ(σ) hypothesis. We find that the model that includes SASDM T S .54 (.23) .44 (.29) Symmetric IndirectInfluence links and DirectIn- SASM F S .54 (.23) .44 (.29) fluence links is the best balance between represen- CSDM T F .6 (.26) .52 (.3) tational power and simplicity. The support for the ISM F F .56 (.25) .51 (.32) inclusion of DirectInfluence links in the model is AASM F A .6 (.24) .51 (.3) weaker than that of IndirectInfluence links, how- AASDM T A .61 (.24) .48 (.3) ever. On a larger dataset, we may have observed stronger effects of both factors. Even on this small Table 1: Accommodation measured using different dataset, we find evidence that adding that struc- models. Legend: µ=mean, σ = standard deviation, DI ture improves the performance of the model with- = “Direct Influence”, II = “Indirect Influence”. out leading to overfitting. pendent variables. Using this ANOVA model, we 5 Conclusions and Current Directions find a highly significant main effect of the Re- In this paper we presented an unsupervised dy- alVsConstructed factor that demonstrates the gen- namic Bayesian modeling approach to modeling eral ability of the models to achieve separation be- speech style accommodation in face-to-face inter- tween Real Pairs and Constructed Pairs; on aver- actions. Our model was motivated by the idea that age F(1,780) = 18.22, p < .0001. because accommodation reflects social processes However, when we look more closely, we find that extend over time within an interaction, one that although the trend is consistently to find more may expect a certain consistency of motion within evidence of speech style accommodation in Real the stylistic shift. Our evaluation demonstrated a Pairs than in Constructed Pairs, we see differen- statistically significant advantage for the models tiation among the models in terms of their abil- that embodied this idea. ity to achieve this separation. When we exam- An important motivation for our modeling ap- ine the two way interactions between DirectIn- proach was that it allows us to avoid targeting fluence and RealVsConstructed as well as be- specific linguistic style features in our measure tween IndirectInfluence and RealVsConstructed, of accommodation. However, in our evaluation, although we do not find significant interactions, we only tested our approach on conversations be- we do find some suggestive patterns when we tween male undergraduate students discussing the do the student T posthoc analysis. In particular, fall of the Ottoman Empire. Thus, while our eval- when we explore just the interaction between In- uation provides evidence that we have taken a first directInfluence links, we find a significant separa- important step towards our ultimate goal, we can- tion between Real vs Constructed pairs for models not yet claim that we have a model that performs with Accommodation states, but not for the cases equally effectively across contexts. In our future where no Accommodation states are included. work, we plan to formally test the extent to which However, when we do the same for the interaction this allows us to accurately measure accommoda- between DirectInfluence links and RealVsCon- tion within contexts in which very different stylis- structed, we find significant separation with or tic elements carry strategic social value. without those links. This suggests that IndirectIn- Another important direction of our current re- fluence links are more important than DirectInflu- search is to explore how measures of speech style ence links. At a finer-grained level, when we ex- accommodation may predict other important mea- amine the models individually, we only find a sig- sures such as how positively partners view one an- nificant separation between Real and Constructed other, how successful partners perform tasks to- pairs with the model that includes both Direct- gether, or how well students learn together. Influence and Symmetric IndirectInfluence links. These results suggest that Symmetric IndirectIn- 6 Acknowledgments fluence links may be slightly better than Asym- We gratefully acknowledge John Levine and Tim- metric ones, and that combining DirectInfluence othy Nokes for sharing their data with us. This links and Symmetric IndirectInfluence links may work was funded by NSF SBE 0836012. be the best combination. 795 References Lauritzen, S. L. & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical struc- Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stol- tures and their application to expert systems. Jour- cke, A. (2002). Prosody-based automatic detection nal of the Royal Statistical Society, 50, 157–224. of annoyance and frustration in human-computer di- alog. In Proc. ICSLP, volume 3, pages 2037–2040. Levitan, R. & Hirschberg, J. (2011). Measuring Citeseer. acoustic-prosodic entrainment with respect to mul- tiple levels and dimensions. In Proceedings of In- Bilous, F. & Krauss, R. (1988). Dominance and terspeech. accommodation in the conversational behaviours of same-and mixed-gender dyads. Language and Levitan, R., Gravano, A., & Hirschberg, J. (2011). Communication, 8(3), 4. Entrainment in speech preceding backchannels. In Proceedings of the 49th Annual Meeting of the As- Bourhis, R. & Giles, H. (1977). The language of in- sociation for Computational Linguistics: Human tergroup distinctiveness. Language, ethnicity and Language Technologies: short papers-Volume 2, intergroup relations, 13, 119. pages 113–117. Association for Computational Lin- Coupland, N. (2007). Style: Language variation and guistics. identity. Cambridge Univ Pr. Liscombe, J., Hirschberg, J., & Venditti, J. (2005). De- DiMicco, J., Pandolfo, A., & Bender, W. (2004). Influ- tecting certainness in spoken tutorial dialogues. In encing group participation with a shared display. In Proceedings of INTERSPEECH, pages 1837–1840. Proceedings of the 2004 ACM conference on Com- Citeseer. puter supported cooperative work, pages 614–623. Nenkova, A., Gravano, A., & Hirschberg, J. (2008). ACM. High frequency word entrainment in spoken dia- Eckert, P. & Rickford, J. (2001). Style and sociolin- logue. In In Proceedings of ACL-08: HLT. Asso- guistic variation. Cambridge Univ Pr. ciation for Computational Linguistics. Edlund, J., Heldner, M., & Hirschberg, J. (2009). opensmile (2011). http://opensmile.sourceforge.net/. Pause and gap length in face-to-face interaction. In Pearl, J. (1988). Probabilistic Reasoning in Intelligent Proc. Interspeech. Systems: Networks of Plausible Inference. Morgan Giles, H. & Coupland, N. (1991). Language: Contexts Kaufmann. and consequences. Thomson Brooks/Cole Publish- Purcell, A. (1984). Code shifting hawaiian style: chil- ing Co. drens accommodation along a decreolizing contin- Giles, H., Mulac, A., Bradac, J., & Johnson, P. (1987). uum. International Journal of the Sociology of Lan- Speech accommodation theory: The next decade guage, 1984(46), 71–86. and beyond. Communication yearbook, 10, 13–48. Putman, W. & Street Jr, R. (1984). The conception and perception of noncontent speech performance: Gweon, G. A. P. U. M. R. B. R. C. P. (2011). The Implications for speech-accommodation theory. In- automatic assessment of knowledge integration pro- ternational Journal of the Sociology of Language, cesses in project teams. In Proceedings of Computer 1984(46), 97–114. Supported Collaborative Learning. Ranganath, R., Jurafsky, D., & McFarland, D. (2009). Hecht, M., Boster, F., & LaMer, S. (1989). The ef- It’s not you, it’s me: detecting flirting and its mis- fect of extroversion and differentiation on listener- perception in speed-dates. In Proceedings of the adapted communication. Communication Reports, 2009 Conference on Empirical Methods in Natural 2(1), 1–8. Language Processing: Volume 1-Volume 1, pages Huffaker, D., Jorgensen, J., Iacobelli, F., Tepper, P., & 334–342. Association for Computational Linguis- Cassell, J. (2006). Computational measures for lan- tics. guage similarity across time in online communities. Reitter, D., Keller, F., & Moore, J. D. (2006). Com- In In ACTS: Proceedings of the HLT-NAACL 2006 putational modelling of structural priming in dia- Workshop on Analyzing Conversations in Text and logue. In In Proc. Human Language Technology Speech, pages 15–22. conference - North American chapter of the Asso- Jensen, F. V. (1996). An introduction to Bayesian net- ciation for Computational Linguistics annual mtg, works. UCL Press. pages 121–124. Labov, W. (2010a). Principles of linguistic change: Sanders, R. (1987). Cognitive foundations of calcu- Internal factors, volume 1. Wiley-Blackwell. lated speech. State University of New York Press. Labov, W. (2010b). Principles of linguistic change: Scotton, C. (1985). What the heck, sir: Style shifting Social factors, volume 2. Wiley-Blackwell. and lexical colouring as features of powerful lan- 796 guage. Sequence and pattern in communicative be- haviour, pages 103–119. Wang, Y., Kraut, R., & Levine, J. (2011). To stay or leave? the relationship of emotional and informa- tional support to commitment in online health sup- port groups. In Proceedings of the ACM conference on computer-supported cooperative work. ACM. Ward, A. & Litman, D. (2007). Automatically measur- ing lexical and acoustic/prosodic convergence in tu- torial dialog corpora. In Proceedings of the SLaTE Workshop on Speech and Language Technology in Education. Citeseer. Welkowitz, J. & Feldstein, S. (1970). Relation of ex- perimentally manipulated interpersonal perception and psychological differentiation to the temporal patterning of conversation. In Proceedings of the 78th Annual Convention of the American Psycho- logical Association, volume 5, pages 387–388. 797 Learning the Fine-Grained Information Status of Discourse Entities Altaf Rahman and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas Richardson, TX 75083-0688 {altaf,vince}@hlt.utdallas.edu Abstract has not been previously referred to; and (3) me- diated (henceforth med) if it is newly mentioned While information status (IS) plays a cru- in the dialogue but she can infer its identity from cial role in discourse processing, there have a previously-mentioned entity. To capture finer- only been a handful of attempts to automat- grained distinctions for IS, Nissim et al. allow an ically determine the IS of discourse entities. old or med entity to have a subtype, which subcat- We examine a related but more challenging task, fine-grained IS determination, which egorizes an old or med entity. For instance, a med involves classifying a discourse entity as entity has the subtype set if the NP that refers to one of 16 IS subtypes. We investigate the it is in a set-subset relation with its antecedent. use of rich knowledge sources for this task IS plays a crucial role in discourse processing: in combination with a rule-based approach it provides an indication of how a discourse model and a learning-based approach. In experi- should be updated as a dialogue is processed in- ments with a set of Switchboard dialogues, the learning-based approach achieves an ac- crementally. Its importance can be reflected in curacy of 78.7%, outperforming the rule- part in the amount of attention it has received in based approach by 21.3%. theoretical linguistics over the years (e.g., Halli- day (1976), Prince (1981), Hajiˇcov´a (1984), Vall- duv´ı (1992), Steedman (2000)), and in part in the 1 Introduction benefits it can potentially bring to NLP applica- A linguistic notion central to discourse processing tions. One task that could benefit from knowledge is information status (IS). It describes the extent of IS is identity coreference: since new entities by definition have not been previously referred to, an to which a discourse entity, which is typically re- NP marked as new does not need to be resolved, ferred to by noun phrases (NPs) in a dialogue, is available to the hearer. Different definitions of IS thereby improving the precision of a coreference have been proposed over the years. In this paper, resolver. Knowledge of fine-grained or subcat- we adopt Nissim et al.’s (2004) proposal, since it egorized IS is valuable for other NLP tasks. For is primarily built upon Prince’s (1992) and Eck- instance, an NP marked as set signifies that it is in ert and Strube’s (2001) well-known definitions, a set-subset relation with its antecedent, thereby and is empirically shown by Nissim et al. to yield providing important clues for bridging anaphora an annotation scheme for IS in dialogue that has resolution (e.g., Gasperin and Briscoe (2008)). good reproducibility.1 Despite the potential usefulness of IS in NLP Specifically, Nissim et al. (2004) adopt a three- tasks, there has been little work on learning way classification scheme for IS, defining a dis- the IS of discourse entities. To investigate the course entity as (1) old to the hearer if it is known plausibility of learning IS, Nissim et al. (2004) to the hearer and has previously been referred to in annotate a set of Switchboard dialogues with the dialogue; (2) new if it is unknown to her and such information2 , and subsequently present a 1 2 It is worth noting that several IS annotation schemes These and other linguistic annotations on the Switch- have been proposed more recently. See G¨otze et al. (2007) board dialogues were later released by the LDC as part of the and Riester et al. (2010) for details. NXT corpus, which is described in Calhoun et al. (2010). 798 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 798–807, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics rule-based approach and a learning-based ap- the hand-written rules and their predictions di- proach to acquiring such knowledge (Nissim, rectly as features for the learner. In an evalua- 2006). More recently, we have improved Nissim’s tion on 147 Switchboard dialogues, our learning- learning-based approach by augmenting her fea- based approach to fine-grained IS determina- ture set, which comprises seven string-matching tion achieves an accuracy of 78.7%, substan- and grammatical features, with lexical and syn- tially outperforming the rule-based approach by tactic features (Rahman and Ng, 2011; hence- 21.3%. Equally importantly, when employing forth R&N). Despite the improvements, the per- these linguistically rich features to learn Nissim’s formance on new entities remains poor: an F- 3-class IS determination task, the resulting classi- score of 46.5% was achieved. fier achieves an accuracy of 91.7%, surpassing the Our goal in this paper is to investigate fine- classifier trained on R&N’s state-of-the-art fea- grained IS determination, the task of classifying ture set by 8.8% in absolute accuracy. Improve- a discourse entity as one of the 16 IS subtypes ments on the new class are particularly substan- defined by Nissim et al. (2004).3 Owing in part tial: its F-score rises from 46.7% to 87.2%. to the increase in the number of categories, fine- grained IS determination is arguably a more chal- 2 IS Types and Subtypes: An Overview lenging task than the 3-class IS determination task In Nissim et al.’s (2004) IS classification scheme, that Nissim and R&N investigated. To our knowl- an NP can be assigned one of three main types edge, this is the first empirical investigation of au- (old, med, new) and one of 16 subtypes. Below tomated fine-grained IS determination. we will illustrate their definitions with examples, We propose a knowledge-rich approach to fine- most of which are taken from Nissim (2003) or grained IS determination. Our proposal is moti- Nissim et al.’s (2004) dataset (see Section 3). vated in part by Nissim’s and R&N’s poor per- formance on new entities, which we hypothesize Old. An NP is marked is old if (i) it is corefer- can be attributed to their sole reliance on shallow ential with an entity introduced earlier, (ii) it is a knowledge sources. In light of this hypothesis, generic pronoun, or (iii) it is a personal pronoun our approach employs semantic and world knowl- referring to the dialogue participants. Six sub- edge extracted from manually and automatically types are defined for old entities: identity, event, constructed knowledge bases, as well as corefer- general, generic, ident generic, and relative. In ence information. The relevance of coreference to Example 1, my is marked as old with subtype IS determination can be seen from the definition identity, since it is coreferent with I. of IS: a new entity is not coreferential with any (1) I was angry that he destroyed my tent. previously-mentioned entity, whereas an old en- However, if the markable has a verb phrase (VP) tity may. While our use of coreference informa- rather than an NP as its antecedent, it will be tion for IS determination and our earlier claim that marked as old/event, as can be seen in Example IS annotation would be useful for coreference res- 2, where the antecedent of That is the VP put my olution may seem to have created a chicken-and- phone number on the form. egg problem, they do not: since coreference reso- lution and IS determination can benefit from each (2) They ask me to put my phone number other, it may be possible to formulate an approach on the form. That I think is not needed. where the two tasks can mutually bootstrap. Other NPs marked as old include (i) relative We investigate rule-based and learning-based pronouns, which have the subtype relative; (ii) approaches to fine-grained IS determination. In personal pronouns referring to the dialogue par- the rule-based approach, we manually compose ticipants, which have the subtype general, and rules to combine the aforementioned knowledge (iii) generic pronouns, which have the subtype sources. While we could employ the same knowl- generic. The pronoun you in Example 3 is an in- edge sources in the learning-based approach, we stance of a generic pronoun. chose to encode, among other knowledge sources, (3) I think to correct the judicial system, 3 you have to get the lawyer out of it. One of these 16 classes is the new type, for which no subtype is defined. For ease of exposition, we will refer to Note, however, that in a coreference chain of the new type as one of the 16 subtypes to be predicted. generic pronouns, every element of the chain is 799 assigned the subtype ident generic instead. If an NP is part of a situation set up by a Mediated. An NP is marked as med if the en- previously-mentioned entity, it is assigned the tity it refers to has not been previously introduced subtype situation, as exemplified by the NP a few in the dialogue, but can be inferred from already- horses in the sentence below, which is involved in mentioned entities or is generally known to the the situation set up by John’s ranch. hearer. Nine subtypes are available for med en- (7) Mary went to John’s ranch and saw that tities: general, bound, part, situation, event, set, there were only a few horses. poss, func value, and aggregation. Similar to old entities, an NP marked as med may General is assigned to med entities that are be related to a previously mentioned VP. In this generally known, such as the Earth, China, and case, the NP will receive the subtype event, as ex- most proper names. Bound is reserved for bound emplified by the NP the bus in the sentence below, pronouns, an instance of which is shown in Ex- which is triggered by the VP traveling in Miami. ample 4, where its is bound to the variable of the (8) We were traveling in Miami, and the universally quantified NP, Every cat. bus was very full. (4) Every cat ate its dinner. If an NP refers to a value of a previously men- Poss is assigned to NPs involved in intra-phrasal tioned function, such as the NP 30 degrees in Ex- possessive relations, including prenominal geni- ample 9, which is related to the temperature, then tives (i.e., X’s Y) and postnominal genitives (i.e., it is assigned the subtype func value. Y of X). Specifically, Y will be marked as poss if (9) The temperature rose to 30 degrees. X is old or med; otherwise, Y will be new. For ex- Finally, the subtype aggregation is assigned to co- ample, in cases like a friend’s boat where a friend ordinated NPs if at least one of the NPs involved is new, boat is marked as new. is not new. However, if all NPs in the coordinated Four subtypes, namely part, situation, event, phrase are new, the phrase should be marked as and set, are used to identify instances of bridg- new. For instance, the NP My son and I in Exam- ing (i.e., entities that are inferrable from a related ple 10 should be marked as med/aggregation. entity mentioned earlier in the dialogue). As an (10) I have a son ... My son and I like to example, consider the following sentences: play chess after dinner. (5a) He passed by the door of Jan’s house New. An entity is new if it has not been intro- and saw that the door was painted red. duced in the dialogue and the hearer cannot infer (5b) He passed by Jan’s house and saw that it from previously mentioned entities. No subtype the door was painted red. is defined for new entities. In Example 5a, by the time the hearer processes the second occurrence of the door, she has already There are cases where more than one IS value had a mental entity corresponding to the door (af- is appropriate for a given NP. For instance, given ter processing the first occurrence). As a result, two occurrences of China in a dialogue, the sec- the second occurrence of the door refers to an ond occurrence can be labeled as old/identity (be- old entity. In Example 5b, on the other hand, the cause it is coreferential with an earlier NP) or hearer is not assumed to have any mental repre- med/general (because it is a generally known sentation of the door in question, but she can in- entity). To break ties, Nissim (2003) define a fer that the door she saw was part of Jan’s house. precedence relation on the IS subtypes, which Hence, this occurrence of the door should be yields a total ordering on the subtypes. Since marked as med with subtype part, as it is involved all the old subtypes are ordered before their med in a part-whole relation with its antecedent. counterparts in this relation, the second occur- If an NP is involved in a set-subset relation with rence of China in our example will be labeled as its antecedent, it inherits the med subtype set. old/identity. Owing to space limitations, we refer This applies to the NP the house payment in Ex- the reader to Nissim (2003) for details. ample 6, whose antecedent is our monthly budget. 3 Dataset (6) What we try to do to stick to our monthly budget is we pretty much have We employ Nissim et al.’s (2004) dataset, which the house payment. comprises 147 Switchboard dialogues. We parti- 800 tion them into a training set (117 dialogues) and a to the dialogue participants. Note that this and test set (30 dialogues). A total of 58,835 NPs are several other rules rely on coreference informa- annotated with IS types and subtypes.4 The distri- tion, which we obtain from two sources: (1) butions of NPs over the IS subtypes in the training chains generated automatically using the Stan- set and the test set are shown in Table 1. ford Deterministic Coreference Resolution Sys- tem (Lee et al., 2011)5 , and (2) manually iden- Train (%) Test (%) tified coreference chains taken directly from the old/identity 10236 (20.1) 1258 (15.8) annotated Switchboard dialogues. Reporting re- old/event 1943 (3.8) 290 (3.6) old/general 8216 (16.2) 1129 (14.2) sults using these two ways of obtaining chains fa- old/generic 2432 (4.8) 427 (5.4) cilitates the comparison of the IS determination old/ident generic 1730 (3.4) 404 (5.1) results that we can realistically obtain using ex- old/relative 1241 (2.4) 193 (2.4) isting coreference technologies against those that med/general 2640 (5.2) 325 (4.1) we could obtain if we further improved exist- med/bound 529 (1.0) 74 (0.9) ing coreference resolvers. Note that both sources med/part 885 (1.7) 120 (1.5) provide identity coreference chains. Specifically, med/situation 1109 (2.2) 244 (3.1) the gold chains were annotated for NPs belong- med/event 351 (0.7) 67 (0.8) med/set 10282 (20.2) 1771 (22.3) ing to old/identity and old/ident generic. Hence, med/poss 1318 (2.6) 220 (2.8) these chains can be used to distinguish between med/func value 224 (0.4) 31 (0.4) old/general NPs and old/ident generic NPs, be- med/aggregation 580 (1.1) 117 (1.5) cause the former are not part of a chain whereas new 7158 (14.1) 1293 (16.2) the latter are. However, they cannot be used total 50874 (100) 7961 (100) to distinguish between old/general entities and old/generic entities, since neither of them belongs Table 1: Distributions of NPs over IS subtypes. The to any chains. As a result, when gold chains are corresponding percentages are parenthesized. used, Rule 1 will classify all occurrences of “you” that are not part of a chain as old/general, regard- less of whether the pronoun is generic. While the 4 Rule-Based Approach gold chains alone can distinguish old/general and In this section, we describe our rule-based ap- old/ident generic NPs, the Stanford chains can- proach to fine-grained IS determination, where we not distinguish any of the old subtypes in the ab- manually design rules for assigning IS subtypes to sence of other knowledge sources, since it gener- NPs based on the subtype definitions in Section 2, ates chains for all old NPs regardless of their sub- Nissim’s (2003) IS annotation guidelines, and our types. This implies that Rule 1 and several other inspection of the IS annotations in the training rules are only a very crude approximation of the set. The motivations behind having a rule-based definition of the corresponding IS subtypes. approach are two-fold. First, it can serve as a The rules for the remaining old subtypes can be baseline for fine-grained IS determination. Sec- interpreted similarly. A few points deserve men- ond, it can provide insight into how the available tion. First, many rules depend on the string of knowledge sources can be combined into predic- the NP under consideration (e.g., “they” in Rule 2 tion rules, which can potentially serve as “sophis- and “whatever” in Rule 4). The decision of which ticated” features for a learning-based approach. strings are chosen is based primarily on our in- As shown in Table 2, our ruleset is composed of spection of the training data. Hence, these rules 18 rules, which should be applied to an NP in the are partly data-driven. Second, these rules should order in which they are listed. Rules 1–7 handle be applied in the order in which they are shown. the assignment of old subtypes to NPs. For in- For instance, though not explicitly stated, Rule 3 stance, Rule 1 identifies instances of old/general, is only applicable to the non-anaphoric “you” and which comprises the personal pronouns referring “they” pronouns, since Rule 2 has already covered their anaphoric counterparts. Finally, Rule 7 uses 4 Not all NPs have an IS type/subtype. For instance, a non-anaphoricity as a test of old/event NPs. The pleonastic “it” does not refer to any real-world entity and 5 therefore does not have any IS, and so are nouns such as The Stanford resolver is available from http://nlp. “course” in “of course”, “accident” in “by accident”, etc. stanford.edu/software/corenlp.shtml. 801 1. if the NP is “I” or “you” and it is not part of a coreference chain, then subtype := old/general 2. if the NP is “you” or “they” and it is anaphoric, then subtype := old/ident generic 3. if the NP is “you” or “they”, then subtype := old/generic 4. if the NP is “whatever” or an indefinite pronoun prefixed by “some” or “any” (e.g., “somebody”), then subtype := old/generic 5. if the NP is an anaphoric pronoun other than “that”, or its string is identical to that of a preceding NP, then subtype := old/ident 6. if the NP is “that” and it is coreferential with the immediately preceding word, then subtype := old/relative 7. if the NP is “it”, “this” or “that”, and it is not anaphoric, then subtype := old/event 8. if the NP is pronominal and is not anaphoric, then subtype := med/bound 9. if the NP contains “and” or “or”, then subtype := med/aggregation 10. if the NP is a multi-word phrase that (1) begins with “so much”, “something”, “somebody”, “someone”, “anything”, “one”, or “different”, or (2) has “another”, “anyone”, “other”, “such”, “that”, “of” or “type” as neither its first nor last word, or (3) its head noun is also the head noun of a preceding NP, then subtype := med/set 11. if the NP contains a word that is a hyponym of the word “value” in WordNet, then subtype := med/func value 12. if the NP is involved in a part-whole relation with a preceding NP based on information extracted from ReVerb’s output, then subtype := med/part 13. if the NP is of the form “X’s Y” or “poss-pro Y”, where X and Y are NPs and poss-pro is a possessive pronoun, then subtype := med/poss 14. if the NP fills an argument of a FrameNet frame set up by a preceding NP or verb, then subtype := med/situation 15. if the head of the NP and one of the preceding verbs in the same sentence share the same WordNet hypernym which is not in synsets that appear one of the top five levels of the noun/verb hierarchy, then subtype := med/event 16. if the NP is a named entity (NE) or starts with “the”, then subtype := med/general 17. if the NP appears in the training set, then subtype := its most frequent IS subtype in the training set 18. subtype := new Table 2: Hand-crafted rules for assigning IS subtypes to NPs. reason is that these NPs have VP antecedents, but Rule 10 concerns med/set. The words and both the gold chains and the Stanford chains are phrases listed in the rule, which are derived manu- computed over NPs only. ally from the training data, provide suggestive ev- idence that the NP under consideration is a subset Rules 8–16 concern med subtypes. Apart from or a specific portion of an entity or concept men- Rule 8 (med/bound), Rule 9 (med/aggregation), tioned earlier in the dialogue. Examples include and Rule 11 (med/func value), which are arguably “another bedroom”, “different color”, “somebody crude approximations of the definitions of the else”, “any place”, “one of them”, and “most other corresponding subtypes, the med rules are more cities”. Condition 3 of the rule, which checks complicated than their old counterparts, in part whether the head noun of the NP has been men- because of their reliance on the extraction of so- tioned previously, is a good test for identity coref- phisticated knowledge. Below we describe the ex- erence, but since all the old entities have suppos- traction process and the motivation behind them. 802 edly been identified by the preceding rules, it be- entities, whose identification is difficult as it re- comes a reasonable test for set-subset relations. quires world knowledge. Consequently, we apply For convenience, we identify part-whole rela- this rule only after all other med rules are applied. tions in Rule 12 based on the output produced by As we can see, the rule assigns med/general to ReVerb (Fader et al., 2011), an open information NPs that are named entities (NEs) and definite de- extraction system.6 The output contains, among scriptions (specifically those NPs that start with other things, relation instances, each of which is “the”). The reason is simple. Most NEs are gener- represented as a triple, <A,rel,B>, where rel is ally known. Definite descriptions are typically not a relation, and A and B are its arguments. To pre- new, so it seems reasonable to assign med/general process the output, we first identify all the triples to them given that the remaining (i.e., unlabeled) that are instances of the part-whole relation us- NPs are presumably either new and med/general. ing regular expressions. Next, we create clusters Before Rule 18, which assigns an NP to the new of relation arguments, such that each pair of ar- class by default, we have a “memorization” rule guments in a cluster has a part-whole relation. that checks whether the NP under consideration This is easy: since part-whole is a transitive rela- appears in the training set (Rule 17). If so, we tion (i.e., <A,part,B> and <B,part,C> implies assign to it its most frequent subtype based on its <A,part,C>), we cluster the arguments by taking occurrences in the training set. In essence, this the transitive closure of these relation instances. heuristic rule can help classify some of the NPs Then, given an NP NPi in the test set, we assign that are somehow “missed” by the first 16 rules. med/part to it if there is a preceding NP NPj such The ordering of these rules has a direct impact that the two NPs are in the same argument cluster. on performance of the ruleset, so a natural ques- In Rule 14, we use FrameNet (Baker et al., tion is: what criteria did we use to order the rules? 1998) to determine whether med/situation should We order them in such a way that they respect the be assigned to an NP, NPi . Specifically, we check total ordering on the subtypes imposed by Nis- whether it fills an argument of a frame set up by sim’s (2003) preference relation (see Section 3), a preceding NP, NPj , or verb. To exemplify, let except that we give med/general a lower priority us assume that NPj is “capital punishment”. We than Nissim due to the difficulty involved in iden- search for “punishment” in FrameNet to access tifying generally known entities, as noted above. the appropriate frame, which in this case is “re- wards and punishments”. This frame contains a 5 Learning-Based Approach list of arguments together with examples. If NPi is In this section, we describe our learning-based ap- one of these arguments, we assign med/situation proach to fine-grained IS determination. Since to NPi , since it is involved in a situation (described we aim to automatically label an NP with its IS by a frame) that is set up by a preceding NP/verb. subtype, we create one training/test instance from In Rule 15, we use WordNet (Fellbaum, 1998) each hand-annotated NP in the training/test set. to determine whether med/event should be as- Each instance is represented using five types of signed to an NP, NPi , by checking whether NPi is features, as described below. related to an event, which is typically described by a verb. Specifically, we use WordNet to check Unigrams (119704). We create one binary fea- whether there exists a verb, v, preceding NPi such ture for each unigram appearing in the training that v and NPi have the same hypernym. If so, we set. Its value indicates the presence or absence assign NPi the subtype med/event. Note that we of the unigram in the NP under consideration. ensure that the hypernym they share does not ap- Markables (209751). We create one binary fea- pear in the top five levels of the WordNet noun ture for each markable (i.e., an NP having an IS and verb hierarchies, since we want them to be subtype) appearing in the training set. Its value is related via a concept that is not overly general. 1 if and only if the markable has the same string Rule 16 identifies instances of med/general. as the NP under consideration. The majority of its members are generally-known Markable predictions (17). We create 17 bi- 6 We use ReVerb ClueWeb09 Extractions 1.1, which nary features, 16 of which correspond to the 16 is available from http://reverb.cs.washington. IS subtypes and the remaining one corresponds to edu/reverb_clueweb_tuples-1.1.txt.gz. a “dummy subtype”. Specifically, if the NP un- 803 der consideration appears in the training set, we 6 Evaluation use Rule 17 in our hand-crafted ruleset to deter- mine the IS subtype it is most frequently associ- Next, we evaluate the rule-based approach and ated with in the training set, and then set the value the learning-based approach to determining the IS of the feature corresponding to this IS subtype to subtype of each hand-annotated NP in the test set. 1. If the NP does not appear in the training set, we Classification results. Table 3 shows the results set the value of the dummy subtype feature to 1. of the two approaches. Specifically, row 1 shows Rule conditions (17). As mentioned before, we their accuracy, which is defined as the percent- can create features based on the hand-crafted rules age of correctly classified instances. For each in Section 4. To describe these features, let us in- approach, we present results that are generated troduce some notation. Let Rule i be denoted by based on gold coreference chains as well as auto- Ai −→ Bi , where Ai is the condition that must matic chains computed by the Stanford resolver. be satisfied before the rule can be applied and Bi As we can see, the rule-based approach is the IS subtype predicted by the rule. We could achieves accuracies of 66.0% (gold coreference) create one binary feature from each Ai , and set its and 57.4% (Stanford coreference), whereas the value to 1 if Ai is satisfied by the NP under con- learning-based approach achieves accuracies of sideration. These features, however, fail to cap- 86.4% (gold) and 78.7% (Stanford). In other ture a crucial aspect of the ruleset: the ordering of words, the gold coreference results are better than the rules. For instance, Rule i should be applied the Stanford coreference results, and the learning- only if the conditions of the first i− 1 rules are not based results are better than the rule-based results. satisfied by the NP, but such ordering is not en- While perhaps neither of these results are surpris- coded in these features. To address this problem, ing, we are pleasantly surprised by the extent to we capture rule ordering information by defining which the learned classifier outperforms the hand- binary feature fi as ¬A1 ∧ ¬A2 ∧ . . . ¬Ai−1 ∧ Ai , crafted rules: accuracies increase by 20.4% and where 1 ≤ i ≤ 16. In addition, we define a fea- 21.3% when gold coreference and Stanford coref- ture, f18 , for the default rule (Rule 18) in a simi- erence are used, respectively. In other words, ma- lar fashion, but since it does not have any condi- chine learning has “transformed” a ruleset that tion, we simply define f18 as ¬A1 ∧ . . . ∧ ¬A16 . achieves mediocre performance into a system that The value of a feature in this feature group is 1 achieves relatively high performance. if and only if the NP under consideration satis- These results also suggest that coreference fies the condition defined by the feature. Note that plays a crucial role in IS subtype determination: we did not create any features from Rule 17 here, accuracies could increase by up to 7.7–8.6% if since we have already generated “markables” and we solely improved coreference resolution perfor- “markable prediction” features for it. mance. This is perhaps not surprising: IS and coreference can mutually benefit from each other. Rule predictions (17). None of the features fi ’s To gain additional insight into the task, we also defined above makes use of the predictions of our show in rows 2–17 of Table 3 the performance hand-crafted rules (i.e., the Bi ’s). To make use on each of the 16 subtypes, expressed in terms of of these predictions, we define 17 binary features, recall (R), precision (P), and F-score (F). A few one for each Bi , where i = 1, . . . , 16, 18. Specif- points deserve mention. First, in comparison to ically, the value of the feature corresponding to the rule-based approach, the learning-based ap- Bi is 1 if and only if fi is 1, where fi is a “rule proach achieves considerably better performance condition” feature as defined above. on almost all classes. One that is of particular in- Since IS subtype determination is a 16-class terest is the new class. As we can see in row 17, classification problem, we train a multi-class its F-score rises by about 30 points. These gains SVM classifier on the training instances using are accompanied by a simultaneous rise in recall SVMmulticlass (Tsochantaridis et al., 2004), and and precision. In particular, recall increases by use it to make predictions on the test instances.7 about 40 points. Now, recall from the introduc- 7 For all the experiments involving SVMmulticlass , we to overfitting (by setting C to a small value) tends to yield set C, the regularization parameter, to 500,000, since pre- poorer classification performance. The remaining learning liminary experiments indicate that preferring generalization parameters are set to their default values. 804 Rule-Based Approach Learning-Based Approach Gold Coreference Stanford Coreference Gold Coreference Stanford Coreference 1 Accuracy 66.0 57.4 86.4 78.7 IS Subtype R P F R P F R P F R P F 2 old/ident 77.5 78.2 77.8 66.1 52.7 58.7 82.8 85.2 84.0 75.8 64.2 69.5 3 old/event 98.6 50.4 66.7 71.3 43.2 53.8 98.3 87.9 92.8 2.4 31.8 4.5 4 old/general 81.9 82.7 82.3 72.3 83.6 77.6 97.7 93.7 95.6 87.8 92.7 90.2 5 old/generic 55.9 55.2 55.5 39.2 39.8 39.5 76.1 87.3 81.3 39.9 85.9 54.5 6 old/ident generic 48.7 77.7 59.9 27.2 51.8 35.7 57.1 87.5 69.1 47.2 44.8 46.0 7 old/relative 55.0 69.2 61.3 55.1 63.4 59.0 98.0 63.0 76.7 99.0 37.5 54.4 8 med/general 29.9 19.8 23.8 29.5 19.6 23.6 91.2 87.7 89.4 84.0 72.2 77.7 9 med/bound 56.4 20.5 30.1 56.4 20.5 30.1 25.7 65.5 36.9 2.7 40.0 5.1 10 med/part 19.5 100.0 32.7 19.5 100.0 32.7 73.2 96.8 83.3 73.2 96.8 83.3 11 med/situation 28.7 100.0 44.6 28.7 100.0 44.6 68.4 95.4 79.7 68.0 97.7 80.2 12 med/event 10.5 100.0 18.9 10.5 100.0 18.9 46.3 100.0 63.3 46.3 100.0 63.3 13 med/set 82.9 61.8 70.8 78.0 59.4 67.4 90.4 87.8 89.1 88.4 86.0 87.2 14 med/poss 52.9 86.0 65.6 52.9 86.0 65.6 93.2 92.4 92.8 90.5 97.6 93.9 15 med/func value 81.3 74.3 77.6 81.3 74.3 77.6 88.1 85.9 87.0 88.1 85.9 87.0 16 med/aggregation 57.4 44.0 49.9 57.4 43.6 49.6 85.2 72.9 78.6 83.8 93.9 88.6 17 new 50.4 65.7 57.0 50.3 65.1 56.7 90.3 84.6 87.4 90.4 83.6 86.9 Table 3: IS subtype accuracies and F-scores. In each row, the strongest result, as well as those that are statistically indistinguishable from it according to the paired t-test (p < 0.05), are boldfaced. tion that previous attempts on 3-class IS determi- and 10.5 for event. Nevertheless, the learning nation by Nissim and R&N have achieved poor algorithm has again discovered a profitable way performance on the new class. We hypothesize to combine the available features, enabling the F- that the use of shallow features in their approaches scores of these classes to increase by 35.1–50.6%. were responsible for the poor performance they While most classes are improved by machine observed, and that using our knowledge-rich fea- learning, the same is not true for old/event and ture set could improve its performance. We will med/bound, whose F-scores are 4.5% (row 3) and test this hypothesis at the end of this section. 5.1% (row 9), respectively, when Stanford coref- Other subtypes that are worth discussing erence is employed. This is perhaps not surpris- are med/aggregation, med/func value, and ing. Recall that the multi-class SVM classifier med/poss. Recall that the rules we designed for was trained to maximize classification accuracy. these classes were only crude approximations, or, Hence, if it encounters a class that is both difficult perhaps more precisely, simplified versions of the to learn and is under-represented, it may as well definitions of the corresponding subtypes. For aim to achieve good performance on the easier- instance, to determine whether an NP belongs to to-learn, well-represented classes at the expense med/aggregation, we simply look for occurrences of these hard-to-learn, under-represented classes. of “and” and “or” (Rule 9), whereas its definition Feature analysis. In an attempt to gain addi- requires that not all of the NPs in the coordinated tional insight into the performance contribution phrase are new. Despite the over-simplicity of each of the five types of features used in the of these rules, machine learning has enabled learning-based approach, we conduct feature ab- the available features to be combined in such a lation experiments. Results are shown in Table 4, way that high performance is achieved for these where each row shows the accuracy of the classi- classes (see rows 14–16). fier trained on all types of features except for the Also worth examining are those classes for one shown in that row. For easy reference, the which the hand-crafted rules rely on sophisti- accuracy of the classifier trained on all types of cated knowledge sources. They include med/part, features is shown in row 1 of the table. According which relies on ReVerb; med/situation, which re- to the paired t-test (p < 0.05), performance drops lies on FrameNet; and med/event, which relies on significantly whichever feature type is removed. WordNet. As we can see from the rule-based re- This suggests that all five feature types are con- sults (rows 10–12), these knowledge sources have tributing positively to overall accuracy. Also, the yielded rules that achieved perfect precision but markables features are the least important in the low recall: 19.5% for part, 28.7% for situation, presence of other feature groups, whereas mark- 805 Feature Type Gold Coref Stanford Coref Feature Type Gold Coref Stanford Coref All features 86.4 78.7 All rules 66.0 57.4 −rule predictions 77.5 70.0 −memorization 62.6 52.0 −markable predictions 72.4 64.7 −ReVerb 64.2 56.6 −rule conditions 81.1 71.0 −cue words 63.8 54.0 −unigrams 74.4 58.6 −markables 83.2 75.5 Table 6: Accuracies of the simplified ruleset. Table 4: Accuracies of feature ablation experiments. R&N’s Features Our Features IS Type R P F R P F Feature Type Gold Coref Stanford Coref old 93.5 95.8 94.6 93.8 96.4 95.1 rule predictions 49.1 45.2 med 89.3 71.2 79.2 93.3 86.0 89.5 markable predictions 39.7 39.7 new 34.6 71.7 46.7 82.4 72.7 87.2 rule conditions 58.1 28.9 Accuracy 82.9 91.7 unigrams 56.8 56.8 markables 10.4 10.4 Table 7: Accuracies on IS types. Table 5: Accuracies of classifiers for each feature type. IS type results. We hypothesized earlier that the poor performance reported by Nissim and able predictions and unigrams are the two most R&N on identifying new entities in their 3-class important feature groups. IS classification experiments (i.e., classifying an To get a better idea of the utility of each feature NP as old, med, or new) could be attributed to type, we conduct another experiment in which we their sole reliance on lexico-syntactic features. To train five classifiers, each of which employs ex- test this hypothesis, we (1) train a 3-class classi- actly one type of features. The accuracies of these fier using the five types of features we employed classifiers are shown in Table 5. As we can see, in our learning-based approach, computing the the markables features have the smallest contribu- features based on the Stanford coreference chains; tion, whereas unigrams have the largest contribu- and (2) compare its results against those obtained tion. Somewhat interesting are the results of the via the lexico-syntactic approach in R&N on our classifiers trained on the rule conditions: the rules test set. Results of these experiments, which are are far more effective when gold coreference is shown in Table 7, substantiate our hypothesis: used. This can be attributed to the fact that the when we replace R&N’s features with ours, accu- design of the rules was based in part on the defini- racy rises from 82.9% to 91.7%. These gains can tions of the subtypes, which assume the availabil- be attributed to large improvements in identifying ity of perfect coreference information. new and med entities, for which F-scores increase by about 40 points and 10 points, respectively. Knowledge source analysis. To gain some in- sight into the extent to which a knowledge source 7 Conclusions or a rule contributes to the overall performance of the rule-based approach, we conduct ablation ex- We have examined the fine-grained IS determi- periments: in each experiment, we measure the nation task. Experiments on a set of Switch- performance of the ruleset after removing a par- board dialogues show that our learning-based ap- ticular rule or knowledge source from it. Specifi- proach, which uses features that include hand- cally, rows 2–4 of Table 6 show the accuracies of crafted rules and their predictions, outperforms its the ruleset after removing the memorization rule rule-based counterpart by more than 20%, achiev- (Rule 17), the rule that uses ReVerb’s output (Rule ing an overall accuracy of 78.7% when relying on 12), and the cue words used in Rules 4 and 10, automatically computed coreference information. respectively. For easy reference, the accuracy of In addition, we have achieved state-of-the-art re- the original ruleset is shown in row 1 of the ta- sults on the 3-class IS determination task, in part ble. According to the paired t-test (p < 0.05), due to our reliance on richer knowledge sources performance drops significantly in all three abla- in comparison to prior work. To our knowledge, tion experiments. This suggests that the memo- there has been little work on automatic IS subtype rization rule, ReVerb, and the cue words all con- determination. We hope that our work can stimu- tribute positively to the accuracy of the ruleset. late further research on this task. 806 Acknowledgments Malvina Nissim, Shipra Dingare, Jean Carletta, and Mark Steedman. 2004. An annotation scheme for We thank the three anonymous reviewers for their information status in dialogue. In Proceedings of detailed and insightful comments on an earlier the 4th International Conference on Language Re- draft of the paper. This work was supported sources and Evaluation, pages 1023–1026. in part by NSF Grants IIS-0812261 and IIS- Malvina Nissim. 2003. Annotation scheme 1147644. for information status in dialogue. Available from http://www.stanford.edu/class/ cs224u/guidelines-infostatus.pdf. References Malvina Nissim. 2006. Learning information status of discourse entities. In Proceedings of the 2006 Con- Collin F. Baker, Charles J. Fillmore, and John B. ference on Empirical Methods in Natural Language Lowe. 1998. The Berkeley FrameNet project. Processing, pages 94–102. In Proceedings of the 36th Annual Meeting of the Ellen F. Prince. 1981. Toward a taxonomy of given- Association for Computational Linguistics and the new information. In P. Cole, editor, Radical Prag- 17th International Conference on Computational matics, pages 223–255. New York, N.Y.: Academic Linguistics, Volume 1, pages 86–90. Press. Sasha Calhoun, Jean Carletta, Jason Brenier, Neil Ellen F. Prince. 1992. The ZPG letter: Subjects, Mayo, Dan Jurafsky, Mark Steedman, and David definiteness, and information-status. In Discourse Beaver. 2010. The NXT-format Switchboard cor- Description: Diverse Analysis of a Fund Raising pus: A rich resource for investigating the syntax, se- Text, pages 295–325. John Benjamins, Philadel- mantics, pragmatics and prosody of dialogue. Lan- phia/Amsterdam. guage Resources and Evaluation, 44(4):387–419. Miriam Eckert and Michael Strube. 2001. Dialogue Altaf Rahman and Vincent Ng. 2011. Learning the acts, synchronising units and anaphora resolution. information status of noun phrases in spoken dia- Journal of Semantics, 17(1):51–89. logues. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Process- Anthony Fader, Stephen Soderland, and Oren Etzioni. ing, pages 1069–1080. 2011. Identifying relations for open information ex- traction. In Proceedings of the 2011 Conference on Arndt Riester, David Lorenz, and Nina Seemann. Empirical Methods in Natural Language Process- 2010. A recursive annotation scheme for referential ing, pages 1535–1545. information status. In Proceedings of the Seventh Christiane Fellbaum. 1998. WordNet: An Electronic International Conference on Language Resources Lexical Database. MIT Press, Cambridge, MA. and Evaluation, pages 717–722. Caroline Gasperin and Ted Briscoe. 2008. Statisti- Mark Steedman. 2000. The Syntactic Process. The cal anaphora resolution in biomedical texts. In Pro- MIT Press, Cambridge, MA. ceedings of the 22nd International Conference on Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Computational Linguistics, pages 257–264. Joachims, and Yasemin Altun. 2004. Support vec- Michael G¨otze, Thomas Weskott, Cornelia En- tor machine learning for interdependent and struc- driss, Ines Fiedler, Stefan Hinterwimmer, Svetlana tured output spaces. In Proceedings of the 21st Petrova, Anne Schwarz, Stavros Skopeteas, and International Conference on Machine Learning, Ruben Stoel. 2007. Information structure. In pages 104–112. Working Papers of the SFB632, Interdisciplinary Enric Vallduv´ı. 1992. The Informational Component. Studies on Information Structure (ISIS). Potsdam: Garland, New York. Universit¨atsverlag Potsdam. Eva Hajiˇcov´a. 1984. Topic and focus. In Contri- butions to Functional Syntax, Semantics, and Lan- guage Comprehension (LLSEE 16), pages 189–202. John Benjamins, Amsterdam. Michael A. K. Halliday. 1976. Notes on transitiv- ity and theme in English. Journal of Linguistics, 3(2):199–244. Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Ju- rafsky. 2011. Stanford’s multi-pass sieve corefer- ence resolution system at the CoNLL-2011 shared task. In Proceedings of the Fifteenth Confer- ence on Computational Natural Language Learn- ing: Shared Task, pages 28–34. 807 Composing extended top-down tree transducers∗ Aur´elie Lagoutte ´Ecole normale sup´erieure de Cachan, D´epartement Informatique

[email protected]

Fabienne Braune and Daniel Quernheim and Andreas Maletti University of Stuttgart, Institute for Natural Language Processing {braunefe,daniel,maletti}@ims.uni-stuttgart.de Abstract RC C PREL C 7→ A composition procedure for linear and NP VP nondeleting extended top-down tree trans- that NP VP ducers is presented. It is demonstrated that C C the new procedure is more widely applica- NP VP 7→ NP VP ble than the existing methods. In general, the result of the composition is an extended VAUX VPART NP VAUX NP VPART top-down tree transducer that is no longer linear or nondeleting, but in a number of Figure 1: Word drop [top] and reordering [bottom]. cases these properties can easily be recov- ered by a post-processing step. The newswire reported yesterday that the Serbs have completed the negotiations. 1 Introduction Gestern [Yesterday] berichtete [reported] die [the] Nachrichtenagentur [newswire] die [the] Serben Tree-based translation models such as syn- [Serbs] h¨atten [would have] die [the] Verhandlungen chronous tree substitution grammars (Eisner, [negotiations] beendet [completed]. 2003; Shieber, 2004) or multi bottom-up tree transducers (Lilin, 1978; Engelfriet et al., 2009; The relation between them can be described Maletti, 2010; Maletti, 2011) are used for sev- (Yamada and Knight, 2001) by three operations: eral aspects of syntax-based machine transla- drop of the relative pronoun, movement of the tion (Knight and Graehl, 2005). Here we consider participle to end of the clause, and word-to-word the extended top-down tree transducer (XTOP), translation. Figure 1 shows the first two oper- which was studied in (Arnold and Dauchet, ations, and Figure 2 shows ln-XTOP rules per- 1982; Knight, 2007; Graehl et al., 2008; Graehl forming them. Let us now informally describe et al., 2009) and implemented in the toolkit the execution of an ln-XTOP on the top rule ρ T IBURON (May and Knight, 2006; May, 2010). of Figure 2. In general, ln-XTOPs process an in- Specifically, we investigate compositions of linear put tree from the root towards the leaves using and nondeleting XTOPs (ln-XTOP). Arnold and a set of rules and states. The state p in the left- Dauchet (1982) showed that ln-XTOPs compute hand side of ρ controls the particular operation of a class of transformations that is not closed under Figure 1 [top]. Once the operation has been per- composition, so we cannot compose two arbitrary formed, control is passed to states pNP and pVP , ln-XTOPs into a single ln-XTOP. However, we which use their own rules to process the remain- will show that ln-XTOPs can be composed into a ing input subtree governed by the variable below (not necessarily linear or nondeleting) XTOP. To them (see Figure 2). In the same fashion, an ln- illustrate the use of ln-XTOPs in machine transla- XTOP containing the bottom rule of Figure 2 re- tion, we consider the following English sentence orders the English verbal complex. together with a German reference translation: In this way we model the word drop by an ln- ∗ All authors were financially supported by the E MMY XTOP M and reordering by an ln-XTOP N . The N OETHER project MA / 4959 / 1-1 of the German Research syntactic properties of linearity and nondeletion Foundation (DFG). yield nice algorithmic properties, and the mod- 808 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 808–817, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics p δ (ε) C δ RC q (1) γ (3) → pNP pVP σ (2) q α γ PREL C (11) y1 y2 x1 α(21) q (22) γ (31) x1 γ that y1 y2 (221) p q C x2 p(311) x3 C qNP VP (3111) → x3 z1 VP z1 qVA qVP qNP z2 z3 z4 z2 z4 z3 Figure 3: Linear normalized tree t ∈ TΣ (Q(X)) [left] and t[α]2 [right] with var(t) = {x1 , x2 , x3 }. The posi- Figure 2: XTOP rules for the operations of Figure 1. tions are indicated in t as superscripts. The subtree t|2 is σ(α, q(x2 )). ular approach is desirable for better design and the composed XTOP has only bounded overlap- parametrization of the translation model (May et ping cuts, post-processing will get rid of them al., 2010). Composition allows us to recombine and restore an ln-XTOP. In the remaining cases, those parts into one device modeling the whole in which unbounded overlapping is necessary or translation. In particular, it gives all parts the occurs in the syntactic form but would not be nec- chance to vote at the same time. This is especially essary, we will compute an XTOP. This is still important if pruning is used because it might oth- an improvement on the existing methods that just erwise exclude candidates that score low in one fail. Since general XTOPs are implemented in part but well in others (May et al., 2010). T IBURON and the new composition covers (essen- Because ln-XTOP is not closed under compo- tially) all cases currently possible, our new com- sition, the composition of M and N might be out- position procedure could replace the existing one side ln-XTOP. These cases have been identified in T IBURON. Our approach to composition is the by Arnold and Dauchet (1982) as infinitely “over- same as in (Engelfriet, 1975; Baker, 1979; Maletti lapping cuts”, which occur when the right-hand and Vogler, 2010): We simply parse the right- sides of M and the left-hand sides of N are un- hand sides of the XTOP M with the left-hand boundedly overlapping. This can be purely syn- sides of the XTOP N . However, to facilitate this tactic (for a given ln-XTOP) or semantic (inher- approach we have to adjust the XTOPs M and N ent in all ln-XTOPs for a given transformation). in two pre-processing steps. In a first step we cut Despite the general impossibility, several strate- left-hand sides of rules of N into smaller pieces, gies have been developed: (i) Extension of the which might introduce non-linearity and deletion model (Maletti, 2010; Maletti, 2011), (ii) online into N . In certain cases, this can also intro- composition (May et al., 2010), and (iii) restric- duce finite look-ahead (Engelfriet, 1977; Graehl tion of the model, which we follow. Composi- et al., 2009). To compensate, we expand the rules tions of subclasses in which the XTOP N has at of M slightly. Section 4 explains those prepa- most one input symbol in its left-hand sides have rations. Next, we compose the prepared XTOPs already been studied in (Engelfriet, 1975; Baker, as usual and obtain a single XTOP computing the 1979; Maletti and Vogler, 2010). Such compo- composition of the transformations computed by sitions are implemented in the toolkit T IBURON. M and N (see Section 5). Finally, we apply a However, there are translation tasks in which the post-processing step to expand rules to reobtain used XTOPs do not fulfill this requirement. Sup- linearity and nondeletion. Clearly, this cannot be pose that we simply want to compose the rules of successful in all cases, but often removes the non- Figure 2, The bottom rule does not satisfy the re- linearity introduced in the pre-processing step. quirement that there is at most one input symbol in the left-hand side. 2 Preliminaries We will demonstrate how to compose two lin- ear and nondeleting XTOPs into a single XTOP, Our trees have labels taken from an alphabet Σ which might however no longer be linear or non- of symbols, and in addition, leaves might be deleting. However, when the syntactic form of labeled by elements of the countably infinite 809 σ σ qS S’ x1 γ α γ σ S θ 7→ θ ←[ α → qV qNP qNP δ δ x3 x1 VP x2 x1 x1 β β x2 β β x2 x2 x3 Figure 4: Substitution where θ(x1 ) = α, θ(x2 ) = x2 , t and θ(x3 ) = γ(δ(β, β, x2 )). qS t set X = {x1 , x2 , . . . } of formal variables. For- S S’ ⇒ mally, for every V ⊆ X the set TΣ (V ) of qV qNP qNP Σ-trees with V -leaves is the smallest set such that VP t1 V ⊆ TΣ (V ) and σ(t1 , . . . , tk ) ∈ TΣ (V ) for all t2 t1 t1 k ∈ N, σ ∈ Σ, and t1 , . . . , tk ∈ TΣ (V ). To avoid excessive universal quantifications, we drop them t2 t3 if they are obvious from the context. For each tree t ∈ TΣ (X) we identify nodes by Figure 5: Rule and its use in a derivation step. positions. The root of t has position ε and the po- sition iw with i ∈ N and w ∈ N∗ addresses the for all σ ∈ Σ and t1 , . . . , tk ∈ TΣ (X). The effect position w in the i-th direct subtree at the root. of a substitution is displayed in Figure 4. Two The set of all positions in t is pos(t). We write substitutions θ, θ0 : X → TΣ (X) can be com- t(w) for the label (taken from Σ ∪ X) of t at po- posed to form a substitution θθ0 : X → TΣ (X) sition w ∈ pos(t). Similarly, we use such that θθ0 (x) = θ(x)θ0 for every x ∈ X. • t|w to address the subtree of t that is rooted Next, we define two notions of compatibility in position w, and for trees. Let t, t0 ∈ TΣ (X) be two trees. If there • t[u]w to represent the tree that is ob- exists a substitution θ such that t0 = tθ, then t0 is tained from replacing the subtree t|w at w an instance of t. Note that this relation is not sym- by u ∈ TΣ (X). metric. A unifier θ for t and t0 is a substitution θ For a given set L ⊆ Σ ∪ X of labels, we let such that tθ = t0 θ. The unifier θ is a most gen- posL (t) = {w ∈ pos(t) | t(w) ∈ L} eral unifier (short: mgu) for t and t0 if for every unifier θ00 for t and t0 there exists a substitution θ0 be the set of all positions whose label belongs such that θθ0 = θ00 . The set mgu(t, t0 ) is the set of to L. We also write posl (t) instead of pos{l} (t). all mgus for t and t0 . Most general unifiers can be The tree t ∈ TΣ (V ) is linear if |posx (t)| ≤ 1 for computed efficiently (Robinson, 1965; Martelli every x ∈ X. Moreover, and Montanari, 1982) and all mgus for t and t0 are equal up to a variable renaming. var(t) = {x ∈ X | posx (t) 6= ∅} Example 1. Let t = σ(x1 , γ(δ(β, β, x2 ))) and collects all variables that occur in t. If the vari- t0 = σ(α, x3 ). Then mgu(t, t0 ) contains θ such ables occur in the order x1 , x2 , . . . in a pre-order that θ(x1 ) = α and θ(x3 ) = γ(δ(β, β, x2 )). Fig- traversal of the tree t, then t is normalized. Given ure 4 illustrates the unification. a finite set Q, we write Q(T ) with T ⊆ TΣ (X) for the set {q(t) | q ∈ Q, t ∈ T }. We will treat 3 The model elements of Q(T ) as special trees of TΣ∪Q (X). The previous notions are illustrated in Figure 3. The discussed model in this contribution is an A substitution θ is a mapping θ : X → TΣ (X). extension of the classical top-down tree trans- When applied to a tree t ∈ TΣ (X), it will return ducer, which was introduced by Rounds (1970) the tree tθ, which is obtained from t by replacing and Thatcher (1970). The extended top-down all occurrences of x ∈ X (in parallel) by θ(x). tree transducer with finite look-ahead or just This can be defined recursively by xθ = θ(x) for XTOPF and its variations were studied in (Arnold all x ∈ X and σ(t1 , . . . , tk )θ = σ(t1 θ, . . . , tk θ) and Dauchet, 1982; Knight and Graehl, 2005; 810 qS S RC S’ qS p S qNP VP → qV qNP qNP S’ → PREL C x1 VP x1 qV qNP C → x2 x1 x3 x2 x1 x3 that pNP pVP x2 x3 x2 x3 y1 y2 y1 y2 Figure 6: Rule [left] and reversed rule [right]. Figure 7: Top rule of Figure 2 reversed. Knight, 2007; Graehl et al., 2008; Graehl et al., 2009). Formally, an extended top-down tree of τM is independent of the choice of Σ0 and ∆0 . transducer with finite look-ahead (XTOPF ) is a Moreover, it is known (Graehl et al., 2009) that system M = (Q, Σ, ∆, I, R, c) where each XTOPF can be transformed into an equiva- • Q is a finite set of states, lent XTOP preserving both linearity and nondele- • Σ and ∆ are alphabets of input and output tion. However, the notion of XTOPF will be con- symbols, respectively, venient in our composition construction. A de- • I ⊆ Q is a set of initial states, tailed exposition to XTOPs is presented by Arnold • R is a finite set of (rewrite) rules of the form and Dauchet (1982) and Graehl et al. (2009). ` → r where ` ∈ Q(TΣ (X)) is linear and A linear and nondeleting XTOP M with r ∈ T∆ (Q(var(`))), and rules R can easily be reversed to obtain • c : R × X → TΣ (X) assigns a look-ahead a linear and nondeleting XTOP M −1 with restriction to each rule and variable such that rules R−1 , which computes the inverse transfor- c(ρ, x) is linear for each ρ ∈ R and x ∈ X. mation τM −1 = τM −1 , by reversing all its rules. The XTOPF M is linear (respectively, nondelet- A (suitable) rule is reversed by exchanging the ing) if r is linear (respectively, var(r) = var(`)) locations of the states. More precisely, given for every rule ` → r ∈ R. It has no look-ahead a rule q(l) → r ∈ R, we obtain the rule (or it is an XTOP) if c(ρ, x) ∈ X for all rules q(r0 ) → l0 of R−1 , where l0 = lθ and r0 is the ρ ∈ R and x ∈ X. In this case, we drop the look- unique tree such that there exists a substitution ahead component c from the description. A rule θ : X → Q(X) with θ(x) ∈ Q({x}) for every ` → r ∈ R is consuming (respectively, produc- x ∈ X and r = r0 θ. Figure 6 displays a rule ing) if posΣ (`) 6= ∅ (respectively, pos∆ (r) 6= ∅). and its corresponding reversed rule. The reversed We let Lhs(M ) = {l | ∃q, r : q(l) → r ∈ R}. form of the XTOP rule modeling the insertion op- Let M = (Q, Σ, ∆, I, R, c) be an XTOPF . In eration in Figure 2 is displayed in Figure 7. order to facilitate composition, we define senten- Finally, let us formally define composition. tial forms more generally than immediately nec- The XTOP M computes the tree transformation essary. Let Σ0 and ∆0 be such that Σ ⊆ Σ0 τM ⊆ TΣ × T∆ . Given another XTOP N that and ∆ ⊆ ∆0 . To keep the presentation sim- computes a tree transformation τN ⊆ T∆ × TΓ , ple, we assume that Q ∩ (Σ0 ∪ ∆0 ) = ∅. A we might be interested in the tree transforma- sentential form of M (using Σ0 and ∆0 ) is a tion computed by the composition of M and N tree of SF(M ) = T∆0 (Q(TΣ0 )). For every (i.e., running M first and then N ). Formally, the ξ, ζ ∈ SF(M ), we write ξ ⇒M ζ if there exist a composition τM ; τN of the tree transformations position w ∈ posQ (ξ), a rule ρ = ` → r ∈ R, and τM and τN is defined by a substitution θ : X → TΣ0 such that θ(x) is an in- stance of c(ρ, x) for every x ∈ X and ξ = ξ[`θ]w τM ; τN = {(s, u) | ∃t : (s, t) ∈ τM , (t, u) ∈ τN } and ζ = ξ[rθ]w . If the applicable rules are re- stricted to a certain subset R0 ⊆ R, then we also and we often also use the notion ‘composition’ for write ξ ⇒R0 ζ. Figure 5 illustrates a derivation XTOP with the expectation that the composition step. The tree transformation computed by M is of M and N computes exactly τM ; τN . τM = {(t, u) ∈ TΣ × T∆ | ∃q ∈ I : q(t) ⇒∗M u} 4 Pre-processing where ⇒∗M is the reflexive, transitive closure We want to compose two linear and nondelet- of ⇒M . It can easily be verified that the definition ing XTOPs M = (P, Σ, ∆, IM , RM ) and 811 LHS(M −1 ) LHS(N ) Rule of M −1 Rule of N q C δ p σ C σ z1 VP p1 p2 α← σ → q1 q2 y1 y2 β σ z2 z3 z4 y1 y2 y1 y2 z1 z2 z1 z2 Figure 8: Incompatible left-hand sides of Example 3. Figure 9: Rules used in Example 5. N = (Q, ∆, Γ, IN , RN ). Before we actually per- form the composition, we will prepare M and N Intuitively, for every ∆-labeled position w in a in two pre-processing steps. After these two steps, right-hand side r1 of M and any left-hand side l2 the composition is very simple. To avoid com- of N , we require (ignoring the states) that either plications, we assume that (i) all rules of M are (i) r1 |w and l2 are not unifiable or (ii) r1 |w is an producing and (ii) all rules of N are consuming. instance of l2 . For convenience, we also assume that the XTOPs Example 3. The XTOPs for the English-to- M and N only use variables of the disjoint sets German translation task in the Introduction are Y ⊆ X and Z ⊆ X, respectively. not compatible. This can be observed on the left-hand side l1 ∈ Lhs(M −1 ) of Figure 7 4.1 Compatibility and the left-hand side l2 ∈ Lhs(N ) of Fig- ure 2[bottom]. These two left-hand sides are il- In the existing composition results for subclasses lustrated in Figure 8. Between them there is an of XTOPs (Engelfriet, 1975; Baker, 1979; Maletti mgu such that θ(Y ) 6⊆ X (e.g., θ(y1 ) = z1 and and Vogler, 2010) the XTOP N has at most one θ(y2 ) = VP(z2 , z3 , z4 ) is such an mgu). input symbol in its left-hand sides. This restric- tion allows us to match rule applications of N to Theorem 4. There exists an XTOPF N 0 that is positions in the right-hand sides of M . Namely, equivalent to N and compatible with M . for each output symbol in a right-hand side of M , Proof. We achieve compatibility by cutting of- we can select a rule of N that can consume that fending rules of the XTOP N into smaller pieces. output symbol. To achieve a similar decompo- Unfortunately, both linearity and nondeletion sition strategy in our more general setup, we in- of N might be lost in the process. We first let troduce a compatibility requirement on right-hand N 0 = (Q, ∆, Γ, IN , RN , cN ) be the XTOPF such sides of M and left-hand sides of N . Roughly that cN (ρ, x) = x for every ρ ∈ RN and x ∈ X. speaking, we require that the left-hand sides of N If N 0 is compatible with M , then we are done. are small enough to completely process right- Otherwise, let l1 ∈ Lhs(M −1 ) be a left-hand side, hand sides of M . However, a comparison of q(l2 ) → r2 ∈ RN be a rule, and w ∈ pos∆ (l1 ) left- and right-hand sides is complicated by the be a position such that θ(y) ∈ / X for some fact that their shape is different (left-hand sides θ ∈ mgu(l1 |w , l2 ) and y ∈ Y . Let v ∈ posy (l1 |w ) have a state at the root, whereas right-hand sides be the unique position of y in l1 |w . have states in front of the variables). We avoid Now we have to distinguish two cases: (i) Ei- these complications by considering reversed rules ther var(l2 |v ) = ∅ and there is no leaf in r2 la- of M . Thus, an original right-hand side of M is beled by a symbol from Γ. In this case, we have now a left-hand side in the reversed rules and thus to introduce deletion and look-ahead into N 0 . We has the right format for a comparison. Recall that replace the old rule ρ = q(l2 ) → r2 by the new Lhs(N ) contains all left-hand sides of the rules rule ρ0 = q(l2 [z]v ) → r2 , where z ∈ X \ var(l2 ) of N , in which the state at the root was removed. is a variable that does not appear in l2 . In addition, Definition 2. The XTOP N is compatible to M we let cN (ρ0 , z) = l2 |v and cN (ρ0 , x) = cN (ρ, x) if θ(Y ) ⊆ X for all unifiers θ ∈ mgu(l1 |w , l2 ) for all x ∈ X \ {z}. between a subtree at a ∆-labeled position (ii) Otherwise, let V ⊆ var(l2 |v ) be a maximal w ∈ pos∆ (l1 ) in a left-hand side l1 ∈ Lhs(M −1 ) set such that there exists a minimal (with respect and a left-hand side l2 ∈ Lhs(N ). to the prefix order) position w0 ∈ pos(r2 ) with 812 Another rule of N q C q µ1 : → C qNP q0 δ σ z1 z → q1 q2 q3 z1 z z1 σ z1 z2 z3 q0 VP z2 z3 µ2 : VP → qVA qVP qNP Figure 10: Additional rule used in Example 5. z2 z3 z4 z2 z4 z3 var(r2 |w0 ) ⊆ var(l2 |v ) and var(r2 [β]w0 )∩V = ∅, Figure 11: Rules replacing the rule in Figure 7. where β ∈ Γ is arbitrary. Let z ∈ X \ var(l2 ) be a fresh variable, q 0 be a new state of N , and Example 5. Let us consider the rules illustrated V 0 = var(l2 |v ) \ V . We replace the rule in Figure 9. We might first note that y1 has to ρ = q(l2 ) → r2 of RN by be unified with β. Since β does not contain any ρ1 = q(l2 [z]v ) → trans(r2 )[q 0 (z)]w0 variables and the right-hand side of the rule of N does not contain any non-variable leaves, we are ρ2 = q 0 (l2 |v ) → r2 |w0 . in case (i) in the proof of Theorem 4. Conse- The look-ahead for z is trivial and other- quently, the displayed rule of N is replaced by a wise we simply copy the old look-ahead, so variant, in which β is replaced by a new variable z cN (ρ1 , z) = z and cN (ρ1 , x) = cN (ρ, x) for all with look-ahead β. x ∈ X \ {z}. Moreover, cN (ρ2 , x) = cN (ρ, x) Secondly, with this new rule there is an mgu, for all x ∈ X. The mapping ‘trans’ is given for in which y2 is mapped to σ(z1 , z2 ). Clearly, we t = γ(t1 , . . . , tk ) and q 00 (z 00 ) ∈ Q(Z) by are now in case (ii). Furthermore, we can select the set V = {z1 , z2 } and position w0 = . Cor- trans(t) = γ(trans(t1 ), . . . , trans(tk )) respondingly, the following two new rules for N ( replace the old rule: hl2 |v , q 00 , v 0 i(z) if z 00 ∈ V 0 trans(q 00 (z 00 )) = q 00 (z 00 ) otherwise, q(σ(z, z 0 )) → q 0 (z 0 ) where v 0 = posz 00 (l2 |v ). q 0 (σ(z1 , z2 )) → σ(q1 (z1 ), q2 (z2 )) , Finally, we collect all newly generated states of the form hl, q, vi in Ql and for every such where the look-ahead for z remains β. state with l = δ(l1 , . . . , lk ) and v = iw, let Figure 10 displays another rule of N . There is l0 = δ(z1 , . . . , zk ) and an mgu, in which y2 is mapped to σ(z2 , z3 ). Thus, we end up in case (ii) again and we can select the set V = {z2 } and position w0 = 2. Thus, we ( q(zi ) if w = ε hl, q, vi(l0 ) → replace the rule of Figure 10 by the new rules hli , q, wi(zi ) otherwise be a new rule of N without look-ahead. q(σ(z1 , z)) → δ(q1 (z1 ), q 0 (z), q3 (z)) (?) 0 Overall, we run the procedure until N 0 is com- q (σ(z2 , z3 )) → q2 (z2 ) patible with M . The procedure eventually ter- q3 (σ(z1 , z2 )) → q3 (z2 ) , minates since the left-hand sides of the newly added rules are always smaller than the replaced where q3 = hσ(z2 , z3 ), q3 , 2i. rules. Moreover, each step preserves the seman- Let us use the construction in the proof of The- tics of N 0 , which completes the proof. orem 4 to resolve the incompatibility (see Exam- We note that the look-ahead of N 0 after the con- ple 3) between the XTOPs presented in the Intro- struction used in the proof of Theorem 4 is either duction. Fortunately, the incompatibility can be trivial (i.e., a variable) or a ground tree (i.e., a tree resolved easily by cutting the rule of N (see Fig- without variables). Let us illustrate the construc- ure 7) into the rules of Figure 11. In this example, tion used in the proof of Theorem 4. linearity and nondeletion are preserved. 813 4.2 Local determinism q δ i s i p i q q0 ps i After the first pre-processing step, we have the → → ps → σ ps ρ ρ s0 ps original linear and nondeleting XTOP M and an XTOPF N 0 = (Q0 , ∆, Γ, IN , RN 0 , c ) that is N y1 y2 y1 y2 y2 y1 y1 equivalent to N and compatible with M . How- q q0 q ever, in the first pre-processing step we might i q i have introduced some non-linear (copying) rules ρs ρs ρs,s0 /ρ0s,s0 → ps → p → ps in N 0 (see rule (?) in Example 5), and it is known σ σ y1 y2 δ y1 that “nondeterminism [in M ] followed by copy- y1 y2 y1 y2 ing [in N 0 ]” is a feature that prevents composition y1 y2 y3 to work (Engelfriet, 1975; Baker, 1979). How- q0 σ q0 δ ever, our copying is very local and the copies are only used to project to different subtrees. ρ0s,s0 i i ρs,s0 i q q0 → → Nevertheless, during those projection steps, we ps0 pα p s0 ρ ρ δ δ need to make sure that the processing in M pro- y2 y3 y1 y2 y3 y2 y3 y3 y1 y2 y3 ceeds deterministically. We immediately note that all but one copy are processed by states of the Figure 12: Useful rules for the composition M 0 ; N 0 of form hl, q, vi ∈ Ql . These states basically pro- Example 8, where s, s0 ∈ {α, β} and ρ ∈ Pσ(z2 ,z3 ) . cess (part of) the tree l and project (with state q) to the subtree at position v. It is guaranteed that p(l) ⇒∗M 0 ξ ⇒M 0 r for some ξ that is not an each such subtree (indicated by v) is reached only instance of t. In other words, we construct each once. Thus, the copying is “resolved” once the rule of Rt by applying existing rules of RM in states of the form hl, q, vi are left. To keep the sequence to generate a (minimal) right-hand side presentation simple, we just add expanded rules that is an instance of t. We thus potentially make to M such that any rule that can produce a part of the right-hand sides of M bigger by joining sev- a tree l immediately produces the whole tree. A eral existing rules into a single rule. Note that similar strategy is used to handle the look-ahead this affects neither compatibility nor the seman- of N 0 . Any right-hand side of a rule of M that tics. In the second step, we add pure ε-rules produces part of a left-hand side of a rule of N 0 that allow us to change the state to one that we with look-ahead is expanded to produce the re- constructed in the previous step. For every new quired look-ahead immediately. state p¯ = p(l) → r, let base(¯ p) = p.S Then Let L ⊆ T∆ (Z) be the set of trees l such that 0 = R 0 = P ∪ RM M ∪ R L ∪ R E and P t∈L Pt • hl, q, vi appears as a state of Ql , or where • l = l2 θ for some ρ2 = q(l2 ) → r2 ∈ RN 0 [ 0 of N with non-trivial look-ahead (i.e., RL = Rt and Pt = {`(ε) | ` → r ∈ Rt } cN (ρ2 , z) ∈ / X for some z ∈ X), where t∈L [ θ(x) = cN (ρ2 , x) for every x ∈ X. RE = {base(¯ p)(x1 ) → p¯(x1 ) | p¯ ∈ Pt } . To keep the presentation uniform, we assume t∈L that for every l ∈ L, there exists a state of the Clearly, this does not change the semantics be- form hl, q, vi ∈ Q0 . If this is not already the cause each rule of RM 0 can be simulated by a case, then we can simply add useless states with- chain of rules of RM . Let us now do a full ex- out rules for them. In other words, we assume that ample for the pre-processing step. We consider a the first case applies to each l ∈ L. nondeterministic variant of the classical example Next, we add two sets of rules to RM , which by Arnold and Dauchet (1982). will not change the semantics but prove to be use- Example 6. Let M = (P, Σ, Σ, {p}, RM ) ful in the composition construction. First, for be the linear and nondeleting XTOP such that every tree t ∈ L, let Rt contain all the rules P = {p, pα , pβ }, Σ = {δ, σ, α, β, }, and p(l) → r, where p = p(l) → r is a new state RM contains the following rules with p ∈ P , minimal normalized tree l ∈ TΣ (X), and an instance r ∈ T∆ (P (X)) of t such that p(σ(y1 , y2 )) → σ(ps (y1 ), p(y2 )) (†) 814 p(δ(y1 , y2 , y3 )) → σ(ps (y1 ), σ(ps0 (y2 ), p(y3 ))) hq, pi C p(δ(y1 , y2 , y3 )) → σ(ps (y1 ), σ(ps0 (y2 ), pα (y3 ))) RC ps (s0 (y1 )) → s(ps (y1 )) → hqNP , pNP i hq 0 , pVP i ps () → PREL C x1 x2 that x1 x2 for every s, s0∈ {α, β}. Similarly, we let N = (Q, Σ, Σ, {q}, RN ) be the linear and non- Figure 13: Composed rule created from the rule of Fig- deleting XTOP such that Q = {q, i} and RN con- ure 7 and the rules of N 0 displayed in Figure 11. tains the following rules q(σ(z1 , z2 )) → σ(i(z1 ), i(z2 )) 5 Composition q(σ(z1 , σ(z2 , z3 ))) → δ(i(z1 ), i(z2 ), q(z3 )) (‡) Now we are ready for the actual composition. For i(s(z1 )) → s(i(z1 )) space efficiency reasons we reuse the notations i() → used in Section 4. Moreover, we identify trees of TΓ (Q0 (P 0 (X))) with trees of TΓ ((Q0 × P 0 )(X)). for all s ∈ {α, β}. It can easily be verified that In other words, when meeting a subtree q(p(x)) M and N meet our requirements. However, N is with q ∈ Q0 , p ∈ P 0 , and x ∈ X, then we also not yet compatible with M because an mgu be- view this equivalently as the tree hq, pi(x), which tween rules (†) of M and (‡) of N might map y2 could be part of a rule of our composed XTOP. to σ(z2 , z3 ). Thus, we decompose (‡) into However, not all combinations of states will be q(σ(z1 , z)) → δ(i(z1 ), q(z), q 0 (z)) allowed in our composed XTOP, so some combi- q 0 (σ(z2 , z3 )) → q(z3 ) nations will never yield valid rules. Generally, we construct a rule of M 0 ; N 0 by ap- q(σ(z1 , z2 )) → i(z1 ) plying a single rule of M 0 followed by any num- where q = hσ(z2 , z3 ), i, 1i. This newly obtained ber of pure ε-rules of RE , which can turn states XTOP N 0 is compatible with M . In addition, we base(p) into p. Then we apply any number of only have one special tree σ(z2 , z3 ) that occurs in rules of N 0 and try to obtain a sentential form that states of the form hl, q, vi. Thus, we need to com- has the required shape of a rule of M 0 ; N 0 . pute all minimal derivations whose output trees are instances of σ(z2 , z3 ). This is again simple Definition 7. Let M 0 = (P 0 , Σ, ∆, IM , RM0 ) and since the first three rule schemes ρs , ρs,s0 , and 0 0 0 N = (Q , ∆, Γ, IN , RN ) beS the XTOPs con- ρ0s,s0 of M create such instances, so we simply structed in Section 4, where S 0 l∈L Pl ⊆ P and create copies of them: 0 00 0 S l∈L Ql ⊆ Q . Let Q = Q \ l∈L Ql . We con- 0 0 struct the XTOP M ; N = (S, Σ, Γ, IN × IM , R) ρs (σ(y1 , y2 )) → σ(ps (y1 ), p(y2 )) ρs,s0 (δ(y1 , y2 , y3 )) → σ(ps (y1 ), σ(ps0 (y2 ), p(y3 ))) where ρ0s,s0 (δ(y1 , y2 , y3 )) → σ(ps (y1 ), σ(ps0 (y2 ), pα (y3 ))) [ S= (Ql × Pl ) ∪ (Q00 × P 0 ) for all s, s0 ∈ {α, β}. These are all the rules l∈L of Rσ(z2 ,z3 ) . In addition, we create the following and R contains all normalized rules ` → r (of the rules of RE : required shape) such that p(x1 ) → ρs (x1 ) p(x1 ) → ρs,s0 (x1 ) ` ⇒M 0 ξ ⇒∗RE ζ ⇒∗N 0 r p(x1 ) → ρ0s,s0 (x1 ) for all s, s0 ∈ {α, β}. for some ξ, ζ ∈ TΓ (Q0 (T∆ (P 0 (X)))). Especially after reading the example it might The required rule shape is given by the defi- seem useless to create the rule copies in Rl [in Ex- nition of an XTOP. Most importantly, we must ample 6 for l = σ(z2 , z3 )]. However, each such have that ` ∈ S(TΣ (X)), which we identify rule has a distinct state at the root of the left-hand with a certain subset of Q0 (P 0 (TΣ (X))), and side, which can be used to trigger only this rule. r ∈ TΓ (S(X)), which similarly corresponds to In this way, the state selects the next rule to apply, a subset of TΓ (Q0 (P 0 (X))). The states are sim- which yields the desired local determinism. ply combinations of the states of M 0 and N 0 , of 815 q q σ σ p p i i δ i i q σ → σ → ps ps0 i q q0 ps ps p y1 σ y1 y1 y2 y3 δ y1 y2 ps00 ρ0 ρ0 y2 y3 y2 y3 y4 y3 y4 y4 Figure 14: Successfully expanded rule from Exam- ple 9. Figure 15: Expanded rule that remains copying (see Example 9). which however the combinations of a state q ∈ Ql with a state p ∈ / Pl are forbidden. This reflects the 6 Post-processing intuition of the previous section. If we entered a Finally, we will compose rules again in an ef- special state of the form hl, q, vi, then we should fort to restore linearity (and nondeletion). Since use a corresponding state p ∈ Pl of M , which the composition of two linear and nondeleting only has rules producing instances of l. We note XTOPs cannot always be computed by a single that look-ahead of N 0 is checked normally in the XTOP (Arnold and Dauchet, 1982), this method derivation process. can fail to return such an XTOP. The presented Example 8. Now let us illustrate the composition method is not a characterization, which means it on Example 6. Let us start with rule (†) of M . might even fail to return a linear and nondelet- q(p(σ(x1 , x2 ))) ing XTOP although an equivalent linear and non- ⇒M 0 q(σ(ps (x1 ), p(x2 ))) deleting XTOP exists. However, in a significant number of examples, the recombination succeeds ⇒RE q(σ(ps (x1 ), ρs0 ,s00 (x2 ))) to rebuild a linear (and nondeleting) XTOP. ⇒N 0 δ(i(ps (x1 )), q(ρs0 ,s00 (x2 )), q 0 (ρs0 ,s00 (x2 ))) Let M 0 ; N 0 = (S, Σ, Γ, I, R) be the composed is a rule of M 0 ; N 0 for every s, s0 , s00 ∈ {α, β}. XTOP constructed in Section 5. We simply in- Note if we had not applied the RE -step, then we spect each non-linear rule (i.e., each rule with a would not have obtained a rule of M ; N (be- non-linear right-hand side) and expand it by all cause we would have obtained the state combina- rule options at the copied variables. Since the tion hq, pi instead of hq, ρs0 ,s00 i, and hq, pi is not a method is pretty standard and variants have al- state of M 0 ; N 0 ). Let us also construct a rule for ready been used in the pre-processing steps, we the state combination hq, ρs0 ,s00 i. only illustrate it on the rules of Figure 12. q(ρs0 ,s00 (δ(x1 , x2 , x3 ))) Example 9. The first (top row, left-most) rule of Figure 12 is non-linear in the variable y2 . Thus, ⇒M 0 q(σ(ps0 (x1 ), σ(ps00 (x2 ), p(x3 )))) we expand the calls hq, ρi(y2 ) and hq 0 , ρi(y2 ). If ⇒N 0 q 0 (ps0 (x1 )) ρ = ρs for some s ∈ {α, β}, then the next rules Finally, let us construct a rule for the state combi- are uniquely determined and we obtain the rule nation hq 00 , ρs0 ,s00 i. displayed in Figure 14. Here the expansion was successful and we could delete the original rule q 00 (ρs0 ,s00 (δ(x1 , x2 , x3 ))) for ρ = ρs and replace it by the displayed ex- ⇒M 0 q(σ(ps0 (x1 ), σ(ps00 (x2 ), p(x3 )))) panded rule. However, if ρ = ρ0s0 ,s00 , then we can ⇒RE q(σ(ps0 (x1 ), σ(ps00 (x2 ), ρs (x3 )))) also expand the rule to obtain the rule displayed in ⇒N 0 q(σ(ps00 (x2 ), ρs (x3 ))) Figure 15. It is still copying and we could repeat the process of expansion here, but we cannot get ⇒N 0 δ(q 0 (ps00 (x1 )), q(ρs (x2 )), q 00 (ρs (x2 ))) rid of all copying rules using this approach (as ex- for every s ∈ {α, β}. pected since there is no linear XTOP computing After having pre-processed the XTOPs in our the same tree transformation). introductory example, the devices M and N 0 can be composed into M ; N 0 . One rule of the com- posed XTOP is illustrated in Figure 13. 816 References Jonathan May, Kevin Knight, and Heiko Vogler. 2010. Efficient inference through cascades of weighted Andr´e Arnold and Max Dauchet. 1982. Morphismes tree transducers. In Proc. ACL, pages 1058–1066. et bimorphismes d’arbres. Theoretical Computer Association for Computational Linguistics. Science, 20(1):33–93. Jonathan May. 2010. Weighted Tree Automata and Brenda S. Baker. 1979. Composition of top-down Transducers for Syntactic Natural Language Pro- and bottom-up tree transductions. Information and cessing. Ph.D. thesis, University of Southern Cali- Control, 41(2):186–213. fornia, Los Angeles. Jason Eisner. 2003. Learning non-isomorphic tree John Alan Robinson. 1965. A machine-oriented logic mappings for machine translation. In Proc. ACL, based on the resolution principle. Journal of the pages 205–208. Association for Computational Lin- ACM, 12(1):23–41. guistics. William C. Rounds. 1970. Mappings and grammars Joost Engelfriet, Eric Lilin, and Andreas Maletti. on trees. Mathematical Systems Theory, 4(3):257– 2009. Composition and decomposition of extended 287. multi bottom-up tree transducers. Acta Informatica, Stuart M. Shieber. 2004. Synchronous grammars as 46(8):561–590. tree transducers. In Proc. TAG+7, pages 88–95. Joost Engelfriet. 1975. Bottom-up and top-down James W. Thatcher. 1970. Generalized2 sequential tree transformations—A comparison. Mathemati- machine maps. Journal of Computer and System cal Systems Theory, 9(3):198–231. Sciences, 4(4):339–367. Joost Engelfriet. 1977. Top-down tree transducers Kenji Yamada and Kevin Knight. 2001. A syntax- with regular look-ahead. Mathematical Systems based statistical translation model. In Proc. ACL, Theory, 10(1):289–303. pages 523–530. Association for Computational Lin- Jonathan Graehl, Kevin Knight, and Jonathan May. guistics. 2008. Training tree transducers. Computational Linguistics, 34(3):391–427. Jonathan Graehl, Mark Hopkins, Kevin Knight, and Andreas Maletti. 2009. The power of extended top- down tree transducers. SIAM Journal on Comput- ing, 39(2):410–430. Kevin Knight and Jonathan Graehl. 2005. An over- view of probabilistic tree transducers for natural language processing. In Proc. CICLing, volume 3406 of LNCS, pages 1–24. Springer. Kevin Knight. 2007. Capturing practical natural language transformations. Machine Translation, 21(2):121–133. Eric Lilin. 1978. Une g´en´eralisation des transduc- teurs d’´etats finis d’arbres: les S-transducteurs. Th`ese 3`eme cycle, Universit´e de Lille. Andreas Maletti and Heiko Vogler. 2010. Composi- tions of top-down tree transducers with ε-rules. In Proc. FSMNLP, volume 6062 of LNAI, pages 69– 80. Springer. Andreas Maletti. 2010. Why synchronous tree sub- stitution grammars? In Proc. HLT-NAACL, pages 876–884. Association for Computational Linguis- tics. Andreas Maletti. 2011. An alternative to synchronous tree substitution grammars. Natural Language En- gineering, 17(2):221–242. Alberto Martelli and Ugo Montanari. 1982. An effi- cient unification algorithm. ACM Transactions on Programming Languages and Systems, 4(2):258– 282. Jonathan May and Kevin Knight. 2006. Tiburon: A weighted tree automata toolkit. In Proc. CIAA, vol- ume 4094 of LNCS, pages 102–113. Springer. 817 Structural and Topical Dimensions in Multi-Task Patent Translation Katharina W¨aschle and Stefan Riezler Department of Computational Linguistics Heidelberg University, Germany {waeschle,riezler}@cl.uni-heidelberg.de Abstract In this paper, we analyze patents with respect to the orthogonal dimensions of topic – the tech- Patent translation is a complex problem due nical field covered by the patent – and structure to the highly specialized technical vocab- – a patent’s text sections –, with respect to their ulary and the peculiar textual structure of influence on machine translation performance. patent documents. In this paper we analyze patents along the orthogonal dimensions of The topical dimension of patents is charac- topic and textual structure. We view differ- terized by the International Patent Classification ent patent classes and different patent text (IPC)1 which categorizes patents hierarchically sections such as title, abstract, and claims, into 8 sections, 120 classes, 600 subclasses, down as separate translation tasks, and investi- to 70,000 subgroups at the leaf level. Table 1 gate the influence of such tasks on machine shows the 8 top level sections. translation performance. We study multi- task learning techniques that exploit com- monalities between tasks by mixtures of A Human Necessities translation models or by multi-task meta- B Performing Operations, Transporting parameter tuning. We find small but sig- C Chemistry, Metallurgy nificant gains over task-specific training D Textiles, Paper by techniques that model commonalities E Fixed Constructions through shared parameters. A by-product F Mechanical Engineering, Lighting, of our work is a parallel patent corpus of 23 million German-English sentence pairs. Heating, Weapons G Physics H Electricity 1 Introduction Table 1: IPC top level sections. Patents are an important tool for the protection of intellectual property and also play a significant Orthogonal to the patent classification, patent role in business strategies in modern economies. documents can be sub-categorized along the di- Patent translation is an enabling technique for mension of textual structure. Article 78.1 of the patent prior art search which aims to detect a European Patent Convention (EPC) lists all sec- patent’s novelty and thus needs to be cross-lingual tions required in a patent document2 : for a multitude of languages. Patent translation is complicated by a highly specialized vocabulary, ”A European patent application shall consisting of technical terms specific to the field contain: of invention the patent relates to. Patents are writ- ten in a sophisticated legal jargon (“patentese”) (a) a request for the grant of a Euro- that is not found in everyday language and ex- pean patent; hibits a complex textual structure. Also, patents 1 http://www.wipo.int/classifications/ are often intentionally ambiguous or vague in or- ipc/en/ 2 der to maximize the coverage of the claims. Highlights by the authors. 818 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 818–828, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics (b) a description of the invention; adapting unsupervised generative modules such (c) one or more claims; as translation models or language models to new (d) any drawings referred to in the de- tasks. For example, transductive approaches have scription or the claims; used automatic translations of monolingual cor- (e) an abstract, pora for self-training modules of the generative SMT pipeline (Ueffing et al., 2007; Schwenk, and satisfy the requirements laid down 2008; Bertoldi and Federico, 2009). Other ap- in the Implementing Regulations.” proaches have extracted parallel data from similar The request for grant contains the patent title; thus or comparable corpora (Zhao et al., 2004; Snover a patent document comprises the textual elements et al., 2008). Several approaches have been pre- of title, description, claim, and abstract. sented that train separate translation and language We investigate whether it is worthwhile to treat models on task-specific subsets of the data and different values along the structural and topical combine them in different mixture models (Fos- dimensions as different tasks that are not com- ter and Kuhn, 2007; Koehn and Schroeder, 2007; pletely independent of each other but share some Foster et al., 2010). The latter kind of approach is commonalities, yet differ enough to counter a applied in our work to multiple patent tasks. simple pooling of data. For example, we con- Multi-task learning efforts in patent transla- sider different tasks such as patents from different tion have so far been restricted to experimental IPC classes, or along an orthogonal dimension, combinations of translation and language mod- patent documents of all IPC classes but consisting els from different sets of IPC sections. For ex- only of titles or only of claims. We ask whether ample, Utiyama and Isahara (2007) and Tinsley such tasks should be addressed as separate trans- et al. (2010) investigate translation and language lation tasks, or whether translation performance models trained on different sets of patent sections, can be improved by learning several tasks simul- with larger pools of parallel data improving re- taneously through shared models that are more so- sults. Ceaus¸u et al. (2011) find that language mod- phisticated than simple data pooling. Our goal is els always and translation model mostly benefit to learn a patent translation system that performs from larger pools of data from different sections. well across several different tasks, thus benefits Models trained on pooled patent data are used as from shared information, but is yet able to address baselines in our approach. the specifics of each task. The machine learning community has devel- One contribution of this paper is a thorough oped several different formalizations of the cen- analysis of the differences and similarities of mul- tral idea of trading off optimality of parameter tilingual patent data along the dimensions of tex- vectors for each task-specific model and close- tual structure and topic. The second contribution ness of these model parameters to the average pa- is the experimental investigation of the influence rameter vector across models. For example, start- of various such tasks on patent translation perfor- ing from a separate SVM for each task, Evgeniou mance. Starting from baseline models that are and Pontil (2004) present a regularization method trained on individual tasks or on data pooled from that trades off optimization of the task-specific pa- all tasks, we apply mixtures of translation mod- rameter vectors and the distance of each SVM to els and multi-task minimum error rate training to the average SVM. Equivalent formalizations re- multiple patent translation tasks. A by-product of place parameter regularization by Bayesian prior our research is a parallel patent corpus of over 23 distributions on the parameters (Finkel and Man- million sentence pairs. ning, 2009) or by augmentation of the feature space with domain independent features (Daum´e, 2 Related work 2007). Besides SVMs, several learning algo- Multi-task learning has mostly been discussed un- rithms have been extended to the multi-task sce- der the name of multi-domain adaptation in the nario in a parameter regularization setting, e.g., area of statistical machine translation (SMT). If perceptron-type algorithms (Dredze et al., 2010) we consider domains as tasks, domain adapta- or boosting (Chapelle et al., 2011). Further vari- tion is a special two-task case of multi-task learn- ants include different formalizations of norms for ing. Most previous work has concentrated on parameter regularization, e.g., `1,2 regularization 819 (Obozinski et al., 2010) or `1,∞ regularization pass alignment. This yields the parallel corpus (Quattoni et al., 2009), where only the features listed in table 2 with high input-output ratios for that are most important across all tasks are kept in claims, and much lower ratios for abstracts and the model. In our experiments, we apply parame- descriptions, showing that claims exhibit a nat- ter regularization for multi-task learning to mini- ural parallelism due to their structure, while ab- mum error rate training for patent translation. stracts and descriptions are considerably less par- allel. Removing duplicates and adding parallel ti- 3 Extraction of a parallel patent corpus tles results in a corpus of over 23 million parallel from comparable data sentence pairs. Our work on patent translation is based on the output de ratio en ratio MAREC3 patent data corpus. MAREC con- tains over 19 million patent applications and abstract 720,571 92.36% 76.81% granted patents in a standardized format from claims 8,346,863 97.82% 96.17% four patent organizations (European Patent Of- descr. 14,082,381 86.23% 82.67% fice (EP), World Intellectual Property Organiza- Table 2: Number of parallel sentences in output with tion (WO), United States Patent and Trademark input/output ratio of sentence aligner. Office (US), Japan Patent Office (JP)), from 1976 to 2008. The data for our experiments are ex- Differences between the text sections become tracted from the EP and WO collections which visible in an analysis of token to type ratios. Ta- contain patent documents that include translations ble 3 gives the average number of tokens com- of some of the patent text. To extract such parallel pared to the average type frequencies for a win- patent sections, we first determine the longest in- dow of 100,000 tokens from every subsection. It stance, if different kinds4 exist for a patent. We shows that titles contain considerably fewer to- assume titles to be sentence-aligned by default, kens than other sections, however, the disadvan- and define sections with a token ratio larger than tage is partially made up by a relatively large 0.7 as parallel. For the language pair German- amount of types, indicated by a lower average English we extracted a total of 2,101,107 parallel type frequency. titles, 291,716 parallel abstracts, and 735,667 par- allel claims sections. tokens types The lack of directly translated descriptions poses a serious limitation for patent translation, de en de en since this section constitutes the largest part of the title 6.5 8.0 2.9 4.8 document. It is possible to obtain comparable de- abstract 37.4 43.2 4.3 9.0 scriptions from related patents that have been filed claims 53.2 61.3 5.5 9.5 in different countries and are connected through description 27.5 35.5 4.0 7.0 the patent family id. We extracted 172,472 patents that were both filed with the USPTO and the EPO Table 3: Average number of tokens and average type and contain an English and a German description, frequencies in text sections. respectively. We reserved patent data published between For sentence alignment, we used the Gargan- 1979 and 2007 for training and documents pub- tua5 tool (Braune and Fraser, 2010) that fil- lished in 2008 for tuning and testing in SMT. ters a sentence-length based alignment with IBM For the dimension of text sections, we sampled Model-1 lexical word translation probabilities, es- 500,000 sentences – distributed across all IPC timated on parallel data obtained from the first- sections – for training and 2,000 sentences for 3 http://www.ir-facility.org/ each text section for development and testing. Be- prototypes/marec cause of a relatively high number of identical sen- 4 A patent kind code indicates the document stage in the tences in test and training set for titles, we re- filing process, e.g., A for applications and B for granted patents, with publication levels from 1-9. See http:// moved the overlap for this section. www.wipo.int/standards/en/part\_03.html. Table 4 shows the distribution of IPC sections 5 http://gargantua.sourceforge.net on claims, with the smallest class accounting for 820 around 300,000 parallel sentences. In order to ob- ison to the task-specific MAREC model, although tain similar amounts of training data for each task the former has been learned on more than three along the topical dimension, we sampled 300,000 times the amount of data. An analysis of the out- sentences from each IPC class for training, and put of both system shows that the Europarl model 2,000 sentences for each IPC class for develop- suffers from two problems: Firstly, there is an ob- ment and testing. vious out of vocabulary (OOV) problem of the Europarl model compared to the MAREC model. A 1,947,542 Secondly, the Europarl model suffers from incor- B 2,522,995 rect word sense disambiguation, as illustrated by C 2,263,375 the samples in table 6. D 299,742 E 353,910 source steuerbar leitet F 1,012,808 Europarl taxable is in charge of G 2,066,132 MAREC controllable guiding H 1,754,573 reference controllable guides Table 4: Distribution of IPC sections on claims. Table 6: Output of Europarl model on MAREC data. 4 Machine translation experiments Table 7 shows the results of the evaluation across text sections; we measured the perfor- 4.1 Individual task baselines mance of separately trained and tuned individual For our experiments we used the phrase-based, models on every section. The results allow some open-source SMT toolkit Moses6 (Koehn et al., conclusions about the textual characteristics of the 2007). For language modeling, we computed sections and indicate similarities. Naturally, ev- 5-gram models using IRSTLM7 (Federico et ery task is best translated with a model trained al., 2008) and queried the model with KenLM on the respective section, as the B LEU scores (Heafield, 2011). B LEU (Papineni et al., 2001) on the diagonal are the highest in every column. scores were computed up to 4-grams on lower- Accordingly, we are interested in the runner-up cased data. on each section, which is indicated in bold font. The results on abstracts suggest that this section Europarl-v6 MAREC bears the strongest resemblance to claims, since B LEU OOV B LEU OOV the model trained on claims achieves a respectable score. The abstract model seems to be the most abstract 0.1726 14.40% 0.3721 3.00% robust and varied model, yielding the runner-up claim 0.2301 15.80% 0.4711 4.20% score on all other sections. Claims are easiest to title 0.0964 26.00% 0.3228 9.20% translate, yielding the highest overall B LEU score of 0.4879. In contrast to that, all models score Table 5: B LEU scores and OOV rate for Europarl base- considerably lower on titles. line and MAREC model. Table 5 shows a first comparison of results of test Moses models trained on 500,000 parallel sen- train abstract claim title desc. tences from patent text sections balanced over IPC abstract 0.3737 0.4076 0.2681 0.2812 classes, against Moses trained on 1.7 Million sen- claim 0.3416 0.4879 0.2420 0.2623 tences of parliament proceedings from Europarl8 title 0.2839 0.3512 0.3196 0.1743 (Koehn, 2005). The best result on each section is desc. 0.32189 0.403 0.2342 0.3347 indicated in bold face. The Europarl model per- forms very poorly on all three sections in compar- Table 7: B LEU scores for 500k individual text section 6 http://statmt.org/moses/ models. 7 http://sourceforge.net/projects/ irstlm/ The cross-section evaluation on the IPC classes 8 http://www.statmt.org/europarl/ (table 8) shows similar patterns. Each section 821 is best translated with a model trained on data section B and C is trained on a data set composed from the same section. Note that best section of 150,000 sentences from each IPC section. The scores vary considerably, ranging from 0.5719 on pooled model for pairing data from abstracts and C to 0.4714 on H, indicating that higher-scoring claims is trained on data composed of 250,000 classes, such as C and A, are more homogeneous sentences from each text section. and therefore easier to translate. C, the Chem- Another approach to exploit commonalities be- istry section, presumably benefits from the fact tween tasks is to train separate language and trans- that the data contain chemical formulae, which lation models9 on the sentences from each task are language-independent and do not have to be and combine the models in the global log-linear translated. Again, for determining the relation- model of the SMT framework, following Fos- ship between the classes, we examine the best ter and Kuhn (2007) and Koehn and Schroeder runner-up on each section, considering the B LEU (2007). Model combination is accomplished by score, although asymmetrical, as a kind of mea- adding additional language model and translation sure of similarity between classes. We can es- model features to the log-linear model and tuning tablish symmetric relationships between sections the additional meta-parameters by standard mini- A and C, B and F as well as G and H, which mum error rate training (Bertoldi et al., 2009). means that the models are mutual runner-up on We try out mixture and pooling for all pairwise the other’s test section. combinations of the three structural sections, for The similarities of translation tasks estab- which we have high-quality data, i.e. abstract, lished in the previous section can be confirmed claims and title. Due to the large number of pos- by information-theoretic similarity measures that sible combinations of IPC sections, we limit the perform a pairwise comparison of the vocabulary experiments to pairs of similar sections, based on probability distribution of each task-specific cor- the A-distance measure. pus. This distribution is calculated on the basis of Table 10 lists the results for two combinations the 500 most frequent words in the union of two of data from different sections: a log-linear mix- corpora, normalized by vocabulary size. As met- ture of separately trained models and simple pool- ric we use the A-distance measure of Kifer et al. ing, i.e. concatenation, of the training data. Over- (2004). If A is the set of events on which the word all, the mixture models perform slightly better distributions of two corpora are defined, then the than the pooled models on the text sections, al- A-distance is the supremum of the difference of though the difference is significant only in two probabilities assigned to the same event. Low dis- cases. This is indicated by highlighting best re- tance means higher similarity. sults in bold face (with more than one result high- Table 9 shows the A-distance of corpora spe- lighted if the difference is not significant).10 cific to IPC classes. The most similar section or We investigate the same mixture and pooling sections – apart from the section itself on the di- techniques on the IPC sections we considered agonal – is indicated in bold face. The pairwise pairwise similar (see table 11). Somehow contra- similarity of A and C, B and F, G and H obtained dicting the former results, the mixture models per- by B LEU score is confirmed. Furthermore, a close form significantly worse than the pooled model on similarity between E and F is indicated. G and three sections. This might be the result of inade- H (electricity and physics, respectively) are very quate tuning, since most of the time the MERT similar to each other but not close to any other algorithm did not converge after the maximum section apart from B. number of iterations, due to the larger number of features when using several models. 4.2 Task pooling and mixture 9 Following Duh et al. (2010), we use the alignment One straightforward technique to exploit com- model trained on the pooled data set in the phrase extraction monalities between tasks is pooling data from phase of the separate models. Similarly, we use a globally separate tasks into a single training set. Instead of trained lexical reordering model. 10 a trivial enlargement of training data by pooling, For assessing significance, we apply the approximate randomization method described in Riezler and Maxwell we train the pooled models on the same amount (2005). We consider pairwise differing results scoring a p- of sentences as the individual models. For in- value smaller than 0.05 as significant; the assessment is re- stance, the pooled model for the pairing of IPC peated three times and the average value is taken. 822 test train A B C D E F G H A 0.5349 0.4475 0.5472 0.4746 0.4438 0.4523 0.4318 0.4109 B 0.4846 0.4736 0.5161 0.4847 0.4578 0.4734 0.4396 0.4248 C 0.5047 0.4257 0.5719 0.462 0.4134 0.4249 0.409 0.3845 D 0.47 0.4387 0.5106 0.5167 0.4344 0.4435 0.407 0.3917 E 0.4486 0.4458 0.4681 0.4531 0.4771 0.4591 0.4073 0.4028 F 0.4595 0.4588 0.4761 0.4655 0.4517 0.4909 0.422 0.4188 G 0.4935 0.4489 0.5239 0.4629 0.4414 0.4565 0.4748 0.4532 H 0.4628 0.4484 0.4914 0.4621 0.4421 0.4616 0.4588 0.4714 Table 8: B LEU scores for 300k individual IPC section models. A B C D E F G H A 0 0.1303 0.1317 0.1311 0.188 0.186 0.164 0.1906 B 0.1302 0 0.2388 0.1242 0.0974 0.0875 0.1417 0.1514 C 0.1317 0.2388 0 0.1992 0.311 0.3068 0.2506 0.2825 D 0.1311 0.1242 0.1992 0 0.1811 0.1808 0.1876 0.201 E 0.188 0.0974 0.311 0.1811 0 0.0921 0.2058 0.2025 F 0.186 0.0875 0.3068 0.1808 0.0921 0 0.1824 0.1743 G 0.164 0.1417 0.2506 0.1876 0.2056 0.1824 0 0.064 H 0.1906 0.1514 0.2825 0.201 0.2025 0.1743 0.064 0 Table 9: Pairwise A-distance for 300k IPC training sets. train test pooling mixture train test pooling mixture abstract-claim abstract 0.3703 0.3704 A-C A 0.5271 0.5274 claim 0.4809 0.4834 C 0.5664 0.5632 claim-title claim 0.4799 0.4789 B-F B 0.4696 0.4354 title 0.3269 0.328 F 0.4859 0.4769 title-abstract title 0.3311 0.3275 G-H G 0.4735 0.4754 abstract 0.3643 0.366 H 0.4634 0.467 Table 10: Mixture and pooling on text sections. Table 11: Mixture and pooling on IPC sections. A comparison of the results for pooling and SMT pipeline is not adaptable. Such situations mixture with the respective results for individual arise if there are not enough data to train transla- models (tables 7 and 8) shows that replacing data tion models or language models on the new tasks. from the same task by data from related tasks However, we assume that there are enough paral- decreases translation performance in almost all lel data available to perform meta-parameter tun- cases. The exception is the title model that bene- ing by minimum error rate training (MERT) (Och, fits from pooling and mixing with both abstracts 2003; Bertoldi et al., 2009) for each task. and claims due to their richer data structure. A generic algorithm for multi-task learning can be motivated as follows: Multi-task learning 4.3 Multi-task minimum error rate training aims to take advantage of commonalities shared In contrast to task pooling and task mixtures, the among tasks by learning several independent but specific setting addressed by multi-task minimum related tasks together. Information is shared be- error rate training is one in which the generative tween tasks through a joint representation and in- 823 tuning test individual pooled average MMERT MMERT-average abstract 0.3721 0.362 0.3657∗+ 0.3719 + 0.3685∗+ claim 0.4711 0.4681 0.4749∗+ 0.475∗+ 0.4734∗+ title 0.3228 0.3152 0.3326∗+ 0.3268∗+ 0.3325∗+ Table 12: Multi-task tuning on text sections. tuning test individual pooled average MMERT MMERT-average A 0.5187 0.5199 0.5213∗+ 0.5195 0.5196 B 0.4877 0.4885 0.4908∗+ 0.4911∗+ 0.4921∗+ C 0.5214 0.5175 0.5199∗+ 0.5218+ 0.5162∗+ D 0.4724 0.4730 0.4733 0.4736 0.4734 E 0.4666 0.4661 0.4679∗+ 0.4669+ 0.4685∗+ F 0.4794 0.4801 0.4811∗ 0.4821∗+ 0.4830∗+ G 0.4596 0.4576 0.4607+ 0.4606+ 0.4610∗+ H 0.4573 0.4560 0.4578 0.4581+ 0.4581+ Table 13: Multi-task tuning on IPC sections. troduces an inductive bias. Evgeniou and Pon- moves beyond the average, it is clipped to the av- til (2004) propose a regularization method that erage value. The process is iterated until a stop- balances task-specific parameter vectors and their ping criterion is met, e.g. a threshold on the max- distance to the average. The learning objective is imum change in the average weight vector. The to minimize task-specific loss functions ld across parameter λ controls the influence of the regular- all tasks d with weight vectors wd , while keep- ization. A larger λ pulls the weights closer to the ingPeach parameter vector close to the average average, a smaller λ leaves more freedom to the 1 D D d=1 wd = wavg . This is enforced by min- individual tasks. imizing the norm (here the `1 -norm) of the dif- ference of each task-specific weight vector to the avarage weight vector. MMERT(w(0) , D, {ld }D d=1 ): for t = 1, . . . , T do (t) 1 PD (t−1) D D wavg = D d=1 wd for d = 1, . . . , D parallel do X X min ld (wd ) + λ ||wd − wavg ||1 (1) (t) (t−1) w1 ,...,wD d=1 d=1 wd = MERT(wd , ld ) for k = 1, . . . , K do The MMERT algorithm is given in figure 1. (t) (t) The algorithm starts with initial weights w(0) . At if w[k]d − wavg [k] > 0 then (t) (t) (t) each iteration step, the average of the parame- wd [k] = max(wavg [k], wd [k] − λ) (t) (t) ter vectors from the previous iteration is com- else if wd [k] − wavg [k] < 0 then puted. For each task d ∈ D, one iteration of stan- (t) (t) (t) wd [k] = min(wavg [k], wd [k] + λ) dard MERT is called, continuing from weight vec- end if (t−1) tor wd and minimizing translation loss func- end for tion ld on the data from task d. The individu- end for ally tuned weight vectors returned by MERT are end for then moved towards the previously calculated av- (T ) (T ) (T ) return w1 , . . . , wD , wavg erage by adding or subtracting a penalty term λ (t) Figure 1: Multi-task MERT. for each weight component wd [k]. If a weight 824 The weight updates and the clipping strategy for each task, where no information has been can be motivated in a framework of gradient de- shared between the tasks. The second baseline scent optimization under `1 -regularization (Tsu- simulates the setting where the sections are not ruoka et al., 2009). Assuming MERT as algorith- differentiated at all. We tune the model on a mic minimizer11 of the loss function ld in equa- pooled development set of 2,000 sentences that tion 1, the weight update towards the average combines the same amount of data from all sec- follows from the subgradient of the `1 regular- tions (pooled). This yields a single joint weight (t) izer. Since wavg is taken as average over weights vector for all tasks optimized to perform well wd (t−1) (t) from the step before, the term wavg is con- across all sections. Furthermore, we compare (t) multi-task MERT tuning with two parameter av- stant with respect to wd , leading to the follow- eraging methods. The first method computes the ing subgradient (where sgn(x) = 1 if x > 0, arithmetic mean of the weight vectors returned by sgn(x) = −1 if x < 0, and sgn(x) = 0 if x = 0): the individual baseline for each weight compo- nent, yielding a joint average vector for all tasks D D ∂ X (t) 1 X (t−1) λ wd − ws (average). The second method takes the last av- (t) D ∂wr [k] d=1 s=1 1 erage vector computed during multi-task MERT D tuning (MMERT-average).12 ! 1 X (t−1) = λ sgn wr(t) [k] − ws [k] . Tables 12 and 13 give the results for multi-task D s=1 learning on text and IPC sections. The latter re- Gradient descent minimization tells us to move in sults have been presented earlier in Simianer et al. the opposite direction of the subgradient, thus mo- (2011). The former table extends the technique tivating the addition or subtraction of the regular- of multi-task MERT to the structural dimension ization penalty. Clipping is motivated by the de- of patent SMT tasks. In all experiments, the pa- sire to avoid oscillating parameter weights and in rameter λ was adjusted to 0.001 after evaluating order to to enforce parameter sharing. different settings on a development set. The best Experimental results for multi-task MERT result on each section is indicated in bold face; * (MMERT) are reported for both dimensions of indicates significance with respect to the individ- patent tasks. For the IPC sections we trained ual baseline, + the same for the pooled baseline. a pooled model on 1,000,000 sentences sampled We observe statistically significant improvements from abstracts and claims from all sections. We of 0.5 to 1% B LEU over the individual baseline for did not balance the sections but kept their orig- claims and titles; for abstracts, the multi-task vari- inal distribution, reflecting a real-life task where ant yields the same result as the baseline, while the distribution of sections is unknown. We then the averaging methods perform worse. Multi-task extend this experiment to the structural dimen- MERT yields the best result for claims; on titles, sion. Since we do not have an intuitive notion of a the simple average and the last MMERT average natural distribution for the text sections, we train dominate. Pooled tuning always performs signifi- a balanced pooled model on a corpus composed cantly worse than any other method, confirming of 170,000 sentences each from abstracts, claims that it is beneficial to differentiate between the and titles, i.e. 510,000 sentences in total. For text section sections. both dimensions, for each task, we sampled 2,000 Similarly for IPC sections, small but statisti- parallel sentences for development, development- cally significant improvements over the individual testing, and testing from patents that were pub- and pooled baselines are achieved by multi-task lished in different years than the training data. tuning and averaging over IPC sections, except- We compare the multi-task experiments with ing C and D. However, an advantage of multi-task two baselines. The first baseline is individual tuning over averaging is hard to establish. task learning, corresponding to standard separate Note that the averaging techniques implicitly MERT tuning on each section (individual). This benefit from a larger tuning set. In order to ascer- results in three separately learned weight vectors tain that the improvements by averaging are not 11 12 MERT as presented in Och (2003) is not a gradient- The aspect of averaging found in all of our multi-task based optimization techniquem, thus MMERT is strictly learning techniques effectively controls for optimizer insta- speaking only “inspired” by gradient descent optimization. bility as mentioned in Clark et al. (2011). 825 test pooled-6k significance ley et al. (2010) and Utiyama and Isahara (2007). A caveat in this situation is that data need to be abstract 0.3628 < from the general patent domain, as shown by the claim 0.4696 < inferior performance of a large Europarl-trained title 0.3174 < model compared to a small patent-trained model. Table 14: Multi-task tuning on 6,000 sentences pooled The goal of this paper is to analyze patent data from text sections. “<” denotes a statistically signifi- along the topical dimension of IPC classes and cant difference to the best result. along the structural dimension of textual sections. Instead of trying to beat a pooling baseline that simply increases the data size, our research goal simply due to increasing the size of the tuning set, is to investigate whether different subtasks along we ran a control experiment where we tuned the these dimensions share commonalities that can model on a pooled development set of 3 × 2, 000 fruitfully be exploited by multi-task learning in sentences for text sections and on a development machine translation. We thus aim to investigate set of 8 × 2, 000 sentences for IPC sections. The the benefits of multi-task learning in realistic sit- results given in table 14 show that tuning on a uations where a simple enlargement of training pooled set of 6,000 text sections yields only min- data is not possible. imal differences to tuning on 2,000 sentence pairs such that the B LEU scores for the new pooled Starting from baseline models that are trained models are still significantly lower than the best on individual tasks or on data pooled from all results in table 12 (indicated by “<”). However, tasks, we apply mixtures of translation models increasing the tuning set to 16,000 sentence pairs and multi-task MERT tuning to multiple patent for IPC sections makes the pooled baseline per- translation tasks. We find small, but statistically form as well as the best results in table 13, except significant improvements for multi-task MERT for two cases (indicated by “<”) (see table 15). tuning and parameter averaging techniques. Im- This is due to the smaller differences between best provements are more pronounced for multi-task and worst results for tuning on IPC sections com- learning on textual domains than on IPC domains. pared to tuning on text sections, indicating that This might indicate that the IPC sections are less IPC sections are less well suited for multi-task well delimitated than the structural domains. Fur- tuning than the textual domains. thermore, this is owing to the limited expressive- ness of a standard linear model including 14-20 test pooled-16k significance features in tuning. The available features are very coarse and more likely to capture structural dif- A 0.5177 < ferences, such as sentence length, than the lexi- B 0.4920 cal differences that differentiate the semantic do- C 0.5133 < mains. We expect to see larger gains due to multi- D 0.4737 task learning for discriminatively trained SMT E 0.4685 models that involve very large numbers of fea- F 0.4832 tures, especially when multi-task learning is done G 0.4608 in a framework that combines parameter regular- H 0.4579 ization with feature selection (Obozinski et al., 2010). In future work, we will explore a combina- Table 15: Multi-task tuning on 16,000 sentences pooled from IPC sections. “<” denotes a statistically tion of large-scale discriminative training (Liang significant difference to the best result. et al., 2006) with multi-task learning for SMT. Acknowledgments 5 Conclusion The most straightforward approach to improve This work was supported in part by DFG grant machine translation performance on patents is to “Cross-language Learning-to-Rank for Patent Re- enlarge the training set to include all available trieval”. data. This question has been investigated by Tins- 826 References Chapter of the Association for Computational Lin- guistics - Human Language Technologies (NAACL- Nicola Bertoldi and Marcello Federico. 2009. Do- HLT’09), Boulder, CO. main adaptation for statistical machine translation George Foster and Roland Kuhn. 2007. Mixture- with monolingual resources. In Proceedings of the model adaptation for SMT. In Proceedings of the 4th EACL Workshop on Statistical Machine Trans- Second Workshop on Statistical Machine Transla- lation, Athens, Greece. tion, Prague, Czech Republic. Nicola Bertoldi, Barry Haddow, and Jean-Baptiste George Foster, Pierre Isabelle, and Roland Kuhn. Fouet. 2009. Improved minimum error rate train- 2010. Translating structured documents. In Pro- ing in Moses. The Prague Bulletin of Mathematical ceedings of the 9th Conference of the Association Linguistics, 91:7–16. for Machine Translation in the Americas (AMTA Fabienne Braune and Alexander Fraser. 2010. Im- 2010), Denver, CO. proved unsupervised sentence alignment for sym- Kenneth Heafield. 2011. KenLM: faster and smaller metrical and asymmetrical parallel corpora. In Pro- language model queries. In Proceedings of the ceedings of the 23rd International Conference on EMNLP 2011 Sixth Workshop on Statistical Ma- Computational Linguistics (COLING’10), Beijing, chine Translation (WMT’11), Edinburgh, UK. China. Daniel Kifer, Shain Ben-David, and Johannes Gehrke. Alexandru Ceaus¸u, John Tinsley, Jian Zhang, and 2004. Detecting change in data streams. In Pro- Andy Way. 2011. Experiments on domain adap- ceedings of the 30th international conference on tation for patent machine translation in the PLuTO Very large data bases, Toronta, Ontario, Canada. project. In Proceedings of the 15th Conference of Philipp Koehn and Josh Schroeder. 2007. Experi- the European Assocation for Machine Translation ments in domain adaptation for statistical machine (EAMT 2011), Leuven, Belgium. translation. In Proceedings of the Second Workshop Olivier Chapelle, Pannagadatta Shivaswamy, Srinivas on Statistical Machine Translation, Prague, Czech Vadrevu, Kilian Weinberger, Ya Zhang, and Belle Republic. Tseng. 2011. Boosted multi-task learning. Ma- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris chine Learning. Callison-Birch, Marcello Federico, Nicola Bertoldi, Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Brooke Cowan, Wade Shen, Christine Moran, Smith. 2011. Better hypothesis testing for statis- Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra tical machine translation: Controlling for optimizer Constantin, and Evan Herbst. 2007. Moses: Open instability. In Proceedings of the 49th Annual Meet- source toolkit for statistical machine translation. In ing of the Association for Computational Linguis- Proceedings of the ACL 2007 Demo and Poster Ses- tics (ACL’11), Portland, OR. sions, Prague, Czech Republic. Hal Daum´e. 2007. Frustratingly easy domain adap- Philipp Koehn. 2005. Europarl: A parallel corpus for tation. In Proceedings of the 45th Annual Meet- statistical machine translation. In Proceedings of ing of the Association for Computational Linguis- Machine Translation Summit X, Phuket, Thailand. tics (ACL’07), Prague, Czech Republic. Percy Liang, Alexandre Bouchard-Cˆot´e, Dan Klein, and Ben Taskar. 2006. An end-to-end dis- Mark Dredze, Alex Kulesza, and Koby Crammer. criminative approach to machine translation. In 2010. Multi-domain learning by confidence- Proceedings of the joint conference of the Inter- weighted parameter combination. Machine Learn- national Committee on Computational Linguistics ing, 79:123–149. and the Association for Computational Linguistics Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. (COLING-ACL’06), Sydney, Australia. 2010. Analysis of translation model adaptation in Guillaume Obozinski, Ben Taskar, and Michael I. Jor- statistical machine translation. In Proceedings of dan. 2010. Joint covariate selection and joint sub- the International Workshop on Spoken Language space selection for multiple classification problems. Translation (IWSLT’10), Paris, France. Statistics and Computing, 20:231–252. Theodoros Evgeniou and Massimiliano Pontil. 2004. Franz Josef Och. 2003. Minimum error rate train- Regularized multi-task learning. In Proceedings of ing in statistical machine translation. In Proceed- the 10th ACM SIGKDD conference on knowledge ings of the Human Language Technology Confer- discovery and data mining (KDD’04), Seattle, WA. ence and the 3rd Meeting of the North American Marcello Federico, Nicola Bertoldi, and Mauro Cet- Chapter of the Association for Computational Lin- tolo. 2008. IRSTLM: an open source toolkit for guistics (HLT-NAACL’03), Edmonton, Cananda. handling large scale language models. In Proceed- Kishore Papineni, Salim Roukos, Todd Ward, and ings of Interspeech, Brisbane, Australia. Wei-Jing Zhu. 2001. Bleu: a method for auto- Jenny Rose Finkel and Christopher D. Manning. 2009. matic evaluation of machine translation. Technical Hierarchical bayesian domain adaptation. In Pro- Report IBM Research Division Technical Report, ceedings of the Conference of the North American RC22176 (W0190-022), Yorktown Heights, N.Y. 827 Ariadna Quattoni, Xavier Carreras, Michael Collins, and Trevor Darrell. 2009. An efficient projec- tion for `1,∞ regularization. In Proceedings of the 26th International Conference on Machine Learn- ing (ICML’09), Montreal, Canada. Stefan Riezler and John Maxwell. 2005. On some pit- falls in automatic evaluation and significance testing for MT. In Proceedings of the ACL-05 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, Ann Arbor, MI. Holger Schwenk. 2008. Investigations on large- scale lightly-supervised training for statistical ma- chine translation. In Proceedings of the Interna- tional Workshop on Spoken Language Translation (IWSLT’08), Hawaii. Patrick Simianer, Katharina W¨aschle, and Stefan Rie- zler. 2011. Multi-task minimum error rate train- ing for SMT. The Prague Bulletin of Mathematical Linguistics, 96:99–108. Matthew Snover, Bonnie Dorr, and Richard Schwartz. 2008. Language and translation model adaptation using comparable corpora. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP’08), Honolulu, Hawaii. John Tinsley, Andy Way, and Paraic Sheridan. 2010. PLuTO: MT for online patent translation. In Pro- ceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, CO. Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ana- niadou. 2009. Stochastic gradient descent train- ing for `1 -regularized log-linear models with cumu- lative penalty. In Proceedings of the 47th Annual Meeting of the Association for Computational Lin- guistics (ACL-IJCNLP’09), Singapore. Nicola Ueffing, Gholamreza Haffari, and Anoop Sarkar. 2007. Transductive learning for statistical machine translation. In Proceedings of the 45th An- nual Meeting of the Association of Computational Linguistics (ACL’07), Prague, Czech Republic. Masao Utiyama and Hitoshi Isahara. 2007. A Japanese-English patent parallel corpus. In Pro- ceedings of MT Summit XI, Copenhagen, Denmark. Bing Zhao, Matthias Eck, and Stephan Vogel. 2004. Language model adaptation for statistical machine translation with structured query models. In Pro- ceedings of the 20th International Conference on Computational Linguistics (COLING’04), Geneva, Switzerland. 828 Not as Awful as it Seems: Explaining German Case through Computational Experiments in Fluid Construction Grammar Remi van Trijp Sony Computer Science Laboratory Paris 6 Rue Amyot 75005 Paris (France)

[email protected]

Abstract 2 The Problem of German Case German case syncretism is often assumed German articles, adjectives and nouns are marked to be the accidental by-product of historical for gender, number and case through morpholog- development. This paper contradicts this ical inflection, as illustrated for definite articles in claim and argues that the evolution of Ger- Table 1. man case is driven by the need to optimize the cognitive effort and memory required Case SG-M SG-F SG-N PL for processing and interpretation. This hy- NOM der die das die pothesis is supported by a novel kind of ACC den die das die computational experiments that reconstruct DAT dem der dem den and compare attested variations of the Ger- man definite article paradigm. The exper- GEN des der des der iments show how the intricate interaction Table 1: German definite articles. between those variations and the rest of the German ‘linguistic landscape’ may direct The system is notorious for its syncretism (i.e. language change. the same form can be mapped onto different func- tions), a riddle that has fascinated many formal 1 Introduction and historical linguists looking for explanations. In his 1880 essay, Mark Twain famously com- 2.1 Historical Linguistics plained that The awful German Language is the most “slipshod and systemless, and so slippery Studies in historical linguistics and grammatical- and elusive to grasp” language of all. A brief ization often propose the following three forces to look at the literature on the German case system explain syncretism (Heine and Kuteva, 2005, p. seems to provide sufficient evidence for instantly 148): agreeing with the American author. But what if 1. The formal distinction between case markers the German case system were not the accidental is lost through phonological changes. by-product of diachronic changes as is often as- sumed? Are there linguistic forces that are not yet 2. One case takes over the functional domain of fully appreciated in the field, but which may ex- another case and replaces it. plain the German case paradigm? 3. A case marker disappears and its functions This paper demonstrates that there indeed are are usurped by another marker. such forces through a case study on German def- inite articles. The experiments ‘reconstruct’ deep Syncretism is thus considered as the accidental language processing models for different variants by-product of such forces, and German case syn- of this paradigm, and show how the ‘linguistic cretism is typically analyzed according to these landscape’ of German has allowed its speakers to lines (Barðdal, 2009; Baerman, 2009, p. 229). reduce their definite article system without loss in However, these forces are not explanatory: they efficiency for processing and interpretation. only describe what has happened, but not why. 829 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 829–839, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics Another problem for the ‘syncretism by acci- However, it is a well-established fact that dis- dent’ hypothesis is the fact that the collapsing of junctions are computationally expensive, which case forms is not randomly distributed over the is illustrated in the top of Figure 1. This Fig- whole paradigm as would be expected. Hawkins ure shows the search tree of a small grammar (2004, p. 78) observes that instead there is a sys- when parsing the utterance Die Kinder gaben der tematic tendency for ‘lower’ cells in the paradigm Lehrerin die Zeichnung (‘the children gave the (e.g. genitive; Table 1) to collapse before cells in drawing to the (female) teacher’), which is un- ‘higher’ positions (e.g. nominative) do so. ambiguous to German speakers. As can be seen in the Figure, the search tree has to explore sev- 2.2 Formal Linguistics eral branches before arriving at a valid solution. Many hidden effects of verbal linguistic theo- Most of the splits are caused by disjunctions. For ries can be uncovered through explicit formaliza- example, when a determiner-noun construction tions. Unfortunately, formal linguists also typi- specifies that the case features of the definite ar- cally distinguish between ‘systematic’ and ‘non- ticle die (nominative or accusative) and the noun systematic’ syncretism when analyzing German Kinder (‘children’; nominative, accusative or gen- case. For instance, in his review of a number of itive) have to unify, the search tree splits into two studies on German (a.o. Bierwisch, 1967; Blevins, hypotheses (a nominative and an accusative read- 1995; Wiese, 1996; Wunderlich, 1997), Müller ing) even though for native speakers of German, (2002) concludes that none of these approaches the syntactic context unambiguously points to a is able to rule out accidental syncretism. nominative reading (because it is the only noun There is however one major stone that has been phrase that agrees with the main verb). left unturned by formal linguists: processing. It should be no surprise, then, that a lot of work Most formal theories, such as HPSG (Ginzburg has focused on processing disjunctions more ef- and Sag, 2000), assume a strict division between ficiently (e.g. Carter, 1990; Ramsay, 1990). As ‘competence’ and ‘performance’ and therefore observed by Flickinger (2000), however, most of represent linguistic knowledge in a purely declar- these studies implicitly assume that the grammar ative, process-independent way (Sag and Wasow, representation has to remain unchanged. He then 2011). While such an approach may be desirable demonstrates through computational experiments from a ‘mathematical’ point of view, it puts the how a different representation can directly impact burden of efficient processing on the shoulders efficiency, and argues that revisions of the gram- of computational linguists, who have to develop mar for efficiency should be discussed more thor- more intelligent interpreters. oughly in the literature. One example of the gap between description The impact of representation on processing is and computational implementation is disjunctive illustrated at the bottom of Figure 1, which shows feature representation, which became popular in the performance of a grammar that uses the same feature-based grammar formalisms in the 1980s processing technique for handling the same utter- (Karttunen, 1984). Disjunctions allow an elegant ance, but a different representation than the dis- notation for multiple feature values, as illustrated junctive grammar. As can be seen, the alternative in example 1 for the German definite article die, grammar (whose technical details are disclosed which is either assigned nominative or accusative further below) is able to parse the German defi- case, and which is either feminine-singular or plu- nite articles without tears, and the resulting search ral. The feature structure (adopted from Kart- tree arguably better reflects the actual processing tunen, 1984, p. 30) represents disjunctions by en- performed by native speakers of German. closing the alternatives in curly brackets ({ }). 2.3 Alternative Hypothesis  " # (1)   GENDER f   The effect of processing-friendly representations   on search suggests that answers for the unsolved     AGREEMENT NUM sg  problems concerning case syncretism have to     h i       NUM pl     be sought in performance. This paper there-    n o  fore rejects the processing-independent approach CASE nom acc and explores the alternative hypothesis, following 830 initial sem syn structure top top (a) Search with disjunctive feature representation: application process determiner-nominal- phrase-cxn determiner-nominal- kinder- (marked-phrasal) phrase-cxn lex (marked-phrasal) (lex) determiner-nominal- phrase-cxn determiner- (marked-phrasal) nominal-phrase- lehrerin- cxn lex (lex) determiner- (marked-phrasal) nominal-phrase- ditransitive- determiner- cxn cxn (arg) kinder- nominal-phrase- (marked-phrasal) lex * der-lex cxn (lex) (lex), die- (marked-phrasal) determiner-nominal-phrase-cxn lex (lex), (marked-phrasal) die-lex initial (lex), + gaben-lex (lex), determiner-nominal- zeichnung- phrase-cxn (marked-phrasal) determiner-nominal- Parsing "die Kinder gaben der Lehrerin die lex (lex) Zeichnung ." phrase-cxn kinder- lex (marked-phrasal) (lex) determiner-nominal- phrase-cxn determiner- (marked-phrasal) nominal-phrase- lehrerin- Applying construction set (8) in cxndirection lex (lex) (marked-phrasal) determiner-nominal-phrase-cxn (marked-phrasal) determiner- kinder- nominal-phrase- lex determiner- Found a solution cxn (lex) nominal-phrase- ditransitive- (marked-phrasal) cxn cxn (arg) initial sem syn (marked-phrasal) structure top top queue determiner-nominal-phrase-cxn (marked-phrasal) kinder-lex (lex) lehrerin-lex (lex) zeichnung-lex (lex) (b) Search with feature matrices: application reset * zeichnung-lex, kinder-lex, lehrerin-lex, gaben-lex, die-lex, detnp-cxn, ditransitive- process initial die-lex , detnp-cxn, der-lex, detnp-cxn cxn queue detnp-cxn der-lex (t) die-lex (t) die-lex (t) Figure 1: The representation of linguistic information has a direct impact on processing efficiency. The top applied figure shows a search ditransitive-cxn tree when detnp-cxn parsing der-lex the unambiguous (t) detnp-cxn die-lex (t) utterance detnp-cxnDie Kinder die-lex (t) gaben der(t)Lehrerin die Zeich- gaben-lex constructions nung (‘The children gave the drawing to the (female) teacher’) using disjunctive feature representation. The lehrerin-lex (t) kinder-lex (t) ... and 1 more bottom figure shows the search tree using distinctive feature matrices. Labels in the boxes show the names resulting of the applied constructions; boxes with a bold border are successful end nodes. Both grammars structure kinder- der-1 have been 1 implemented indetnp- Fluid Construction Grammar (FCG; Steels, 2011, 2012a) and aredetnp- processed using a standard depth-first search unit-1 algorithm (Bleys et al., 2011) and general unification (withoutunit-3 optimization lehrerin- for particular die-1 1 types or data structures; Steels and De Beule, 2006; De Beule, 2012). The utterance is assumed to be seg- mented into zeichnung- words. Interested readers can explore the Figure through an interactive web demonstration die-2 at 1 detnp- http://www.fcg-net.org/demos/design-patterns/07-feature-matrices/. detnp- unit-2 ditransitive- sem syn ditransitive- unit-2 zeichnung- top top die-2 unit-1 unit-1 1 Steels (2004, 2012b), that grammar evolves in or- 3. The decrease of cue-reliability of case for gaben-1 die-1 der to optimize communicative success by damp- disambiguation encourages detnp- the emergence of eninglehrerin- the search space in linguistic processing and competing systemsunit-1 kinder- (such as word order). 1 detnp- 1 reducing the cognitive unit-3 effort needed for interpre- der-1 tation, while at the same time minimizing the re- The hypothesis is substantiated gaben-1 through com- sources required for doing so. More specifically, putational experiments that reconstruct three dif- ferent variants of the German definite article sys- Meaning: this paper explores the following claims: tem (the current ((teacher.f ?recipient-1) (unique-referent ?recipient-1) (drawing system, its Old High German pre- ?sem-role-3) 1. The ?sem-role-3) (unique-referent German definite article ?ref-2) (children system can be decessor,?ref-2) (unique-referent Wright, 1906; and the Texas German (gave ?ev-1 ?ref-2 ?sem-role-3 ?recipient-1)) processed as efficiently as its Old High Ger- dialect system, Boas, 2009a,b) and compare their reset man predecessor, which had less syncretism. performance in terms of processing efficiency and cognitive effort in interpretation. 2. The presence of other grammatical structures have made it possible to reduce the definite 3 Operationalizing German Case article paradigm without increasing the cog- nitive effort needed for disambiguating the An adequate operationalization of German case argument structures that underly German ut- requires a bidirectional grammar (for parsing and terances. production) and easy access to linguistic process- 831 ing data. All experiments reported in this paper case, the value ‘–’ for dative means that die can- have therefore been implemented in Fluid Con- not be assigned dative case. We can do the same struction Grammar (FCG; Steels, 2011, 2012a), a for Kinder (‘children’), which can be nominative unification-based grammar formalism that comes or accusative, but not dative: equipped with an interactive web interface and    monitoring tools (Loetzsch, 2012). A second ad- (3) Kinder: nom ?nom   vantage of FCG is that it features strong bidirec- CASE   acc ?acc   tionality: the FCG-interpreter can achieve both dat – parsing and production using the same linguistic inventory. Other feature structure platforms, such As demonstrated in Figure 1, disjunctive fea- as the lkb-system (Copestake, 2002), require a ture representation would cause a split in the separate parser and generator for formalizing bidi- search tree when unifying die and Kinder. Us- rectional grammars, which make them less suited ing a feature matrix, however, the choice between for substantiating the claims of this paper. a nominative and accusative reading can simply be postponed until enough information from the 3.1 Distinctive Feature Matrix rest of the utterance is available. Unifying die and German case has become the litmus test for Kinder yields the following feature structure: demonstrating how well a feature-based grammar    formalism copes with multifunctionality, espe- (4) die Kinder: nom ?nom   cially since Ingria (1990) provocatively stated that CASE   acc ?acc   unification is not the best technique for handling dat – it. People have gone to great lengths to counter Ingria’s claim, especially within the HPSG frame- 3.2 A Three-Dimensional Matrix work (e.g. Müller, 1999; Daniels, 2001; Sag, The German case paradigm is obviously more 2003), and various formalizations have been of- complex than the examples shown so far. Let’s fered for German case (Heinz and Matiasek, consider Table 1 again, but this time we replace 1994; Müller, 2001; Crysmann, 2005). However, every cell in the table by a variable. This leads to these proposals either do not succeed in avoiding the following feature matrix for the German defi- inefficient disjunctions or they require a complex nite articles: double type hierarchy (Crysmann, 2005). The experiments in this paper use a more Case SG-M SG-F SG-N PL straightforward solution, called a distinctive fea- ?NOM ?n-s-m ?n-s-f ?n-s-n ?n-pl ture matrix, which is based on an idea that was ?ACC ?a-s-m ?a-s-f ?a-s-n ?a-pl first explored by Ingria (1990) and of which a ?DAT ?d-s-m ?d-s-f ?d-s-n ?d-pl variation has recently also been proposed for ?GEN ?g-s-m ?g-s-f ?g-s-n ?g-pl Lexical Functional Grammar (Dalrymple et al., Table 2: A distinctive feature matrix for German case. 2009). Instead of treating case as a single-valued feature, it can be represented as an array of fea- Each cell in this matrix represents a specific tures, as shown for the definite article die (ignor- feature bundle that collects the features case, ing the genitive case for the time being): number, and person. For example, the variable    ?n-s-m stands for nominative singular mascu- (2) die: nom ?nom  line. Note that also the cases themselves have  CASE   acc ?acc   their own variable (?nom, ?acc, ?dat and dat – ?gen). This allows us to single out a specific di- mension of the matrix for constructions that only The case feature includes a paradigm of three care about case distinctions, but abstract away cases (nom, acc and dat), whose values can ei- from gender or number. Each linguistic item fills ther be ‘+’ or ‘–’, or left unspecified through a in as much information as possible in this case variable (indicated by a question mark). The two matrix. For example, Table 3 shows how the def- variables ?nom and ?acc indicate that die can inite article die underspecifies its potential values potentially be assigned nominative or accusative and rules out all other options through ‘–’. 832 Case SG-M SG-F SG-N PL 4 Experiments ?NOM – ?n-s-f – ?n-pl This section describes the experimental set-up and ?ACC – ?a-s-f – ?a-pl discusses the experimental results. – – – – – – – – – – 4.1 Three Paradigms Table 3: The feature matrix of die. The experiments compare three different variants of the German definite article paradigm. Standard German. The Standard German paradigm has been illustrated in Table 1 and its The feature matrix of Kinder (‘children’), operationalization has been shown in section 3.2. which underspecifies for nominative, accusative The paradigm has been inherited without signifi- and genitive, is shown in Table 4. Notice, how- cant changes from Middle High German (1050- ever, that the same variable names are used for 1350; Walshe, 1974) and features six different both the column that singles out the case dimen- forms. sion as for the column of the plural feature bun- Old High German. The Old High German dles. paradigm is the direct predecessor of the current Case SG-M SG-F SG-N PL paradigm of definite articles. It contained at least ?n-pl – – – ?n-pl twelve distinct forms (depending on which varia- ?a-pl – – – ?a-pl tion is taken) that included gender distinctions in – – – – – plural (Wright, 1906, p. 67). It also included one ?g-pl – – – ?g-pl definite article that marked the now extinct instru- mental case, which is ignored in this paper. The Table 4: The feature matrix of Kinder (‘children’). variant of the Old High German paradigm that has been implemented in the experiments is summa- Unification of die and Kinder can exploit these rized in Table 6. variable ‘equalities’ for ruling out a singular value of the definite article. Likewise, the matrix of die Case Singular rules out the genitive reading of Kinder, as illus- M F N trated in Table 5. NOM dër diu daz¸ ACC dën die daz¸ Case SG-M SG-F SG-N PL DAT dëmu dëru dëmu ?n-pl – – – ?n-pl GEN dës dëra dës ?a-pl – – – ?a-pl Plural – – – – – M F N – – – – – NOM die deo diu Table 5: The feature matrix of die Kinder. ACC die deo diu DAT d¯em d¯em d¯em Argument structure constructions (Goldberg, GEN dëro dëro dëro 2006), such as the ditransitive, can then later as- Table 6: The Old High German definite article system. sign either nominative or accusative case. The main advantage of feature matrices is that linguis- tic search only has to commit to specific feature- Texas German. The third variant is an values once sufficient information is available, so American-German dialect called Texas German the search tree only splits when there is an actual (Boas, 2009a,b), which evolved a two-way case ambiguity. Moreover, they can be handled using distinction between nominative and oblique. This standard unification. Interested readers can con- type of case system, in which the accusative and sult van Trijp (2011) for a thorough description of dative case have collapsed, is also a common the approach, as well as a discussion on how the evolution in the Low German dialects (Shrier, FCG implementation differs from Ingria (1990) 1965). The implemented paradigm of Texas and Dalrymple et al. (2009). German is shown in Table 7. 833 Case SG-M SG-F SG-N PL The experiments exploit types because there NOM der die das die are three different language systems, hence it is ACC/DAT den die den die impossible to use a single, real corpus and its to- ken frequencies. It would also be unwarranted to Table 7: The Texas German definite article system. use different corpora because corpus-specific bi- ases would distort the comparative results. Sec- ondly, as the experiments involve models of deep language processing (as opposed to stochastic 4.2 Production and Parsing Tasks models), the use of types instead of tokens is justified in this phase of the research: the first Each grammar is tested as to how efficiently it can concern of precision-grammars is descriptive ade- produce and parse utterances in terms of cognitive quacy, for which types are a more reliable source. effort and search (see section 4.3). There are three Obviously, the effect of token frequency needs to basic types of utterances: be examined in future research. 1. Ditransitive: NOM – Verb – DAT – ACC 4.3 Measuring Cognitive Effort 2. Transitive (a): NOM – Verb – ACC The experiments measure two kinds of cognitive effort: syntactic search and semantic ambiguity. 3. Transitive (b): NOM – Verb – DAT Search. The search measure counts the number The argument roles are filled by noun phrases of branches in the search process that reach an end whose head nouns always have a distinct form node, which can either be a possible solution or for singular and plural (e.g. Mann vs. Män- a dead end (i.e. no constructions can be applied ner; ‘man’ vs. ‘men’), but that are unmarked for anymore). Duplicate nodes (for instance, nodes case. The combinations of arguments is always that use the same rules but in a different order) unique along the dimensions of number and gen- are not counted. The search measure is then used der, which yields 216 unique utterance types for as a ‘sanity check’ to verify whether the three dif- the ditransitive as follows: ferent paradigms can be processed with the same efficiency in terms of search tree length, as hy- NOM.S.M V DAT.S.M ACC.S.M pothesized by this paper. More specifically, the NOM.S.M V DAT.S.F ACC.S.M following conditions have to be met: (5) NOM.S.M V DAT.S.N ACC.S.M NOM.S.M V DAT.PL.M ACC.S.M 1. In production, there should only be one etc. branch. In transitive utterances, there is an additional 2. In parsing, search has to be equal to the se- distinction based on animacy for noun phrases in mantic effort. the Object position of the utterance, which yields The single branch constraint in production 72 types in the NOM-ACC configuration and 72 checks whether the definite articles are suffi- in the NOM-DAT configuration. Together, there ciently distinct from one another. Since there is no are 360 unique utterance types. As can be gleaned ambiguity about which argument plays which role from the utterance types, the genitive case is not in the utterance, the grammar should only come considered by the experiments, as the genitive is up with one solution. In parsing, the number of not part of basic German argument structures and branches has to correspond to ‘real’ semantic am- it has almost disappeared in most dialects of Ger- biguities and not create additional search, as ar- man (Shrier, 1965). gued in section 2.2. In production, the grammar is presented with a meaning that needs to be verbalized into an utter- Semantic Ambiguity. Semantic ambiguity ance. In parsing, the produced utterance has to be equals the number of possible interpretations analyzed back into a meaning. Every utterance is of an utterance. For instance, the utterance processed using a full search, that is, all branches Der Hund beißt den Mann ‘the dog bites the and solutions are calculated. man’ is unambiguous in Modern High German, 834 since der Hund can only be nominative singular- Cue E1 E2 E3 E4 masculine, and den Mann can only be accusative SV-agreement + + masculine-singular. There is thus only one pos- Selection restrictions + + sible interpretation in which the dog is the biter and the man is being bitten, illustrated as follows using a logic-based meaning representation (also see Steels, 2004, for this operationalization of cognitive effort): SV-agreement restricts the subject to singular or plural nouns, and semantic selection restric- (6) Interpretation 1: tions can disambiguate utterances in which for ex- Der Hund beißt den Mann. ample the Agent-role has to be animate (e.g. in perception verbs such as sehen ‘to see’). All other dog(?a) bite(?ev) man(?b) possible cues, such as word order, are ignored. biter(?ev, ?x) ?a=?x bitten(?ev, ?y) ?b=?y 5 Results 5.1 Search However, an utterance such as die Katze beißt die Frau ‘the cat bites the woman’ is ambiguous In all experiments, the constraints of the search because die has both a nominative and accusative measure were satisfied: every grammar only re- singular-feminine reading: quired one branch per utterance in production, and the number of branches in parsing never ex- (7) a. Interpretation 1: ceeded the number of possible interpretations. In Die Katze beißt die Frau. terms of search length, more syncretism therefore cat(?a) bite(?ev) woman(?b) does not automatically harm efficiency, provided biter(?ev, ?x) that the grammar uses an adequate representation. ?a=?x bitten(?ev, ?y) ?b=?y Arguably, the smaller paradigms are even more efficient because they require less unifications to b. Interpretation 2: Die Katze beißt die Frau. be performed. cat(?a) bite(?ev) woman(?b) 5.2 Semantic Ambiguity biter(?ev, ?x) bitten(?ev, ?y) ?b=?x Now that it has been ascertained that more ?a=?y syncretism does not harm processing efficiency, we can compare cue-reliability of the different Here, German speakers are likely to use word paradigms for semantic interpretation. order, intonation and world knowledge (i.e. cats are more likely to bite a person than the other way Ambiguous Utterances. Figure 2 shows the round) for disambiguating the utterance. number of ambiguous utterances in parsing (in %) per paradigm and per set-up. As can be seen, 4.4 Experimental Parameters the Old High German paradigm (black) is the The experiments (E1-E4) concern the cue- most reliable cue in Experiment 1 (E1; when SV- reliability of the definite articles for disambiguat- agreement and selection restrictions are ignored) ing event structure. In all experiments, the differ- with 35.56% of ambiguous utterances, as opposed ent grammars can exploit the case-number-gender to 55.56% for Modern High German (grey) and information of definite articles, and also the gen- 77.78% for Texas German (white). der and number specifications of nouns, and the When SV-agreement is taken into account (E2), syntactic valence of verbs. For instance, the the difference between Old and Modern High noun form Frauen ‘women’ is specified as plural- German becomes smaller, with both paradigms feminine, and verbs like helfen ‘to help’ are spec- offering a reliability of more than 70%, while ified to take a dative object, whereas verbs like Texas German still faces more than 70% of am- finden ‘to find’ take an accusative object. In other biguous utterances. experiments, different combinations of grammat- Ambiguity is even more reduced when using ical cues become available or not: semantic selection restrictions of the verb (set-up 835 E3). Here, the difference between Old and Mod- amount of ambiguity remains more than 20% us- ern High German becomes trivial with 4.44% and ing all available cues. One verifiable predic- 6.94% of ambiguous utterances respectively. The tion of the experiments is therefore that this di- difference with Texas German remains apparent, alect should show an increase in alternative syn- even though its ambiguity is cut by half. tactic restrictions (such as word order) in order In set-up E4 (case, SV-agreement and selection to make up for the lost case distinctions. Inter- restrictions), the Old and Modern High German estingly, such alternatives have been attested in paradigms resolve almost all ambiguities, leaving Low German dialects that have evolved a simi- little difference between them. Using the Texas lar two-way case system (Shrier, 1965). Modern German dialect, one utterance out of five remains High German, on the other hand, has already re- ambiguous and requires additional grammatical cruited word order for other purposes (such as in- cues or inferencing for semantic interpretation. formation structure; Lenerz, 1977; Micelli, 2012), which may explain why the current paradigm has Number of possible interpretations. Semantic been able to survive since the Middle Ages. ambiguity can also be measured by counting the Instead of an accidental by-product of phono- number of possible interpretations per utterance. logical and morphological changes, then, a new A non-ambiguous language would thus have 1 picture emerges for explaining syncretism in possible interpretation per utterance. The aver- Modern High German definite articles: German age number of interpretations per utterance (per speakers have been able to reduce their case paradigm and per set-up) is shown in Table 8. paradigm without loss in processing and interpre- Paradigm E1 E2 E3 E4 tation efficiency. With cognitive effort as a selec- Old High German 1.56 1.22 1.04 1.03 tion criterion, subsequent generations of speakers Modern High German 1.56 1.28 1.07 1.04 found no linguistic pressures for maintaining par- Texas German 2.84 2.39 1.36 1.22 ticular distinctions such as gender in plural arti- cles. Especially forms whose acoustic distinctions Table 8: Average number of interpretations per utter- are harder to perceive are candidates for collapse ance type. if they are no longer functional for processing or interpretation. Other factors, such as frequency, The Old High German paradigm has the least may accelerate this evolution, as also argued by semantic ambiguity throughout, except in Exper- Barðdal (2009). For instance, there may be less iment 1 (E1). Here, Modern High German has benefits for upholding a case distinction for infre- the same average effort despite having more am- quent than for frequent forms. biguous utterances. This means that the Old High If case syncretism is not randomly distributed German paradigm provides a better coverage in over a grammatical paradigm, but rather func- terms of construction types, but when ambiguity tionally motivated, a new explanatory model is occurs, more possible interpretations exist. needed. One candidate is evolutionary linguistics (Steels, 2012b), a framework of cultural evolu- 6 Discussion tion in which populations of language users con- The experiments compare how well three differ- stantly shape and reshape their language in re- ent paradigms of definite articles perform if they sponse to their communicative needs. The ex- are inserted in the grammar of Modern High Ger- periments reported here suggest that this dynamic man. The results show that, in isolation, Old High shaping process is guided by the ‘linguistic land- German offers the best cue-reliability for retriev- scape’ of a language. For instance, the pres- ing who’s doing what to whom in events. How- ence of grammatical cues such as gender, num- ever, when other grammatical cues are taken into ber and SV-agreement may encourage paradigm account, it turns out that Modern High German reduction. However, reduction may be the start achieves similar results with respect to syntactic of a self-enforcing loop in which the decreasing search and semantic ambiguity, with a reduced cue-reliability of a paradigm may pressure lan- paradigm (using only six instead of twelve forms). guage users into enforcing the alternatives to take As for the Texas German dialect, which has on even more of the cognitive load of processing. collapsed the accusative-dative distinction, the The intricate interactions between grammati- 836 %  of  ambiguous  u,erances   100   90   80   77.78   71.11   70   60   55.56   50   40   35.56   35.56   28.89   30   22.22   22.22   20   10   6.94   4.44   2.78   3.61   0   E1   E2   E3   E4   Old  High  German   Modern  High  German   Texas  German   Figure 2: This chart shows the number of ambiguous utterances per paradigm per E(xperimental set-up) in %. cal systems also requires more sophisticated mea- experiments have demonstrated that Modern High sures. A promising extension of this paper could German achieves a similar performance as its Old lie in an information-theoretic approach to lan- High German predecessor using only half of the guage (Hale, 2003; Jaeger and Tily, 2011), which forms in its definite article paradigm. has recently explored a set of tools for assessing Instead of a series of historical accidents, the linguistic complexity, processing effort and un- German case system thus underwent a systematic certainty. Unfortunately, only little work has been and “performance-driven [...] morphological re- done on morphological paradigms so far (see e.g. structuring” (Hawkins, 2004, p. 79), in which lin- Ackerman et al., 2011), and the approach is typi- guistic pressures such as cognitive effort decided cally applied in stochastic or Probabilistic Context on the maintenance or loss of certain distinctions. Free Grammars, hence it remains unclear how the The case study makes clear that formal and com- assumptions of this field fit into models of deep putational models of deep language understand- language processing. ing have to reconsider their strict division between competence and performance if the goal is to ex- 7 Conclusions plain individual language development. This pa- per proposed that new tools and methodologies More than 130 years after Mark Twain’s com- should be sought in evolutionary linguistics. plaints, it seems that the German language is not that awful after all. Through a series of compu- Acknowledgements tational experiments, this paper has proposed a different explanation for German case syncretism This research has been conducted at the Sony that answers some of the unsolved riddles of pre- Computer Science Laboratory Paris. I would like vious studies. First, the experiments have shown to thank Luc Steels, director of Sony CSL Paris that an increase in syncretism does not necessar- and the VUB AI-Lab of the University of Brus- ily lead to an increase in the cognitive effort re- sels, for his support and feedback. I also thank quired for syntactic search, provided that the rep- Hans Boas, Jóhanna Barðdal, Peter Hanappe, resentation of the grammar is processing-friendly. Manfred Hild and the anonymous reviewers for Secondly, by comparing cue-reliability of differ- helping to improve this article. All errors remain ent paradigms for semantic disambiguation, the of course my own. 837 References minacy, and likeness of case. In Stefan Müller, editor, Proceedings of the 12th International Farrell Ackerman, James P. Blevins, and Robert Conference on Head-Driven Phrase Structure Malouf. Parts and wholes: Implicative patterns Grammar, pages 91–107, Stanford, 2005. CSLI in inflectional paradigms. In J.P. Blevins and Publications. J. Blevins, editors, Analogy in Grammar: Form and Acquisition, pages 54–81. Oxford Univer- Mary Dalrymple, Tracy Holloway King, and sity Press, Oxford, 2011. Louisa Sadler. Indeterminacy by underspecifi- cation. Journal of Linguistics, 45:31–68, 2009. Matthew Baerman. Case syncretism. In An- drej Malchukov and Andrew Spencer, editors, Michael Daniels. On a type-based analysis of fea- The Oxford Handbook of Case, chapter 14, ture neutrality and the coordination of unlikes. pages 219–230. Oxford University Press, Ox- In Proceedings of the 8th International Confer- ford, 2009. ence on HPSG, pages 137–147, Stanford, 2001. CSLI. J. Barðdal. The development of case in germanic. In J. Barðdal and S. Chelliah, editors, The Role Joachim De Beule. A formal deconstruction of of Semantics and Pragmatics in the Develop- Fluid Construction Grammar. In Luc Steels, ed- ment of Case, pages 123–159. John Benjamins, itor, Computational Issues in Fluid Construc- Amsterdam, 2009. tion Grammar. Springer Verlag, Berlin, 2012. Manfred Bierwisch. Syntactic features in Daniel P. Flickinger. On building a more efficient morphology: General problems of so-called grammar by exploiting types. Natural Lan- pronominal inflection in German. In To Hon- guage Engineering, 6(1):15–28, 2000. our Roman Jakobson, pages 239–270. Mouton Jonathan Ginzburg and Ivan A. Sag. Interroga- De Gruyter, Berlin, 1967. tive Investigations: the Form, the Meaning, and James Blevins. Syncretism and paradigmatic op- Use of English Interrogatives. CSLI Publica- position. Linguistics and Philosophy, 18:113– tions, Stanford, 2000. 152, 1995. Adele E. Goldberg. Constructions At Work: The Joris Bleys, Kevin Stadler, and Joachim De Beule. Nature of Generalization in Language. Oxford Search in linguistic processing. In Luc Steels, University Press, Oxford, 2006. editor, Design Patterns in Fluid Construction John T. Hale. The information conveyed by words Grammar. John Benjamins, Amsterdam, 2011. in sentences. Journal of Psycholinguistic Re- Hans C. Boas. Case loss in Texas German: The search, 32(2):101–123, 2003. influence of semantic and pragmatic factors. In John A. Hawkins. Efficiency and Complexity in J. Barðdal and S. Chelliah, editors, The Role of Grammars. Oxford University Press, Oxford, Semantics and Pragmatics in the Development 2004. of Case, pages 347–373. John Benjamins, Am- Bernd Heine and Tania Kuteva. Language Con- sterdam, 2009a. tact and Grammatical Change. Cambridge Hans C. Boas. The Life and Death of Texas University Press, Cambridge, 2005. German, volume 93 of Publication of the The Wolfgang Heinz and Johannes Matiasek. Argu- American Dialect Society. Duke University ment structure and case assignment in german. Press, Durham, 2009b. In John Nerbonne, Klaus Netter, and Carl Pol- David Carter. Efficient disjunctive unification lard, editors, German in Head-Driven Phrase for bottom-up parsing. In Proceedings of the Structure Grammar, volume 46 of CSLI Lec- 13th Conference on Computational Linguistics, ture Notes, pages 199–236. CSLI Publications, pages 70–75. ACL, 1990. Stanford, 1994. Ann Copestake. Implementing Typed Feature R.J.P. Ingria. The limits of unification. In Pro- Structure Grammars. CSLI Publications, Stan- ceedings of the 28th Annual Meeting of the ford, 2002. ACL, pages 194–204, 1990. Berthold Crysmann. Syncretism in german: A T. Florian Jaeger and Harry Tily. On language unified approach to underspecification, indeter- ‘utility’: Processing complexity and commu- 838 nicative efficiency. WIREs: Cognitive Science, Meeting of the Association for Computational 2(3):323–335, 2011. Linguistics, pages 9–19, Barcelona, 2004. L. Karttunen. Features and values. In Proceedings Luc Steels, editor. Design Patterns in Fluid Con- of the 10th International Conference on Com- struction Grammar. John Benjamins, Amster- putational Linguistics, Stanford, 1984. dam, 2011. Jürgen Lenerz. Zur Abfolge nominaler Luc Steels, editor. Computational Issues in Satzglieder im Deutschen. Narr, Tübin- Fluid Construction Grammar. Springer, Berlin, gen, 1977. 2012a. Martin Loetzsch. Tools for grammar engineering. Luc Steels. Self-organization and selection in cul- In Luc Steels, editor, Computational Issues in tural language evolution. In Luc Steels, editor, Fluid Construction Grammar. Springer Verlag, Experiments in Cultural Language Evolution. Berlin, 2012. John Benjamins, Amsterdam, 2012b. Vanessa Micelli. Field topology and information Luc Steels and Joachim De Beule. Unify and structure: A case study for German constituent merge in Fluid Construction Grammar. In order. In Luc Steels, editor, Computational Is- P. Vogt, Y. Sugita, E. Tuci, and C. Nehaniv, sues in Fluid Construction Grammar. Springer editors, Symbol Grounding and Beyond., LNAI Verlag, Berlin, 2012. 4211, pages 197–223, Berlin, 2006. Springer. Gereon Müller. Remarks on nominal inflection Remi van Trijp. Feature matrices and agreement: in German. In Ingrid Kaufmann and Bar- A case study for German case. In Luc Steels, bara Stiebels, editors, More than Words: A editor, Design Patterns in Fluid Construction Festschrift for Dieter Wunderlich, pages 113– Grammar. John Benjamins, Amsterdam, 2011. 145. Akademie Verlag, Berlin, 2002. M. Walshe. A Middle High German Reader: With Stefan Müller. An HPSG-analysis for free rela- Grammar, Notes and Glossary. Oxford Univer- tive clauses in german. Grammars, 2(1):53– sity Press, Oxford, 1974. 105, 1999. Bernd Wiese. Iconicity and syncretism. on Stefan Müller. Case in German – towards and pronominal inflection in Modern German. In HPSG analysis. In Tibor Kiss and Det- Robin Sckmann, editor, Theoretical Linguistics mar Meurers, editors, Constraint-Based Ap- and Grammatical Description, pages 323–344. proaches to Germanic Syntax. CSLI, Stanford, John Benjamins, Amsterdam, 1996. 2001. Joseph Wright. An Old High German Primer. Clarendon Press, Oxford, 2nd edition, 1906. Allan Ramsay. Disjunction without tears. Com- putational Linguistics, 16(3):171–174, 1990. Dieter Wunderlich. Der unterspezifizierte Artikel. In Karl Heinz Ramers Dürscheid and Monika Ivan A. Sag. Coordination and underspecifica- Schwarz, editors, Sprache im Fokus, pages 47– tion. In Jongbok Kom and Stephen Wechsler, 55. Niemeyer, Tübingen, 1997. editors, Proceedings of the Ninth International Conference on HPSG, Stanford, 2003. CSLI. Ivan A. Sag and Thomas Wasow. Performance- compatible competence grammar. In Robert D. Borsley and Kersti Börjars, editors, Non- Transformational Syntax: Formal and Explicit Models of Grammar. Wiley-Blackwell, Ox- ford, 2011. Martha Shrier. Case systems in German dialects. Language, 41(3):420–438, 1965. Luc Steels. Constructivist development of grounded construction grammars. In Walter Daelemans, editor, Proceedings 42nd Annual 839 Managing Uncertainty in Semantic Tagging Silvie Cinkov´a and Martin Holub and Vincent Kr´ızˇ Charles University in Prague, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics {cinkova|holub}@ufal.mff.cuni.cz

[email protected]

Abstract logic. By semantic tagging we mean a process of assigning semantic categories to target words in Low interannotator agreement (IAA) is a given contexts. This process can be either manual well-known issue in manual semantic tag- or automatic. ging (sense tagging). IAA correlates with the granularity of word senses and they Traditionally, semantic tagging relies on the both correlate with the amount of informa- tacit assumption that various uses of polysemous tion they give as well as with its reliability. words can be sorted into discrete senses; under- We compare different approaches to seman- standing or using an unfamiliar word be then like tic tagging in WordNet, FrameNet, Prop- looking it up in a dictionary. When building a dic- Bank and OntoNotes with a small tagged data sample based on the Corpus Pattern tionary entry for a given word, the lexicographer Analysis to present the reliable information sorts a number of its occurrences into discrete gain (RG), a measure used to optimize the senses present (or emerging) in his/her mental lex- semantic granularity of a sense inventory icon, which is supposed to be shared by all speak- with respect to its reliability indicated by ers of the same language. The assumed common the IAA in the given data set. RG can also mental representation of a words meaning should be used as feedback for lexicographers, and make it easy for other humans to assign random as a supporting component of automatic se- occurrences of the word to one of the pre-defined mantic classifiers, especially when dealing with a very fine-grained set of semantic cat- senses (Fellbaum et al., 1997). egories. This assumption seems to be falsified by the interannotator agreement (IAA, sometimes ITA) 1 Introduction constantly reported much lower in semantic than in morphological or syntactic annotation, as well The term semantic tagging is used in two diver- as by the general divergence of opinion on which gent areas: value of which IAA measure indicates a reliable 1) recognizing objects of semantic importance, annotation. In some projects (e.g. OntoNotes such as entities, events and polarity, often tailored (Hovy et al., 2006)), the percentage of agreements to a restricted domain, or between two annotators is used, but a number 2) relating occurrences of words in a corpus to a of more complex measures are available (for a lexicon and selecting the most appropriate seman- comprehensive survey see (Artstein and Poesio, tic categories (such as synsets, semantic frames, 2008)). Consequently, using different measures wordsenses, semantic patterns or framesets). for IAA makes the reported IAA values incompa- We are concerned with the second case, which rable across different projects. seeks to make lexical semantics tractable for com- Even skilled lexicographers have trouble se- puters. Lexical semantics, as opposed to proposi- lecting one discrete sense for a concordance (Kr- tional semantics, focuses the meaning of lexical ishnamurthy and Nicholls, 2000), and, more to items. The disciplines that focus lexical seman- say, when the tagging performance of lexicog- tics are lexicology and lexicography rather than raphers and ordinary annotators (students) was 840 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 840–850, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics compared, the experiment showed that the men- cording to a reference source. There is, never- tal representations of a word’s semantics differ for theless, a substantial difference: whereas mor- each group (Fellbaum et al., 1997), and cf. (Jor- phologically or syntactically annotated data ex- gensen, 1990). Lexicographers are trained in con- ist separately from the reference (tagset, anno- sidering subtle differences among various uses of tation guide, annotation scheme), a semantically a word, which ordinary language users do not re- tagged resource can be regarded both as a cor- flect. Identifying a semantic difference between pus of texts disambiguated according to an at- uses of a word and deciding whether a difference tached inventory of semantic categories and as is important enough to constitute a separate sense a lexicon with links to example concordances means presenting a word with a certain degree for each semantic category. So, in semanti- of semantic granularity. Intuitively, the finer the cally tagged resources, the data and the reference granularity of a word entry is, the more oppor- are intertwined. Such double-faced semantic re- tunities for interannotator disagreement there are sources have also been called semantic concor- and the lower IAA can be expected. Brown et al. dances (Miller et al., 1993a). For instance, one of proved this hypothesis experimentally (Brown et the earlier versions of WordNet, the largest lexi- al., 2010). Also, the annotators are less confident cal resource for English, was used in the seman- in their decisions, when they have many options tic concordance SemCor (Miller et al., 1993b). to choose from (Fellbaum et al. (1998) reported a More recent lexical resources have been built as drop in subjective annotators confidence in words semantic concordances from the very beginning with 8+ senses). (PropBank (Palmer et al., 2005), OntoNotes word Despite all the known issues in semantic tag- senses (Weischedel et al., 2011)). ging, the major lexical resources (WordNet (Fell- In morphological or syntactic annotation, the baum, 1998), FrameNet (Ruppenhofer et al., tagset or inventory of constituents are given be- 2010), PropBank (Palmer et al., 2005) and the forehand and are supposed to hold for all to- word-sense part of OntoNotes (Weischedel et al., kens/sentences contained in the corpus. Prob- 2011)) are still maintained and their annotation lematic and theory-dependent issues are few and schemes are adopted for creating new manually mostly well-known in advance. Therefore they annotated data (e.g. MASC, the Manually An- can be reflected by a few additional conventions in notated Subcorpus (Ide et al., 2008)). More to the annotation manual (e.g. where to draw the line say, these resources are not only used in WSD and between particles and prepositions or between ad- semantic labeling, but also in research directions jectives and verbs in past participles (Santorini, that in their turn do not rely on the idea of an in- 1990) or where to attach a prepositional phrase ventory of discrete senses any more, e.g. in dis- following a noun phrase and how to treat specific tributional semantics (Erk, 2010) and recognizing “financialspeak” structures (Bies et al., 1995)). textual entailment (e.g. (Zanzotto et al., 2009) and Even in difficult cases, there are hardly more than (Aharon et al., 2010)). two options of interpretation. Data manually an- It is a remarkable fact that, to the best of our notated for morphology or surface syntax are reli- knowledge, there is no measure that would relate able enough to train syntactic parsers with an ac- granularity, reliability of the annotation (derived curacy above 80 % (e.g. (Zhang and Clark, 2011; from IAA) and the resulting information gain. McDonald et al., 2006)). Therefore it is impossible to say where the opti- On the other hand, semantic tagging actually mum for granularity and IAA lies. employs a different tagset for each word lemma. Even within the same part of speech, individual 2 Approaches to semantic tagging words require individual descriptions. Possible similarities among them come into relief ex post 2.1 Semantic tagging vs. morphological or rather than that they could be imposed on the lex- syntactic analysis icographers from the beginning. When assign- Manual semantic tagging is in many respects sim- ing senses to concordances, the annotator often ilar to morphological tagging and syntactic anal- has to select among more than two relevant op- ysis: human annotators are trained to sort cer- tions. These two aspects make achieving good tain elements occurring in a running text ac- IAA much harder than in morphology and syn- 841 tax tasks. In addition, while a linguistically edu- In FrameNet corpora, content words are associ- cated annotator can have roughly the same idea of ated to particular semantic frames that they evoke parts of speech as the author of the tagset, there (e.g. charm would relate to the Aesthetics frame) is no chance that two humans (not even two pro- and their collocates in relevant syntactic positions fessional lexicographers) would create identical (arguments of verbs, head nouns of adjectives, entries for e.g. a polysemous verb. Any human etc.) would be assigned the corresponding frame- evaluation of complete entries would be subjec- element labels (e.g. in their dazzling charm, their tive. The maximum to be achieved is that the en- would be The Entity for which a particular grad- try reflects the corpus data in a reasonable gran- able Attribute is appropriate and under considera- ular way on which annotators still can reach rea- tion and dazzling would be Degree). Neither IAA sonable IAA. nor granularity seem to be an issue in FrameNet. We have not succeeded in finding a report on IAA 2.2 Major existing semantic resources in the original FrameNet annotation, except one The granularity vs. IAA equilibrium is of great measurement in progress in the annotation of the concern in creating lexical resources as well as in Manually Annotated Subcorpus of English (Ide et applications dealing with semantic tasks. When al., 2008).1 WordNet (Fellbaum, 1998) was created, both IAA PropBank is a valency (argument structure) lex- and subjective confidence measurements served icon. The current resource lists and labels ar- as an informal feedback to lexicographers (Fell- guments and obligatory modifiers typical of each baum et al., (1998), p. 200). In general, WordNet (very coarse) word sense (called frameset). Two has been considered a resource too fine-grained core criteria for distinguishing among framesets for most annotations (and applications). Nav- are the semantic roles of the arguments along igli (2006) developed a method of reducing the with the syntactic alternations that the verb can granularity of WordNet by mapping the synsets undergo with that particular argument set. To to senses in a more coarse-grained dictionary. A keep low granularity, this lexicon—among other manual, more coarse-grained grouping of Word- things—does usually not make special framesets Net senses has been performed in OntoNotes for metaphoric uses. The overall IAA measured (Weischedel et al., 2011). The OntoNotes 90 % on verbs was 94 % (Palmer et al., 2005). solution (Hovy et al., 2006) actually means such a degree of granularity that enables a 90-%-IAA. 2.3 Semantic Pattern Recognition OntoNotes is a reaction to the traditionally poor From corpus-based lexicography to semantic IAA in WordNet annotated corpora, caused by the patterns high granularity of senses. The quality of seman- The modern, corpus-based lexicology of 1990s tic concordances is maintained by numerous itera- (Sinclair, 1991; Fillmore and Atkins, 1994) has tions between lexicographers and annotators. The had a great impact on lexicography. There is a categories ‘right’–‘wrong’ have been, for the pur- general consensus that dictionary definitions need pose of the annotated linguistic resource, defined to be supported by corpus examples. Cf. Fell- by the IAA score, which is—in OntoNotes— baum (2001): calculated as the percentage of agreements be- “For polysemous words, dictionaries [. . . ] do tween two annotators. not say enough about the range of possible con- Two other, somewhat different, lexical re- texts that differentiate the senses. [. . . ] On the sources have to be mentioned to complete the pic- other hand, texts or corpora [. . . ] are not ex- ture: FrameNet (Ruppenhofer et al., 2010) and plicit about the word’s meaning. When we first PropBank (Palmer et al., 2005). While Word- encounter a new word in a text, we can usually Net and OntoNotes pair words and word senses in form only a vague idea of its meaning; checking a a way comparable to printed lexicons, FrameNet dictionary will clarify the meaning. But the more is primarily an inventory of semantic frames and contexts we encounter for a word, the harder it is PropBank focuses the argument structure of verbs to match them against only one dictionary sense.” and nouns (NomBank (Meyers et al., 2008), a re- lated project capturing the argument structure of 1 Checked on the project web www.anc.org/MASC/Home nouns, was later integrated in OntoNotes). 2011-10-29. 842 The lexical description in modern English can be semantically so tightly related that they monolingual dictionaries (Sinclair et al., 1987; could appear together under one sense in a tradi- Rundell, 2002) explicitly emphasizes contextual tional dictionary. The patterns are not senses but clues, such as typical collocates and the syntac- syntactico-semantically characterized prototypes tic surroundings of the given lexical item, rather (see the example verb submit in Table 1). Con- than relying on very detailed definitions. In cordances that match these prototypes well are other words, the sense definitions are obtained called norms in Hanks (forthcoming). Concor- as syntactico-semantic abstractions of manually dances that match with a reservation (metaphor- clustered corpus concordances in the modern ical uses, argument mismatch, etc.) are called ex- corpus-based lexicography: in classical dictionar- ploitations. The PDEV corpus annotation indi- ies as well as in semantic concordances. cates the norm-exploitation status for each con- Nevertheless, the word senses, even when ob- cordance. tained by a collective mind of lexicographers and Compared to other semantic concordances, the annotators, are naturally hard-wired and tailored granularity of PDEV is high and thus discourag- to the annotated corpus. They may be too fine- ing in terms of expected IAA. However, select- grained or too coarse-grained for automatic pro- ing among patterns does not really mean disam- cessing of different corpora (e.g. a restricted- biguating a concordance but rather determining to domain corpus). Kilgarriff (1997, p. 115) shows which pattern it is most similar—a task easier for (the handbag example) that there is no reason to humans than WSD is. This principle seems par- expect the same set of word senses to be relevant ticularly promising for verbs as words expressing for different tasks and that the corpus dictates the events, which resist the traditional word sense dis- word senses and therefore ‘word sense’ was not ambiguation the most. found to be sufficiently well-defined to be a work- able basic unit of meaning (p. 116). On the other A novel approach to semantic tagging hand, even non-experts seem to agree reasonably We present the semantic pattern recognition as well when judging the similarity of use of a word a novel approach to semantic tagging, which is in different contexts (Rumshisky et al., 2009). Erk different from the traditional word-sense assign- et al. (2009) showed promising annotation results ment tasks. We adopt the central idea of CPA that with a scheme that allowed the annotators graded words do not have fixed senses but that regular judgments of similarity between two words or be- patterns can be identified in the corpus that ac- tween a word and its definition. tivate different conversational implicatures from Verbs are the most challenging part of speech. the meaning potential of the given verb. Our We see two major causes: vagueness and coer- method draws on a hard-wired, fine-grained in- cion. We neglect ambiguity, since it has proved to ventory of semantic categories manually extracted be rare in our experience. from corpus data. This inventory represents the maximum semantic granularity that humans are CPA and PDEV able to recognize in normal and frequent uses of a Our current work focuses on English verbs. verb in a balanced corpus. We thoroughly analyze It has been inspired by the manual Corpus Pat- the interannotator agreement to find out which of tern Analysis method (CPA) (Hanks, forthcom- the highly semantic categories are useful in the ing) and its implementation, the Pattern Dictio- sense of information gain. Our goal is a dynamic nary of English Verbs (PDEV) (Hanks and Puste- optimization of semantic granularity with respect jovsky, 2005). PDEV is a semantic concordance to given data and target application. built on yet a different principle than FrameNet, Like Passonneau et al. (2010), we are con- WordNet, PropBank or OntoNotes. The man- vinced that IAA is specific to each respective ually extracted patterns of frequent and normal word and reflects its inherent semantic properties verb uses are, roughly speaking, intuitively sim- as well as the specificity of contexts the given ilar uses of a verb that express—in a syntacti- word occurs in, even within the same balanced cally similar form—a similar event in which sim- corpus. We accept as a matter of fact that inter- ilar participants (e.g. humans, artifacts, institu- annotator confusion is inevitable in semantic tag- tions, other events) are involved. Two patterns ging. However, the amount of uncertainty of the 843 No. Pattern / Implicature [[Human 1 | Institution 1] ˆ [Human 1 | Institution 1 = Competitor]] submit [[Plan | Document | Speech Act | Proposition | {complaint | demand | request | claim | application | proposal | report | resignation | information | plea | petition | memorandum | budget | amendment | programme | . . . }] ˆ [Artifact | Artwork | Service | Activity | {design | tender | bid | entry 1 | dance | . . . }]] (({to} Human 2 | Institution 2 = authority)ˆ({to} Human 2 | Institution 2 = referee)) ({for} {approval | discussion | arbitration | inspection | designation | assessment | funding | taxation | . . . }) [[Human 1 | Institution 1]] presents [[Plan | Document]] to [[Human 2 | Institution 2]] for {approval | discussion | arbitration | inspection | designation | assessment | taxation | . . . } [Human | Institution] submit [THAT-CL|QUOTE] 2 [[Human | Institution]] respectfully expresses {that [CLAUSE]} and invites listeners or readers to accept that {that [CLAUSE]} is true} [Human 1 | Institution 1] submit (Self) ({to} Human 2 | Institution 2) 4 [[Human 1 | Institution 1]] acknowledges the superior force of [[Human 2 | Institution 2]] and puts [[Self]] in the power of [[Human 2 | Institution 2]] [Human 1] submit (Self) [[{to} Eventuality = Unpleasant] ˆ [{to} Rule]] 5 [[Human 1]] accepts [[Rule |Eventuality = Unpleasant]] without complaining [passive] 6 [Human| Institution] submit [Anything] [{to} Eventuality] [[Human 1|Institution 1]] exposes [[Anything]] to [[Eventuality]] Table 1: Example of patterns defined for the verb submit. “right” tag differs a lot, and should be quantified. per verb). The annotators were given the en- For that purpose we developed the reliable infor- tries as well as the reference sample annotated mation gain measure presented in Section 3.2. by the lexicographer and a test sample of 50 con- cordances for annotation. We measured IAA, us- CPA Verb Validation Sample ing Fleiss’s kappa,3 and analyzed the interannota- The original PDEV had never been tested with tor confusion manually. IAA varied from verb to respect to IAA. Each entry had been based on verb, mostly reaching safely above 0.6. When the concordances annotated solely by the author of IAA was low and the type of confusion indicated a that particular entry. The annotation instructions problem in the entry, the entry was revised. Then had been transmitted only orally. The data had the lexicographer revised the original reference been evolving along with the method, which im- sample along with the first 50-concordance sam- plied inconsistencies. We put down an annotation ple. The annotators got back the revised entry, the manual (a momentary snapshot of the theory) and newly revised reference sample and an entirely trained three annotators accordingly. For practical new 50-concordance annotation batch. The fi- annotation we use the infrastructure developed at nal multiple 50-concordance sample went through Masaryk University in Brno (Hor´ak et al., 2008), one more additional procedure, the adjudication: which was also used for the original PDEV de- first, the lexicographer compared the three anno- velopment. After initial IAA experiments with tations and eliminated evident errors. Then the the original PDEV, we decided to select 30 verb lexicographer selected one value for each concor- entries from PDEV along with the annotated con- dance to remain in the resulting one-value-per- cordances. We made a new semantic concordance concordance gold standard data and recorded it sample (Cinkov´a et al., 2012) for the validation of into the gold standard set. The adjudication pro- the annotation scheme. We refer to this new col- lection2 as VPS-30-En (Verb Pattern Sample, 30 3 Fleiss’s kappa (Fleiss, 1971) is a generalization of English verbs). Scott’s π statistic (Scott, 1955). In contrast to Cohen’s kappa (Cohen, 1960), Fleiss’s kappa evaluates agreement between We slightly revised some entries and updated multiple raters. However, Fleiss’s kappa is not a generaliza- the reference samples (usually 250 concordances tion of Cohen’s kappa, which is a different, yet related, sta- tistical measure. Sometimes, the terminology about kappas 2 This new lexical resource, including the complete docu- is confusing in the literature. For a detailed explanation refer mentation, is publicly available at http://ufal.mff.cuni.cz/spr. e.g. to (Artstein and Poesio, 2008). 844 tocol has been kept for further experiments. All Properties: ACM is symmetric and for any i 6= j values except the marked errors are regarded as the number Cij ? says how many times a pair of equally acceptable for this type of experiments. annotators disagreed on two tags ti and tj , while In the end, we get for each verb: Cii? is the frequency P of? agreements on ti ; the sum in the i-th row j Cij is the total frequency of • an entry, which is an inventory of semantic assigned sets {t, t0 } that contain ti . categories (patterns) An example of ACM is given in Table 2. The corresponding confusion matrices are shown in • 300+ manually annotated concordances (sin- Table 3. gle values) • out of which 50 are manually annotated and 1 1.a 2 4 5 adjudicated concordances (multiple values 1 85 8 2 0 0 without evident errors). 1.a 8 1 2 0 0 2 2 2 34 0 0 3 Tagging confusion analysis 4 0 0 0 4 8 5 0 0 0 8 6 3.1 Formal model of tagging confusion To formally describe the semantic tagging task, Table 2: Aggregated Confusion Matrix. we assume a target word and a (randomly se- lected) corpus sample of its occurrences. The Our approach to exact tagging confusion analy- tagged sample is S = {s1 , . . . , sr }, where each sis is based on probability and information theory. instance si is an occurrence of the target word Assigning semantic tags by annotators is viewed with its context, and r is the sample size. as a random process. We define (categorical) ran- For multiple annotation we need a set of m an- dom variable T1 as the outcome of one annota- notators A = {A1 , . . . , Am } who choose from tor; its values are single member sets {t}, and we a given set of semantic categories represented have mr observations to compute their probabil- by a set of n semantic tags T = {t1 , . . . , tn }. ities. The probability that an annotator will use Generally, if we admitted assigning more tags to ti is denoted by p1 (ti ) = Pr(T1 = {ti }) and is one word occurrence, annotators could assign any practically computed as the relative frequency of subset of T to an instance. In our experiments, ti among all mr assigned tags. Formally, however, annotators were allowed to assign just one tag to each tagged instance. Therefore each m r 1 XX annotator is described as a function that assigns a p1 (ti ) = |Ak (sj ) ∩ {ti }|. mr single member set to each instance Ai (s) = {t}, k=1 j=1 where s ∈ S, t ∈ T . When a pair of annotators The outcome of two annotators (they both tag tag an instance s, they produce a set of one or two the same instance) is described by random vari- different tags {t, t0 } = Ai (s) ∪ Aj (s). able T2 ; its values are single or double member Detailed information about interannotator sets {t, t0 }, and we have m 2 r observations to (dis)agreement on a given sample S is rep- compute their probabilities. In contrast to p1 , the resented by a set of m 2 symmetric matrices Ak Al probability that ti will be used by a pair of anno- Cij = |{s ∈ S | Ak (s) ∪ Al (s) = {ti , tj }}|, tators is denoted by p2 (ti ) = Pr(T2 ⊇ {ti }), and for 1 ≤ k < l ≤ m, and i, j ∈ {1, . . . , n}. is computed as the relative frequency of assigned Note that each of those matrices can be easily sets {t, t0 } containing ti among all m 2 r observa- computed as C Ak Al = C + C T − In C, where tions: C is a conventional confusion matrix representing 1 X ? the agreement between annotators Ak and Al , p2 (ti ) = m Cik . 2 r k and In is a unit matrix. Definition: Aggregated Confusion Matrix (ACM) We also need the conditional probability that an annotator will use ti given that another annotator has used tj . For convenience, we use the nota- X C? = C Ak Al . 1≤k<l≤m tion p2 (ti | tj ) = Pr(T2 ⊇ {ti } | T2 ⊇ {tj }). 845 A1 vs. A2 A1 vs. A3 A2 vs. A3 1 1.a 2 4 5 1 1.a 2 4 5 1 1.a 2 4 5 1 29 1 1 0 0 1 29 2 0 0 0 1 27 2 0 0 0 1.a 0 1 0 0 0 1.a 1 0 0 0 0 1.a 2 0 1 0 0 2 0 1 11 0 0 2 0 0 12 0 0 2 1 0 11 0 0 4 0 0 0 2 0 4 0 0 0 1 1 4 0 0 0 1 4 5 0 0 0 3 1 5 0 0 0 0 4 5 0 0 0 0 1 Table 3: Example of all confusion matrices for the target word submit and three annotators. Obviously, it can be computed as only on the probability of that tag, and would be defined as I(tj ) = − log p1 (tj ). However, intu- Pr(T2 = {ti , tj }) p2 (ti | tj ) = itively one can say that a good measure of use- Pr(T2 ⊇ {tj }) fulness of a particular tag should also take into Cij? ? Cij consideration the expected tagging confusion re- = m = P ? . 2 r · p2 (tj ) k Cjk lated to the tag. Therefore, to exactly measure usefulness of the tag tj we propose to compare and measure similarity of the distribution p1 (ti ) Definition: Confusion Probability Matrix (CPM) and the distribution p2 (ti | tj ), i = 1, . . . , n. ? Cij How much information do we gain when an an- p Cji = p2 (ti | tj ) = P ? . notator assigns the tag tj to an instance? When k Cjk the tag tj has once been assigned to an instance Properties: The sum in any row is 1. The j-th by an annotator, one would naturally expect that row of CPM contains probabilities of assigning ti another annotator will probably tend to assign the given that another annotator has chosen tj for the same tag tj to the same instance. Formally, things same instance. Thus, the j-th row of CPM de- make good sense if p2 (tj | tj ) > p1 (tj ) and if scribes expected tagging confusion related to the p2 (ti | tj ) < p1 (ti ) for any i different from j. tag tj . If p2 (tj | tj ) = 100 %, then there is full con- An example is given in Table 3 (all confusion sensus about assigning tj among annotators; then matrices for three annotators), in Table 2 (the and only then the measure of usefulness of the tag corresponding ACM), and in Table 4 (the corre- tj should be maximal and should have the value sponding CPM). of − log p1 (tj ). Otherwise, the value of useful- ness should be smaller. This is our motivation to 1 1.a 2 4 5 define a quantity of reliable information gain ob- 1 0.895 0.084 0.021 0.000 0.000 tained from semantic tags as follows: 1.a 0.727 0.091 0.182 0.000 0.000 Definition: Reliable Gain (RG) from the tag tj is 2 0.053 0.053 0.895 0.000 0.000 4 0.000 0.000 0.000 0.333 0.667 X p2 (tk |tj ) RG(tj ) = −(−1)δkj p2 (tk |tj ) log . 5 0.000 0.000 0.000 0.571 0.429 p1 (tk ) k Table 4: Example of Confusion Probability Matrix. Properties: RG is similar to the well known Kullback-Leibler divergence (or information gain). If p2 (ti | tj ) = p1 (ti ) for all i = 1, . . . , n, 3.2 Semantic granularity optimization then RG(tj ) = 0. If p2 (tj | tj ) = 100 %, then Now, having a detailed analysis of expected tag- and only then RG(tj ) = − log p1 (tj ), which ging confusion described in CPM, we are able to is the maximum. If p2 (ti | tj ) < p1 (ti ) for compare usefulness of different semantic tags us- all i different from j, the greater difference in ing a measure of the information content associ- probabilities, the bigger (and positive) RG(tj ). ated with them (in the information theory sense). And vice versa, the inequality p2 (ti | tj ) > p1 (ti ) Traditionally, the amount of self-information con- for all i different from j implies a negative value tained in a tag (as a probabilistic event) depends of RG(tj ). 846 Definition: Average Reliable Gain (ARG) from 3.3 Classifier evaluation with respect to the tagset {t1 , . . . , tn } is computed as an expected expected tagging confusion value of RG(tj ): An automatic classifier is considered to be a func- X tion c that—the same way as annotators— assigns ARG = p1 (tj )RG(tj ) tags to instances s ∈ S, so that c(s) = {t}, j t ∈ T . The traditional way to evaluate the ac- curacy of an automatic classifier means to com- Properties: ARG has its maximum value if the pare its output with the correct semantic tags on CPM is a unit matrix, which is the case of the a Gold Standard (GS) dataset. Within our formal absolute agreement among all annotators. Then framework, we can imagine that we have a “gold” ARG has the value of the entropy of the p1 distri- annotator Ag , so that the GS dataset is represented bution: ARGmax = H(p1 (t1 ), . . . , p1 (tn )). by Ag (s1 ), . . . , Ag (sr ). Then the classic accuracy 1 Pr Merging tags with poor RG score can be computed as r i=1 |Ag (si )∩c(si )|. The main motivation for developing the ARG However, that approach does not take into con- value was the optimization of the tagset granular- sideration the fact that some semantic tags are ity. We use a semi-greedy algorithm that searches quite confusing even for human annotators. In our for an “optimal” tagset. The optimization process opinion, automatic classifier should not be penal- starts with the fine-grained list of CPA semantic ized for mistakes that would be made even by hu- categories and then the algorithm merges some mans. So we propose a more complex evaluation tags in order to maximize the ARG value. An ex- score using the knowledge of the expected tagging ample is given in Table 5. Tables 6 and 7 show confusion stored in CPM. the ACM and the CPM after merging. The ex- Definition: Classifier evaluation Score with re- amples relate to the verb submit already shown in spect to tagging confusion is defined as the pro- Tables 1, 2, 3 and 4. portion Score(c) = S(c)/Smax , where Original tagset Optimal merge r Tag f RG Tag f RG αX S(c) = |Ag (si ) ∩ c(si )| + r 1 90 +0.300 i=1 1 + 1.a 96 +0.425 r 1.a 6 −0.001 1−αX 2 36 +0.447 2 36 +0.473 + p2 (c(si ) | Ag (si )) r i=1 4 8 −0.071 r 4+5 18 +0.367 1−α 5 10 −0.054 X Smax = α + p2 (Ag (si ) | Ag (si )). r i=1 Table 5: Frequency and Reliable Gain of tags. α=1 α = 0.5 α=0 1 2 4 Verb Score Score Score 1 94 4 0 halt 1 0.84 2 0.90 4 0.81 2 4 34 0 submit 2 0.83 1 0.90 1 0.84 4 0 0 18 ally 3 0.82 3 0.89 5 0.76 cry 4 0.79 4 0.88 2 0.82 Table 6: Aggregated Confusion Matrix after merging. arrive 5 0.74 5 0.85 3 0.81 plough 6 0.70 6 0.81 6 0.72 deny 7 0.62 7 0.74 7 0.66 1 2 4 cool 8 0.58 8 0.69 8 0.53 1 0.959 0.041 0.000 yield 9 0.55 9 0.67 9 0.52 2 0.105 0.895 0.000 4 0.000 0.000 1.000 Table 8: Evaluation with different α values. Table 7: Confusion Probability Matrix after merging. Table 8 gives an illustration of the fact that us- ing different α values one can get different re- 847 sults when comparing tagging accuracy for dif- 4 Conclusion ferent words (a classifier based on bag-of-words The usefulness of a semantic resource depends on approach was used). The same holds true for com- two aspects: parison of different classifiers. • reliability of the annotation 3.4 Related work In their extensive survey article Artstein and Poe- • information gain from the annotation. sio (2008) state that word sense tagging is one In practice, each semantic resource emphasizes of the hardest annotation tasks. They assume one aspect: OntoNotes, e.g., guarantees reliabil- that making distinctions between semantic cate- ity, whereas the WordNet-annotated corpora seek gories must rely on a dictionary. The problem to convey as much semantic nuance as possible. is that annotators often cannot consistently make To the best of our knowledge, there has been no the fine-grained distinctions proposed by trained exact measure for the optimization, and the use- lexicographers, which is particularly serious for fulness of a given resource can only be assessed verbs, because verbs generally tend to be polyse- when it is finished and used in applications. We mous rather than homonymous. propose the reliable information gain, a measure A few approaches have been suggested in based on information theory and on the analysis of the literature that address the problem of the interannotator confusion matrices for each word fine-grained semantic distinctions by (automatic) entry, that can be continually applied during the measuring sense distinguishability. Diab (2004) creation of a semantic resource, and that provides computes sense perplexity using the entropy func- automatic feedback about the granularity of the tion as a characteristic of training data. She also used tagset. Moreover, the computed information compares the sense distributions to obtain sense about the amount of expected tagging confusion distributional correlation, which can serve as a is also used in evaluation of automatic classifiers. “very good direct indicator of performance ra- tio”, especially together with sense context con- Acknowledgments fusability (another indicator observed in the train- ing data). Resnik and Yarowsky (1999) intro- This work has been supported by the Czech Sci- duced the communicative/semantic distance be- ence Foundation projects GK103/12/G084 and tween the predicted sense and the “correct” sense. P406/2010/0875 and partly by the project Euro- Then they use it for evaluation metric that pro- MatrixPlus (FP7-ICT-2007-3-231720 of the EU vides partial credit for incorrectly classified in- and 7E09003+7E11051 of the Ministry of Edu- stances. Cohn (2003) introduces the concept of cation, Youth and Sports of the Czech Republic). (non-uniform) misclassification costs. He makes We thank our friends from Masaryk University use of the communicative/semantic distance and in Brno for providing the annotation infrastruc- proposes a metric for evaluating word sense dis- ture and for their permanent technical support. ambiguation performance using the Receiver Op- We thank Patrick Hanks for his CPA method, for erating Characteristics curve that takes the mis- the original PDEV development, and for numer- classification costs into account. Bruce and ous discussions about the semantics of English Wiebe (1998) analyze the agreement among hu- verbs. We also thank three anonymous reviewers man judges for the purpose of formulating a re- for their valuable comments. fined and more reliable set of sense tags. Their method is based on statistical analysis of inter- annotator confusion matrices. An extended study is given in (Bruce and Wiebe, 1999). 848 References Christiane Fellbaum, Joachim Grabowski, and Shari Landes. 1997. Analysis of a hand-tagging task. In Roni Ben Aharon, Idan Szpektor, and Ido Dagan. Proceedings of the ACL/Siglex Workshop, Somer- 2010. Generating entailment rules from FrameNet. set, NJ. In Proceedings of the ACL 2010 Conference Short Christiane Fellbaum, J. Grabowski, and S. Landes. Papers., pages 241–246, Uppsala, Sweden. 1998. Performance and confidence in a semantic Ron Artstein and Massimo Poesio. 2008. Inter-coder annotation task. In WordNet: An Electronic Lexical agreement for computational linguistics. Computa- Database, pages 217–238. Cambridge (Mass.): The tional Linguistics, 34(4):555–596, December. MIT Press., Cambridge (Mass.). Ann Bies, Mark Ferguson, Karen Katz, Robert Mac- Christiane Fellbaum, Martha Palmer, Hoa Trang Dang, Intyre, Victoria Tredinnick, Grace Kim, Mary Ann Lauren Delfs, and Susanne Wolf. 2001. Manual Marcinkiewicz, and Britta Schasberger. 1995. and automatic semantic annotation with WordNet. Bracketing guidelines for treebank II style. Tech- nical report, University of Pennsylvania. Christiane Fellbaum. 1998. WordNet. An Electronic Lexical Database. MIT Press, Cambridge, MA. Susan Windisch Brown, Travis Rood, and Martha Palmer. 2010. Number or nuance: Which factors Charles J. Fillmore and B. T. S. Atkins. 1994. Start- restrict reliable word sense annotation? In LREC, ing where the dictionaries stop: The challenge for pages 3237–3243. European Language Resources computational lexicography. In Computational Ap- Association (ELRA). proaches to the Lexicon, pages 349–393. Oxford Rebecca F. Bruce and Janyce M. Wiebe. 1998. Word- University Press. sense distinguishability and inter-coder agreement. Joseph L. Fleiss. 1971. Measuring nominal scale In Proceedings of the Third Conference on Em- agreement among many raters. Psychological Bul- pirical Methods in Natural Language Processing letin, 76:378–382. (EMNLP ’98), pages 53–60. Granada, Spain, June. Patrick Hanks and James Pustejovsky. 2005. A pat- Rebecca F. Bruce and Janyce M. Wiebe. 1999. Recog- tern dictionary for natural language processing. Re- nizing subjectivity: A case study of manual tagging. vue Francaise de linguistique applique, 10(2). Natural Language Engineering, 5(2):187–205. Patrick Hanks. forthcoming. Lexical Analysis: Norms Silvie Cinkov´a, Martin Holub, Adam Rambousek, and and Exploitations. MIT Press. Lenka Smejkalov´a. 2012. A database of seman- Aleˇs Hor´ak, Adam Rambousek, and Piek Vossen. tic clusters of verb usages. In Proceedings of the 2008. A distributed database system for develop- LREC ’2012 International Conference on Language ing ontological and lexical resources in harmony. Resources and Evaluation. To appear. In 9th International Conference on Intelligent Text Jacob Cohen. 1960. A coefficient of agreement for Processing and Computational Linguistics, pages nominal scales. Educational and Psychological 1–15. Berlin: Springer. Measurement, 20(1):37–46. Eduard Hovy, Mitchell Marcus, Martha Palmer, Trevor Cohn. 2003. Performance metrics for word Lance Ramshaw, and Ralph Weischedel. 2006. sense disambiguation. In Proceedings of the Aus- OntoNotes: the 90% solution. In Proceedings tralasian Language Technology Workshop 2003, of the Human Language Technology Conference pages 86–93, Melbourne, Australia, December. of the NAACL, Companion Volume: Short Papers, Mona T. Diab. 2004. Relieving the data acquisition NAACL-Short ’06, pages 57–60, Stroudsburg, PA, bottleneck in word sense disambiguation. In Pro- USA. Association for Computational Linguistics. ceedings of the 42nd Annual Meeting of the ACL, Nancy Ide, Collin Baker, Christiane Fellbaum, Charles pages 303–310. Barcelona, Spain. Association for Fillmore, and Rebecca Passoneau. 2008. MASC: Computational Linguistics. The Manually Annotated Sub-Corpus of American Katrin Erk, Diana McCarthy, and Nicholas Gaylord. English. In Proceedings of the Sixth International 2009. Investigations on word senses and word us- Conference on Language Resources and Evaluation ages. In Proceedings of the Joint Conference of the (LREC’08), pages 28–30. European Language Re- 47th Annual Meeting of the ACL and the 4th In- sources Association (ELRA). ternational Joint Conference on Natural Language Julia Jorgensen. 1990. The psycholinguistic reality of Processing of the AFNLP, pages 10–18, Suntec, word senses. Journal of Psycholinguistic Research, Singapore, August. Association for Computational (19):167–190. Linguistics. Adam Kilgarriff. 1997. “I don’t believe in word Katrin Erk. 2010. What is word meaning, really? senses”. Computers and the Humanities, 31(2):91– (And how can distributional models help us de- 113. scribe it?). In Proceedings of the 2010 Workshop Ramesh Krishnamurthy and Diane Nicholls. 2000. on GEometrical Models of Natural Language Se- Peeling an onion: The lexicographer’s experience mantics, pages 17–26, Uppsala, Sweden, July. As- of manual sense tagging. Computers and the Hu- sociation for Computational Linguistics. manities, 34:85–97. 849 Ryan McDonald, Kevin Lerman, and Fernando John Sinclair. 1991. Corpus, Concordance, Colloca- Pereira. 2006. Multilingual dependency analysis tion. Describing English Language. Oxford Univer- with a two-stage discriminative parser. In Proceed- sity Press. ings of the Tenth Conference on Computational Nat- Ralph Weischedel, Martha Palmer, Mitchell Marcus, ural Language Learning CoNLLX 06, pages 216– Eduard Hovy, Sameer Pradhan, Lance Ramshaw, 220. Association for Computational Linguistics. Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Adam Meyers, Ruth Reeves, and Catherine Macleod. Franchini, Mohammed El-Bachouti, Robert Belvin, 2008. NomBank v 1.0. and Ann Houston. 2011. OntoNotes release 4.0. G. A. Miller, C. Leacock, R. Tengi, and R. T. Bunker. Fabio Massimo Zanzotto, Marco Pennacchiotti, and 1993a. A semantic concordance. In Proceedings of Alessandro Moschitti. 2009. A machine learning ARPA Workshop on Human Language Technology. approach to textual entailment recognition. Natural G. A. Miller, C. Leacock, R. Tengi, and R. T. Bunker. Language Engineering, 15(4):551–582. 1993b. A semantic concordance. In Proceedings of Yue Zhang and Stephen Clark. 2011. Syntactic pro- ARPA Workshop on Human Language Technology. cessing using the generalized perceptron and beam Roberto Navigli. 2006. Meaningful clustering of search. Computational Linguistics, 37(November senses helps boost word sense disambiguation per- 2009):105–151. formance. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 105–112, Syd- ney, Australia. Martha Palmer, Dan Gildea, and Paul Kingsbury. 2005. The proposition bank: A corpus annotated with semantic roles. Computational Linguistics Journal, 31(1). Rebecca J. Passonneau, Ansaf Salleb-Aoussi, Vikas Bhardwaj, and Nancy Ide. 2010. Word sense anno- tation of PolysemousWords by multiple annotators. In LREC Proceedings, pages 3244–3249, Valetta, Malta. Philip Resnik and David Yarowsky. 1999. Distin- guishing systems and distinguishing senses: New evaluation methods for word sense disambiguation. Natural Language Engineering, 5(2):113–133. Anna Rumshisky, M. Verhagen, and J. Moszkowicz. 2009. The holy grail of sense definition: Creating a Sense-Disambiguated corpus from scratch. Pisa, Italy. Michael Rundell. 2002. Macmillan English Dictio- nary for advanced learners. Macmillan Education. Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, and Jan Schef- fczyk. 2010. FrameNet II: Extended Theory and Practice. ICSI, University of Berkeley, September. Beatrice Santorini. 1990. Part-of-Speech tagging guidelines for the penn treebank project. University of Pennsylvania 3rd Revision 2nd Printing, (MS- CIS-90-47):33. William A. Scott. 1955. Reliability of content analy- sis: The case of nominal scale coding. Public Opin- ion Quarterly, 19(3):321–325. John Sinclair, Patrick Hanks, and et al. 1987. Collins Cobuild English Dictionary for Advanced Learn- ers 4th edition published in 2003. HarperCollins Publishers 1987, 1995, 2001, 2003 and Collins A–Z Thesaurus 1st edition first published in 1995. HarperCollins Publishers 1995. 850 Parallel and Nested Decomposition for Factoid Questions Aditya Kalyanpur, Siddharth Patwardhan, Branimir Boguraev, Jennifer Chu-Carroll and Adam Lally IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA {adityakal,siddharth,bran,jencc,alally}@us.ibm.com Abstract Largely a legacy of the nature of TREC questions (Voorhees, 2002), this tactic works in most cases Typically, automatic Question Answering where the assumption holds that a question is fo- (QA) approaches use the question in its en- tirety in the search for potential answers. cused upon a single fact, and support for it may We argue that decomposing complex fac- be found in a single resource. toid questions into separate facts about their Our work deals with more complex factoid answers is beneficial to QA, since an an- questions, specifically ones containing multiple swer candidate with support coming from multiple independent facts is more likely facts related to the correct answer. Because such to be the correct one. We broadly cate- facts may be independent of each other, they may gorize decomposable questions as parallel well reside in different resources—and thus out- or nested, and we present a novel ques- side of the scope of a single-shot search query. tion decomposition framework for enhanc- (2) Which company has origins dating back to the ing the ability of single-shot QA systems 1870s and became the first U.S. company to to answer complex factoid questions. Es- have 1 million stockholders? sential to the framework are components for decomposition recognition, question re- Example (2) shows a question with two facts writing, and candidate answer synthesis about its answer (a company): its origins date and re-ranking. We discuss the inter- back to the 1870s, and it became the first in U.S. play among these, with particular empha- sis on decomposition recognition, a pro- to have 1 million stockholders. We turn to ques- cess which, we argue, can be sufficiently in- tion decomposition to leverage the separate facts formed by lexico-syntactic features alone. within the question, using them to garner support We validate our decomposition approach by for the correct answer from independent sources implementing the framework on top of a of evidence. Our hypothesis is that the more in- state-of-the-art QA system, showing a sta- dependent facts support an answer candidate, the tistically significant improvement over its accuracy. more likely it is to be the correct answer. We focus here on decomposition applied to im- proving the quality of QA over a broad set of 1 Introduction factoid questions. In contrast to most work on Question Answering (QA) systems for factoid decomposition to date, which tends to appeal to questions typically adopt a “single-shot” ap- discourse and/or semantic properties of the ques- proach for the task. Single-shot QA implicitly as- tion (Section 2), we exploit the notion of a fact to sumes that the question contains a single nugget view decomposition as circumscribed largely by of information (as in Example (1)). the syntactic shape of questions. Facts are entity- (1) In which city are the headquarters of GE relationship expressions, where the relation may located? be an N-ary predicate. Most informative, and thus To answer the question, these approaches attempt useful, facts are those that contain at least one to locate the factual information (the location of named entity (including temporal or locative ex- GE’s headquarters) in their underlying resources. pressions). 851 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 851–860, Avignon, France, April 23 - 27 2012. 2012 c Association for Computational Linguistics The particular relationship between indepen- 2005), lists (Hartrumpf, 2008) or lists of sets (Lin dent facts in any given question leads us to catego- and Liu, 2008), and so forth. rize “decomposable” questions broadly into two In the literature, we find descriptions of pro- types: parallel and nested. Examples (2) above cesses like local decomposition and meronymy and (3) below are parallel decomposable: sub- decomposition (Hartrumpf, 2008), semantic de- questions can be evaluated independently of one composition using knowledge templates (Katz et another. In contrast, nested questions require their al., 2005), question refocusing (Hartrumpf, 2008; decompositions to be processed in sequence, with Katz et al., 2005), and textual entailment (Laca- the answer to an ‘inner’ sub-question plugged tusu et al., 2006) to connect, through semantics into the ‘outer’. In Example (4), the inner sub- and discourse, the original question with its nu- question is marked in brackets; its answer, “cir- merous decompositions. In general, such pro- rhosis”, then leads to an outer question “In the cesses are not limited to using only lexical mate- treatment of cirrhosis, which drug reduces portal rial explicitly present in the question: a constraint venous blood inflow”, the answer to which is also we place upon our decomposition algorithms in the answer to the original question. order to retain the ability to do open-domain QA. (3) Which 2011 tax form do I fill if I need to do Closer to our strategy are notions like the syn- itemized deductions and I have an IRA rollover tactic decomposition of Katz et al. (2005), and the from 2010? temporal/spatial analysis of Saquete et al. (2004) (4) In the treatment of [a condition that causes and Hartrumpf (2008). Still, our approach differs bleeding esophageal varices], which drug in at least two significant ways. We offer a prin- reduces portal venous blood inflow? cipled solution to the problem of the final combi- Questions like these are found in domains such as nation and ranking of candidate answers returned medical, legal, etc., as they tend to arise in more from multiple decompositions, by means of train- dynamic QA system setting. Independently of do- ing a model to weigh the effects of decomposition main and type, however, they share a common recognition rules. We also note that spatial and characteristic: if a search query is constructed temporal decomposition are just special cases of from all the facts collectively describing the an- solving nested decomposable questions. swer, it is likely to ‘flood’ the system with noise, The closest similarity our fact-based decompo- and confuse the identification of potential answer- sition has with an established approach is with the bearing passages. The notion of decomposition notion of asking additional questions in order to thus goes hand in hand with that of recursively derive constraints on candidate answers (Prager applying a QA system to the individual facts (sub- et al., 2004). However, the additional questions questions), followed by suitable re-composition there are generated through knowledge of the do- of the candidate answer lists for the sub-questions. main, making that technique hard to apply in an This paper presents a novel decomposition ap- open domain setting. In contrast, we developed proach for such questions. We discuss the partic- a domain-independent approach to question de- ular strategies for recognizing and typing decom- composition, in which we use the question con- posable questions, and the subsequent processing text alone in generating queriable constraints. of sub-questions, and their candidate answer lists, in ways which can improve the performance of an 3 Fact-based Decomposition existing state-of-the-art QA system. Enhancing a single-shot QA system with a ca- 2 Related work pability for incremental solving of decomposable questions requires recognizing that a question is A variety of approaches to QA cite ‘decomposi- decomposable, and engaging in a staged process- tion’, in the context of addressing question com- ing of its sub-question parts. Whether parallel or plexity. In most work to date, however, “com- nested, the system needs to identify the multiple plex” refers to questions requiring non-factoid facts, and configure itself as appropriate. Figure answers: e.g. multiple sentences or summaries 1 shows our fact-based decomposition “meta”- of answers (Lacatusu et al., 2006), connected framework (“meta”, as it builds on top of an ex- paragraphs (Soricut and Brill, 2004), explana- isting QA system). It comprises four main com- tions and/or justification of an answer (Katz et al., ponents as illustrated in the figure. 852 Question the two different pathways in the figure, multi- ple parallel facts submitted to the base QA system vs. inner-outer sub-question pairs, processed via a feedback loop. The base system is invoked on the full question, and on its decompositions. 4 Decomposition Recognizers The primary goal in decomposing questions is to identify facts involving the entity being asked for Ranked Ranked Ranked Candidates Candidates Candidates (henceforth the focus), simpler than the full ques- tion and solvable independently (Section 1). Most question decomposition work (Section 2) tends to defer to semantic, discourse, and other domain- Final Answer List specific information; in contrast, we recognize de- composable questions primarily on the basis of Figure 1: Fact-based decomposition framework their syntactic shape. This is important for our claim that the decomposition framework outlined in Section 3 is generally applicable to multiple Decomposition Recognizers analyze the input QA tasks and system configurations. question and identify decomposable parts using a In our work, we use a dataset of factoid ques- set of predominantly lexico-syntactic cues (Sec- tion/answer pairs from Jeopardy!,1 a popular TV tion 4). Question Rewriters re-write the sub- quiz show in the US. The data is particularly chal- questions found by the recognizer, retaining key lenging, not least for the broad domain it covers contextual information (Section 5.1). Underly- and the complex language used. In addition to ing QA System generates, for any factoid ques- making for an excellent test-bed for open-domain tion, a ranked list of answer candidates, each with QA, the data offers a wide choice of questions a confidence corresponding to the probability of which require decomposing. the answer being correct. Answer Synthesis and Re-ranking is a placeholder for the particular pro- 4.1 Decomposition Patterns cess which tries to combine ranked candidate an- Our analysis of complex decomposable questions swers obtained to the original question with so- highlights numerous syntactic cues that are reli- lutions for the decomposed facts into a uniform able indicators for decomposition, and it is pre- ranked answer list. In general, different combi- dominantly such cues we exploit for driving the nation functions may be appropriate for different recognition and typing of decomposable ques- types of decomposable questions. Thus, for the tions. A set of recognition patterns can be formu- classes of parallel and nested questions, our de- lated in terms of fine-grained lexico-syntactic in- composition strategies (described in Sections 5.2 formation, expressed over the predicate-argument and 5.3) defer to an Answer Merger. Other combi- structure (PAS) for the syntactic parse of the nation functions may be required for e.g. selecting question. We identify three major categories from or aggregating over lists; cf. Hartrumpf’s op- of configurationally-based patterns: independent erational decomposition (2008), or Lin and Liu’s subtrees, composable units and segments with multi-focus questions (2008); see also the special qualifiers. These are general, in the sense questions solving techniques of (Prager et al., ). that they capture relationships between configura- We use a particular QA system (Ferrucci et tional properties of a question and its status with al., 2010) as base. However, any system can be respect to decomposability. The specific rules im- plugged into our meta-framework, as long as: it plementing the patterns may, or may not, have to can solve factoid questions by providing answers be modified as, for instance, there may be a style with confidences reflecting correctness probabil- change, or a shift in the syntactic analysis frame- ity; and it maintains context/topic information for work of the base QA system, to a different parser; the question separately from its main content. 1 Parallel and nested processing are distinct: note http://www.jeopardy.com. 853 Independent Subtrees (1.P) Parallel clause Its original name meant “bitter water” and it was Fact #1: Its original name meant “bitter water” made palatable to Europeans after the Spaniards Fact #2: It was made palatable to Europeans after the added sugar Spaniards added Sugar complementary “American Prometheus” is a biography of this physi- Fact #1: this physicist who died in 1967 cist who died in 1967 Fact #2: “American Prometheus” is a biography of this physicist (1.N) Nested coincidental When “60 Minutes” premiered, this man was U.S. Inner Fact: When “60 Minutes” premiered President Outer Fact: When this man was president based-on A controversial 1979 war film was based on a 1902 Inner Fact: A controversial 1979 war film work by this author Outer Fact: film was based on a work by this author named-for Article of clothing named for an old character who Inner Fact: an old character who dressed in loose dressed in loose trousers in commedia dell’arte trousers in commedia dell’arte Outer Fact: Article of clothing named for character Composable Units (2.P) Parallel verb-args He launched his lecturing career in 1866 with a talk Fact #1: He launched his lecturing career in 1866 later titled “Our fellow savages of the Sandwich Is- lands” focus-mod “The Mute” was the working title of this 1940 novel Fact #1: this 1940 novel by a female author by a female author triple His rise began when he upset Robert M. La Follette, Fact #1: he upset Robert M. La Follette, Jr. Jr. in a 1946 Senate primary (2.N) Nested explicit-link To honor his work, this man’s daughter took the name Inner Fact: To honor his work, [this] daughter took Maria Celeste when she became a nun in 1616 the name Maria Celeste, when . . . Outer Fact: this man’s daughter descriptive-np The word for this congressional job comes from a fox- Inner Fact: a fox-hunting term for someone who hunting term for someone who keeps the hunting dogs keeps the hunting dogs from straying from straying Outer Fact: The word for this congressional job comes from term Segments with Qualifiers (3.P) Parallel qualifier Winning in 1965 and 1966, he was the first man to win Fact #1: he was the first man to win the Masters golf the Masters golf tournament in 2 consecutive years tournament in 2 consecutive years Table 1: Decomposition Rule Sets such implementations do not affect our analysis subtree from the question as a decomposable fact. of syntactically-cued decomposition recognition. Table 1 shows example decompositions within pattern categories; note that within a category, typically there are rule sets for parallel and nested decomposition types.2 Independent Subtrees A good source of in- dependent sub-questions within a question is in For example, the subtree fragment circled is an in- clauses likely to capture a unique piece of infor- dependent fact (in brackets) identified within the mation about the answer, distinct from the rest larger question “The name of [this character, first of the question. Relative or subordinate clauses introduced in 1894], comes from the Hindi for (not in a superlative or ordinal context; see Seg- ‘bear’”. This category also includes rules using ments with Qualifiers below) are examples of in- conjunctions as decomposition points (at various dependent subtrees and are indicative of parallel levels of the syntactic parse), as in Example (3) decomposition. PAS configurations that connect earlier (Section 1). such subtrees to the focus are generally good in- Parallel decomposition of this type is captured dicators of a sub-question: cues to “break off” a in two rule sets, clause and complementary, which 2 In the data we use, questions are posed in a declarative differ primarily in that ‘complementary’ rules at- format, with stylized marking of question focus. This should tempt to derive two separate sub-questions, while not detract from referring to them as ‘questions’. the ‘clause’ rules attempt to locate independent 854 sub-questions in the original question. Examples Units’ rules combine separate parts of the PAS in Table 1/Row (1.P) illustrate this distinction. into a fact. For instance, a sub-question can be For nested decomposition, we have three rule created by associating the focus head with its pre- sets: coincidental, based-on and named-for. modifiers and postmodifiers. If the premodifiers These use lexical cues to detect specific seman- and postmodifiers are sufficiently specific, we ob- tic relations within the question that could indi- tain reasonably independent sub-questions, with cate nestedness. For instance, the ‘coincidental’ parallel-decomposable behavior. rules identify sub-questions resolving a tempo- Three parallel decomposition rule sets are de- ral link with the focus of the original question. fined in this category: verb-args, focus-mod and The ‘based-on’ and ‘named-for’ rules detect sub- triple (see Table 1/row (2.P)). The rules in ‘verb- questions where the answer to the original ques- args’ “compose” a fact from the verb and its ar- tion is based on or named for the answer to the in- guments (subject, object, PP complements). The ner sub-question (Table 1/row (1.N)). Note that in ‘focus-mod’ rules combine the head of the focus different domains, different relations may corre- NP with its modifiers to generate a sub-question. late with nestedness, for instance, disease-causes- Similar to ‘verb-args’ are ‘triple’ rules, which symptom in a medical setting; cf. Example (4) in create less constrained sub-questions (in that the Section 1. The general pattern would still apply, composition always links only two of the argu- even if we need different rule(s) to implement it. ments to the underlying predicate, e.g. subject- Configurational information is used to deter- verb-object or subject-verb-complement). mine whether the question exhibits parallel or Here also, a particular configuration around the nested decomposition profile. Thus the syntac- focus may indicate a question requiring nested tic contour of Example (3) shows that two clauses processing. For nested, the Composable Units characterize the same entity (the focus): a clear category has two rule sets: explicit-link and indicator that the sub-questions are parallel. Con- descriptive-np (Table 1/row (2.N)). versely, “A controversial 1979 war film was based In contrast to questions where modifiers of the on a 1902 work by this author” exhibits a very focus can be cues for parallel decomposition (i.e. different set of configurational properties. There the ’focus-mod’ rules above), the ‘explicit-link’ are two underspecified entities (including the fo- rules detect nested decomposition, signaled by the cus), both characterized as head-plus-modifiers focus itself being a modifier. For example, in syntactic units; however, there is no ‘sharing’ of “To honor his work, this man’s daughter took the the separate characterizations (facts) via a com- name Maria Celeste when she became a nun in mon head. This indicates nestedness: the inner 1616”, the focus (“this man”) is a determiner to sub-question is the one around the underspecified, an underspecified node (“daughter”). Traversing but non-focus, element (“a controversial 1979 the tree without descending to the level of the fo- war film”); the outer is “[film] was based on a cus would “carve out” an inner sub-question itself 1902 work by this author”. focused on that underspecified node (“daughter”): Another cue for nested questions is a sub-tree see Table 1/row (2.N). labeled by a temporal subordinate conjunction, or a subordinate clause, away from the focus- enclosing top level of the question and itself un- derspecified. Such analysis will motivate the question “When “60 Minutes” premiered, this man was U.S. president” to be solved first for the temporal expression, “When did “60 Minutes” The ‘descriptive-np’ rule set finds ‘parenthetical’ premiere?”, followed by “Who was U.S. Presi- descriptions of underspecified nouns in the pri- dent in 1968?”. mary question, as in e.g. “This arboreally named area was made famous by [a prince in the re- Composable Units An alternate strategy for gion noted for impaling enemies on stakes]”: identifying sub-questions is to “compose” a fact the nested-decomposable nature of this question by combining elements from the question. In con- is captured in the descriptive phrase (in square trast to the previous category, the ‘Composable brackets) functioning as an inner sub-question. 855 Segments with Qualifiers This category of parallel or nested, the appropriate pathway in the rules covers cases where the modifier of the fo- framework (Figure 1) needs to get instantiated; cus is a relative qualifier, such as “the first”, before sub-questions are submitted to the base QA “only”, “the westernmost”. In such cases, in- system, they may need augmentation to facilitate formation from another clause is usually re- the recursive system invocation. The answer sets quired to “complete” the relative qualifier: con- obtained from sub-questions processing need then sider e.g. the incomplete “the third man” vs. to be analyzed and rationalized, to determine the the fact “the third man . . . to climb Mt. Ever- final answer to the original question. est”) To deal with these cases, rules in this cat- egory combine the characteristics of Composable 5.1 Question Re-Writing Units with those of Independent Subtrees rules. For parallel decomposition, the goal is to solve the We “compose” the relative qualifier, the focus original question Q by solving sub-questions in- (along with its modifiers) and the attached sup- dependently and combining results appropriately. porting clause “subtree” to generate this type of For example, consider the Jeopardy! question rules. As illustrated in row (3.P) of Table 1, for (5) H ISTORIC P EOPLE: The life story of this man parallel decomposition our rule set covers sub- who died in 1801 was chronicled in an A&E questions expressed as superlatives. We do not Biography DVD titled “Triumph and treason” have any rules of this type for the nested case. We get two decompositions:3 Q1 : This man who died in 1801 Q2 : The life story of this man was chronicled in an A&E Biography DVD titled “Triumph and treason” Submitting sub-questions—unmodified—to the base QA system raises at least two problems. Sub-questions are often much shorter than the original question, and in many cases no longer 4.2 Decomposition Filters have a unique answer. Moreover, some of the All three pattern categories above rely only on a information from the original question that was syntactic analysis of the question; this is delivered dropped in a sub-question may be relevant con- by the English Slot Grammar (ESG) parser (Mc- textual cues that the QA system needs to come up Cord, 1989). When rules fire, they also identify with the correct answer. Q1 above illustrates these question segments proposed as sub-questions. problems: it does not have a unique answer, and Not surprisingly, the rules over-generate; to suffers from a recall problem (the correct answer mitigate against that, we apply several heuristic is not in the candidate answer list of the base sys- filters to the proposed sub-questions. The filters tem when it considers this sub-question alone). discard sub-questions that do not contain either a Our solution is to insert contextual informa- named entity, a quoted string, or a time or date ex- tion into the sub-questions. In a two-step pro- pression (these are detected by the ESG parser). cess for a sub-question Qi , we obtain the set of Additionally, we discard sub-questions that al- all named entities and nouns (ignoring stopwords) most completely overlap the entire question or a in the original question text outside of Qi , and sub-question from a prior rule. A partial prior- we insert these keywords into the original ques- ity order is imposed on rule application, based on tion category. In Jeopardy! questions, the cate- intuitions of how informative the facts generated gory field is the context/topic information which by a rule are; this order is reflected on a per-type the underlying QA system needs in order to use basis in Table 1: e.g. within type (2.P) we prefer the decomposition framework, as stated in Sec- verb-args to triple since the latter tends to produce tion 3. In general, a QA system may derive such less constrained facts than the former. information in a variety of ways, e.g. by exploit- ing the problem description in a technical assis- 5 Using Decomposition tance QA setting, or a patient’s medical history, In essence, decomposition recognition informs 3 Jeopardy! questions also contain category information, two processes. According to the question type, which further contextualizes the search for the answer. 856 in a medical QA setting. What is important here Feature Name Description Binary feature signaling whether is that the base system treat such information dif- candidate was top answer to non- Orig. Top Answer ferently from the question itself. Rewriting takes decomposed question advantage of this differential weighting to ensure Confidence for candidate answer to Orig. Confidence non-decomposed question that the larger context of the original question is still taken into account when evaluating a sub- Number of sub-questions which # Facts Matched have candidate answer in top 10 question, albeit with less weight. Rule-verb-args Features corresponding to the rules The re-written Q1 /Q2 for Example (5) are: Rule-clause sets used in parallel decomposition – (5-1) H ISTORIC P EOPLE (A&E B IOGRAPHY DVD Rule-qualifier each feature takes a numeric value, “T RIUMPH AND T REASON ”): This man who Rule-focus-mod which is the confidence of the QA Rule-complementary system on a fact identified by the cor- died in 1801 Rule-triple responding rule set (5-2) H ISTORIC P EOPLE (1801): The life story of this man was chronicled in an A&E Biography Table 2: Features in Parallel Re-ranking Model DVD titled “Triumph and treason” The keywords are inserted in parentheses, to en- Finally, if the sub-questions are not of a good sure a clear separation between the original cat- quality (e.g. due to a bad parse), we need a fall- egory terms and the context terms added. Other back to the original question, which implies that systems may need a different re-writing tactic. the confidence for the candidate answer for the The above re-writing technique is used for both entire question should also be considered when parallel and nested decomposable questions. For making a final decision. Consequently, we use a the nested case, there is an additional re-writing machine-learning model to combine information step that needs to be done – after solving the in- across sub-question answer confidences, with fea- ner question, we need to substitute its answer into tures capturing the above information (Table 2). the outer when solving for it. Thus the first ex- In case a candidate answer is not in the answer ample in Table 1/row(1.N) would have its inner list of the full question or any of the decomposed focus “When ‘60 Minutes’ premiered” replaced sub-questions, the corresponding feature value is with “In 1968” creating the outer question ”In set to missing. If a rule generates multiple sub- 1968, this man was U.S. President” whose solu- questions, its corresponding feature value for the tion is the answer to the original question. candidate answer is set to the sum of the confi- dences obtained for that answer across all sub- 5.2 Answer Re-Ranking: Parallel questions. The model is trained using Weka’s The base QA system will process the re-written (Witten and Frank, 2000) logistic regression al- category/sub-question pairs, and will produce a gorithm with instance weighting. set of ranked candidate lists with confidences. These need to be combined into a final answer list 5.3 Answer Re-Ranking: Nested for the original question, accounting for informa- Nested questions decompose into inner/outer tion across all sub-question candidate lists. question pairs. The task is to solve the inner ques- One way to produce a final score for each can- tion first, substitute the answer obtained, based on didate answer is simply to take the product of its confidence, into the outer, and solve that for the the scores returned by the QA system for each final answer. This is contingent upon selecting an- of the sub-questions. This assumes that the sub- swers to the inner question which might profitably questions are typically independent and that the be plugged into the outer; substituting incorrect QA system produces a confidence which corre- answers will only lead to noisy final answers, with sponds to the probability of the answer being cor- negative impact to overall accuracy. rect. However, even if the sub-questions are inde- We rely on the ability of the underlying QA sys- pendent, question re-writing breaks this assump- tem to produce meaningful confidences for its an- tion as it brings information from the remain- swers, and only consider the top answer to the in- der of the question into the sub-question context. ner question for substitution into the outer—if its Also, the sub-questions are generated by decom- confidence exceeds some threshold. position rules that have varying precision and re- Finally, the answers to the outer question need call, and thus should not be weighted equally. to be related to the full question answer list, to 857 produce the final ranked answers. For answer QA End-to-End Decomposable Q System Accuracy Accuracy re-ranking, we use the following heuristic selec- PB 635/1269 (50.05%) 339/598 (56.68%) tion strategy: we compute the aggregate confi- PD−QR 634/1269 (49.96%) 338/598 (56.52%) dence of the answer obtained through decompo- PD+QR 643/1269 (50.66%) 347/598 (58.02%) sition as the product of the inner-question answer NB 635/1269 (50.05%) 129/255 (50.58%) confidence and the outer-question answer confi- ND+QR 640/1269 (50.43%) 134/255 (52.54%) dence, and compare this value with that of the top Table 3: Evaluating Decomposition answer confidence to the entire question select- ing the higher confidence one as our final answer. Note that this re-ranking is different from the one NB to Nested Baseline; both are results from run- used in parallel decomposition where we combine ning the underlying QA system without any de- results from multiple sub-questions into a single composition capabilities. PD and ND refer to Par- confidence. allel and Nested Decomposition systems respec- tively and QR refers to question re-writing. Sepa- 6 Evaluation rate experiments determined end-to-end accuracy for the different system configurations, with re- 6.1 Evaluation Data spect to the entire test set, and accuracy over the As we discuss question decomposition in the con- decomposable questions subsets of the test set. text of Jeopardy! data (Section 4), our test set con- We do not offer separate analysis of decompo- tains only Final Jeopardy! (FJ) questions. They sition recognition. Manual creation of decompo- are often long and complex, with multiple facts sition standard is highly non-trivial, largely due or constraints that need to be satisfied. Also, they to the numerous alternative ways to decompose are typically much harder to answer than regular a question, and synthesize unique facts from the Jeopardy! questions both for humans and for our segments. Indeed, this is precisely the motivation base QA system. The test set comprises close to for weighting the decomposition rules in a trained 3000 FJ questions, broken into 1138 for training, re-ranking model (Section 5.2). Given this, we 517 for development and 1269 questions for test- are interested only in measuring the impact of de- ing (as blind data). composition on end-to-end QA performance. 6.2 Experiments 6.3 Discussion of Results The decomposition rules (Section 4) and re- The results in Table 3 show that the parallel de- ranking parameters (Sections 5.2, 5.3) were de- composition rules were able to decompose a large fined and tuned on the development set. The final fraction of the test set (598 out of 1269 ques- re-ranking model was trained over the training set, tions: 47%). Interestingly, the performance of using the features described in Section 5.2, and lo- the baseline QA system on the decomposable set gistic regression with instance re-weighting. The was 56.6%, which is 6% higher than the overall results of applying the decomposition rules fol- performance over the entire test set. One reason lowed by the re-ranking model to the 1269 test for this result would be that parallel decompos- questions are shown in Table 3. The baseline able questions typically contain a lot more infor- is the performance of the underlying QA system mation (more than one fact or constraint that the used in our meta-framework without any decom- answer must satisfy) about the same answer, and position components, applied to the same test set. the system in some cases is able to exploit this We evaluated separately the impact of our ques- redundancy—such as when one fact is strongly tion re-writing strategy (Section 5.1) that main- associated with the correct answer and there is ev- tains contextual information from the original idence supporting this in the sources. question into the sub-questions. For this purpose, A different, important result is that using the we altered our algorithm to issue the sub-question decomposition algorithm without question re- text as-is, using the original category, and re- writing did not show impact over the baseline. trained and evaluated the resulting model again on This highlights the importance of contextual in- the test set. The results are also shown in Table 3. formation for QA. On the other hand, when us- In the table, PB refers to Parallel Baseline, and ing re-writing to maintain context (Section 5.1), 858 our parallel decomposition algorithm was able to 7 Conclusion achieve a gain of 1.4% on the parallel decompos- In this paper, we presented a general-purpose de- able question set, which translated to an end-to- composition framework for answering complex end gain of 0.6%. factoid questions, which consists of three compo- Separately, the table shows that roughly a nents: 1) a decomposition recognizer, which iden- fifth (255 out of 1269 questions) of the entire tifies the subparts of a decomposable question, 2) test set were recognized as nested decomposable. a question re-writer, which composes new sub- Again, interestingly, the performance of the base- questions from the identified subparts, taking into line QA system on the nested decomposable set account context from the original question, and was roughly the same as the overall performance 3) an answer synthesis and re-ranking component, (and much lower than the parallel decomposable which synthesizes and ranks final answers based cases). The likely explanation here is that nested on candidate answers to the sub-questions. Addi- questions require solving for an inner fact first, tionally, this framework leverages an underlying and it is the answer to this, which often provides factoid QA system for producing answers to the the necessary missing information required to find sub-questions. Any QA system that can associate the correct answer: this makes nested questions confidence scores with its answers and can make much harder to solve than parallel decomposable distinctions between the question and the context ones with their multiple independent facts. Our in which the question should be interpreted can be nested decomposition algorithm using the heuris- adopted in this decomposition framework. tic re-ranking approach (Section 5.3) was able to We applied our decomposition framework to achieve a gain of 2% on the nested decompos- address two broad classes of complex factoid able question set, which translated to an end-to- questions, parallel and nested decomposition end gain of 0.4%. questions. These are distinguished by how the The aggregate impact of parallel and nested de- identified sub-questions related to each other, composition was a 1.5% gain in accuracy on the which in turn affects how the candidate answers decomposable set, and a 1% gain on end-to-end to the sub-questions are combined to form the fi- system accuracy (in our case the questions that are nal answers. In order to maintain generality and classified as parallel or nested form disjoint sets). facilitate domain adaptation, the rule-based pat- To put these results in perspective, we empha- terns for decomposition recognition leverage syn- size that the baseline QA system represents state- tactic characteristics of the question that are in- of-the-art in solving Jeopardy! questions. The dicative of sub-question boundaries. To optimally FJ questions, which exclusively comprise our test leverage these patterns, a machine learning model data, are known to be harder than regular Jeop- was trained to properly weigh the possibly over- ardy!: qualified Jeopardy! players’ accuracy on lapping, and occasionally conflicting, patterns. this kind of questions is 48%,4 and the underlying We demonstrated the impact of our question QA system has an accuracy close to 51% on pre- decomposition approach on a state-of-the-art fac- viously unseen FJ questions. A gain of 1% end- toid QA system. On a test set of 1269 Final Jeop- to-end on such questions, therefore, represents a ardy! questions, 47% of the question were found strong improvement. Also, using the statistical to be parallel decomposable and 20% were nested McNemar’s test (McNemar, 1947), we found the decomposable. Overall, the system achieved a net end-to-end impact to be statistically signifi- statistically significant gain of 1.5% in accuracy cant at a 99% confidence interval. on these questions, further increasing the system’s Finally, we note that our error analysis of the lead over human Jeopardy! players’ performance test questions shows a wide variety of reasons for on these questions. their failures beyond question decomposition. To Given that factoid (and, often, complex) ques- further improve the system on this test set would tions are typically found in several real world do- require advances beyond deciding whether to take mains (e.g. medical, legal, technical support), a single-shot or decomposable approach to ques- we expect our decomposition framework to have tions, which is beyond the scope of this paper. broad impact, both in open- and specialized- 4 Calculated over historical games data, from J-archive domain QA. (http://www.j-archive.com). 859 References E. Voorhees. 2002. Overview of the TREC 2002 Question Answering Track. In NIST Special Pub- D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, lication 500-251: The Eleventh Text REtrieval Con- D. Gondek, A. Kalyanpur, A. Lally, J. W. Murdock, ference (TREC 2002), Gaithersburg, MD, Novem- E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. ber. 2010. Building Watson: An Overview of the I. Witten and E. Frank. 2000. Data Mining - Practical DeepQA Project. AI Magazine, 31(3):59–79, Fall. Machine Learning Tools and Techniques with Java S. Hartrumpf. 2008. Semantic Decomposition for Implementations. Morgan–Kaufmann, San Fran- Question Answering. In Proceedings of the 18th cisco, CA. European Conference on Artificial Intelligence, pages 313–317, Patras, Greece, July. B. Katz, G. Borchardt, and S. Felshin. 2005. Syntactic and Semantic Decomposition Strategies for Ques- tion Answering from Multiple Sources. In Proceed- ings of the AAAI Workshop on Inference for Textual Question Answering, pages 35–41, Pittsburgh, PA, July. F. Lacatusu, A. Hickl, and S. Harabagiu. 2006. The Impact of Question Decomposition on the Quality of Answer Summaries. In Proceedings of the Fifth Language Resources and Evaluation Conference, pages 1147–1152, Genoa, Italy, May. C.J. Lin and R.R. Liu. 2008. An Analysis of Multi- Focus Questions. In Proceedings of the SIGIR 2008 Workshop on Focused Retrieval, pages 30–36, Sin- gapore, July. M. McCord. 1989. Slot Grammar: A System for Simpler Construction of Practical Natural Language Grammars. In Proceedings of the International Symposium on Natural Language and Logic, pages 118–145, Hamburg, Germany, May. Q. McNemar. 1947. Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages. Psychometrika, 12(2):153–157. J. Prager, E. Brown, and J. Chu-Carroll. Special Ques- tions and Techniques. Submitted to IBM Jour- nal of Research and Development, Special Issue on DeepQA. J. Prager, J. Chu-Carroll, and K. Czuba. 2004. Ques- tion Answering by Constraint Satisfaction: QA-by- Dossier with Constraints. In Proceedings of the 42nd Annual Meeting of the Association for Com- putational Linguistics, pages 574–581, Barcelona, Spain, July. E. Saquete, P. Mart´ınez-Barco, R. Mu˜noz, and J. Vicedo. 2004. Splitting Complex Temporal Questions for Question Answering Systems. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 566–573, Barcelona, Spain, July. R. Soricut and E. Brill. 2004. Automatic Question Answering: Beyond the Factoid. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 57–64, Boston, MA, May. 860 Author Index Aizawa, Akiko, 686 Clark, Stephen, 736 Alfonseca, Enrique, 214 Cook, Paul, 591 Alkuhlani, Sarah, 675 Cooke, Martin, 1 Andersson, Evelina, 44 Costa, Francisco, 266 Andr´es-Ferrer, Jes´us, 152 Daume III, Hal, 204, 747 Baeza-Yates, Ricardo, 706 Delort, Jean-Yves, 214 Baldwin, Timothy, 591 D´etrez, Gr´egoire, 645 Balle, Borja, 409 Dinarelli, Marco, 174 Baroni, Marco, 23 Dinu, Liviu P., 524 Barzilay, Regina, 397 Do, Ngoc-Quynh, 23 Battersby, Stuart, 482 Dodge, Jesse, 747 Baumann, Timo, 514 Dolan, William B., 306 Bell, Peter, 471 Dr¨ager, Markus, 757 Berg, Alex, 747 Dzikovska, Myroslava O., 471 Berg, Tamara, 747 Bernardi, Raffaella, 23 Eckle-Kohler, Judith, 550, 580 Bethard, Steven, 336 Elsner, Micha, 634 Bhowmick, Rishav, 162 Fan, James, 185 Bisazza, Arianna, 439 Farkas, Rich´ard, 55 Blackwood, Graeme, 736 Faruqui, Manaal, 623 Boguraev, Branimir, 851 Federico, Marcello, 439 Bohnet, Bernd, 77 Feng, Vanessa Wei, 315 Bouamor, Houda, 716 Fernandez Monsalve, Irene, 398 Boves, Lou, 561 Fern´andez-Gonz´alez, Daniel, 66 Branco, Ant´onio, 266 Ferschke, Oliver, 777 Braune, Fabienne, 808 Figueroa, Alejandro, 99 Bronner, Amit, 356 Frank, Stefan L., 398 Buß, Okko, 514 Fraser, Alexander, 664, 726 Cahill, Aoife, 664, 767 Gasc´o, Guillem, 152 Callison-Burch, Chris, 130 Georgiev, Georgi, 492 Can, Burcu, 654 Glaser, Andrea, 276 Cap, Fabienne, 664 Gliozzo, Alfio, 185 Carreras, Xavier, 409 Gojun, Anita, 726 Casacuberta, Francisco, 152, 245 Goldwater, Sharon, 234 Chambers, Nathanael, 603 Gollub, Tim, 570 Chebotar, Yevgen, 777 G´omez-Rodr´ıguez, Carlos, 66 Cheung, Jackie Chi Kit, 33, 696 Gonz´alez-Rubio, Jes´us, 245 Chowdhury, Md. Faisal Mahbub, 420 Goyal, Amit, 747 Christensen, Janara, 503 Grishman, Ralph, 194 Chrupala, Grzegorz, 613 Gurevych, Iryna, 550, 580, 777 Chu-Carroll, Jennifer, 851 Gweon, Gahgene, 787 Cinkov´a, Silvie, 840 861 Habash, Nizar, 675 Min, Bonan, 194 Han, Xufeng, 747 Mitchell, Margaret, 747 Hanamoto, Atsushi, 430 Mitkov, Ruslan, 706 Hartmann, Silvana, 580 Miyao, Yusuke, 686 Henrich, Verena, 387 Moens, Marie-Francine, 336, 449 Hinrichs, Erhard, 387 Mohit, Behrang, 162 Hirst, Graeme, 315 Monz, Christof, 2, 109, 356 Holub, Martin, 840 Mooney, Raymond, 602 Hoppe, Dennis, 570 Moore, Johanna D., 471 Hovy, Dirk, 185 Mostow, Jack, 377 Huang, Ruihong, 286 Nakov, Preslav, 492 Irvine, Ann, 130 Newman, David, 591 Isard, Amy, 471 Ng, Vincent, 798 Niculae, Vlad, 524 Jagarlamudi, Jagadeesh, 204 Nikoulina, Vassilina, 109 Jain, Mahaveer, 787 Nivre, Joakim, 44 Jang, Hyeju, 377 Jans, Bram, 336 Oflazer, Kemal, 162 Joachims, Thorsten, 224 Ordan, Noam, 255 Ortiz-Mart´ınez, Daniel, 245 Kaisser, Michael, 88 Osenova, Petya, 492 Kalyanpur, Aditya, 851 Klakow, Dietrich, 325 Pado, Sebastian, 623 Klementiev, Alexandre, 12, 130 Pasca, Marius, 503 Koller, Alexander, 757 Patwardhan, Siddharth, 185, 851 Kovachev, Bogomil, 109 Peldszus, Andreas, 514 Kr´ızˇ , Vincent, 840 Penn, Gerald, 33, 696 Kuhn, Jonas, 77, 767 Penstein Ros´e, Carolyn, 787 Kwiatkowski, Tom, 234 Powers, David Martin Ward, 345 Purver, Matthew, 482 Lagos, Nikolaos, 109 Lagoutte, Aurelie, 808 Qu, Zhonghua, 367 Lally, Adam, 851 Quattoni, Ariadna, 409 Lau, Jey Han, 591 Quernheim, Daniel, 808 Lavelli, Alberto, 420 Lembersky, Gennadi, 255 Rahman, Altaf, 798 Liu, Ting, 296 Raj, Bhiksha, 787 Liu, Yang, 367 Ranta, Aarne, 645 Luque, Franco M., 409 Rello, Luz, 706 Riezler, Stefan, 818 Maletti, Andreas, 808 Riloff, Ellen, 286 Manandhar, Suresh, 654 Rocha, Martha-Alicia, 152 Marchetti-Bowick, Micol, 603 Rosset, Sophie, 174 Martzoukos, Spyros, 2 Matsubayashi, Yuichiroh, 686 Sanchis-Trilles, Germ´an, 152 Matsuzaki, Takuya, 430 Schlangen, David, 514 Matuschek, Michael, 580 Schmid, Helmut, 55 Max, Aur´elien, 716 Schneider, Nathan, 162 McCarthy, Diana, 591 Sch¨utze, Hinrich, 276 McDonough, John, 787 Sennrich, Rico, 539 Mensch, Alyssa, 747 Shan, Chung-chieh, 23 Meyer, Christian M., 580 Shivaswamy, Pannaga, 224 Simov, Kiril, 492 Sipos, Ruben, 224 Smith, Noah A., 162 Sokolov, Artem, 120 Steedman, Mark, 234 Stein, Benno, 570 Stratos, Karl, 747 Strik, Helmer, 561 Strzalkowski, Tomek, 296 Sulea, Octavia-Maria, 524 Tiedemann, J¨org, 141 Titov, Ivan, 12 Tsarfaty, Reut, 44 Tsujii, Jun’ichi, 430 Udupa, Raghavendra, 204 van Cranenburgh, Andreas, 460 van den Bosch, Antal, 561 van Trijp, Remi, 829 Verberne, Suzan, 561 Vigliocco, Gabriella, 398 Vilnat, Anne, 716 Vincze, Veronika, 55 Vodolazova, Tatiana, 387 Volkova, Svitlana, 306 Vuli´c, Ivan, 336, 449 Waeschle, Katharina, 818 Weller, Marion, 664 Welty, Christopher, 185 Wiegand, Michael, 325 Wilson, Theresa, 306 Wintner, Shuly, 255 Wirth, Christian, 580 Wisniewski, Guillaume, 120 Yamaguchi, Kota, 747 Yarowsky, David, 130 Yvon, Francois, 120 Zarrieß, Sina, 767 Zesch, Torsten, 529 Zettlemoyer, Luke, 234 Zhang, Yue, 736 Zhikov, Valentin, 492