(PDF) Named Entity Transliteration Gener

IJCNLP 2011 Proceedings of NEWS 2011 2011 Named Entities Workshop November 12, 2011 Shangri-La Hotel Chiang Mai, Thailand IJCNLP 2011 NEWS 2011 2011 Named Entities Workshop November 12, 2011 Chiang Mai, Thailand We wish to thank our sponsors Gold Sponsors www.google.com www.baidu.com The Office of Naval Research (ONR) Department of Systems Engineering and The Asian Office of Aerospace Research and Devel- Engineering Managment, The Chinese Uni- opment (AOARD) versity of Hong Kong Silver Sponsors Microsoft Corporation Bronze Sponsors Chinese and Oriental Languages Information Processing Society (COLIPS) Supporter Thailand Convention and Exhibition Bureau (TCEB) We wish to thank our sponsors Organizers Asian Federation of Natural Language National Electronics and Computer Technolo- Processing (AFNLP) gy Center (NECTEC), Thailand Sirindhorn International Institute of Technology Rajamangala University of Technology Lanna (SIIT), Thailand (RMUTL), Thailand Chiang Mai University (CMU), Thailand Maejo University, Thailand c 2011 Asian Federation of Natural Language Proceesing vii Preface The workshop series, Named Entities WorkShop (NEWS), focuses on research on all aspects of the Named Entities, such as, identifying and analyzing named entities, mining, translating and transliterating named entities, etc. The first of the NEWS workshops (NEWS 2009) was held as a part of ACL-IJCNLP 2009 conference in Singapore, and the second one, NEWS 2010, was held as an ACL 2010 workshop in Uppsala, Sweden. The current edition, NEWS 2011, was held as an IJCNLP 2011 workshop, in Chiang Mai, Thailand. The purpose of the NEWS workshop is to bring together researchers across the world interested in identification, analysis, extraction, mining and transformation of named entities in monolingual or multilingual natural language text corpora. The workshop scope includes many interesting specific research areas pertaining to the named entities, such as, orthographic and phonetic characteristics, corpus analysis, unsupervised and supervised named entities extraction in monolingual or multilingual corpus, transliteration modeling, and evaluation methodologies, to name a few. For this year edition, 8 research papers were submitted, each of which was reviewed by at least 3 reviewers from the program committee. 5 papers were chosen for publication, covering main research areas, from named entities identification, classification, to machine transliteration and transliteration mining from comparable corpus and wiki. All accepted research papers are published in the workshop proceedings. Following the tradition of the NEWS workshop series, NEWS 2011 continued the machine transliteration shared task this year as well. The shared task was first introduced in NEWS 2009 and continued in NEWS 2010. In NEWS 2011, by leveraging on the previous success of NEWS 2009 and NEWS 2011, we significantly increased the hand-crafted parallel named entities corpora to include 14 different language pairs from 11 language families, and made them available as the common dataset for the shared task. We published the details of the shared task and the training and development data several months ahead of the conference that attracted an overwhelming response from the research community. In total, 10 international teams participated from around the globe. The approaches ranged from traditional unsupervised learning methods (such as, Phrasal SMT-based, Conditional Random Fields, etc.) to somewhat new approaches (such as, Non-Parametric Bayesian Co-segmentation, Multi-to-Multi Joint Source Channel Model and Leveraging Transliterations from Multiple Languages), in addition to several teams resorting to model/system combinations for results re-ranking. A report of the shared task that summarizes all submissions and the original whitepaper are also included in the proceedings, and will be presented in the workshop. The participants in the shared task were asked to submit short system papers (4 content pages each) describing their approaches, and each of such papers was reviewed by at least three members of the program committee to help improve the quality. All the 10 system papers were finally accepted to be published in the workshop proceedings. It is heartening for us to report that the previous year’s NEWS datasets are being regularly requested by research groups throughout the year outside the NEWS shared tasks, for calibration of new approaches by groups that had not previously participated in the shared tasks. We expect such trend to continue, establishing the NEWS parallel names corpora as a standard dataset, and NEWS metrics as a standard measure for future machine transliteration research. We hope that NEWS 2011 would provide an exciting and productive forum for researchers working in this research area. We wish to thank all the researchers for their research submission and the enthusiastic participation in the transliteration shared tasks. We wish to express our gratitude to CJK Institute, Institute for Infocomm Research, Microsoft Research India, Thailand National Electronics and Computer Technology Centre and The Royal Melbourne Institute of Technology (RMIT)/Sarvnaz Karimi for preparing the data released as a part of the shared tasks. Finally, we thank all the program committee members for reviewing the submissions in spite of the tight schedule. viii Workshop Chairs Haizhou Li, Institute for Infocomm Research, Singapore A Kumaran, Microsoft Research, India Min Zhang, Institute for Infocomm Research, Singapore 12 November 2011, Chiang Mai, Thailand ix Organizers: Workshop Co-Chair: Haizhou Li, Institute for Infocomm Research, Singapore Workshop Co-Chair: A Kumaran, Microsoft Research, India Workshop Co-Chair: Min Zhang, Institute for Infocomm Research, Singapore Program Committee: Kalika Bali, Microsoft Research, India Rafael Banchs, Institute for Infocomm Research, Singapore Sivaji Bandyopadhyay, University of Jadavpur, India Pushpak Bhattacharyya, IIT-Bombay, India Monojit Choudhury, Microsoft Research, India Marta Ruiz Costa-jussa, UPC, Spain Gregory Grefenstette, Exalead, France Guohong Fu, Heilongjiang University, China Sarvnaz Karimi, NICTA and the University of Melbourne, Australia Mitesh Khapra, IIT-Bombay, India Greg Kondrak, University of Alberta, Canada Olivia Kwong, City University, Hong Kong Ming Liu, Institute for Infocomm Research, Singapore Jong-Hoon Oh, NICT, Japan Yan Qu, Advertising.com, USA Sudeshna Sarkar, IIT-Kharagpur, India Keh-Yih Su, Behavior Design Corporation, Taiwan Raghavendra Udupa, Microsoft Research, India Vasudeva Varma, IIIT-Hyderabad, India Haifeng Wang, Baidu.com, China Chai Wutiwiwatchai, NECTEC, Thailand Chengqing Zong, Institute of Automation, CAS, China xi Table of Contents Report of NEWS 2011 Machine Transliteration Shared Task Min Zhang, Haizhou Li, A Kumaran and Ming Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Whitepaper of NEWS 2011 Shared Task on Machine Transliteration Min Zhang, A Kumaran and Haizhou Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Integrating Models Derived from non-Parametric Bayesian Co-segmentation into a Statistical Machine Transliteration System Andrew Finch, Paul Dixon and Eiichiro Sumita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Simple Discriminative Training for Machine Transliteration Canasai Kruengkrai, Thatsanee Charoenporn and Virach Sornlertlamvanich . . . . . . . . . . . . . . . . . . 28 English-Korean Named Entity Transliteration Using Statistical Substring-based and Rule-based Ap- proaches Yu-Chun Wang and Richard Tzong-Han Tsai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Leveraging Transliterations from Multiple Languages Aditya Bhargava, Bradley Hauer and Grzegorz Kondrak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Comparative Evaluation of Spanish Segmentation Strategies for Spanish-Chinese Transliteration Rafael E. Banchs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Using Features from a Bilingual Alignment Model in Transliteration Mining Takaaki Fukunishi, Andrew Finch, Seiichi Yamamoto and Eiichiro Sumita . . . . . . . . . . . . . . . . . . . 49 Product Name Identification and Classification in Thai Economic News Nattadaporn Lertcheva and Wirote Aroonmanakun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Mining Multi-word Named Entity Equivalents from Comparable Corpora Abhijit Bhole, Goutham Tholpadi and Raghavendra Udupa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 An Unsupervised Alignment Model for Sequence Labeling: Application to Name Transliteration Najmeh Mousavi Nejad and Shahram Khadivi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Forward-backward Machine Transliteration between English and Chinese Based on Combined CRFs Ying Qin and GuoHua Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 English-to-Chinese Machine Transliteration using Accessor Variety Features of Source Graphemes Mike Tian-Jian Jiang, Chan-Hung Kuo and Wen-Lian Hsu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 The Amirkabir Machine Transliteration System for NEWS 2011: Farsi-to-English Task Najmeh Mousavi Nejad, Shahram Khadivi and Kaveh Taghipour . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 English-Chinese Personal Name Transliteration by Syllable-Based Maximum Matching Oi Yee Kwong. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96 Statistical Machine Transliteration with Multi-to-Multi Joint Source Channel Model Yu Chen, Rui Wang and Yi Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Named Entity Transliteration Generation Leveraging Statistical Machine Translation Technology Pradeep Dasigi and Mona Diab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 xiii Conference Program November 12, 2011 8:30-10:00 Session 1 8:30–8:40 Opening Remarks by Haizhou Li, A Kumaran, Min Zhang and Ming Liu 8:40–9:00 Integrating Models Derived from non-Parametric Bayesian Co-segmentation into a Statistical Machine Transliteration System Andrew Finch, Paul Dixon and Eiichiro Sumita 9:00–9:20 Simple Discriminative Training for Machine Transliteration Canasai Kruengkrai, Thatsanee Charoenporn and Virach Sornlertlamvanich 9:20–9:40 English-Korean Named Entity Transliteration Using Statistical Substring-based and Rule-based Approaches Yu-Chun Wang and Richard Tzong-Han Tsai 9:40–10:00 Leveraging Transliterations from Multiple Languages Aditya Bhargava, Bradley Hauer and Grzegorz Kondrak 10:00-10:30 Morning Break xv November 12, 2011 (continued) 10:00-12:00 Session 2 10:00–10:30 Comparative Evaluation of Spanish Segmentation Strategies for Spanish-Chinese Translit- eration Rafael E. Banchs 10:30–11:00 Using Features from a Bilingual Alignment Model in Transliteration Mining Takaaki Fukunishi, Andrew Finch, Seiichi Yamamoto and Eiichiro Sumita 11:00–11:30 Product Name Identification and Classification in Thai Economic News Nattadaporn Lertcheva and Wirote Aroonmanakun 11:30–12:00 Mining Multi-word Named Entity Equivalents from Comparable Corpora Abhijit Bhole, Goutham Tholpadi and Raghavendra Udupa 12:00–14:00 Lunch Break 14:00-15:30 Session 3 14:00–14:30 An Unsupervised Alignment Model for Sequence Labeling: Application to Name Translit- eration Najmeh Mousavi Nejad and Shahram Khadivi 14:30–14:50 Forward-backward Machine Transliteration between English and Chinese Based on Com- bined CRFs Ying Qin and GuoHua Chen 14:50–15:10 English-to-Chinese Machine Transliteration using Accessor Variety Features of Source Graphemes Mike Tian-Jian Jiang, Chan-Hung Kuo and Wen-Lian Hsu 15:10–15:30 The Amirkabir Machine Transliteration System for NEWS 2011: Farsi-to-English Task Najmeh Mousavi Nejad, Shahram Khadivi and Kaveh Taghipour 15:30–16:00 Afternoon Break xvi November 12, 2011 (continued) 16:00-17:00 Session 4 16:00–16:20 English-Chinese Personal Name Transliteration by Syllable-Based Maximum Matching Oi Yee Kwong 16:20–16:40 Statistical Machine Transliteration with Multi-to-Multi Joint Source Channel Model Yu Chen, Rui Wang and Yi Zhang 16:40–17:00 Named Entity Transliteration Generation Leveraging Statistical Machine Translation Technology Pradeep Dasigi and Mona Diab 17:00-17:10 Closing xvii Report of NEWS 2011 Machine Transliteration Shared Task Min Zhang† , Haizhou Li† , A Kumaran‡ and Ming Liu † † Institute for Infocomm Research, A*STAR, Singapore 138632 {mzhang,hli,mliu}@i2r.a-star.edu.sg ‡ Multilingual Systems Research, Microsoft Research India

[email protected]

Abstract tems. Much research effort has been made to ad- dress the transliteration issue in the research com- This report documents the Machine munity (Knight and Graehl, 1998; Meng et al., Transliteration Shared Task conducted 2001; Li et al., 2004; Zelenko and Aone, 2006; as a part of the Named Entities Work- Sproat et al., 2006; Sherif and Kondrak, 2007; shop (NEWS 2011), an IJCNLP 2011 Hermjakob et al., 2008; Al-Onaizan and Knight, workshop. The shared task features 2002; Goldwasser and Roth, 2008; Goldberg and machine transliteration of proper names Elhadad, 2008; Klementiev and Roth, 2006; Oh from English to 11 languages and from 3 and Choi, 2002; Virga and Khudanpur, 2003; Wan languages to English. In total, 14 tasks and Verspoor, 1998; Kang and Choi, 2000; Gao are provided. 10 teams from 7 different et al., 2004; Zelenko and Aone, 2006; Li et al., countries participated in the evaluations. 2009b; Li et al., 2009a). These previous work Finally, 73 standard and 4 non-standard fall into three categories, i.e., grapheme-based, runs are submitted, where diverse translit- phoneme-based and hybrid methods. Grapheme- eration methodologies are explored and based method (Li et al., 2004) treats translitera- reported on the evaluation data. We report tion as a direct orthographic mapping and only the results with 4 performance metrics. uses orthography-related features while phoneme- We believe that the shared task has based method (Knight and Graehl, 1998) makes successfully achieved its objective by pro- use of phonetic correspondence to generate the viding a common benchmarking platform transliteration. Hybrid method refers to the com- for the research community to evaluate the bination of several different models or knowledge state-of-the-art technologies that benefit sources to support the transliteration generation. the future research and development. The first machine transliteration shared task (Li et al., 2009b; Li et al., 2009a) was held in NEWS 1 Introduction 2009 at ACL-IJCNLP 2009. It was the first time Names play a significant role in many Natural to provide common benchmarking data in diverse Language Processing (NLP) and Information Re- language pairs for evaluation of state-of-the-art trieval (IR) systems. They are important in Cross techniques. While the focus of the 2009 shared Lingual Information Retrieval (CLIR) and Ma- task was on establishing the quality metrics and chine Translation (MT) as the system performance on baselining the transliteration quality based on has been shown to positively correlate with the those metrics, the 2010 shared task (Li et al., correct conversion of names between the lan- 2010a; Li et al., 2010b) expanded the scope of guages in several studies (Demner-Fushman and the transliteration generation task to about a dozen Oard, 2002; Mandl and Womser-Hacker, 2005; languages, and explored the quality depending on Hermjakob et al., 2008; Udupa et al., 2009). The the direction of transliteration, between the lan- traditional source for name equivalence, the bilin- guages. NEWS 2011 was a continued effort of gual dictionaries — whether handcrafted or sta- NEWS 2010 and NEWS 2009. tistical — offer only limited support because new The rest of the report is organised as follows. names always emerge. Section 2 outlines the machine transliteration task All of the above point to the critical need for ro- and the corpora used and Section 3 discusses the bust Machine Transliteration technology and sys- metrics chosen for evaluation, along with the ratio- 1 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 1–13, Chiang Mai, Thailand, November 12, 2011. nale for choosing them. Sections 4 and 5 present language, implicitly specifying the transliteration the participation in the shared task and the results direction. Training and development data in each with their analysis, respectively. Section 6 con- of the language pairs have been made available to cludes the report. all registered participants for developing a translit- eration system for that specific language pair using 2 Transliteration Shared Task any approach that they find appropriate. At the evaluation time, a standard hand-crafted In this section, we outline the definition and the test set consisting of between 500 and 3,000 description of the shared task. source names (approximately 5-10% of the train- 2.1 “Transliteration”: A definition ing data size) have been released, on which the participants are required to produce a ranked list There exists several terms that are used inter- of transliteration candidates in the target language changeably in the contemporary research litera- for each source name. The system output is ture for the conversion of names between two tested against a reference set (which may include languages, such as, transliteration, transcription, multiple correct transliterations for some source and sometimes Romanisation, especially if Latin names), and the performance of a system is cap- scripts are used for target strings (Halpern, 2007). tured in multiple metrics (defined in Section 3), Our aim is not only at capturing the name con- each designed to capture a specific performance version process from a source to a target lan- dimension. guage, but also at its practical utility for down- For every language pair each participant is re- stream applications, such as CLIR and MT. There- quired to submit at least one run (designated as a fore, we adopted the same definition of translit- “standard” run) that uses only the data provided by eration as during the NEWS 2009 workshop (Li the NEWS workshop organisers in that language et al., 2009a) to narrow down ”transliteration” to pair, and no other data or linguistic resources. This three specific requirements for the task, as fol- standard run ensures parity between systems and lows:“Transliteration is the conversion of a given enables meaningful comparison of performance name in the source language (a text string in the of various algorithmic approaches in a given lan- source writing system or orthography) to a name guage pair. Participants are allowed to submit in the target language (another text string in the more “standard” runs, up to 4 in total. If more than target writing system or orthography), such that one “standard” runs is submitted, it is required to the target language name is: (i) phonemically name one of them as a “primary” run, which is equivalent to the source name (ii) conforms to the used to compare results across different systems. phonology of the target language and (iii) matches In addition, up to 4 “non-standard” runs could be the user intuition of the equivalent of the source submitted for every language pair using either data language name in the target language, consider- beyond that provided by the shared task organisers ing the culture and orthographic character usage or linguistic resources in a specific language, or in the target language.” both. This essentially may enable any participant In NEWS 2011, we introduce three to demonstrate the limits of performance of their back-transliteration tasks. We define back- system in a given language pair. transliteration as a process of restoring translit- erated words to their original languages. For The shared task timelines provide adequate time example, NEWS 2011 offers the tasks to convert for development, testing (approximately 1 month western names written in Chinese and Thai into after the release of the training data) and the final their original English spellings, and romanized result submission (7 days after the release of the Japanese names into their original Kanji writings. test data). 2.2 Shared Task Description 2.3 Shared Task Corpora Following the tradition in NEWS 2010, the shared We considered two specific constraints in select- task at NEWS 2011 is specified as development of ing languages for the shared task: language diver- machine transliteration systems in one or more of sity and data availability. To make the shared task the specified language pairs. Each language pair interesting and to attract wider participation, it is of the shared task consists of a source and a target important to ensure a reasonable variety among 2 the languages in terms of linguistic diversity, or- mary. Each run contains a ranked list of up to thography and geography. Clearly, the ability of 10 candidate transliterations for each source name. procuring and distributing a reasonably large (ap- The submitted results are compared to the ground proximately 10K paired names for training and truth (reference transliterations) using 4 evalua- testing together) hand-crafted corpora consisting tion metrics capturing different aspects of translit- primarily of paired names is critical for this pro- eration performance. The same as the NEWS cess. At the end of the planning stage and after 2010, we have dropped two M AP metrics used discussion with the data providers, we have cho- in NEWS 2009 because they don’t offer additional sen the set of 14 tasks shown in Table 1 (Li et al., information to M APref . Since a name may have 2004; Kumaran and Kellner, 2007; MSRI, 2009; multiple correct transliterations, all these alterna- CJKI, 2010). tives are treated equally in the evaluation, that is, NEWS 2011 leverages on the success of NEWS any of these alternatives is considered as a correct 2010 by utilizing the training and dev data of transliteration, and all candidates matching any of NEWS 2010 as the training data of NEWS 2011 the reference transliterations are accepted as cor- and the test data of NEWS 2010 as the dev data of rect ones. NEWS 2011. NEWS 2011 provides entirely new The following notation is further assumed: test data across all 14 tasks for evaluation. In ad- N : Total number of names (source dition to the 12 tasks inherited from NEWS 2010, words) in the test set NEWS 2011 is augmented with 2 new tasks with ni : Number of reference transliterations two new languages (Persian, Hebrew). for i-th name in the test set (ni ≥ 1) The names given in the training sets for Chi- ri,j : j-th reference transliteration for i-th nese, Japanese, Korean, Thai, Persian and Hebrew name in the test set languages are Western names and their respective ci,k : k-th candidate transliteration (system transliterations; the Japanese Name (in English) output) for i-th name in the test set → Japanese Kanji data set consists only of native (1 ≤ k ≤ 10) Japanese names; the Arabic data set consists only Ki : Number of candidate transliterations of native Arabic names. The Indic data set (Hindi, produced by a transliteration system Tamil, Kannada, Bangla) consists of a mix of In- 3.1 Word Accuracy in Top-1 (ACC) dian and Western names. For all of the tasks chosen, we have been Also known as Word Error Rate, it measures cor- able to procure paired names data between the rectness of the first transliteration candidate in the source and the target scripts and were able to candidate list produced by a transliteration system. make them available to the participants. For ACC = 1 means that all top candidates are cor- some language pairs, such as English-Chinese and rect transliterations i.e. they match one of the ref- English-Thai, there are both transliteration and erences, and ACC = 0 means that none of the top back-transliteration tasks. Most of the task are just candidates are correct. one-way transliteration, although Indian data sets contained mixture of names of both Indian and N 1 X 1 if ∃ri,j : ri,j = ci,1 ; Western origins. The language of origin of the ACC = N 0 otherwise names for each task is indicated in the first column i=1 of Table 1. (1) Finally, it should be noted here that the corpora 3.2 Fuzziness in Top-1 (Mean F-score) procured and released for NEWS 2011 represent perhaps the most diverse and largest corpora to be The mean F-score measures how different, on av- used for any common transliteration tasks today. erage, the top transliteration candidate is from its closest reference. F-score for each source word 3 Evaluation Metrics and Rationale is a function of Precision and Recall and equals 1 when the top candidate matches one of the refer- The participants have been asked to submit results ences, and 0 when there are no common characters of up to four standard and four non-standard runs. between the candidate and any of the references. One standard run must be named as the primary Precision and Recall are calculated based on submission and is used for the performance sum- the length of the Longest Common Subsequence 3 Data Size Name origin Source script Target script Data Owner Task ID Train Dev Test Western English Chinese Institute for Infocomm Research 37K 2.8K 2K EnCh Western Chinese English Institute for Infocomm Research 28K 2.7K 2K ChEn Western English Korean Hangul CJK Institute 7K 1K 1K EnKo Western English Japanese Katakana CJK Institute 26K 2K 3K EnJa Japanese English Japanese Kanji CJK Institute 10K 2K 3K JnJk Arabic Arabic English CJK Institute 27K 2.5K 2.5K ArEn Mixed English Hindi Microsoft Research India 12K 1K 2K EnHi Mixed English Tamil Microsoft Research India 10K 1K 2K EnTa Mixed English Kannada Microsoft Research India 10K 1K 2K EnKa Mixed English Bangla Microsoft Research India 13K 1K 2K EnBa Western English Thai NECTEC 27K 2K 2K EnTh Western Thai English NECTEC 25K 2K 2K ThEn Western English Persian Sarvnaz Karimi/RMIT 10K 2K 1K EnPe Western English Hebrew Microsoft Research India 9.5K 1K 2K EnHe Table 1: Source and target languages for the shared task on transliteration. (LCS) between a candidate and a reference: implies that the correct answer is mostly produced close to the top of the n-best lists. 1 LCS(c, r) = (|c| + |r| − ED(c, r)) (2) 2 minj 1j if ∃ri,j , ci,k : ri,j = ci,k ; RRi = where ED is the edit distance and |x| is the length 0 otherwise (7) of x. For example, the longest common subse- N 1 X quence between “abcd” and “afcde” is “acd” and M RR = RRi (8) its length is 3. The best matching reference, that N i=1 is, the reference for which the edit distance has 3.4 MAPref the minimum, is taken for calculation. If the best matching reference is given by Measures tightly the precision in the n-best can- didates for i-th source name, for which reference ri,m = arg min (ED(ci,1 , ri,j )) (3) transliterations are available. If all of the refer- j ences are produced, then the MAP is 1. Let’s de- then Recall, Precision and F-score for i-th word note the number of correct candidates for the i-th are calculated as source word in k-best list as num(i, k). MAPref is then given by LCS(ci,1 , ri,m ) Ri = (4) N ni ! |ri,m | 1 X 1 X M APref = num(i, k) (9) LCS(ci,1 , ri,m ) N ni Pi = (5) i k=1 |ci,1 | R i × Pi 4 Participation in Shared Task Fi = 2 (6) Ri + Pi 10 teams from 7 countries and regions (Canada, • The length is computed in distinct Unicode Hong Kong/Mainland China, Iran, Germany, characters. USA, Japan, Thailand) submitted their transliter- ation results. • No distinction is made on different character Two teams have participated in all or almost all types of a language (e.g., vowel vs. conso- tasks while others participated in 1 to 4 tasks. Each nants vs. combining diereses etc.) language pair has attracted on average around 4 teams. The details are shown in Table 3. 3.3 Mean Reciprocal Rank (MRR) Teams are required to submit at least one stan- Measures traditional MRR for any right answer dard run for every task they participated in. In produced by the system, from among the candi- total, we receive 73 standard and 4 non-standard dates. 1/M RR tells approximately the average runs. Table 2 shows the number of standard and rank of the correct transliteration. MRR closer to 1 non-standard runs submitted for each task. It is 4 clear that the most “popular” task is the translit- substrings of arbitrary lengths in both source and eration from English to Chinese being attempted target strings. Qin and Chen (2011) adopt the ap- by 7 participants. The next most popular is back- proach of Conditional Random Fields (CRF) (Laf- transliteration from Chinese to English being at- ferty et al., 2001). tempted by 6 participants. This is somewhat dif- Kwong (2011) present their transliteration sys- ferent from NEWS 2010, where the two most tem with a syllable-based Backward Maximum popular tasks were English to Hindi and English Matching method. The system uses the Onset First to other Indic scripts (Tamil,Kannada,Bangla) and Principle to syllabify English names and align Thai transliteration. them with Chinese names. The bilingual lexi- con containing aligned segments of various syl- 5 Task Results and Analysis lable lengths subsequently allows direct translit- eration by chunks. Wang and Tsai (2011) adopt 5.1 Standard runs the substring-based transliteration approach which All the results are presented numerically in Ta- groups the characters of named entity in both bles 4–17, for all evaluation metrics. These are the source and target languages into substrings and official evaluation results published for this edition then formulate the transliteration as a sequential of the transliteration shared task. tagging problem to tag the substrings in the source The methodologies used in the ten submitted language with the substrings in the target lan- system papers are summarized as follows. Finch guage. The CRF algorithm is then used to deal et al. (2011) employ non-Parametric Bayesian with this tagging problem. They also construct method to co-segment bilingual named entities for a rule-based transliteration method for compari- model training and report very good performance. son. Nejad et al. (2011) report three systems for This system is based on phrase-based statistical transliteration: the first system is a maximum en- machine transliteration (SMT) (Finch and Sumita, tropy model with a newly proposed alignment al- 2008), an approach initially developed for ma- gorithm. The second system is Sequitur g2p tool, chine translation (Koehn et al., 2003), where the an open source grapheme to phoneme convertor. SMT system’s log-linear model is augmented with The third system is Moses, a phrased based sta- a set of features specifically suited to the task of tistical machine translation system. In addition, transliteration. In particular, the model utilizes a several new features are introduced to enhance the feature based on a joint source-channel model, and overall accuracy in the maximum entropy model. a feature based on a maximum entropy model that Their results show that the combination of maxi- predicts target grapheme sequences using the local mum entropy system with Sequitur g2p tool and context of graphemes and grapheme sequences in Moses lead to a considerable improvement over both source and target languages. individual systems. Jiang et al. (2011) extensively explore the use of accessor variety (a similarity measure) of 5.2 Non-standard runs the source graphemes as a feature under CRF For the non-standard runs, we pose no restrictions framework for machine transliteration and report on the use of data or other linguistic resources. promising results. Kruengkrai et al. (2011) study The purpose of non-standard runs is to see how discriminative training based on the Margin In- best personal name transliteration can be, for a fused Relaxed Algorithm with simple character given language pair. In NEWS 2011, the ap- alignments under SMT framework for machine proaches used in non-standard runs are typical and transliteration. They report very impressive re- may be summarised as follows: sults. Bhargava et al. (2011) attemp to improve transliteration performance by leveraging translit- • with supplemental transliteration data from erations from multiple languages. Dasigi and Diab other languages of NEWS 2011 data. (Bhar- (2011) adopt the approach of phrase-based statis- gava et al., 2011). Significant performance tical machine transliteration (Finch and Sumita, improvement is reported with this additional 2008). Chen et al. (2011) extend the joint source- knowledge. channel model (Li et al., 2004) on the translit- eration task into a multi-to-multi joint source- • with English phonemic information from channel model, which allows alignments between CMU Pronouncing Dictionary v0.7a1 5 English to Chinese to English to Thai to En- English to English to English to Chinese English Thai glish Hindi Tamil Kannada Language pair code EnCh ChEn EnTh ThEn EnHi EnTa EnKa Standard runs 15 13 4 4 9 4 4 Non-standard runs 0 0 0 0 1 0 0 English to English English to Arabic to English to English to English to Japanese to Korean Japanese English Bengali Persian Hebrew Katakana Hangul Kanji (Bangla) Language pair code EnJa EnKo JnJk ArEn EnBa EnPe EnHe Standard runs 2 2 1 3 3 6 3 Non-standard runs 0 3 0 0 0 0 0 Table 2: Number of runs submitted for each task. Number of participants coincides with the number of standard runs submitted. Team Organisation EnCh ChEn EnTh ThEn EnHi EnTa EnKa EnJa EnKo JnJk ArEn EnBa EnPe EnHe ID 1 Amirkabir University x of Technology 2 NICT x x x x x x x x x x x x x x 3 Beijing Foreign Stud- x x ies University 4 DFKI GmbH x x 5 City University of x x Hong Kong 6 NECTEC x x x x x x x x x x 7 University of Alberta x x x 8 Yuan Ze University x and National Taiwan University 9 National Tsing Hua x x University 10 Columbia University x x x Table 3: Participation of teams in different tasks. (http://www.speech.cs.cmu.edu/cgi- niques such as Phrase-Based Machine Transliter- bin/cmudict) (Das et al., 2010). However, ation (Koehn et al., 2003), system combination performance drops very much when using and re-ranking in the NEWS 2010, we are de- the English phonemic information. lighted to see that several new techniques have been proposed and explored with promising re- 6 Conclusions and Future Plans sults reported, including Non-Parametric Bayesian Co-segmentation (Finch et al., 2011), Multi-to- The Machine Transliteration Shared Task in Multi Joint Source Channel Model (Chen et al., NEWS 2011 shows that the community has a 2011), Leveraging Transliterations from Multiple continued interest in this area. This report sum- Languages (Bhargava et al., 2011) and discrim- marizes the results of the shared task. Again, inative training based on the Margin Infused Re- we are pleased to report a comprehensive cal- laxed Algorithm (Kruengkrai et al., 2011) . As ibration and baselining of machine translitera- the standard runs are limited by the use of corpus, tion approaches as most state-of-the-art machine most of the systems are implemented under the di- transliteration techniques are represented in the rect orthographic mapping (DOM) framework (Li shared task. In addition to the most popular tech- et al., 2004). While the standard runs allow us 6 to conduct meaningful comparison across differ- References ent algorithms, we recognise that the non-standard Yaser Al-Onaizan and Kevin Knight. 2002. Machine runs open up more opportunities for exploiting a transliteration of names in arabic text. In Proc. variety of additional linguistic corpora. ACL-2002 Workshop: Computational Apporaches to Encouraged by the success of the NEWS work- Semitic Languages, Philadelphia, PA, USA. shop series, we would like to continue this event Aditya Bhargava, Bradley Hauer, and Grzegorz Kon- in the future conference to promote the machine drak. 2011. Leveraging transliterations from multi- transliteration research and development. ple languages. In Proc. Named Entities Workshop at IJCNLP 2011. Acknowledgements Yu Chen, Rui Wang, and Yi Zhang. 2011. Statisti- The organisers of the NEWS 2011 Shared Task cal machine transliteration with multi-to-multi joint would like to thank the Institute for Infocomm source channel model. In Proc. Named Entities Workshop at IJCNLP 2011. Research (Singapore), Microsoft Research In- dia, CJK Institute (Japan), National Electronics CJKI. 2010. CJK Institute. http://www.cjk.org/. and Computer Technology Center (Thailand) and Amitava Das, Tanik Saikh, Tapabrata Mondal, Asif Ek- Sarvnaz Karim / RMIT for providing the corpora bal, and Sivaji Bandyopadhyay. 2010. English to and technical support. Without those, the Shared Indian languages machine transliteration system at Task would not be possible. We thank those par- NEWS 2010. In Proc. Named Entities Workshop at ticipants who identified errors in the data and sent ACL 2010. us the errata. We also want to thank the members Pradeep Dasigi and Mona Diab. 2011. Named entity of programme committee for their invaluable com- transliteration using a statistical machine translation ments that improve the quality of the shared task framework. In Proc. Named Entities Workshop at papers. Finally, we wish to thank all the partici- IJCNLP 2011. pants for their active participation that have made D. Demner-Fushman and D. W. Oard. 2002. The ef- this first machine transliteration shared task a com- fect of bilingual term list size on dictionary-based prehensive one. cross-language information retrieval. In Proc. 36-th Hawaii Int’l. Conf. System Sciences, volume 4, page 108.2. Andrew Finch and Eiichiro Sumita. 2008. Phrase- based machine transliteration. In Proc. 3rd Int’l. Joint Conf NLP, volume 1, Hyderabad, India, Jan- uary. Andrew Finch, Paul Dixon, and Eiichiro Sumita. 2011. Integrating models derived from non-parametric bayesian co-segmentation into a statistical machine transliteration system. In Proc. Named Entities Workshop at IJCNLP 2011. Wei Gao, Kam-Fai Wong, and Wai Lam. 2004. Phoneme-based transliteration of foreign names for OOV problem. In Proc. IJCNLP, pages 374–381, Sanya, Hainan, China. Yoav Goldberg and Michael Elhadad. 2008. Identifica- tion of transliterated foreign words in Hebrew script. In Proc. CICLing, volume LNCS 4919, pages 466– 477. Dan Goldwasser and Dan Roth. 2008. Translitera- tion as constrained optimization. In Proc. EMNLP, pages 353–362. Jack Halpern. 2007. The challenges and pitfalls of Arabic romanization and arabization. In Proc. Workshop on Comp. Approaches to Arabic Script- based Lang. 7 Ulf Hermjakob, Kevin Knight, and Hal Daum´e. 2008. Haizhou Li, A Kumaran, Min Zhang, and Vladimir Name translation in statistical machine translation: Pervouchine. 2010a. Report of news 2010 translit- Learning when to transliterate. In Proc. ACL, eration generation shared task. In Proc. Named En- Columbus, OH, USA, June. tities Workshop at ACL 2010. Mike Tian-Jian Jiang, Chan-Hung Kuo, and Wen-Lian Haizhou Li, A Kumaran, Min Zhang, and Vladimir Hsu. 2011. English-to-chinese machine translit- Pervouchine. 2010b. Whitepaper of news 2010 eration using accessor variety features of source shared task on transliteration generation. In Proc. graphemes. In Proc. Named Entities Workshop at Named Entities Workshop at ACL 2010. IJCNLP 2011. T. Mandl and C. Womser-Hacker. 2005. The effect of Byung-Ju Kang and Key-Sun Choi. 2000. named entities on effectiveness in cross-language in- English-Korean automatic transliteration/back- formation retrieval evaluation. In Proc. ACM Symp. transliteration system and character alignment. In Applied Comp., pages 1059–1064. Proc. ACL, pages 17–18, Hong Kong. Helen M. Meng, Wai-Kit Lo, Berlin Chen, and Karen Tang. 2001. Generate phonetic cognates to han- Alexandre Klementiev and Dan Roth. 2006. Weakly dle name entities in English-Chinese cross-language supervised named entity transliteration and discov- spoken document retrieval. In Proc. ASRU. ery from multilingual comparable corpora. In Proc. 21st Int’l Conf Computational Linguistics and 44th MSRI. 2009. Microsoft Research India. Annual Meeting of ACL, pages 817–824, Sydney, http://research.microsoft.com/india. Australia, July. Najmeh Mousavi Nejad, Shahram Khadivi, and Kaveh Kevin Knight and Jonathan Graehl. 1998. Machine Taghipour. 2011. The machine transliteration sys- transliteration. Computational Linguistics, 24(4). tem description for news 2011. In Proc. Named En- tities Workshop at IJCNLP 2011. P. Koehn, F. J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proc. HLT-NAACL. Jong-Hoon Oh and Key-Sun Choi. 2002. An English- Korean transliteration model using pronunciation Canasai Kruengkrai, Thatsanee Charoenporn, and and contextual rules. In Proc. COLING 2002, Virach Sornlertlamvanich. 2011. Simple discrim- Taipei, Taiwan. inative training for machine transliteration. In Proc. Named Entities Workshop at IJCNLP 2011. Ying Qin and GuoHua Chen. 2011. Forward- backward machine transliteration between english A Kumaran and T. Kellner. 2007. A generic frame- and chinese based on combined crfs. In Proc. work for machine transliteration. In Proc. SIGIR, Named Entities Workshop at IJCNLP 2011. pages 721–722. Tarek Sherif and Grzegorz Kondrak. 2007. Substring- Oi Yee Kwong. 2011. English-chinese personal name based transliteration. In Proc. 45th Annual Meeting transliteration by syllable-based maximum match- of the ACL, pages 944–951, Prague, Czech Repub- ing. In Proc. Named Entities Workshop at IJCNLP lic, June. 2011. Richard Sproat, Tao Tao, and ChengXiang Zhai. 2006. Named entity transliteration with comparable cor- J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- pora. In Proc. 21st Int’l Conf Computational Lin- ditional random fields: Probabilistic models for seg- guistics and 44th Annual Meeting of ACL, pages 73– menting and labeling sequence data. In Proc. Int’l. 80, Sydney, Australia. Conf. Machine Learning, pages 282–289. Raghavendra Udupa, K. Saravanan, Anton Bakalov, Haizhou Li, Min Zhang, and Jian Su. 2004. A joint and Abhijit Bhole. 2009. “They are out there, if source-channel model for machine transliteration. you know where to look”: Mining transliterations In Proc. 42nd ACL Annual Meeting, pages 159–166, of OOV query terms for cross-language informa- Barcelona, Spain. tion retrieval. In LNCS: Advances in Information Retrieval, volume 5478, pages 437–448. Springer Haizhou Li, A Kumaran, Vladimir Pervouchine, and Berlin / Heidelberg. Min Zhang. 2009a. Report of NEWS 2009 machine transliteration shared task. In Proc. Named Entities Paola Virga and Sanjeev Khudanpur. 2003. Translit- Workshop at ACL 2009. eration of proper names in cross-lingual information retrieval. In Proc. ACL MLNER, Sapporo, Japan. Haizhou Li, A Kumaran, Min Zhang, and Vladimir Pervouchine. 2009b. ACL-IJCNLP 2009 Named Stephen Wan and Cornelia Maria Verspoor. 1998. Au- Entities Workshop — Shared Task on Translitera- tomatic English-Chinese name transliteration for de- tion. In Proc. Named Entities Workshop at ACL velopment of multilingual resources. In Proc. COL- 2009. ING, pages 1352–1356. 8 Yu-Chun Wang and Richard Tzong-Han Tsai. 2011. English-korean named entity transliteration us- ing statistical substring-based and rule-based ap- proaches. In Proc. Named Entities Workshop at IJC- NLP 2011. Dmitry Zelenko and Chinatsu Aone. 2006. Discrimi- native methods for transliteration. In Proc. EMNLP, pages 612–617, Sydney, Australia, July. 9 Team ID ACC F -score MRR MAPref Organisation Primary runs 2 0.3485 0.700095 0.462495 0.341924 NICT 6 0.342 0.701729 0.40574 0.331184 NECTEC 7 0.3405 0.691719 0.4203 0.331469 University of Alberta 9 0.3265 0.688231 0.423711 0.318296 National Tsing Hua University 4 0.3195 0.673834 0.396812 0.308382 DFKI GmbH 3 0.308 0.666474 0.337148 0.305857 Beijing Foreign Studies Univer- sity 5 0.3055 0.672302 0.377732 0.296502 City University of Hong Kong Non-primary standard runs 6 0.328 0.695756 0.392008 0.318354 NECTEC 3 0.308 0.666474 0.337148 0.305857 Beijing Foreign Studies Univer- sity 9 0.3035 0.675249 0.383354 0.293095 National Tsing Hua University 7 0.2875 0.661642 0.2875 0.27303 University of Alberta 5 0.2855 0.659605 0.349497 0.276169 City University of Hong Kong 4 0.26 0.638255 0.340081 0.250505 DFKI GmbH 9 0.2025 0.610451 0.282637 0.195431 National Tsing Hua University 9 0 0.124144 0.000063 0 National Tsing Hua University Table 4: Runs submitted for English to Chinese task. Team ID ACC F -score MRR MAPref Organisation Primary runs 3 0.166814 0.764739 0.201932 0.166703 Beijing Foreign Studies Univer- sity 5 0.154898 0.765737 0.215209 0.155119 City University of Hong Kong 2 0.144748 0.764534 0.242493 0.144417 NICT 4 0.132833 0.745695 0.210143 0.132723 DFKI GmbH 6 0.131068 0.729656 0.19266 0.131178 NECTEC 9 0.000883 0.014535 0.00248 0.000883 National Tsing Hua University Non-primary standard runs 5 0.153575 0.756761 0.205823 0.153685 City University of Hong Kong 6 0.121359 0.726054 0.176186 0.121139 NECTEC 6 0.120035 0.713803 0.184312 0.119925 NECTEC 4 0.117387 0.730918 0.176915 0.117277 DFKI GmbH 6 0.113416 0.713676 0.169103 0.113305 NECTEC 3 0.097087 0.692511 0.127462 0.096867 Beijing Foreign Studies Univer- sity 9 0 0.010269 0.000412 0 National Tsing Hua University Table 5: Runs submitted for Chinese to English back-transliteration task. 10 Team ID ACC F -score MRR MAPref Organisation Primary runs 6 0.3545 0.85371 0.450846 0.350021 NECTEC 2 0.338 0.85323 0.443537 0.335972 NICT Non-primary standard runs 6 0.3545 0.857262 0.457232 0.350625 NECTEC 6 0.354 0.855659 0.456143 0.349931 NECTEC Table 6: Runs submitted for English to Thai task. Team ID ACC F -score MRR MAPref Organisation Primary runs 2 0.29641 0.845061 0.427258 0.296617 NICT 6 0.28359 0.840587 0.401574 0.282973 NECTEC Non-primary standard runs 6 0.282564 0.841174 0.400137 0.280754 NECTEC 6 0.280513 0.839531 0.397005 0.278251 NECTEC Table 7: Runs submitted for Thai to English back-transliteration task. Team ID ACC F -score MRR MAPref Organisation Primary runs 2 0.478 0.879438 0.591206 0.4765 NICT 7 0.471 0.878619 0.571162 0.46975 University of Alberta 6 0.436 0.870378 0.53784 0.435 NECTEC 10 0.387 0.859914 0.51587 0.38675 Columbia University Non-primary standard runs 7 0.493 0.883611 0.581677 0.492 University of Alberta 7 0.457 0.877803 0.551577 0.45475 University of Alberta 6 0.42 0.866161 0.518392 0.41875 NECTEC 6 0.417 0.867697 0.522927 0.41575 NECTEC 10 0.386 0.859778 0.515204 0.38575 Columbia University Non-standard runs 7 0.521 0.896287 0.606057 0.5205 University of Alberta Table 8: Runs submitted for English to Hindi task. Team ID ACC F -score MRR MAPref Organisation Primary runs 2 0.441 0.900489 0.577195 0.44 NICT 6 0.432 0.895693 0.55284 0.4305 NECTEC Non-primary standard runs 6 0.42 0.890297 0.521162 0.4185 NECTEC 6 0.409 0.890383 0.511919 0.4075 NECTEC Table 9: Runs submitted for English to Tamil task. 11 Team ID ACC F -score MRR MAPref Organisation Primary runs 2 0.419 0.885498 0.539931 0.41725 NICT 6 0.398 0.877997 0.501557 0.396722 NECTEC Non-primary standard runs 6 0.378 0.871573 0.469133 0.375861 NECTEC 6 0.371 0.869731 0.46439 0.368333 NECTEC Table 10: Runs submitted for English to Kannada task. Team ID ACC F -score MRR MAPref Organisation Primary runs 7 0.434711 0.815425 0.434711 0.434435 University of Alberta 2 0.393939 0.802719 0.535614 0.393939 NICT Table 11: Runs submitted for English to Japanese Katakana task. Team ID ACC F -score MRR MAPref Organisation Primary runs 8 0.430213 0.711027 0.430213 0.422824 Yuan Ze University and National Taiwan University 2 0.356322 0.68032 0.461892 0.352627 NICT Non-standard runs 8 0.331691 0.653147 0.331691 0.325123 Yuan Ze University and National Taiwan University 8 0.331691 0.653147 0.466886 0.331691 Yuan Ze University and National Taiwan University 8 0.215107 0.474405 0.215107 0.208949 Yuan Ze University and National Taiwan University Table 12: Runs submitted for English to Korean task. Team ID ACC F -score MRR MAPref Organisation Primary runs 2 0.45359 0.640551 0.568179 0.45359 NICT Table 13: Runs submitted for English to Japanese Kanji back-transliteration task. Team ID ACC F -score MRR MAPref Organisation Primary runs 10 0.525502 0.928104 0.628327 0.386179 Columbia University 2 0.447063 0.910865 0.550146 0.351398 NICT Non-primary standard runs 10 0.518547 0.926968 0.61153 0.382576 Columbia University Table 14: Runs submitted for Arabic to English task. 12 Team ID ACC F -score MRR MAPref Organisation Primary runs 2 0.478 0.89183 0.596738 0.4765 NICT 6 0.455 0.886901 0.556766 0.453 NECTEC Non-primary standard runs 6 0.456 0.884593 0.554751 0.4545 NECTEC Table 15: Runs submitted for English to Bengali (Bangla) task. Team ID ACC F -score MRR MAPref Organisation Primary runs 1 0.872 0.979153 0.912697 0.869435 Amirkabir University of Tech- nology 6 0.6435 0.942838 0.744343 0.629047 NECTEC 2 0.6145 0.93794 0.741716 0.603994 NICT 10 0.6055 0.933434 0.696681 0.589026 Columbia University Non-primary standard runs 6 0.642 0.943011 0.747032 0.626604 NECTEC 10 0.6045 0.933263 0.696521 0.588117 Columbia University Table 16: Runs submitted for English to Persian task. Team ID ACC F -score MRR MAPref Organisation Primary runs 6 0.602 0.931385 0.701797 0.602 NECTEC 2 0.6 0.928666 0.715443 0.6 NICT Non-primary standard runs 6 0.601 0.929689 0.697298 0.601 NECTEC Table 17: Runs submitted for English to Hebrew task. 13 Whitepaper of NEWS 2011 Shared Task on Machine Transliteration∗ Min Zhang† , A Kumaran‡ , Haizhou Li† † Institute for Infocomm Research, A*STAR, Singapore 138632 {mzhang,hli}@i2r.a-star.edu.sg ‡ Multilingual Systems Research, Microsoft Research India

[email protected]

Abstract Transliteration is defined as phonetic translation of names across languages. Transliteration of Named Entities (NEs) is necessary in many applications, such as machine translation, corpus alignment, cross-language IR, information and automatic lexicon acquisition. All such systems call for high-performance transliteration, which is the focus of shared task in the NEWS 2011 workshop. The objective of the shared task is to pro- mote machine transliteration research by providing a common benchmarking plat- form for the community to evaluate the state-of-the-art technologies. 1 Task Description The task is to develop machine transliteration tem in one or more of the specified language being considered for the task. Each language pair consists of a source and a target language. The training and development data sets released for each language pair are to be used for a transliteration system in whatever way that the participants find appropriate. At the evaluation time, a test set of source names only would be released, on which the participants are expected to produce a ranked list of transliteration dates in another language (i.e. n-best translitera- tions), and this will be evaluated using common metrics. For every language pair the participants must submit at least one run that uses only the data provided by the NEWS workshop organisers in a given language pair (designated as “standard” run, primary submission). Users may submit more “stanrard” runs. They may also submit several “non-standard” runs for each language pair that ∗ http://translit.i2r.a-star.edu.sg/news2011/ Proceedings (a) A sample test set and expected user out- put format are to be released. (b) An evaluation script, which runs on the above two, is to be released. (c) The participants must make sure that their output is produced in a way that the evaluation script may run and pro- duce the expected output. (d) The same script (with held out test data and the user outputs) would be used for final evaluation. 4. Test data (13 June 2011) (a) The test data would be released on 13 June 2011, and the participants have a maximum of 7 days to submit their re- sults in the expected format. (b) One “standard” run must be submit- ted from every group on a given lan- guage pair. Additional “standard” runs may be submitted, up to 4 “standard” runs in total. However, the partici- pants must indicate one of the submit- ted “standard” runs as the “primary sub- mission”. The primary submission will be used for the performance summary. In addition to the “standard” runs, more “non-standard” runs may be submitted. In total, maximum 8 runs (up to 4 “stan- dard” runs plus up to 4 “non-standard” runs) can be submitted from each group on a registered language pair. The defi- nition of “standard” and “non-standard” runs is in Section 5. (c) Any runs that are “non-standard” must be tagged as such. (d) The test set is a list of names in source language only. Every group will pro- duce and submit a ranked list of translit- eration candidates in another language for each given name in the test set. Please note that this shared task is a “transliteration generation” task, i.e., given a name in a source language one is supposed to generate one or more transliterations in a target language. It is not the task of “transliteration discov- ery”, i.e., given a name in the source lan- guage and a set of names in the target language evaluate how to find the ap- propriate names from the target set that workshop participants. We are aiming at accepting all system papers, and se- lected ones will be presented the NEWS 2011 workshop. (c) All registered participants to register and attend the workshop introduce your work. (d) All paper submission and review will be managed electronically through https:// www.softconf.com/ijcnlp2011/NEWS. 4 Language Pairs The tasks are to transliterate personal names or place names from a source to a target language as summarised in Table 1. NEWS 2011 Shared Task offers 14 evaluation subtasks, among them ChEn and ThEn are the back-transliteration of EnCh and EnTh tasks respectively. NEWS 2011 releases training, development and testing data for each of the language pairs. NEWS 2011 continues some language pairs that were evaluated in NEWS 2010. In such cases, the training and development data in the release of NEWS 2011 may overlap with those in NEWS 2010. However, the test data in NEWS 2011 are entirely new. The names given in the training sets for Chi- nese, Japanese, Korean, Thai and Persian lan- guages are Western names and their respective transliterations; the Japanese Name (in English) → Japanese Kanji data set consists only of native Japanese names; the Arabic data set consists only of native Arabic names. The Indic data set (Hindi, Tamil, Kannada, Bangla) consists of a mix of In- dian and Western names. Examples of transliteration: English → Chinese Timothy → « English → Japanese Katakana Harrington → ÏêóÈó English → Korean Hangul Bennett → 베넷 Japanese name in English → Japanese Kanji Akihiro → Ë English → Hindi San Francisco → सैन फ्रा English → Tamil London → லண்டன் Name origin Source script Target script Western English Chinese Western Chinese English Western English Korean Hangul Western English Japanese Katakana Japanese English Japanese Kanji Arabic Arabic English Mixed English Hindi Mixed English Tamil Mixed English Kannada Mixed English Bangla Western English Thai Western Thai English Western English Persian Western English Hebrew Table 1: Source and target i. If they are cleaned up manually, we appeal that such data be provided back to the organisers for bution to all the participating groups in that language pair; such sharing benefits all participants, and further ensures that the evaluation provides normalisation with respect to data quality. ii. If automatic cleanup were used, such cleanup would be considered a part of the system fielded, and hence not required to be shared with all participants. 4. Standard Runs We expect that the partici- pants to use only the data (parallel names) provided by the Shared Task for translitera- tion task for a “standard” run evaluation. One such run (using only provided by the shared task) is mandatory for all participants for a given language pair that they participate in. 5. Non-standard Runs If more data (either par- allel names data or monolingual data) were used, then all such runs using extra data must be marked as “non-standard”. For such “non- standard” runs, it is required to disclose the size and characteristics of the data used in the system paper. 6. A participant may submit a maximum of 8 runs for a given language pair (including mandatory 1 “standard” run marked mary submission”). N : Total number of names (source words) in the test set ni : Number of reference transliterations for i-th name in the test set (ni ≥ 1) ri,j : j-th reference transliteration for i-th name in the test set ci,k : k-th candidate transliteration (system output) for i-th name in the test set (1 ≤ k ≤ 10) Ki : Number of candidate transliterations produced by a transliteration system 1. Word Accuracy in Top-1 (ACC) Also known as Word Error Rate, it measures correct- ness of the first transliteration candidate in the can- didate list produced by a transliteration ACC = 1 means that all top candidates are cor- rect transliterations i.e. they match one of the ref- erences, and ACC = 0 means that none of the top candidates are correct. N 1 X 1 if ∃ ri,j : ri,j = ci,1 ; ACC = N 0 otherwise i=1 (1) 2. Fuzziness in Top-1 (Mean F-score) The mean F-score measures how different, on average, the top transliteration candidate is from its closest reference. F-score for each source word is a func- tion of Precision and Recall and equals 1 when the top candidate matches one of the references, and 0 when there are no common characters between the candidate and any of the references. Precision and Recall are calculated based on the length of the Longest Common Subsequence be- tween a candidate and a reference: 1 LCS(c, r) = (|c| + |r| − ED(c, r)) 2 where ED is the edit distance and |x| is the length of x. For example, the longest common subse- quence between “abcd” and “afcde” its length is 3. The best matching reference, is, the reference for which the edit distance the minimum, is taken for calculation. matching reference is given by ri,m = arg min (ED(ci,1 , ri,j )) (3) j then Recall, Precision and F-score for i-th word use other data than those provided by the NEWS 2011 workshop; such runs would be evaluated and reported separately. 2 Important Dates Research paper submission deadline 6 July 2011 extraction Shared task Registration opens 1 April 2011 Registration closes 31 May 2011 Training/Development data release 20 April 2011 Test data release 13 June 2011 Results Submission Due 20 June 2011 Results Announcement 30 June 2011 Task (short) Papers Due 6 July 2011 For all submissions Acceptance Notification 6 Aug 2011 Camera-Ready Copy Deadline 19 Aug 2011 Workshop Date 12 Nov 2011 sys- pairs 3 Participation 1. Registration (1 April 2011) (a) NEWS Shared Task opens for registra- tion. developing (b) Prospective participants are to register to the NEWS Workshop homepage. 2. Training & Development Data (20 April 2011) candi- (a) Registered participants are to obtain training and development data from the Shared Task organiser and/or the desig- nated copyright owners of databases. (b) All registered participants are required to participate in the evaluation of at least one language pair, submit the results and a short paper and attend the workshop at IJCNLP 2011. 3. Evaluation Script (20 April 2011) 14 of the 2011 Named Entities Workshop, IJCNLP 2011, pages 14–22, Chiang Mai, Thailand, November 12, 2011. are transliterations of the given source name. 5. Results (30 June 2011) (a) On 30 June 2011, the evaluation results would be announced and will be made available on the Workshop website. (b) Note that only the scores (in respective metrics) of the participating systems on each language pairs would be published, and no explicit ranking of the participat- ing systems would be published. (c) Note that this is a shared evaluation task and not a competition; the results are meant to be used to evaluate systems on common data set with common metrics, and not to rank the participating sys- tems. While the participants can cite the performance of their systems (scores on metrics) from the workshop report, they should not use any ranking information in their publications. (d) Furthermore, all participants should agree not to reveal identities of other participants in any of their publications unless you get permission from the other respective participants. By default, all participants remain anonymous in pub- lished results, unless they indicate oth- erwise at the time of uploading their re- sults. Note that the results of all systems will be published, but the identities of those participants that choose not to dis- close their identity to other participants will be masked. As a result, in this case, your organisation name will still appear in the web site as one of participants, but it will not be linked explicitly to your re- sults. 6. Short Papers on Task (6 July 2011) (a) Each submitting site is required to sub- mit a 4-page system paper (short paper) for its submissions, including their ap- proach, data used and the results on ei- ther test set or development set or by n- fold cross validation on training set. (b) The review of the system papers will be done to improve paper quality and read- ability and make sure the authors’ ideas and methods can be understood by the 15 English → Kannada Tokyo → ಟೋಕ್ಯೋ orally in Arabic → Arabic name in English $#"! → Khalid are required to 5 Standard Databases Training Data (Parallel) Paired names between source and target lan- guages; size 5K – 32K. Training Data is used for training a basic transliteration system. Development Data (Parallel) Paired names between source and target lan- guages; size 2K – 6K. Development Data is in addition to the Train- ing data, which is used for system fine-tuning of parameters in case of need. Participants are allowed to use it as part of training data. Testing Data Source names only; size 2K – 3K. This is a held-out set, which would be used for evaluating the quality of the translitera- tions. 1. Participants will need to obtain licenses from the respective copyright owners and/or agree to the terms and conditions of use that are given on the downloading website (Li et al., 2004; MSRI, 2010; CJKI, 2010). NEWS 2011 will provide the contact details of each individual database. The data would be pro- vided in Unicode UTF-8 encoding, in XML format; the results are expected to be sub- mitted in UTF-8 encoding in XML format. The XML formats details are available in Ap- pendix A. 2. The data are provided in 3 sets as described above. 3. Name pairs are distributed as-is, as provided by the respective creators. (a) While the databases are mostly man- ually checked, there may be still in- consistency (that is, non-standard usage, region-specific usage, errors, etc.) or in- completeness (that is, not all right varia- न्सिस्को tions may be covered). (b) The participants may use any method to further clean up the data provided. 16 Data Size Data Owner Task ID Train Dev Test Institute for Infocomm Research 37K 2.8K 2K EnCh Institute for Infocomm Research 28K 2.7K 2.2K ChEn CJK Institute 7K 1K 609 EnKo CJK Institute 26K 2K 1.8K EnJa CJK Institute 10K 2K 571 JnJk CJK Institute 27K 2.5K 2.6K ArEn Microsoft Research India 12K 1K 1K EnHi Microsoft Research India 10K 1K 1K EnTa Microsoft Research India 10K 1K 1K EnKa Microsoft Research India 13K 1K 1K EnBa NECTEC 27K 2K 2K EnTh NECTEC 25K 2K 1.9K ThEn Sarvnaz Karimi / RMIT 10K 2K 2K EnPe Microsoft Research India 9.5K 1K 1K EnHe languages for the shared task on transliteration. 6 Paper Format redistri- Paper submissions to NEWS 2011 should follow the IJCNLP 2011 paper submission policy, includ- ing paper format, blind review policy and title and author format convention. Full papers (research paper) are in two-column format without exceed- ing eight (8) pages of content plus two (2) extra page for references and short papers (task paper) are also in two-column format without exceeding four (4) pages content plus two (2) extra page for references. Submission must conform to the of- ficial IJCNLP 2011 style guidelines. For details, please refer to the IJCNLP 2011 website2 . 7 Evaluation Metrics to ensure a fair the data We plan to measure the quality of the translitera- tion task using the following 4 metrics. We accept up to 10 output candidates in a ranked list for each input entry. Since a given source name may have multiple correct target transliterations, all these alternatives are treated equally in the evaluation. That is, any of these alternatives are considered as a correct transliteration, and the first correct transliteration in the ranked list is accepted as a correct hit. The following notation is further assumed: the as “pri- 2 http://www.ijcnlp2011.org/ 17 are calculated as LCS(ci,1 , ri,m ) Ri = (4) |ri,m | LCS(ci,1 , ri,m ) Pi = (5) |ci,1 | R i × Pi Fi = 2 (6) R i + Pi • The length is computed in distinct Unicode characters. • No distinction is made on different character types of a language (e.g., vowel vs. conso- nants vs. combining diereses etc.) system. 3. Mean Reciprocal Rank (MRR) Measures traditional MRR for any right answer produced by the system, from among the candidates. 1/M RR tells approximately the average rank of the correct transliteration. MRR closer to 1 implies that the correct answer is mostly produced close to the top of the n-best lists. minj 1j if ∃ri,j , ci,k : ri,j = ci,k ; RRi = 0 otherwise (7) 1 X N M RR = RRi (8) N i=1 4. MAPref Measures tightly the precision in the n-best candidates for i-th source name, for which reference transliterations are available. If all of the references are produced, then the MAP is 1. Let’s denote the number of correct candidates for the i-th source word in k-best list as num(i, k). MAPref is then given by N ni ! 1 X 1 X (2) M APref = num(i, k) (9) N ni i k=1 8 Contact Us If you have any questions about this share task and is “acd” and the database, please email to that has Mr. Ming Liu If the best Institute for Infocomm Research (I2 R), A*STAR 1 Fusionopolis Way #08-05 South Tower, Connexis Singapore 138632

[email protected]

18 Dr. Min Zhang References Institute for Infocomm Research (I2 R), [CJKI2010] CJKI. 2010. CJK Institute. A*STAR http://www.cjk.org/. 1 Fusionopolis Way [Li et al.2004] Haizhou Li, Min Zhang, and Jian Su. #08-05 South Tower, Connexis 2004. A joint source-channel model for machine Singapore 138632 transliteration. In Proc. 42nd ACL Annual Meeting,

[email protected]

pages 159–166, Barcelona, Spain. [MSRI2010] MSRI. 2010. Microsoft Research India. http://research.microsoft.com/india. 19 A Training/Development Data • File Naming Conventions: NEWS11 train XXYY nnnn.xml NEWS11 dev XXYY nnnn.xml NEWS11 test XXYY nnnn.xml – XX: Source Language – YY: Target Language – nnnn: size of parallel/monolingual names (“25K”, “10000”, etc) • File formats: All data will be made available in XML for- mats (Figure 1). • Data Encoding Formats: The data will be in Unicode UTF-8 encod- ing files without byte-order mark, and in the XML format specified. B Submission of Results • File Naming Conventions: You can give your files any name you like. During submission online you will need to indicate whether this submission belongs to a “standard” or “non-standard” run, and if it is a “standard” run, whether it is the primary submission. • File formats: All data will be made available in XML for- mats (Figure 2). • Data Encoding Formats: The results are expected to be submitted in UTF-8 encoded files without byte-order mark only, and in the XML format specified. 20 <?xml version="1.0" encoding="UTF-8"?> <TransliterationCorpus CorpusID = "NEWS2011-Train-EnHi-25K" SourceLang = "English" TargetLang = "Hindi" CorpusType = "Train|Dev" CorpusSize = "25000" CorpusFormat = "UTF8"> <Name ID= 1 > <SourceName>eeeeee1</SourceName> <TargetName ID="1">hhhhhh1_1</TargetName> <TargetName ID="2">hhhhhh1_2</TargetName> ... <TargetName ID="n">hhhhhh1_n</TargetName> </Name> <Name ID= 2 > <SourceName>eeeeee2</SourceName> <TargetName ID="1">hhhhhh2_1</TargetName> <TargetName ID="2">hhhhhh2_2</TargetName> ... <TargetName ID="m">hhhhhh2_m</TargetName> </Name> ...  ... </TransliterationCorpus> Figure 1: File: NEWS2011 Train EnHi 25K.xml 21 <?xml version="1.0" encoding="UTF-8"?> <TransliterationTaskResults SourceLang = "English" TargetLang = "Hindi" GroupID = "Trans University" RunID = "1" RunType = "Standard" Comments = "HMM Run with params: alpha=0.8 beta=1.25"> <Name ID="1"> <SourceName>eeeeee1</SourceName> <TargetName ID="1">hhhhhh11</TargetName> <TargetName ID="2">hhhhhh12</TargetName> <TargetName ID="3">hhhhhh13</TargetName> ... <TargetName ID="10">hhhhhh110</TargetName>  </Name> <Name ID="2"> <SourceName>eeeeee2</SourceName> <TargetName ID="1">hhhhhh21</TargetName> <TargetName ID="2">hhhhhh22</TargetName> <TargetName ID="3">hhhhhh23</TargetName> ... <TargetName ID="10">hhhhhh110</TargetName>  </Name> ...  ... </TransliterationTaskResults> Figure 2: Example file: NEWS2011 EnHi TUniv 01 StdRunHMMBased.xml 22 Integrating Models Derived from non-Parametric Bayesian Co-segmentation into a Statistical Machine Transliteration System Andrew Finch Paul Dixon Eiichiro Sumita NICT NICT NICT 3-5 Hikaridai 3-5 Hikaridai 3-5 Hikaridai Keihanna Science City Keihanna Science City Keihanna Science City 619-0289 JAPAN 619-0289 JAPAN 619-0289 JAPAN

[email protected] [email protected] [email protected]

Abstract 2 System Description The system presented in this paper is based 2.1 Bayesian Co-segmentation upon a phrase-based statistical machine The typical method of deriving a translation-model transliteration (SMT) framework. The for a machine translation is to use GIZA++ (Och SMT system’s log-linear model is aug- and Ney, 2003) to perform word alignment and a mented with a set of features specifically set of heuristics for phrase-pair extraction. A com- suited to the task of transliteration. In par- monly used set of heuristics is known as grow- ticular our model utilizes a feature based diag-final-and. This type of approach was taken by on a joint source-channel model, and a fea- (Finch and Sumita, 2010b; Rama and Gali, 2009) ture based on a maximum entropy model to train their models. that predicts target grapheme sequences An alternative approach is to use a non- using the local context of graphemes parametric Bayesian technique to co-segment both and grapheme sequences in both source source and target in a single step (Finch and and target languages. The segmentation Sumita, 2010a; Huang et al., 2011). This ap- for our approach was performed using a proach has the advantage of being symmetric with non-parametric Bayesian co-segmentation respect to source and target languages, and fur- model, and in this paper we present ex- thermore Bayesian techniques tend to give rise to periments comparing the effectiveness of models with few parameters that do not overfit this segmentation relative to the publicly the data in the same way as traditional maximum available state-of-the-art m2m alignment likelihood training. In experiments on an English- tool. In all our experiments we have taken Japanese transliteration task, (Finch and Sumita, a strictly language independent approach. 2010a) showed that that a Bayesian approach of- Each of the language pairs were processed fered higher performance than using GIZA++ to- automatically with no special treatment. gether with heuristic phrase-pair extraction. Their approach unfortunately required a simple set of ag- 1 Introduction glomeration heuristics in order get good perfor- In the NEWS2010 workshop, (Finch and Sumita, mance from the system. Similarly, (Huang et al., 2010b) reported that the performance of a phrase- 2011) show that their Bayesian system is able to based statistical machine transliteration system outperform a baseline based on EM alignment, by (Finch and Sumita, 2008; Rama and Gali, 2009) removing the need to align to a single grapheme in could be improved significantly by combining it one language to avoid over-fitting. with a model based on the n-gram context of In our approach, we adopt the same Bayesian source-target grapheme sequence pairs: a joint co-segmentation (bilingual alignment) framework source-channel model similar to that of (Li et al., as (Finch and Sumita, 2010a), and replace the 2004). Their system integrated the two approaches agglomeration heuristics by incorporating a joint by using a re-scoring step at the end of the de- source-channel model directly into the decoder coding process. Our system goes one step fur- as an additional feature. Our motivation for this ther and integrates a joint source-channel model di- was simply that the phrase-based translation model rectly into the SMT decoder to allow the probabili- lacks contextual information, and in the experi- ties from it to be taken into account within a single ments of (Finch and Sumita, 2010a), the model search process in the similar manner to (Banchs et gained this contextual information implicitly by al., 2005). the use of agglomerated phrases. In other words, 23 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 23–27, Chiang Mai, Thailand, November 12, 2011. the longer phrases carried with them their own the target language model to generate derivations built-in context. In our model these contextual de- that are too short. pendencies are made explicit and modeled directly by the joint source-channel model. 2.2.4 Maximum-entropy model The termination condition for our Bayesian co- In a typical phrase-based SMT system, the trans- segmentation algorithm was set based on pilot ex- lation model contains a context-independent prob- periments that showed very little gain in system ability of the target grapheme sequence (phrase) performance after iteration 10, and no loss in per- given the source. Our system replaces this with formance by continuing the training. We arbitrar- a more sophisticated maximum entropy model ily chose iteration 30 in all our experiments as the that takes the local context of source and target final iteration. graphemes and grapheme sequences into account. The features can be partitioned into two classes: 2.2 Phrase-based SMT Models grapheme-based features and grapheme sequence- The decoding was performed using a specially based features. In both cases we use a context of modified version of the CLEOPATRA de- 2 to the left and right for the source, and 2 to the coder (Finch et al., 2007), an in-house multi-stack left for the target. Sequence begin and end mark- phrase-based decoder that operates on the same ers are added to both source and target and are used principles as the MOSES decoder (Koehn et al., in the context. The features used in the ME model 2007). The system we used in this shared task is a consist of all possible bigrams of contiguous ele- log-linear combination of 5 different models, the ments in the context. We do not mix features at following sections describe each of these models the grapheme level and grapheme sequence level, in detail. Due to the small size of many of the data so for example, a grapheme sequence bigram can sets in the shared tasks, we used all of the data to only consist of grapheme sequences (including se- build models for the final systems. quences of length 1). 2.2.1 Joint source-channel model 2.3 Parameter Tuning The joint source-channel model was trained from The exponential log-linear model weights of our the Viterbi co-segmentation arising from the final system are set by tuning the system on develop- iteration of the Bayesian segmentation process on ment data using the MERT procedure (Och, 2003) the training data (for model used in parameter tun- by means of the publicly available ZMERT toolkit 1 (Zaidan, 2009). The systems reported in this pa- ing), and the training data added to the develop- ment data (for the model used to decode the test per used a metric based on the word-level F-score, data). We used the MIT language modeling toolkit an official evaluation metric for the shared tasks, (Bo-june et al., 2008) with modified Knesser-Ney which measures the relationship of the longest smoothing to build this model. In all experiments common subsequence of the transliteration pair to we used a language model of order 5. the lengths of both source and target sequences. 2.2.2 Target Language model 2.4 Official Results The target model was trained from target side of The official scores for our system are given in Ta- the training data (for model used in parameter tun- ble 1. Some of the data tracks will benefit from a ing), and the training data added to the develop- language-dependent treatment (for example in Ko- ment data (for the model used to decode the test rean it is advantageous to decompose the charac- data). We used the MIT language modeling toolkit ters), and in these tracks our language-independent with Knesser-Ney smoothing to build this model. approach was not competitive. Our system typi- In all experiments we used a language model of cally gave a strong relative performance on those order 5. tracks with larger amounts of training data. 2.2.3 Insertion penalty models 3 Segmentation Experiments Both grapheme based and grapheme-sequence- A novel feature of our system is the Bayesian based insertion penalty models are simple models co-segmentation approach used to bilingually seg- that add a constant value to their score each time a ment the data in order to yield training data from grapheme (or grapheme sequence) is added to the which to train the models in our system. It has been target hypotheses. These models control the ten- 1 dency both of the joint source-channel model and http://www.cs.jhu.edu/∼ozaidan/zmert/ 24 En-Ch Ch-En En-Th Th-En En-Hi En-Ta En-Ka Acc. 0.348 0.145 0.338 0.296 0.478 0.441 0.419 F-score 0.700 0.765 0.853 0.854 0.879 0.900 0.885 En-Ja En-Ko Jn-Jk Ar-En En-Ba En-Pe En-He Acc. 0.394 0.356 0.454 0.447 0.478 0.615 0.600 F-score 0.803 0.680 0.641 0.911 0.892 0.938 0.929 Table 1: The Evaluation Results on the 2011 Shared Task for our System in terms of the official F-score and Top-1 accuracy metrics. shown (Finch and Sumita, 2010a) that in translit- strongly prefer to learn a model in which the pa- eration, this Bayesian approach can give rise to rameters are re-used. a smaller and more useful phrase-table than that Initially we considered the hypotheses that the derived by using GIZA++ for alignment and the difference in performance between these two ap- grow-diag-final-and heuristics which have been proaches came from differences in the sparseness shown to be effective for transliteration (Rama and of the language models. Surprisingly however, the Gali, 2009). In these experiments we compare numbers of bi-grams and tri-grams in the joint lan- the Bayesian segmenter to a similar state-of-the-art guage models are quite similar. segmentation tool that is capable of many-to-many Another explanation is that the smaller num- alignments: the publicly available m2m alignment ber of unigrams indicates that the segmentation is tool 2 (Jiampojamarn et al., 2007) that is trained more self-consistent and therefore makes the gen- using the EM algorithm and is based on the prin- eration task less ambiguous. This is supported by ciples set out in (Ristad and Yianilos, 1998). looking at the development set perplexity. On the We used a similar system to that in the shared Jn-Jk task where the differences between the sys- task, but without the maximum entropy model. tems are the largest, we found that a joint language The experiments were run in the same way us- model trained on the Bayesian segmentation had ing the same script, the only difference being the 1-, 2-, and 3-gram perplexities of 218.3, 88.4 and choice of aligner used. We used data from the 2009 87.5 respectively, whereas the corresponding m2m NEWS workshop for our experiments, and evalu- model’s perplexities were 321.8, 120.5 and 119.3. ated using the F-score metric used for the shared The number of segments used to segment the cor- task evaluation. The aligners were run with their pus was the same for both systems in this experi- default settings, and with the same limits for source ment. and target segment size. It may have been possible Table 3 gives an example from the data of to obtain better performance from the aligners by the differences in segmentation consistency. The adjusting specific parameters, but no attempt was Bayesian segmentation is strongly self-consistent. made to do this. The results are shown in Table 2. The source sequence ‘ara’ has been segmented In all experiments, the Bayesian segmenter gave identically as a single unit in all cases. The m2m the best performance, and the largest improvement system also shows self-consistency, but uses a few was on language pairs that have large grapheme different strategies to segment the start of the se- set sizes on the target side. The grapheme set size quence. Interestingly the Bayesian method in this is shown in Table 2 in the ‘Target Types’ column. example has segmented according to the correct The source grapheme set sizes were very similar linguistic readings of the kanji. We investigate this and small (around 27) for all experiments, as the further in the next section. source language was either English or in the case of Jn-Jk, a romanized form of Japanese. Looking at 3.1 Linguistic Agreement the n-gram statistics in Table 2, for languages with large grapheme sets the number of unigrams in the In this experiment, we attempt to assess the ability Bayesian model is less than half that used by the of each segmentation scheme to discover the un- m2m model. Learning a compact model is one of derlying linguistic segmentation of the data. We the signature characteristics of the Bayesian model took a random sample of 100 word-pairs from the we use; adding a new parameter to the model is Japanese romaji to Japanese Kanji training cor- extremely costly, and the algorithm will therefore pus. The segmentation of this sample using both systems was then labeled as either ‘correct’ or 2 http://code.google.com/p/m2m-aligner/ ‘incorrect’ by a human judge using a Japanese 25 Language Target m2m Bayesian m2m Bayesian Pairs Types F-score F-score 1-grams 2-grams 3-grams 1-grams 2-grams 3-grams En-Ch 372 0.858 0.880 9379 44003 75513 4706 38647 72905 En-Hi 84 0.874 0.884 3114 15209 30195 1867 20218 34657 En-Ko 687 0.623 0.651 4337 11891 14112 2968 11233 14729 En-Ru 66 0.919 0.922 1638 6351 14869 1105 12607 23250 En-Ta 64 0.885 0.892 2852 14696 27869 1561 17195 30244 Jn-Jk 1514 0.669 0.767 7942 27286 38365 3532 22717 37560 Table 2: System performance in terms of F-score, by using alternative segmentation schemes together with statistics relating to be number of parameters in the models derived from the segmentations. m2m Bayesian arad7→荒 a7→田 ara7→荒 da7→田 ar7→新 ae7→江 ara7→新 e7→江 ar7→荒 ahori7→堀 ara7→荒 hori7→堀 ar7→新 ai7→井 ara7→新 i7→井 ar7→新 ai7→居 ara7→新 i7→居 ar7→荒 ai7→井 ara7→荒 i7→井 ar7→荒 ai7→居 ara7→荒 i7→居 araj7→荒 ima7→島 ara7→荒 jima7→島 arak7→新 i7→木 ara7→新 ki7→木 arak7→荒 i7→木 ara7→荒 ki7→木 ar7→荒 akid7→木 a7→田 ara7→荒 ki7→木 da7→田 ar7→荒 ao7→尾 ara7→荒 o7→尾 ar7→荒 ao7→生 ara7→荒 o7→生 ar7→荒 aoka7→岡 ara7→荒 oka7→岡 arasa7→荒 wa7→沢 ara7→荒 sawa7→沢 ar7→荒 aseki7→関 ara7→荒 seki7→関 Table 3: Example segmentations from the m2m segmenter and the Bayesian segmenter, taken from a long contiguous section of the training set where both techniques disagree on the segmentation. name reading dictionary as a reference. We found for a joint source-channel model, and is able to that Bayesian segmentation agreed with the human accurately induce the linguistic segmentation of segmentation in 96% of the test cases, and whereas Japanese names, building a compact model based the m2m system agreed in 42% of cases. on a self-consistent segmentation of the data. In the future we would like to develop more sophis- 4 Conclusion ticated Bayesian models, and investigate methods for identifying and dealing with different source The system entered in the year’s shared task is built languages. We would also like to measure the within a statistical machine translation framework, utility of training the language model component but has been augmented by adding features specif- of our system independently on large amounts of ically suited to transliteration. In particular, a joint monolingual data, which is often much more read- source-channel model and a maximum entropy ily available than aligned bilingual corpora. model were integrated into the decoder to enhance the translation model of the SMT system by con- Acknowledgements tributing local contextual information. Our sys- For the English-Japanese, English-Korean and tem uses a novel Bayesian co-segmentation tech- Arabic-English datasets, the reader is referred to nique to perform a many-to-many source-target the CJK website: http://www.cjk.org. For English- sequence alignment of the corpus. The models Hindi, English-Tamil, and English-Kannada, and of our system are trained directly from this co- English-Bangla the data sets originated from the segmentation. We have shown that this tech- work of (Kumaran and Kellner, 2007). nique is very effective for producing training data 26 References Haizhou Li, Min Zhang, and Jian Su. 2004. A joint source-channel model for machine transliteration. In Rafael E. Banchs, Josep Maria Crego, Adria Degispert, ACL ’04: Proceedings of the 42nd Annual Meeting Patrik Lambert, Marta Ruiz, and Jose A. R. Fonol- on Association for Computational Linguistics, page losa. 2005. Bilingual n-gram statistical machine 159, Morristown, NJ, USA. Association for Compu- translation. In Proc. of Machine Translation Sum- tational Linguistics. mit X, pages 275–282. Franz Josef Och and Hermann Ney. 2003. A systematic Bo-june, Paul Hsu, and James Glass. 2008. Iterative comparison of various statistical alignment models. language model estimation: Efficient data structure Computational Linguistics, 29(1):19–51. and algorithms. In Proc. Interspeech. Franz J. Och. 2003. Minimum error rate training for Andrew Finch and Eiichiro Sumita. 2008. Phrase- statistical machine translation. In Proceedings of the based machine transliteration. In Proc. 3rd Interna- ACL. tional Joint Conference on NLP, volume 1, Hyder- abad, India. Taraka Rama and Karthik Gali. 2009. Modeling ma- chine transliteration as a phrase based statistical ma- Andrew Finch and Eiichiro Sumita. 2010a. A Bayesian chine translation problem. In NEWS ’09: Proceed- Model of Bilingual Segmentation for Transliteration. ings of the 2009 Named Entities Workshop: Shared In Marcello Federico, Ian Lane, Michael Paul, and Task on Transliteration, pages 124–127, Morris- Franc¸ois Yvon, editors, Proceedings of the seventh town, NJ, USA. Association for Computational Lin- International Workshop on Spoken Language Trans- guistics. lation (IWSLT), pages 259–266. Eric Sven Ristad and Peter N. Yianilos. 1998. Andrew Finch and Eiichiro Sumita. 2010b. Transliter- Learning string edit distance. IEEE Transactions ation using a phrase-based statistical machine trans- on Pattern Recognition and Machine Intelligence, lation system to re-score the output of a joint multi- 20(5):522–532, May. gram model. In Proceedings of the 2010 Named En- tities Workshop, NEWS ’10, pages 48–52, Strouds- Omar F. Zaidan. 2009. Z-MERT: A fully configurable burg, PA, USA. Association for Computational Lin- open source tool for minimum error rate training of guistics. machine translation systems. The Prague Bulletin of Mathematical Linguistics, 91:79–88. Andrew Finch, Etienne Denoual, Hideo Okuma, Michael Paul, Hirofumi Yamamoto, Keiji Yasuda, Ruiqiang Zhang, and Eiichiro Sumita. 2007. The NICT/ATR speech translation system for IWSLT 2007. In Proceedings of the IWSLT, Trento, Italy. Yun Huang, Min Zhang, and Chew Lim Tan. 2011. Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars. In ACL (Short Papers), pages 534–539. Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Sherif. 2007. Applying many-to-many alignments and hidden markov models to letter-to-phoneme con- version. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Pro- ceedings of the Main Conference, pages 372–379, Rochester, New York, April. Association for Com- putational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowa, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In ACL 2007: proceedings of demo and poster sessions, pages 177–180, Prague, Czeck Republic, June. A. Kumaran and Tobias Kellner. 2007. A generic framework for machine transliteration. In SIGIR’07, pages 721–722. 27 Simple Discriminative Training for Machine Transliteration Canasai Kruengkrai, Thatsanee Charoenporn, Virach Sornlertlamvanich National Electronics and Computer Technology Center Thailand Science Park, Klong Luang, Pathumthani 12120, Thailand {canasai.kruengkrai,thatsanee.charoenporn,virach.sornlertlamvanich}@nectec.or.th Abstract transliteration units are not given in the training corpus. In this paper, we describe our system used The process of machine transliteration is very in the NEWS 2011 machine translitera- similar to that of phrase-based statistical machine tion shared task. Our system consists of translation (SMT) (Koehn et al., 2003). As a two main components: simple strategies result, a number of previous studies directly for generating training examples based applied phrase-based SMT techniques to ma- on character alignment, and discriminative chine transliteration (Finch and Sumita, 2009; training based on the Margin Infused Re- Rama and Gali, 2009; Finch and Sumita, 2010; laxed Algorithm. We submitted results for Avinesh and Parikh, 2010). However, unlike 10 language pairs on standard runs. Our word alignment in phrase-based SMT, character system achieves the best performance for alignment in machine transliteration seems to be English-to-Thai and English-to-Hebrew. monotonic in which reordering of target language characters rarely occurs but is still possible in 1 Introduction some language pairs. After alignment, the target language translit- We aim to develop a machine transliteration sys- eration units can be considered as tags (or la- tem that performs well in any given language pair bels) of the source language transliteration units. without much effort in pre- and post-processing, As a result, some previous studies viewed ma- and parameter tuning. To compare the perfor- chine transliteration as simply as a sequence la- mance of our system against state-of-the-art ap- proaches, we participated in the machine translit- beling problem (Aramaki and Abekawwa, 2009; Shishtla et al., 2009). With this problem setting, eration shared task conducted as a part of the Named Entities Workshop (NEWS 2011), an IJC- the system can apply any powerful discrimina- tive training algorithm (e.g., Conditional Ran- NLP 2011 workshop. Specifically, we focus on standard runs where only the corpus (containing dom Fields (CRFs) (Lafferty, 2001)) incorporated with rich features. Our system follows this re- parallel names) provided by the shared task is used for training. We submitted results for 10 language search direction, but we pay more attention on how to extract appropriate transliteration units and pairs. train our model using the Margin Infused Re- 2 Background laxed Algorithm (MIRA) (Crammer et al., 2005; McDonald, 2006). 2.1 Motivation 2.2 Problem Setting As discussed in (Li et al., 2004), machine translit- eration can be viewed as two levels of decod- Here, we formulate the process of machine ing: (1) segmenting the source language charac- transliteration based on discriminative learning. ter string into transliteration units, and (2) relat- Given a character string x in the source language, ing the source language transliteration units with we need to find the most likely character string ˆy units in the target language by resolving different out of all possible character strings in the target combinations of alignments and unit mappings. A language. We express this process by: transliteration unit could be one or more charac- ˆy = argmax s(x, y; w) , (1) ters. Typically, the source and target language y∈Y 28 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 28–31, Chiang Mai, Thailand, November 12, 2011. x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 y1 y1 y2 y2 y3 y3 y4 y4 y5 Figure 3: The aligner cannot map x2 to any tar- get language character. Based on the information Figure 1: Ideal alignment. from the previous alignment, we align x2 to y1 . x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 y1 y1 y2 y2 y3 y3 y4 y4 Figure 2: The source language character string x is y5 longer than the target language character string y. y6 The aligner maps two source language characters to a single target language character. Figure 4: The aligner cannot map y4 to any source language character. Based on the information where s denotes a discriminant function over a from the previous alignment, we align y4 to x3 . pair of a source language character string x and a hypothesized target language character string y given a parameter w. source language characters to a single target lan- guage character, i.e., {x1 , x2 } → y1 . To handle 3 Strategies for Generating Training this case, we associate the position-of-character Examples (POC) tags with the target language character. In this section, we describe how to generate train- Our POC tags includes {B, I}, indicating the be- ing examples from a parallel name corpus. Our ginning and the intermediate positions, respec- training example construction is based on charac- tively. Our training example becomes (hx1 , B-y1 i, ter alignment. hx2 , I-y1 i, hx3 , B-y2 i, hx4 , B-y3 i, hx5 , B-y4 i). At the first step, we can apply any word align- In practice, the aligner often yields incom- ment tool commonly used in SMT. Given a train- plete alignments. Some target language characters ing corpus containing parallel name pairs, we use could not be aligned to source language charac- the aligner to obtain initial character alignments. ters, and vice versa. To handle this case, we use Figure 1 shows an ideal alignment example be- simple heuristics by looking at neighboring align- tween the source language character string x and ments. We find unaligned characters in both the the target language character string y. Now, as- source and target character strings. If the previous sume that we have only one parallel name pair. alignment is already established, we expand it to Thus, our training example can be directly written the empty alignment. If the previous alignment is as (hx1 , y1 i, hx2 , y2 i, . . . , hx5 , y5 i). not available (e.g., the unaligned character occurs Unfortunately, the lengths of parallel name at the beginning position), we instead use the in- pairs in the training corpus are typically unequal. formation from the next alignment. The source language character string x could be Figure 3 shows an example when the aligner shorter or longer than the target language char- cannot map x2 to any target language character. acter string y. Figure 2 shows an example when Based on our heuristics, we align x2 to y1 . As x is longer than y, and the aligner maps two a result, our training example is identical to that 29 x1 x2 x3 x4 x5 4 Learning and Decoding y1 The goal of our model is to learn a mapping from y2 source language character strings x ∈ X to target y3 language character strings y ∈ Y based on train- ing examples of source-target language name pairs y4 D = {(xt , yt )}Tt=1 . y5 In our model, we apply a generalized version of MIRA (Crammer et al., 2005; McDonald, 2006) that can incorporate k-best decoding in the update Figure 5: Reordering occurs in the target language procedure. From Equation (1), the linear discrimi- characters. y4 and y5 are first merged into a single nant function s becomes the dot product between a transliteration unit y4 y5 , and x4 and x5 are then feature function f of the source language character aligned to B-y4 y5 and I-y4 y5 , respectively. string x and the target language character string y and a corresponding weight vector w: x1 x2 x3 x4 x5 y1 s(x, y; w) = hw, f (x, y)i . (2) y2 In each iteration, MIRA updates the weight vec- y3 tor w by keeping the norm of the change in the y4 weight vector as small as possible. With this framework, we can formulate the optimization problem as follows (McDonald, 2006): Figure 6: Another possible character reordering. w(i+1) = argminw kw − w(i) k (3) of Figure 2. Figure 4 shows another example s.t. ∀ˆy ∈ bestk (xt ; w ) : (i) when the aligner cannot map y4 to any source s(xt , yt ; w) − s(xt , ˆy; w) ≥ L(yt , ˆy) , language character. In this case, we align y4 to x3 . Now, a single source language charac- where bestk (xt ; w(i) ) represents a set of top k-best ter is associated with two target language char- outputs given the weight vector w(i) . We gener- acters, i.e., x3 → {y3 , y4 }. As a result, we ate bestk (xt ; w(i) ) using a dynamic programming merge y3 and y4 into a single transliteration unit search (Nagata, 1994). We measure L(yt , ˆy) using y3 y4 . Our training example becomes (hx1 , B-y1 i, the zero-one loss function. Our basic features op- hx2 , B-y2 i, hx3 , B-y3 y4 i, hx4 , B-y5 i, hx5 , B-y6 i). erate over the window of ±4 source language char- Note that character reordering can be found acters and the target language character bigrams. in the alignments. Figure 5 shows an ex- ample when reordering occurs in the target 5 Development and Final Results language characters. To be able to per- form the monotone search in decoding, we In development, we were interested in how merge y4 and y5 into a single transliteration the quality of alignment affects the perfor- unit y4 y5 . Our training example becomes mance of transliteration because errors in align- (hx1 , B-y1 i, hx2 , B-y2 i, hx3 , B-y3 i, hx4 , B-y4 y5 i, ment inevitably propagate to the learning phase. hx5 , I-y4 y5 i). We used two popular alignment tools, includ- Figure 6 shows another possible character re- ing GIZA++1 (Och and Ney, 2003) and Berke- ordering. We use the same scheme as the pre- leyAligner2 (Liang et al., 2006). With their de- vious example. Thus, our training example fault parameter settings, GIZA++ yields better becomes (hx1 , B-y1 i, hx2 , B-y2 i, hx3 , B-y4 y5 i, performance than BerkeleyAligner on all develop- hx4 , I-y4 y5 i, hx5 , I-y4 y5 i). To summarize, we ex- ment data sets. As a result, our submitted primary amine whether reordering occurs in the target lan- runs on the test data sets are based on the resulting guage characters. If so, we merge those target alignments from GIZA++. Our learning algorithm language characters until the alignments become 1 http://code.google.com/p/giza-pp monotonic. 2 http://code.google.com/p/berkeleyaligner 30 Language Pair ACC F-score MRR MAPref Rank (# of all primary runs) En→Ch 0.342 0.702 0.406 0.331 2 (7) Ch→En 0.131 0.730 0.193 0.131 5 (6) En→Th 0.354 0.854 0.451 0.350 1 (2) Th→En 0.284 0.841 0.402 0.283 2 (2) En→Hi 0.436 0.870 0.538 0.435 3 (4) En→Ta 0.432 0.896 0.553 0.430 2 (2) En→Ka 0.398 0.878 0.502 0.397 2 (2) En→Ba 0.455 0.887 0.557 0.453 2 (2) En→Pe 0.643 0.943 0.744 0.629 2 (4) En→He 0.602 0.931 0.702 0.602 1 (2) Table 1: Final results showing the “standard run” performance of our system on the test data sets. Lan- guage acronyms include En = English, Ch = Chinese, Th = Thai, Hi = Hindi, Ta = Tamil, Ka = Kannada, Ba = Bengali (Bangla), Pe = Persian, and He = Hebrew. has two tunable parameters: the number of train- Andrew Finch and Eiichiro Sumita. 2010. Transliteration ing iterations N and the number of top k-best out- using a phrase-based statistical machine translation sys- tem to re-score the output of a joint multigram model. In puts. We heuristically set N = 10 and k = 5 for Proceedings of the 2010 Named Entities Workshop, pages all experiments. 48–52. Final results showing the “standard run” perfor- Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. mance of our system on the test data sets are given Statistical phrase-based translation. In Proceedings of in Table 1. Evaluation metrics include word accu- HLT-NAACL, pages 48–54. racy in top-1 (ACC), fuzziness in top-1 (F-score), John Lafferty. 2001. Conditional random fields: Probabilis- mean reciprocal rank (MRR), and MAPref de- tic models for segmenting and labeling sequence data. In scribed in more detail in (Zhang et al., 2011). The Proceedings of ICML, pages 282–289. table shows the scores of our primary runs, and the Haizhou Li, Min Zhang, and Su Jian. 2004. A joint source- last column indicates our ranks in which we com- channel model for machine transliteration. In Proceedings of ACL, pages 159–166. pare our scores with those of other participants. Our system performs reasonably well across Percy Liang, Ben Taskar, and Dan Klein. 2006. Alignment language pairs, except for Chinese-to-English by agreement. In Proceedings of HLT-NAACL, pages 104–111. back-transliteration. We achieve the best per- formance for English-to-Thai and English-to- Ryan McDonald. 2006. Discriminative Training and Span- ning Tree Algorithms for Dependency Parsing. University Hebrew, and the second-best performance (in the of Pennsylvania, PhD Thesis. cases that more than two primary runs were sub- mitted) for English-to-Chinese and English-to- Masaki Nagata. 1994. A stochastic japanese morphological analyzer using a forward-DP backward-A* n-best search Persian. algorithm. In Proceedings of COLING, pages 201–207. Franz Josef Och and Hermann Ney. 2003. A systematic com- References parison of various statistical alignment models. Comput. Linguist., 29:19–51. Eiji Aramaki and Takeshi Abekawwa. 2009. Fast decoding and easy implementation: transliteration as sequential la- Taraka Rama and Karthik Gali. 2009. Modeling machine beling. In Proceedings of the 2009 Named Entities Work- transliteration as a phrase based statistical machine trans- shop, pages 65–68. lation problem. In Proceedings of the 2009 Named Enti- ties Workshop, pages 124–127. P. V. S. Avinesh and Ankur Parikh. 2010. Phrase-based transliteration system with simple heuristics. In Proceed- Praneeth Shishtla, V. Surya Ganesh, Sethuramalingam ings of the 2010 Named Entities Workshop, pages 81–84. Subramaniam, and Vasudeva Varma. 2009. A Koby Crammer, Ryan McDonald, and Fernando Pereira. language-independent transliteration schema using char- 2005. Scalable large-margin online learning for structured acter aligned models at news 2009. In Proceedings of the classification. In NIPS Workshop on Learning With Struc- 2009 Named Entities Workshop, pages 40–43. tured Outputs. Min Zhang, A Kumaran, and Haizhou Li. 2011. Whitepa- Andrew Finch and Eiichiro Sumita. 2009. Transliteration by per of news 2011 shared task on machine transliteration. bidirectional statistical machine translation. In Proceed- http://translit.i2r.a-star.edu.sg/news2011. ings of the 2009 Named Entities Workshop, pages 52–56. 31 English-Korean Named Entity Transliteration Using Statistical Substring-based and Rule-based Approaches Yu-Chun Wang Richard Tzong-Han Tsai Department of Computer Science Department of Computer Science and Information Engineering and Engineering National Taiwan University, Taiwan Yuan Ze University, Taiwan

[email protected] [email protected]

Abstract tokens in the target language with CRF. Since Ko- rean writing system, Hangul, is alphabetic, we This paper describes our approach to consider that the sequential labeling method is English-Korean transliteration in NEWS suitable for English-Korean transliteration. In ad- 2011 Shared Task on Machine Translit- dition, we also apply rule-based method with a eration. We adopt the substring-based pronouncing dictionary for comparison. transliteration approach which group the characters of named entity in both source 2 Our Approach and target languages into substrings and We comprises three different approaches for then formulate the transliteration as a se- the transliteration: grapheme substring-based, quential tagging problem to tag the sub- phoneme substring-based, and rule-based meth- strings in the source language with the ods. Grapheme and phoneme substring-based substrings in the target language. The CRF methods are both based on substring-based algorithm are used to deal with this tag- transliteration methods with CRF. The difference ging problem. We also construct a rule- is that the substrings composed with English char- based transliteration method for compari- acters or English phonemes. The details of each son. Our standard and non-standard runs methods are described in the following subsec- achieves 0.43 and 0.332 in top-1 accu- tions. racy which were ranked as the best for the English-Korean pair. 2.1 Substring-based Approach The substring-based approach comprise the fol- 1 Introduction lowing steps: Named entity translation plays an important role 1. Pre-processing in machine translation, cross-language informa- tion retrieval, and question answering. However, 2. Substring alignment named entities such as person names or organiza- tion names are generated everyday and do not of- 3. CRF training ten appear in dictionaries since bilingual dictionar- 4. Substring segmentation and transliteration ies cannot update their contents frequently. Most name entity translation is based on transliteration, 2.1.1 Pre-processing which is a method to map phonemes or graphemes Korean writing system, namely Hangul, is alpha- from source language into target language. There- betical. However, unlike western writing system fore, it is necessary to construct a named entity with Latin alphabets, Korean alphabet is com- transliteration system. posed into syllabic blocks. For transliteration from For English-Korean name entity transliteration, other languages to Korean, one syllabic block con- we adopt the substring-based transliteration pro- tains two or three letters mainly, including 14 lead- posed by Reddy and Waxmonsky (Reddy and ing consonants, 10 vowels, and 7 tailing conso- Waxmonsky, 2009) with conditional random fields nants. For instance, the syllabic block “한” (han) (CRF). The method treats the transliteration as a is composed with three letters: a leading conso- sequential labeling task where substring tokens in nant “ㅎ” (h), a vowel “ㅏ” (a), and a tailing con- the source languages are tagged with the substring sonant “ㄴ” (n). 32 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 32–35, Chiang Mai, Thailand, November 12, 2011. Thus, in order to deal with Korean training data, the data in both directions from source language to we have to decompose Korean syllabic blocks into target language and target language to source lan- letters before performing training. The Korean let- guage. The final bidirectional alignment result is ters in syllabic blocks are almost perfectly corre- the union of the alignments in both directions. In- sponding to their phonological forms. However, serted characters (aligned to NULL by GIZA++) the actual pronunciation of some consonant let- in the alignment results are merged with the pre- ters may vary in different positions in the syllabic ceding character into the same substring. For ex- block. For example, the letter “ㅅ” is pronounced ample, the bidirectional alignment result of the as [s] in the leading consonant position, but as [t] English word “KNOX” to the Korean word “nok in the tailing consonant position. We do not distin- sW” (녹스) is [KN → n, O → o, X → k, null → guish this pronunciation difference of these letters s, null → W]. The null → s and null → W map- and treat them as the same tokens. For convenient pings are merged into the previous alignment to processing, we convert the Korean letters into Ro- generate X → ksW. Finally, we get the one-to-one man symbols with the Revised Romanization of alignment as [KN → n, O → o, X → ksW]. Korean proposed by the South Korea Government. After the processing of the bidirectional align- ments, we transform the training data into one- 2.1.2 Substring alignment to-one substring mapping pairs. These substrings Unlike Korean, English orthography might not re- pairs are used as token set fro the CRF training. flect its actual phonological forms, which makes A few pairs in the training data cannot be aligned trivial one-to-one character alignment between one-to-one such as “THAILAND” to /th a i/ (타 English and Korean not practical. English may use 이) because they are not actual transliterations. several characters for one phoneme which is pre- We drop these pairs from the training data because sented in one letter in Korean, such as “ch” to “ㅊ” CRF can handle one-to-one alignments only. and “oo” to “ㅜ”. In contrast, English sometimes In addition, since Korean is a phonological writ- use a single character for a diphthong or conso- ing system, for non-standard runs, we also adopt nant cluster, which are presented as several letters phonemic information for English name entities. in Korean. For example, the letter “x’ in the En- The English word pronunciations are obtained glish name entity “Texas” corresponds to two let- from the CMU Pronouncing Dictionary v0.7a1 . ters “ㄱ” and “ㅅ” in Korean. Besides, some En- The CMU pronouncing dictionary provides the glish letters in the word might not be pronounced, phonemic representations of English pronuncia- like “k” in the English word “knight”. tions with a sequence of phoneme symbols. For Furthermore, due to Korean phonology, Korean instance, the English word KNOX is segmented may insert a specific vowel “ㅡ” [W] between En- and tagged as the phonemic representation < N glish consonant clusters or behind the last burst AA K S >. Since the CMU pronouncing dictio- stop consonant of the syllable. For instance, the nary does not cover all the pronunciation informa- English name entity “Snell” is transliterated as “스 tion of the name entities in the training data, we 넬” /sW nel/ and “Albert” is transliterated as “앨 also apply LOGIOS Lexicon Tool2 to generate the 버트” /æl b@ th W/. phonemic representations of all other name enti- In order to deal with these complex orthogra- ties not in the CMU pronouncing dictionary. After phy problems, we adopt substring-based method obtaining the phonemic representation of all the to group characters into substrings. English words English named entities in the training data, we for- are segmented into several substrings and each mulate the sequence of phoneme symbols of the substring maps to a substring in the target lan- English name entities as a string and apply the guage, Korean. substring alignment method mentioned earlier to To create training sets of substrings, we use get the mappings from English phoneme symbols the GIZA++ toolkit (Och and Ney, 2003) to align to Korean letters. For the previous example, the all the name entity pairs in the training data. phoneme symbols < N AA K S > from the En- The GIZA++ toolkit performs one-to-many align- glish name entity KNOX are aligned to the letters ments, which means that a single symbol in the 1 source language may be aligned to at least one http://www.speech.cs.cmu.edu. /cgi-bin/cmudict symbol in the target language. To obtain the many- 2 http://www.speech.cs.cmu.edu/tools/ to-many substring alignments, we run GIZA++ on lextool.html 33 of its corresponding Korean word “nok sW” as [N should belong to one substring, we need only B → n, AA → o, K → k, S → sW]. We name this and I classes in the tag sets. substring alignment based on the English phone- After the English named entities are segmented mic representation as “phoneme substring-based” into substrings, it can be passed into the CRF method for non-standard run, and the substring model we trained in section 2.1.3 as input data to alignment based on the English orthography as produce the transliteration results. “grapheme substring-based” for standard run. The transliteration results predicted by the CRF model is an romanized representation of Korean 2.1.3 CRF training letters. Therefore, the romanized representation With the transformed substring training data, we sequences should be converted back to Korean now use CRF to train a sequential model with syllabic blocks. Because the position informa- the substrings as the basic tokens. We adopt the tion of each Korean letters in the syllabic blocks CRF++ open-source toolkit (Kudo, 2005). (leading consonant, vowel and tailing consonant We train our CRF models with the unigram , mentioned in section 2.1.1) does not remain while bigram, and trigram features over the input sub- training, we have to organize the sequential letters strings in the source language. The features are into blocks based on the Korean orthography. Ko- shown in the following. rean orthographic rules are applied to combine the • Unigram: s−1 , s0 , and s1 letters into syllabic blocks. For example, the se- quential Korean letters “ㅁ, ㅏ, ㄱ, ㅅ, ㅣ” (m, a, • Bigram: s−1 s0 k, s, i) are combined into two syllabic blocks “막 시” (mak-si) to make “k” in the tailing consonant • Trigram: s−2 s−1 s0 , s−1 s0 s1 , and s0 s1 s2 position of the first syllable and “s” in the lead- where current substring is s0 and si is other sub- ing consonant position of the second syllable be- strings relative to the position of the current sub- cause consonant clusters are not allowed in a Ko- string. rean syllabic block. Besides, between the succes- sive vowel letters, the zero consonant letter “ㅇ” is 2.1.4 Substring segmentation and inserted because of Korean orthography. transliteration Because our method is based on the substrings 2.2 Rule-based Approach from the transformed training data, we have to We also construct a rule-based transliteration sys- segment the unseen English named entities into tem. According to the “외 래 어 표 기 법” (Ko- the substrings before applying CRF testing of our rean writing method of loanwords)3 standardized model. For example, we have to segment the En- by the National Institute of Korean Language, we glish named entity “SHASHI” into four substrings build a transliteration mapping table from interna- < SH A SH I >. Since the substrings used to train tional phonetic alphabet (IPA) to Korean letters. the CRF model are generated by the bidirectional The phonemic representations of English name alignments from the training data, we also used entities in the test set are first extracted by the CRF to train another model for substring segmen- CMU Pronouncing Dictionary and LOGIOS Lex- tation of English named entities. icon Tool. Then, each phoneme symbol is translit- We adopt the segmentation approach motivated erated into corresponding Korean letter based on by the Chinese segmentation (Tsai et al., 2006) the transliteration mapping table. The results gen- which treat Chinese segmentation as a tagging erated by the mapping table need to be composed problem. The characters in a sentence are tagged into Korean syllabic blocks. We use the same tech- in B class if it is the first character of a Chinese nique described in section 2.1.4 to produce the fi- word or in I class if it is in a Chinese word but nal results of the rule-based method. not the first character. Thus, we collect all the substring results from the bidirectional alignments 3 Results and tag each character in the English named entity in the training data as B class (the first character Table 1 shows the final results of our translitera- of the substring) or I class (not the first character tion approaches on the test data. We construct four of the substring) to create a training data of sub- 3 http://www.korean.go.kr/09_new/dic/ string segmentation for CRF. Since each character rule/rule_foreign_0101.jsp 34 Run Accuracy Mean F-score MRR MAPref Grapheme substring-based 0.430 0.711 0.430 0.423 Phoneme substring-based 0.332 0.653 0.332 0.325 Rule-based 0.215 0.474 0.215 0.209 Mixed 0.332 0.653 0.467 0.332 Table 1: Final results on the test data runs as following. the source language to the target language. Then, the transliteration is formulated as a sequential • Grapheme substring-based: CRF model tagging problem to tag the substrings in the source with the substring training set based on En- language with the substrings in the target lan- glish orthography. guage. The CRF algorithm is used to deal with • Phoneme substring-based: CRF model this tagging problem. For English substring gen- with the substring training set based on En- eration, we create two types of substrings. One is glish phonemic representations. based on the English orthography, and the other is based on the phonemic symbols from the CMU • Rule-based: transliteration mapping table pronouncing dictionary. In addition, we also con- from English phonemes to Korean letters. struct a rule-based transliteration system based on • Mixed: union of the results from the previous the Korean writing method of loanwords from three runs in the order of Phoneme substring- the National Institute of Korean language. From based, Grapheme substring-based and Rule- the evaluation results, the substring-based method based. based on the English orthography performs better than other runs. The results show that the grapheme-based ap- For future work, we plan to add more phonetic proach achieves better than others in the four eval- features for the CRF training and try to integrate uation metrics. The rule-based one does not per- the CRF-based statistical based method and the form well due to the rules from the Korean writ- rule-based methods to improve the transliteration ing method of loanwords may not be enough to performance. We also try to apply the re-ranking cover most possible cases of the transliteration techniques from the web data to get better translit- detailedly. However, the result of the phoneme eration results. substring-based approach is not as good as the grapheme substring-based one. It might be due to two reasons: one is that the Korean translit- References eration sometimes is based on the orthography Taku Kudo. 2005. CRF++: Yet another CRF not the actual pronunciation; the second reason is toolkit. Available at http://chasen.org/ that the pronunciation from LOGIOS lexicon tool ttaku/software/ctf++/. may not be accurate to get the correct phonemic Franz Josef Och and Hermann Ney. 2003. A sys- forms. The phoneme substring-based and rule- tematic comparison of various statistical alignment based approaches suffer such problems. The per- models. Computational Liguistics, 29(4):417–449. formance of the mixed run which merged the re- Sravana Reddy and Sonjia Waxmonsky. 2009. sults of above three runs shows that the joint result Substring-based transliteration with conditional ran- of different methods can help cover more possible dom fields. Proceedings of the 2009 Named Entities transliterations. Workshop, ACL-IJCNLP, pages 92–95. 4 Conclusion Richard Tzong-Han Tsai, Hsieh-Chuan Hung, Cheng- Lung Sung, Hong-Jie Dai, , and Wen-Lian Hsu. In this paper, we adopt the substring-based 2006. On closed task of chinese word segmentation: An improved crf model coupled with character clus- transliteration approach with CRF model for tering and automatically generated template match- English-Korean named entity transliteration. The ing. Proceedings of the Fifth SIGHAN Workshop on characters in the source and target language are Chinese Language Processin, pages 134–137. aligned in bi-direction and then group into sub- strings to generate the substring mappings from 35 Leveraging Transliterations from Multiple Languages Aditya Bhargava, Bradley Hauer, and Grzegorz Kondrak Department of Computing Science University of Alberta Edmonton, Alberta, Canada, T6G 2E8 {ab31,bmhauer,gkondrak}@ualberta.ca Abstract transliterate his name into Hindi. Since a Japanese version already exists, we could extract from it While past research on machine transliter- additional information to help with the translitera- ation has focused on a single translitera- tion process. Importantly, since our article is about tion task, there exist a variety of supple- an American guitarist, we would explicitly want mental transliterations available in other to start with the English (original) version of the languages. Given an input for English-to- name, and treat other languages as extra data, rather Hindi transliteration, for example, translit- than vice versa. erations from other languages such as In order to effectively incorporate the other- Japanese or Hebrew may be helpful in the language data, we apply SVM re-ranking in a man- transliteration process. In this paper, we ner that has previously been shown to provide sig- propose the application of such supplemen- nificant improvement for grapheme-to-phoneme tal transliterations to English-to-Hindi ma- conversion (Bhargava and Kondrak, 2011). This chine transliteration via an SVM re-ranking method is flexible enough to incorporate multiple method with features based on n-gram languages; it employs features based on character alignments as well as system and align- alignments between potential outputs and existing ment scores. This method achieves a rel- transliterations from other languages, as well as ative improvement of over 10% over the scores of these alignments, which serve as a mea- base system used on its own. We further sure of similarity. We apply this approach on top of apply this method to system combination, the same D IREC TL+ system as submitted last year demonstrating just under 5% relative im- (Jiampojamarn et al., 2010b) for English-to-Hindi provement. machine transliteration. Compared to the base D I - REC TL+ performance, we are able to achieve sig- 1 Introduction nificantly better results, with a relative performance The focus of significant previous work in machine increase of over 10%. We also achieve improve- transliteration, including that presented at past ments without supplemental transliterations by sim- NEWS Shared Tasks (Li et al., 2009; Kumaran et ply apply the same approach with another sys- al., 2010b), has been on single transliteration tasks tem’s output as extra data. We furthermore experi- in isolation of other other languages. This is despite ment with romanization for Hindi data as well as the fact that the various languages provided repre- different alignment length settings for English-to- sent a significant quantity of potentially useful data Chinese transliteration. This paper presents meth- that is being ignored. In this NEWS 2011 Shared ods, methodology, and results for the above experi- Task submission, we present a method which bene- ments. ficially applies supplemental transliterations from 2 Leveraging multiple transliterations other languages to English-to-Hindi transliteration. In practice, this is a realistic situation in which Bhargava and Kondrak (2011) present a method for transliterations from other languages can help. For applying transliterations to grapheme-to-phoneme example, Wikipedia contains articles on guitarist conversion. Here, we apply this method verbatim John Petrucci in English and Japanese, but not in to machine transliteration. The method is based on Hindi. If we wanted to automatically generate a SVM re-ranking applied over n-best output lists stub (skeleton) article in Hindi, we would need to generated by a base system. Intuitively, we have 36 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 36–40, Chiang Mai, Thailand, November 12, 2011. an existing base transliteration system that, for a drak (2011) show the effectiveness of the granular given input, provides a set of n scored outputs, with n-gram features vs. the score features as well as the correct output not always appearing in the top the importance of applying multiple transliteration position. In order to help bring the correct output languages. to the top, we turn to existing transliterations of the input from other languages. In order to leverage 3 Alignment of training data a variety of features and transliterations from all available languages, SVM re-ranking is applied to Practically, we must consider how to generate the this task. alignments between the candidate output transliter- For each output, a feature vector is constructed. ations and the supplemental transliterations for the Given alignments between the input and output, n-gram features, as well as how to generate the sim- for example, binary indicator features based on ilarity scores. M2M-A LIGNER (Jiampojamarn et grouping input and output n-grams in the style of al., 2007) addresses both of these. M2M-A LIGNER D IREC TL+ (Jiampojamarn et al., 2010a) are con- is an unsupervised character alignment system, structed. The base system’s score for the output meaning that it can learn to align data given suf- would be included as well, along with differences ficient training data consisting of unaligned input- between the given output’s score and the scores for output pairs. Once trained, M2M-A LIGNER will the other outputs in the list. This feature construc- then produce an alignment for a new pair as well tion process is then repeated, replacing the input as an alignment score. Because the algorithm is with an available transliteration, for each available a many-to-many extension of the unsupervised transliteration language. The score in this latter edit distance algorithm, we can see that the align- case is used as a measure of how “similar” a candi- ment score should represent some notion of script- date output is to a “reference” transliteration from agnostic similarity. another language. We refer to these other transliter- Since we will be applying M2M-A LIGNER be- ations as supplemental transliterations. While the tween candidate output transliterations and supple- score features provide a global measure of similar- mental transliterations for a variety of supplemental ity, the n-gram features allow weights to be learned languages, we will need to build several alignment for character combinations between the candidate models, each being built from separate training output and supplemental transliterations; this pro- data. The majority of the task data are English- vides very fine-grained features that can explicitly source, so for any entry in one language corpus use certain characters in supplemental translitera- we can easily find corresponding transliterations in tions to help determine the quality of a candidate other language corpora. In other words, to gener- output. ate training data for M2M-A LIGNER between the There are, however, some practicalities that must target transliteration language and a supplemental be considered. Bhargava and Kondrak (2011) note language, we need only intersect the two corpora the importance of applying multiple languages; on the basis of the common English input. they found it difficult to achieve significant im- Table 1 shows the amount of overlap between provements using transliterations from one lan- the test data for the different English-source lan- guage only. This is due in part to noise in the data guages and the combined training and development (which has been observed in some of the NEWS data for the other English-source languages. Note Shared Task data (Jiampojamarn et al., 2009)) as that the Chinese- and Korean-target corpora show well as differing conventions for various translitera- very high coverage; however, we focus on English- tion “schemes”. These issues are handled implicitly to-Hindi transliteration as it enables us to more in two ways: (1) the granularity of the n-gram fea- closely examine the outputs based on our own lin- tures allows certain character combinations in the guistic familiarities. The use of other corpora here transliteration to be learned as being positive or neg- requires that these results be submitted as a non- ative indicators of a candidate output’s quality, or standard run. Note that, because there is not com- that they should be ignored altogether; and (2) the plete coverage for the English-to-Hindi test data, use of multiple transliterations helps smooth out we simply submit the base system’s results as-is some of the noise. While we do not examine these in cases where there is no transliteration available methods here for brevity’s sake, Bhargava and Kon- from other languages. 37 Language Test set Overlap Language Type System Acc. EnBa 1,000 498 EnHi Standard DTL 47.1 EnCh 2,000 2,000 EnHi Standard DTL+Rom. 45.7 EnHe 1,000 525 EnHi Standard DTL+S EQ 49.3 EnHi 1,000 889 EnHi Non-Std. DTL+Supp. 52.1 EnJa 1,815 734 EnCh Standard DTL 3-1 34.1 EnKa 1,000 883 Standard DTL 7-1 28.7 EnKo 609 608 EnPe 2,000 1,049 EnJa Standard DTL 43.5 EnTa 1,000 884 EnTh 2,000 1,564 Table 2: Word accuracy (%) for the various submitted runs. DTL is generic D IREC TL+; Table 1: The number of entries in the test data DTL+Rom. is D IREC TL+ trained on romanized (per language) that have at least one supplemen- data; DTL+S EQ is D IREC TL+ re-ranked with S E - tal transliteration available from another language QUITUR outputs; and DTL+Supp. is D IREC TL+ corpus. re-ranked with supplemental transliteration data from other languages. 4 Base systems 5 Hindi romanization Our principal base system that generates the n-best In addition to the above re-ranking approach, we output lists is D IREC TL+, which has produced ex- experimented with a romanization method for the cellent results in the NEWS 2010 Shared Task on Hindi data. Since consonant characters in the De- Transliteration (Jiampojamarn et al., 2010b). For vanagari alphabet have vowels included by default, re-ranking, note that training a re-ranker requires we romanize the text in order to provide D IREC TL+ training data where the base system scores are rep- with direct individual control over the consonant resentative of unseen data so that the re-ranker does and vowel components of the Hindi characters. The not simply learn to follow the base system; we default vowel is changed by means of diacritic- therefore split the training data into ten folds and like characters, which in turn deletes the default perform a sort-of cross validation with D IREC TL+. vowel; this requires a context-sensitive (but still This provides us with usable training data for re- rule-based) romanization method, which we con- ranking. We tune the SVM’s hyperparameter based struct manually. We then train D IREC TL+ on the on performance on the provided development data, romanized data; during testing, we take the ro- and use the best D IREC TL+ settings established manized output and convert it back into Devana- in the NEWS 2010 Shared Task (Jiampojamarn et gari Unicode characters, again using a manually- al., 2010b). Armed with optimal parameter settings, constructed context-sensitive rule-based converter. we combine the training and development data into a single set used to train our final D IREC TL+ sys- 6 Results tem. We also repeat the cross-validation process for training the re-ranker. Table 2 shows that SVM re-ranking significantly We also apply the SVM re-ranking approach improves the English-to-Hindi transliteration accu- to system combination. In this case, we addi- racy in comparison with the base system. Leverag- tionally train another system—here we use S E - ing all of the English-source transliteration corpora QUITUR (Bisani and Ney, 2008)—for English-to- as supplemental data yields an increase of over Hindi transliteration. During test time, we feed the 10%. When applied using S EQUITUR ’s output as input into both D IREC TL+ and S EQUITUR , and “supplemental” data, we see almost a 5% (relative) use the top S EQUITUR output as supplemental data. increase in word accuracy. We expect that sometimes S EQUITUR will provide In contrast, our Hindi romanization approach de- a correct answer where D IREC TL+ does not; the creases the accuracy. This differs from the results hope is that the SVM re-ranking approach will be of the successful application of romanization to able to learn when this is the case based on the Japanese (Jiampojamarn et al., 2010b), demonstrat- n-gram and score features. ing that it is not always possible to transfer an idea 38 from one language to another. ing approach to machine transliteration, and simi- The English-to-Chinese results, which use only larly Khapra et al. (2010) propose to transliterate the base D IREC TL+ system, demonstrate the im- through “bridge” languages. Along similar lines, portance of the alignment length parameter setting. Kumaran et al. (2010a) find increases in accuracy D IREC TL+ requires aligned data for input, and the using a linear-combination-of-scores system that maximum length of the alignments will have an combined the outputs of a direct transliteration sys- effect on what D IREC TL+ learns to produce. We tem with a system that transliterated through a third submitted both 3-to-1 and 7-to-1 alignments be- language. For statistical machine translation, Cohn cause they gave similar results during development, and Lapata (2007) also explore the use of a third and both were better than other tested possibilities. language. In the final results, we see a substantial difference Finally, we also touched briefly on system com- between the two alignment settings. We hypoth- bination: we applied the SVM re-ranking method esize that the complexity of English-to-Chinese to combining the outputs of both D IREC TL+ and mappings is better captured by the alignments that S EQUITUR , in particular treating D IREC TL+ as map longer sequences of English letters to single the base system and using S EQUITUR ’s best out- Chinese characters. making it difficult to generalize puts to re-rank D IREC TL+’s output lists. Finch and to new data. Sumita (2010), in contrast, combine S EQUITUR ’s Finally, we observe very good overall accuracy output with that of a phrase-based statistical ma- in the English-to-Japanese results (which also only chine translation system, achieving excellent re- use base D IREC TL+), which further confirm the sults. Where our approach is based on SVM re- effectiveness of D IREC TL+ when applied to ma- ranking, theirs merged the outputs of the two sys- chine transliteration. tems together and then used a linear combination of the system scores to re-rank the combined list. 7 Previous work 8 Conclusion There are three lines of research that are relevant to the work we have presented in this paper: (1) D I - In this paper, we described our submission to the REC TL+ and S EQUITUR for machine translitera- NEWS 2011 Shared Task on machine translitera- tion; (2) applying multiple languages; and (3) sys- tion. Our focus was on incorporating supplemental tem combination. data, using a method based on SVM re-ranking, For the NEWS 2009 and 2010 Shared Tasks, with features derived from n-gram alignments and the discriminative D IREC TL+ system that incor- alignment scores. We demonstrated improvements porates many-to-many alignments, online max- of over 10% when applying other transliteration margin training and a phrasal decoder was shown data to English-to-Hindi machine transliteration, to function well as a general string transduction and just under 5% when applying another system’s tool; while originally designed for grapheme-to- outputs in a similar manner. We also found that phoneme conversion, it produced excellent results the romanization of Hindi characters brings about for machine transliteration (Jiampojamarn et al., a decrease in performance, and that the alignment 2009; Jiampojamarn et al., 2010b), leading us to length parameter in the D IREC TL+ system has a re-use it here. Finch and Sumita (2010) also sub- critical effects on the results. mitted a top-performing system that was based in part on S EQUITUR , which is a generative system Acknowledgements based on joint n-gram modelling (Bisani and Ney, We are grateful to Ying Xu for examining our initial 2008). Chinese results. This research was supported by In this paper, we applied multiple transliteration the Natural Sciences and Engineering Research languages to a single transliteration task. While Council of Canada. our method is based on SVM re-ranking with similar features as to those used in the base sys- tem (Bhargava and Kondrak, 2011), there have References been other explorations into incorporating other Aditya Bhargava and Grzegorz Kondrak. 2011. How language data, particularly when data are scarce. do you pronounce your name? Improving G2P with Zhang et al. (2010), for example, apply a pivot- transliterations. In Proceedings of the 49th Annual 39 Meeting of the Association for Computational Lin- A. Kumaran, Mitesh M. Khapra, and Pushpak Bhat- guistics: Human Language Technologies, Portland, tacharyya. 2010a. Compositional machine translit- Oregon, USA, June. Association for Computational eration. ACM Transactions on Asian Language Linguistics. Information Processing (TALIP), 9(4):13:1–29, De- cember. Maximilian Bisani and Hermann Ney. 2008. Joint- sequence models for grapheme-to-phoneme conver- A. Kumaran, Mitesh M. Khapra, and Haizhou Li. sion. Speech Communication, 50(5):434–451, May. 2010b. Report of NEWS 2010 Transliteration Min- ing Shared Task. In Proceedings of the 2010 Named Trevor Cohn and Mirella Lapata. 2007. Machine Entities Workshop, pages 21–28, Uppsala, Sweden, translation by triangulation: Making effective use of July. Association for Computational Linguistics. multi-parallel corpora. In Proceedings of the 45th Annual Meeting of the Association of Computational Haizhou Li, A. Kumaran, Vladimir Pervouchine, and Linguistics, pages 728–735, Prague, Czech Repub- Min Zhang. 2009. Report of NEWS 2009 Ma- lic, June. Association for Computational Linguistics. chine Transliteration Shared Task. In Proceedings of the 2009 Named Entities Workshop: Shared Task on Andrew Finch and Eiichiro Sumita. 2010. Translitera- Transliteration (NEWS 2009), pages 1–18, Suntec, tion using a phrase-based statistical machine transla- Singapore, August. Association for Computational tion system to re-score the output of a joint multi- Linguistics. gram model. In Proceedings of the 2010 Named Entities Workshop, pages 48–52, Uppsala, Sweden, Min Zhang, Xiangyu Duan, Vladimir Pervouchine, and July. Association for Computational Linguistics. Haizhou Li. 2010. Machine transliteration: Lever- aging on third languages. In Coling 2010: Posters, Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek pages 1444–1452, Beijing, China, August. Coling Sherif. 2007. Applying many-to-many alignments 2010 Organizing Committee. and hidden Markov models to letter-to-phoneme conversion. In Human Language Technologies 2007: The Conference of the North American Chap- ter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 372– 379, Rochester, New York, USA, April. Association for Computational Linguistics. Sittichai Jiampojamarn, Aditya Bhargava, Qing Dou, Kenneth Dwyer, and Grzegorz Kondrak. 2009. Di- recTL: a language independent approach to translit- eration. In Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009), pages 28–31, Suntec, Singapore, August. As- sociation for Computational Linguistics. Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kondrak. 2010a. Integrating joint n-gram features into a discriminative training framework. In Human Language Technologies: The 2010 Annual Confer- ence of the North American Chapter of the Associa- tion for Computational Linguistics, pages 697–700, Los Angeles, California, USA, June. Association for Computational Linguistics. Sittichai Jiampojamarn, Kenneth Dwyer, Shane Bergsma, Aditya Bhargava, Qing Dou, Mi-Young Kim, and Grzegorz Kondrak. 2010b. Translitera- tion generation and mining with limited training resources. In Proceedings of the 2010 Named Entities Workshop, pages 39–47, Uppsala, Sweden, July. Association for Computational Linguistics. Mitesh M. Khapra, A Kumaran, and Pushpak Bhat- tacharyya. 2010. Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Lin- guistics, pages 420–428, Los Angeles, California, June. Association for Computational Linguistics. 40 Comparative Evaluation of Spanish Segmentation Strategies for Spanish-Chinese Transliteration Rafael E. Banchs Human Language Technology Department, Institute for Infocomm Research 1 Fusionopolis Way, #21-01 Connexis South, Singapore 138632

[email protected]

The transliteration task can be approached Abstract from either a rule-based or a statistical perspec- tive, but in any case, the problem can be theo- This work presents a comparative evaluation retically grounded on Finite-state Automata The- among three different Spanish segmentation ory (Knight, 2009). Several different approaches strategies for Spanish-Chinese transliteration. to transliteration have been proposed in the lit- The transliteration task is implemented by erature (Arbabi et al., 1994; Divay and Vitale, means of Statistical Machine Translation, us- 1997; Knight and Graehl, 1998; Al-Onaizan and ing Chinese characters and Spanish sub-word segments as the textual units to be translated. Knight, 2002; Li et al., 2004; Tao et al., 2006; Three different Spanish segmentation strate- Yoon et al., 2007; Jansche and Sproat, 2009) gies are evaluated: character-based, syllabic- covering specific transliteration tasks between based and a proposed sub-syllabic segmenta- English and a large variety of languages such as tion scheme. Experimental results show that Japanese (Knight and Graehl, 1998), French (Di- syllabic-based segmentation is the most effec- vay and Vitale, 1997), Arabic (Arbabi et al., tive strategy for Spanish-to-Chinese translit- 1994; Al-Onaizan and Knight, 2002), Chinese eration, while the proposed sub-syllabic seg- (Ren et al., 2009; Kwong, 2009), Hindi (Chinna- mentation is the most effective scheme in the kotla and Damani, 2009; Das et al., 2009; Haque case of Chinese-to-Spanish transliteration. et al., 2009), Tamil (Vijayanand, 2009) and Ko- rean (Hong et al., 2009), among others. 1 Introduction Nevertheless, despite of the large body of re- Transliteration can be defined as the process of search on automatic transliteration, and as far as transcribing a word from one language to another we are concerned, there have not been research by using the characters of the latter’s alphabet. efforts reported on this area for the specific case This actually constitutes a “phonetic translation of Spanish and Chinese. According to this, the of names across languages” (Zhang et al., 2011). main objective of this work is twofold: first, to Transliteration is typically used to construct ap- create an experimental dataset for transliteration propriate translations for words that either do not between Chinese and Spanish; and, second, to have specific equivalents or are inexistent in the report some research results on transliteration target language, such as, for instance, names of tasks between these two languages. people, institutions or geographical locations. The remaining of the paper is structured as Although they are conceptually similar tasks, follows. First, in section 2, the main technical technically speaking, translation and translitera- issue evaluated in this work, which is the seg- tion exhibit some important differences. For in- mentation of Spanish words into sub-word units, stance, while translation mainly operates at the is introduced and motivated. Then, in section 3, word level, transliteration does it at the sub-word the selected SMT-based approach for Chinese- level. Perhaps, the most important difference is Spanish transliteration, is described. In section 4, the fact that in the transliteration task, reordering the creation of an experimental dataset for Chi- of units is not required. As in the case of transla- nese-Spanish transliteration is described in detail. tion, transliteration results are not necessarily In section 5, experimental results are presented unique, i.e. one word might have different valid and discussed. Finally, in section 6, main conclu- transliterations. sions and future research ideas are provided. 41 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 41–48, Chiang Mai, Thailand, November 12, 2011. 2 Spanish Word Segmentation The concept of isochronism in language was first introduced by Pike (1945). Three types of rhyt- hmic patterns can be distinguished: stress-timed, syllable-timed and mora-timed. Although this theory has not been fully accepted, there is some accepted empirical evidence that both Spanish (Pamies Bertran, 1999) and Chinese (Lin and Wang, 2007) belong to the syllable-timed rhyth- mic group. In the case of Chinese, syllabic segmentation is naturally induced by the basic association be- tween the characters and their corresponding sounds. On the contrary, in the case of Spanish, as well as many other western languages, syl- Figure 1. Some examples of Chinese-Spanish labic segmentation is a phonetic property that name transliterations does not exhibit a direct or explicit association with orthographic properties of the language. A detailed analysis on the syllabic length ra- According to this, syllabic segmentation or tios between Chinese and Spanish names on our syllabification constitutes a problem of interest in experimental dataset (more details on the dataset some natural language processing applications. are provided in section 4) reveals that the most This problem can be addressed by means of ei- common situation is that both Chinese and Span- ther rule-based or data-driven approaches (Adsett ish names have the same number of syllables. et al., 2009). Syllabification algorithms based on This occurs in about 75% of the cases. From the finite-state transducers have been proposed for remaining 25% of cases, about 15% (and 10%) languages such as English and German (Kiraz correspond to cases in which the Chinese ver- and Mobius, 1998). For the effects of the present sions of the names contain more (and less) sylla- work, we implemented our own rule-based syl- bles than their corresponding Spanish versions. labic segmentation algorithm for Spanish by fol- Further analysis show that some clear patterns lowing the work of Cuayahuitl (2004). for sub-syllabic segmentation can be observed in Three different strategies for Spanish word those cases of Chinese transliterations containing segmentation are studied in this work with the more syllables than their corresponding Spanish objective of determining the most appropriate versions, which is not the case for the opposite segmentation scheme for Chinese-Spanish trans- situation. Some of these patterns include the literation. These three strategies are: character segmentation of Spanish diphthongs such as ue segmentation (the simple division of a word in into u-e, which will generate the more appropri- characters), syllabic segmentation (the division ate segmentation sa-mu-el for the fourth example of a word according to Spanish syllabic phonetic in Figure 1; the separation of some multiple con- units) and an intermediate segmentation to be sonant constructions such as br into b-r, which referred to as sub-syllabic segmentation. The rest will provide the more appropriate segmentation of this section is devoted to motivate and explain a-b-ra-ham; and the separation of some ending this latter segmentation scheme. consonants such as as into a-s, which will gener- The main motivation for the proposed sub- ate e-li-a-s. This sub-syllabic segmentation strat- syllabic segmentation of Spanish words is the egy is expected to improve the performance of observed fact that, although they agree in most of the transliteration task as it both reduces the vo- the cases, syllabifications can often differ be- cabulary size of Spanish syllabic units and im- tween Spanish and Chinese transliterated names. proves syllable correspondences between Chi- Consider, for instance, the examples presented in nese and Spanish. The complete set and sequence Figure 1. The first two examples illustrate cases of rules implemented for sub-syllabic segmenta- in which the Chinese name contains less sylla- tion is presented in Figure 2. bles than the corresponding Spanish name. On Notice that the proposed sub-syllabic segmen- the other hand, the last three examples illustrate tation strategy is only addressing those cases in cases in which the Chinese name contains more which the Chinese versions of the names contain syllables than the corresponding Spanish name. more syllables than their corresponding Spanish 42 versions. Addressing the opposite case, would Although several parameters can be varied in require instead the definition of rules for merging order to study their effect over the overall trans- consecutive Spanish syllables. We have not con- literation performance, we will focus our study in sidered this case because of two reasons: first, three specific parameters, which we consider according to our exploratory analysis of the data, could have the largest incidence, as well as make it does not seem to be clear patterns for syllabic an important difference, on quality for both merging; and, second, a merging strategy would transliterations directions under consideration: lead to an increment of the vocabulary of Span- Spanish-to-Chinese and Chinese-to-Spanish. ish Syllabic units, which is not desirable in terms The first parameter of interest is substring of the resulting transliteration model sparseness. segmentation. Although we only consider Chi- nese characters as substring units for Chinese; in the case of Spanish, we consider three different types of substring units according to the three segmentation schemes described in the previous section. More specifically, characters, syllables and the proposed sub-syllabic units are consid- ered for Spanish. The other two parameters to be considered for evaluation purposes are the order of the target language model and the alignment strategy used for phrase extraction. In the case of the target language model, four different orders are com- pared, namely: 1-gram, 2-gram, 3-gram and 4- gram; and in the case of the alignment strategy, three different methods are compared, namely: Figure 2. Rules and their sequence of application source-to-target, target-to-source and grow-diag- for sub-syllabic segmentation final-and (Koehn et al., 2007). According to this, our experimental work in- Notice that, those cases in which the Chinese volves the construction of 72 different translit- versions of the names contain less syllables than eration systems, by considering 2 transliteration their corresponding Spanish versions are basi- directions, 3 Spanish segmentation schemes, 4 cally unaddressed by our proposed segmentation target language model orders, and 3 alignment strategy. This, however, should not constitute a strategies. In each of these transliteration sys- problem in the case of Spanish-to-Chinese trans- tems, the standard set of phrase-based features, literation as the transliteration model just should which include the forward and backward relative be required to learn how to throw away some frequencies and lexical models, as well as the Spanish syllables. On the other hand, this cer- target language and phrase-length penalty mod- tainly posses a problem for the case of Chinese- els, are used. to-Spanish transliteration as the transliteration As evaluation metric for assessing translitera- model must be able to generate Spanish syllables tion quality we use the BLEU score (Papineni et from no Chinese correspondents. However, we al., 2001). In the case of Spanish-to-Chinese still expect an overall gain as the former case is transliterations, BLEU is computed at the Chi- more common that the latter one. nese character level. Similarly, and in order to make results among all three different Spanish 3 Transliteration Approach segmentation schemes comparable, in the case of Chinese-to-Spanish transliterations, BLEU is For implementing the transliteration system, we computed at the character level too. have used the Phrase-Based Statistical Machine Finally, each of the implemented systems is Translation approach, which has been proven to tuned by means of the minimum error rate train- be a good strategy for transliteration (Noeman, ing procedure (Och, 2003), in which the BLEU 2009; Jia et al., 2009). Within this approach, score is minimized over a development dataset. transliteration is performed as a machine transla- Final system scores are computed over a test tion task over substring units of both the source dataset, which is transliterated by using the tuned and the target languages. More specifically, we parameters. More details on the datasets are pro- use the MOSES toolkit (Koehn et al., 2007). vided in the following section. 43 4 Dataset Construction • The automatically extracted list of corre- sponding Chinese names was manually As no named entity dataset is available for trans- depurated. Because of the noisy nature of literation purposes between Spanish and Chinese, the alignment process, in several cases ei- the first objective of this work was the creation ther more than one Chinese word was as- of such a dataset. Despite the fact that Chinese signed to the same Spanish names or an and Spanish are the most spoken native lan- erroneous Chinese word was selected. Af- guages in the word, the amount of bilingual re- ter this second filtering processing, the fi- sources for this specific language pair happens to nal bilingual list of 841 names was ob- be very scarce (Costa-jussa et al. 2011). tained. According to this, we used one of the few bi- lingual resources that are available, the Holy Bi- For the preparation of the experimental dataset ble (Table 1 presents the basic statistics for this each side of the resulting corpus was segmented dataset), for constructing an experimental dataset as follows: Chinese data was segmented at the for transliteration research purposes. character level, and Spanish data was segmented by following the three segmentation schemes Language Sentences Words Vocab. described in section 2: character-based, syllable- Chinese 29,887 781,113 28,178 based and sub-syllabic. Spanish 29,887 848,776 13,126 Two additional normalization processes were applied to the Spanish dataset: lowercasing and Table 1. Basic statistics of the Bible dataset stress mark elimination. The total number of sub- string units and their vocabulary for each of the In his section we present a description of the constructed versions of the dataset are presented procedure followed for creating the dataset, as in Table 2. well as the basic statistics and characteristics of the constructed dataset. Dataset Names Substrings Vocab. The construction of the experimental dataset Chinese 841 2,190 314 for transliteration can be summarized according Spa (char) 841 4,766 24 to the following steps: Spa (sub) 841 3,005 108 Spa (syl) 841 2,165 491 • A list of named entities was extracted from the Spanish side of the dataset. This Table 2. Names, substring units and vocabulary extraction was conducted by using a stan- of substring units for each constructed dataset dard labeling approach based on Condi- tional Random Fields (Lafferty et al., As seen from the table, the tree Spanish word 2001). From this step a list of 1,608 Span- segmentations to be studied exhibit significantly ish names were collected. different properties in terms of the total amount of running substrings and the vocabulary size of • A reduced list of named entities was gen- substring units. Indeed, the proposed sub-syllabic erated by manually filtering the original segmentation strategy represents an intermediate list. In this process some errors derived compromise in both, substrings and vocabulary, from the first automatic step were re- between the character-based segmentation and moved, as well as any valid name entity the syllabic-based segmentation. not belonging to the two basic categories In order to be able to use the generated dataset of persons and places. In this second step, under the statistical machine translation frame- the list was reduced to 948 names. work described in section 3, the resulting bilin- gual dataset of 841 names was finally split into • The corresponding Chinese versions of the three subsets: train (with 691 names), develop- names were extracted from the Chinese ment (with 50 names) and test (with 100 names). side of the dataset. This was done auto- Although a random sample strategy was used matically by aligning both corpus at the for splitting the original corpus into the three word level (Och and Ney, 2000), and us- experimental subsets, special attention was paid ing the alignment links to identify the cor- to not include in the development and test sub- responding transliteration candidates for sets any name that would have produced out-of- each Spanish name in the list. vocabulary substrings. 44 5 Experimental Results src-2-trg trg-2-src g-d-f-a In this section we present and discuss the ex- 1-gram 15.36 16.09 14.35 perimental results corresponding to all 72 im- 2-gram 18.98 21.87 19.43 plemented transliteration systems. All experi- 3-gram 15.33 23.35 18.83 ments were conducted over the experimental 4-gram 18.19 24.05 19.85 datasets described in section 4 by following the procedure described in section 3. Although we Table 3a. BLEU scores for Spanish-to-Chinese will focus our analysis on aggregated scores systems with Spanish character segmentation computed over different subsets of experiments, Tables 3a through 3f present individual system src-2-trg trg-2-src g-d-f-a scores for all of the 72 implemented translitera- 1-gram 20.20 16.72 15.96 tion systems. 2-gram 15.58 22.85 15.37 As seen from the tables, although individual results by themselves could exhibit some degree 3-gram 20.49 21.93 19.30 of noise due to the random variability derived 4-gram 21.80 21.72 19.17 from both, dataset selection and tuning proc- esses, some clear and interesting trends can be Table 3b. BLEU scores for Spanish-to-Chinese observed form the results. For instance, notice systems with Spanish sub-syllabic segmentation how best scores tend to be always associated to src-2-trg trg-2-src g-d-f-a language model of orders 3 and 4. 1-gram 23.42 23.02 23.79 Similarly, it can be derived from the tables that the grow-diag-final-and alignment strategy 2-gram 25.27 24.28 31.98 tends to be the best alignment strategy only in 3-gram 31.26 22.14 35.98 those cases when the Spanish syllabic segmenta- 4-gram 30.83 24.41 35.48 tion is used. Alternatively, it can be observed that in the other two cases, i.e. when Spanish charac- Table 3c. BLEU scores for Spanish-to-Chinese ter and sub-syllabic segmentations are used, the systems with Spanish syllabic segmentation target-to-source alignment strategy is more bene- ficial for the Spanish-to-Chinese transliteration src-2-trg trg-2-src g-d-f-a direction while the source-to-target alignment 1-gram 38.38 33.96 35.58 strategy happens to be more beneficial for the 2-gram 37.94 35.34 35.99 Chinese-to-Spanish direction. 3-gram 35.41 39.34 37.21 In order to have a better grasp of the general 4-gram 39.11 39.52 38.78 trends in transliteration quality along the dimen- sions of each of the experimental parameters un- Table 3d. BLEU scores for Chinese-to-Spanish der consideration, let us now look at the aggre- systems with Spanish character segmentation gated results along each individual parameter variation. In this sense, Figures 3a, 3b and 3c src-2-trg trg-2-src g-d-f-a summarize transliteration quality variations with 1-gram 40.17 36.53 39.94 respect to n-gram order, alignment strategy and 2-gram 42.21 42.15 38.78 Spanish segmentation, respectively. 3-gram 39.67 43.03 40.89 Let us consider first Figure 3a. This figure 4-gram 40.70 36.45 39.88 shows the relative variations of transliteration quality with respect to n-gram order. These val- Table 3e. BLEU scores for Chinese-to-Spanish ues have been computed by aggregating all sys- systems with Spanish sub-syllabic segmentation tem scores along the alignment strategy and Spanish segmentation dimensions for each of the src-2-trg trg-2-src g-d-f-a two transliteration directions under considera- 1-gram 37.50 30.74 37.77 tion. Additionally, the resulting scores have been 2-gram 38.86 36.89 41.38 normalized with respect to the unigram case. As 3-gram 38.66 37.20 40.83 seen from the figure, there is a more critical inci- 4-gram 39.26 37.20 40.38 dence of the n-gram order on the case of Span- ish-to-Chinese transliteration than in the opposite Table 3f. BLEU scores for Chinese-to-Spanish transliteration direction. systems with Spanish syllabic segmentation 45 In the case of Figure 3b, aggregation has been conducted along the n-gram orders and Spanish segmentations. In this case, the resulting scores have been normalized with respect to the average score value for each transliteration direction. While grow-diag-final-and is the best alignment strategy for the Spanish-to-Chinese case, source- to-target alignments also happen to be a good strategy in the Chinese-to-Spanish case. Notice, however, that relative variation of scores in Fig- ure 3b is actually very low (about 2%), which suggests that the alignment strategy has a low incidence on transliteration quality for the tasks Figure 3a. Transliteration quality variations in under consideration. terms of n-gram order Finally, let us consider Figure 3c, where the relative variations of transliteration quality with respect to the selected Spanish segmentation method are depicted. In this cases system scores have been aggregated along both the n-gram or- der and the alignment strategy dimensions, and normalized with respect to average scores at each transliteration direction. Notice from the figure how syllabic segmentation is clearly the best op- tion in the Spanish-to-Chinese transliteration di- rection, while the proposed sub-syllabic segmen- tation constitutes the best alternative in the Chi- nese-to-Spanish direction. This latter interesting result can be explained in terms of the mapping functions required to Figure 3b. Transliteration quality variations in map the corresponding substring units from one terms of alignment strategy language into the other, as the larger the source vocabulary the better the mapping function is. So, in the case of the Spanish-to-Chinese task, the syllabic segmentation must provide a better mapping as it allows for a vocabulary reduction mapping, as can be verified from the vocabulary column in Table 2. On the other hand, in the Chinese-to-Spanish task the proposed method for sub-syllabic segmentation is the one providing a vocabulary reduction (as can be verified from the vocabulary column in Table 2) that allows for a better mapping function. 6 Conclusions and Future Research Figure 3c. Transliteration quality variations in In this work, we have presented a comparative terms of Spanish segmentation method evaluation among three different Spanish seg- mentation strategies for Spanish-Chinese trans- It is evident, from Figure 3a, that the translit- literation, as well as two other important parame- eration tasks does not benefits from n-gram or- ters of the transliteration system implementation: ders larger than 2 in the Chinese-to-Spanish di- target language model order and alignment strat- rection, while it certainly does in the Spanish-to- egy for bilingual unit extraction. The translitera- Chinese case. This result can be explained by the tion task was implemented by means of Statisti- larger character vocabulary size of Chinese when cal Machine Translation, using Chinese charac- compared to Spanish segmentations. ters and Spanish sub-word segments as the tex- 46 tual units to be translated. The three different and English-Kannada Transliteration Tasks at Spanish segmentation strategies evaluated were: NEWS 2009, In Proceedings of ACL-IJCNLP 2009 character-based, syllabic-based and a proposed Named Entities Workshop (NEWS 2009), pages sub-syllabic segmentation scheme. Experimental 44-47, Singapore. results shown that syllabic-based segmentation, Marta R. Costa-jussa, Carlos A. Henriquez, and Ra- along with a language model of order 4 and the fael E. Banchs, 2011, Evaluating Indirect Strategies grow-diag-final-and alignment method, consti- for Chinese-Spanish statistical machine translation tutes the most effective strategy for Spanish-to- with English as pivot language, In Proceedings of Chinese transliteration, while the proposed sub- the 27th Conference of the Spanish Society for Natural Language Processing, Huelva, Spain. syllabic segmentation, along with a language model of order 2 and the source-to-target align- Heriberto Cuayahuitl, 2004, A Syllabification Algo- ment method, constitutes the most effective rithm for Spanish, in A. Gelbukh (Ed.): CICLing strategy for Chinese-to-Spanish transliteration. 2004, LNCS 2945, pages 412-415, Springer. As an additional contribution, and due to the Amitava Das, Asif Ekbal, Tapabrata Mondal, and lack of dataset for Chinese-Spanish translitera- Sivaji Bandyopadhyay, 2009, English to Hindi tion research, we have constructed an experimen- Machine Transliteration System at NEWS 2009, In tal parallel corpus containing a total of 841 Proceedings of ACL-IJCNLP 2009 Named Entities named entities in both Chinese and Spanish. Workshop (NEWS 2009), pages 80-83, Singapore. As future research work, we intend to expand Michel Divay and Anthony J. Vitale, 1997, Algo- the experimental dataset, as well as to continue rithms for grapheme-phoneme translation for Eng- evaluating the specific peculiarities of both Chi- lish and French: Applications, Computational Lin- nese-to-Spanish and Spanish-to-Chinese translit- guistics, 23(4):495-524. eration tasks. A comprehensive manual evalua- Rejwanul Haque, Sandipan Dandapat, Ankit Kumar tion on the experimental results described here Srivastava, Sudip Kumar Naskar, and Andy Way, should be conducted in order to identify both, 2009, English-Hindi Transliteration Using Con- possible improvements to the proposed Spanish text-Informed PB-SMT: the DCU System for sub-syllabic segmentation method and some ad- NEWS 2009, In Proceedings of ACL-IJCNLP 2009 ditional strategies for improving the performance Named Entities Workshop (NEWS 2009), pages of transliteration quality between Chinese and 104-107, Singapore. Spanish. Gumwon Hong, Min-Jeong Kim, Do-Gil Lee, and Hae-Chang Rim, 2009, A Hybrid Approach to Acknowledgments English-Korean Name Transliteration, In Proceed- The author would like to thank the Institute for ings of ACL-IJCNLP 2009 Named Entities Work- Infocomm Research (I2R) and the Agency for shop (NEWS 2009), pages 108-111, Singapore. Science, Technology And Research (A*STAR) Martin Jansche and Richard Sproat, 2009, Named for their support and permission to publish this Entity Transcription with Pair n-Gram Models, In work. Proceedings of ACL-IJCNLP 2009 Named Entities Workshop (NEWS 2009), pages 32-35, Singapore. References Yuxiang Jia, Danqing Zhu, Shiwen Yu, 2009, A Connie R. Adsett, Yannick Marchand, and Vlado Ke- Noisy Channel Model for Grapheme-based Ma- selj, 2009, Syllabification rules versus data-driven chine Transliteration, In Proceedings of ACL- methods in a language with low syllabic complex- IJCNLP 2009 Named Entities Workshop (NEWS ity: The case of Italian, Computer Speech and Lan- 2009), pages 88-91, Singapore. guages, 23(4): 444-463. G. A. Kiraz and B. Mobius, 1998, Multilingual syl- Yaser Al-Onaizan and Kevin Knight, 2002, Machine labification using weighted finite-state transducers, Transliteration of names in Arabic text,In Proceed- In Proceedings of the 3rd ESCA Workshop on ings of the ACL Workshop on Computational Ap- Speech Synthesis, Jenolan Caves, Australia. proaches to Semitic Languages, Philadelphia, PA. Keving Knight and Jonathan Graehl, 1998, Machine Mansur Arbabi, Scott M. Fischthal, Vincent C. Transliteration, Computational Linguistics, 24(4): Cheng, and Elizabeth Bart, 1994, Algorithms for 599-612. Arabic name transliteration, IBM Journal of Re- Kevin Knight, 2009, Automata for Transliteration and search and Development, 38(2):183-193. Machine Translation, In Proceedings of ACL- Manoj Kumar Chinnakotla, and Om P. Damani, 2009, IJCNLP 2009 Named Entities Workshop (NEWS Experiences with English-Hindi, English-Tamil 2009), page 27, Singapore. 47 P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, Federico, N. Bertoldi, B. Cowan, W. Shen, C. 2001, BLEU: a method for automatic evaluation of Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, machine translation, IBM Research Report RC- and E. Herbst, 2007, MOSES: Open source toolkit 22176. for statistical machine translation, In Proceedings Kenneth L. Pike, 1945, Step-by-step procedure for of the 45th ACL Annual Meeting, pages 177-180, marking limited intonation with its related features Prague, Czech Republic. of pause, stress and rhythm, in Charles C. Fries Oi Y. Kwong, 2009, Graphemic Approximation of (Ed.), Teaching and Learning English as a Foreign Phonological Context for English-Chinese Trans- Language, pages 62-74, Publication of the English literation, In Proceedings of ACL-IJCNLP 2009 Language Institute, University of Michigan, Ann Named Entities Workshop (NEWS 2009), pages Arbor. 186-193, Singapore. Feiliang Ren, Muhua Zhu, Huizhen Wang, and J. Lafferty, A. McCallum, and F. Pereira, 2001, Con- Jingbo Zhu, 2009, Chinese-English Organization ditional Random Fields: Probabilistic models for Name Translation Based on Correlative Expansion, segmenting and labeling sequence data, In Pro- In Proceedings of ACL-IJCNLP 2009 Named Enti- ceedings of the International Conference on Ma- ties Workshop (NEWS 2009), pages 143-151, Sin- chine Learning, pages 282-289. gapore. Haizhou Li, Min Zhang, and Jian Su, 2004, A joint Tao Tao, Su-Youn Yoon, Andrew Fister, Richard source-channel model for machine transliteration, Sproat and Cheng-Xiang Zhai, 2006, Unsupervised In Proceedings of the 42nd ACL Annual Meeting, Named Entity Transliteration Using Temporal and pages 159-166, Barcelona, Spain. Phonetic Correlation, In Proceedings of Empirical Methods in Natural Language Processing, pages Hua Lin and Qian Wang, 2007, Mandarin Rhythm: 22-23, Sydney, Australia. An Acoustic Study, Journal of Chinese Language and Computing, 17(3): 127-140. Kommaluri Vijayanand, 2009, Testing and Perform- ance Evaluation of Machine Transliteration System Sara Noeman, 2009, Language Independent Translit- for Tamil Language, In Proceedings of ACL- eration system using phrase based SMT approach IJCNLP 2009 Named Entities Workshop (NEWS on substrings, In Proceedings of ACL-IJCNLP 2009), pages 48-51, Singapore. 2009 Named Entities Workshop (NEWS 2009), pages 112-115, Singapore. Su-Youn Yoon, Kyoung-Young Kim, and Richard Sproat, 2007, Multilingual Transliteration Using Franz J. Och and Hermann Ney, 2000, A comparison Feature based Phonetic Method, In Proceedings of of alignment models for statistical machine transla- the 45th ACL Annual Meeting, pages 112-119, Pra- tion, In Proceedings of the 18th Conference on gue, Czech Republic. Computational Linguistics, pages 1086-1090, Mor- ristown, NJ. Min Zhang, A. Kumaran, and Haizhou Li, 2011, Whitepaper of NEWS 2011 Shared Task on Ma- Franz J. Och, 2003, Minimum error rate training in chine Transliteration, In Proceedings of IJCNLP statistical machine translation, In Proceedings of 2011 Named Entities Workshop (NEWS 2011), re- the 41st ACL Annual Meeting, pages 160-167, Sap- trieved on June 15, 2011, from http://translit.i2r.a- poro, Japan. star.edu.sg/news2011/news2011whitepaper.pdf Antonio Pamies Bertran, 1999, Prosodic Typology: On the Dichotomy between Stress-Timed and Syl- lable-Timed Languages, Language Design, 2: 103- 130. 48 /iFydye!KBH9X/Qb?Bb?X+XDT M/`2rX7BM+?!MB+iX;QXDT b2vKKQ!KBHX/Qb?Bb?X+XDT マイケル・ジャクソン・ 49 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 49–57, Chiang Mai, Thailand, November 12, 2011. 50 ( k, k) ( −k , −k ) −k k p(( k , k ))|( −k , −k )) = S N (( k , k )) + αG0 (( k , k )) N +α S N N (( k , k )) ( k, k) γ G|α,G0 ∼ DP (α, G0 ) ( k , k )|G ∼ G G0 α α>0 G α G0 G G0 k 51 Japanese アンドリュー Character Sequence a do riyuu English Character Sequence an d roid Model Score: 0.034 0.012 10e-12 f1 f2 f3 f4 logprob |t| |sbad |+|tbad | numsegs |s| |s|+|t| minprob f1 f2 f3 f4 logprob numsegs minprob |s| |t| |sbad | |tbad | 52 Web Resource Document Test pairs are a (Wikipedia) randomly sampled subset Japanese Wiki Interlanguage links English Wiki Document Titles Document Titles マイケル Michael ジャクソン Jackson ... ... Bilingual Co-segmentation マイ|mi ケ|cha ル|el -4.6 -7.3 - -5.1 Segment DocumentFile Features Document Test Threshold Negative Seed Sentences Document Examples Document (Positive Examples) Train SVM Good pairs Document Bad Pairs Document 53 1 0.98 precision 0.96 recall f-score Average log probability of the segments 0.94 0.92 Score 0.9 0.88 0.86 0.84 0.82 0.8 -1 -0.5 0 0.5 1 SVM classification threshold Log probability of the least likely segment a priori 54 1 1 proposed 0.9 lcsr40 0.8 random 0.8 En-Ar En-Ch baseline 0.7 0.6 Recall Recall 0.6 0.4 proposed 0.5 lcsr50 random 0.2 0.4 baseline 0.3 0 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 0.2 0.4 0.6 0.8 1 Precision Precision 1 1 0.9 0.9 0.8 En-Hi 0.8 En-Ru 0.7 0.7 Recall Recall 0.6 0.6 proposed proposed 0.5 0.5 lcsr40 lcsr40 0.4 random 0.4 random baseline baseline 0.3 0.3 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Precision Precision 1 1 0.9 0.9 0.8 En-Ta 0.8 En-Ja 0.7 0.7 Recall Recall 0.6 0.6 proposed 0.5 0.5 lcsr50 proposed 0.4 random 0.4 lcsr58 baseline random 0.3 0.3 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Precision Precision 55 f3 ∼ 56 ¸ 57 Product Name Identification and Classification in Thai Economic News Nattadaporn Lertcheva Wirote Aroonmanakun Department of Linguistics Department of Linguistics Chulalongkorn University Chulalongkorn University

[email protected] [email protected]

of the analysis will be presented in sections 4 and Abstract 5 followed by the conclusion. 2 Background Knowledge The purpose of this research is to analyze the Unlike a person name, an organization name, or patterns of the product names used in Thai a location name, which is normally used to refer economic news and to find clues that could be to one unique referent, a product name is used to used to identify the product names’ boundaries refer to many referents categorized under the and their categories. It is found that the same product. Product names are a kind of patterns of Thai product names are quite proper name because each is created to refer to a varied. Thirty two patterns are found in this certain product produced by a company. This study. While some clues like collocation and section describes the definition of product names the context of names can be used for and product categories used in this study. identifying product names, many of them cannot be identified by these means. This Although product named entity recognition has indicates that the task of product named entity been analyzed in Lertcheva and Aroonmanakun recognition is an interesting task for Thai (2009) which focused on linguistic analysis of language processing. the product names for solving product name identification, this paper furthers the study by 1 Introduction analyzing product names in detail using a larger corpus. Moreover, we will propose the pattern of Most named entity recognition research has been product names and describe the components used focused on person, location, and organization to classify different types of product. names. Though other proper names, such as biomedical names and product names, are 2.1 Definition of Product Names important in language processing, only a little To distinguish one product from the same research has been done on recognizing these products produced by other companies, names in Thai such as Lertcheva and trademarks or brand names are usually used in Aroonmanakun (2009). Since different types of the product names. However, previous research names have different patterns and characteristics, used the terms “product names”with different basic linguistic knowledge of the names is meanings. For example, Liu et al. (2005) defined needed for imposing any rules or features for any a product name as a name consisting of a trade rule-based or statistical-based named entity mark and product type, e.g. Nokia 3310. Nilsson recognition systems. This paper presents basic and Malmgren (2005) defined a product name as knowledge of Thai product names. A corpus of a term under brand names. In other words, a Thai economic news is used in analyzing product brand name consists of a trademark, a product names. Patterns and variations of their forms in name, and a service name. Trademarks have a texts are analyzed. In this paper, background broader scope than product names or service information of product names and relevant names. For example, Volvo is regarded as a research will be presented first. Then, the corpus trademark while Volvo C70 is considered a and annotation used in marking Thai product product name. Boonpaisarnsatit (2005) used the names will be described in section 3. The results 58 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 58–64, Chiang Mai, Thailand, November 12, 2011. term “product names” differently from Liu et al. 24. Farming products (2005) and Nilsson and Malmgren (2005). What 25. Cleaning products is called “product name” in Boonpaisarnsatit 26. Miscellaneous (2005) is actually a generic noun indicating a category of product. He referred to “brand 3 Corpus and Annotation names” as the combination of product name and To reveal patterns of product names in Thai, a trademark. For example, รถยนต์|โตโยต้า is analyzed corpus of Thai economic news is used. The as consisting of a product name รถยนต์ -‘car’ and a corpus size is 178,474 words, in which 2,463 trademark โตโยต้า -‘Toyota’. The use of a generic product names are found.1 Since the language noun when referring to a product is a used in the headlines usually has different style characteristic of referring to products in Thai. In from the body text, in this study, we analyze only this study, we use the term product name as the product names found in the body text of the defined in Lertcheva and Aroonmanakun (2009) news. TEI annotation style is used in marking up which is a linguistic expression consisting of a product names. A product name is tagged by generic noun, a brand name indicator, a brand using<productNametype=“Product’s_Category” name, a product type indicator, and a product >…</productName>. The annotation of the type. components in product names is as follows. 1. <genericNoun>.....</genericNoun> is used 2.2 Product Categories for tagging words used to describe the type of In product named entity recognition, the task product. For example, โทรศัพท์ มือถือ|โนเกีย consists of includes not only identifying the boundary but a compound noun, โทรศัพท์มือถือ-‘mobile phone’, also the type of the product. However, there is no and a brand name “Nokia”. Although the corpus standard classification of product category. In is collected from Thai economic news, generic this paper, we use the classification listed by the nouns are not always written in Thai script. Even Department of Export Promotion, Ministry of though English names can be transliterated using Commerce of Thailand and Wikipedia as a basis Thai script, they are often written in English. For of classification and divide the products into 26 example, the product name “LCD TV รุ่ นAN-LT categories as follows: 322 DU" begins with a generic noun in English 1. Foods “LCD TV” followed by a product type indicator 2. Medical devices in Thai รุ่ น-‘model’ and then the product type in 3. Pharmaceutical 4. Cosmetic and spa products English “AN-LT 322 DU”. Generic nouns can be 5. Eyewear brands a simple word, a compound, or a phrase e.g.อาหาร 6. Electrical products and parts / ทะเลแช่แข็ง-‘frozen sea food’. Electronics 2. <brandIndicator>.....</brandIndicator> is 7. Automotive / auto parts and accessories used to mark a brand indicator, or a word 8. Building materials and hardware items indicating the brand name. Brand indicators 9. Chemicals and plastic resins found in the corpus are limited to words like ตรา- 10. Printing products, paper and packaging ‘brand’, ยีห่ อ้ -‘brand’, ตระกูล-‘family’, เครื่ องหมายการค้า- 11. Machinery and equipment ‘trademark’, ชื่อ-‘name’, ผลิตภัณฑ์-‘product’, and แบ 12. Gems and jewelry รนด์-‘brand’. Brand indicators can be preceded by 13. Watches/Clocks 14. Bags/Footwear/Leather Products some prepositions like ภายใต้-‘under’, e.g. ภายใต้| 15. Textiles, garments and fashion ผลิตภัณฑ์ = ‘under’+‘product’, or it can be modified accessories by an adjective like ใหม่-‘new’, e.g. แบรนด์|ใหม่= 16. Sporting goods ‘brand’+‘new’. 17. Furniture and parts 3. <brandName>.....</brandName> is used to 18. Gift and decorative items/handicrafts mark the brand name of the product. The brand 19. Household products name is normally a trademark named for the 20. Home textiles products. Brand names are sometimes found 21. Toys and games 22. Stationery/Office supplies and 1 Equipment The corpus can be downloaded from 23. Tobacco http://pioneer.chula.ac.th/~awirote/Data- Nattadaporn.zip 59 written in English, such as, เครื่ องสําอาง|DHC = a Example: A + (B) + C = A + B + C or A + C generic noun ‘cosmetic’ + a brand name ‘DHC’ 2. […] indicates that the component is 4. <proIndicator>..... </proIndicator> is the required in the pattern. markup for the product type indicator used to Example: [+brand] means that a brand identify the product type. Product type indicators indicator must be present in this pattern and must found in the corpus are รุ่ น-‘type’, ซี รีย-์ ‘series’, be the word ‘brand’. สู ตร-‘formula’, กลิ่น-‘scent’, รส/รสชาติ-‘taste’, ชนิด- 3. {…} indicates that at least one element in the braces must be present. ‘type’,ครอบครัวตระกูล-‘family’. These product type Example: {A + B} + C = A + C or B + C or A + indicators sometimes can be modified by an B+C adjective, such as รุ่ น+ใหม่ = ‘type’+‘new’. 4. | is used for marking the selection of only 5. <productType>.....</productType> is for one choice. tagging product subtype under the same brand Example: A|B + D = A + D or B + D name. It is found that either common nouns or From the 32 patterns found in the 2,463 proper nouns can be used as a product type. In product names, we can categorize them into 4 food product names, a common noun related to groups as follows: taste is likely to be used indicating its subtype, 1. Head Structures e.g. มาม่า+รส+ต้ มยํากุ้ง– ‘Mama’+‘taste’+‘spicy This pattern consists of one component, brand lemongrass with shrimp’. For technology name or product type, functioning as the head products, a proper noun is usually used to word. From all the product names, the pattern identify the subtype, e.g. the name ยาริ ส-‘Yaris’ is with the brand name as head is found in 39.26% used to indicate a specific model of the car, โต of the product names while the pattern with product type as head is found in 4.06% of the โยต้า+ยาริ ส–‘Toyota’+ ‘Yaris’ product names.  Brand name 4 Product Name Identification This pattern is found when the product Product names in Thai consist of five name is used continuously in the text or in an components as stated in the previous section. illustration sentence. For example, <product However, the patterns can be varied. To identify Name type="cosSpa” ID=“P03”><brandName> a product name, its patterns and contextual clues จุยซ์บิวตี้</brandName></productName> is a name have to be examined. In this study, we found 32 consisting of only the brand name “Juice patterns of product names. These patterns can be Beauty”. categorized into 4 groups, head only, head-  Product type initial, head-centre, and head-final (section 4.1). This pattern is found when the product Then, a study of context clues for identifying name is continuously referred to in the text. The product names is presented in section 4.2. product type can be either a common noun or a proper noun. For example, <product Nametype 4.1 Pattern of Product Names =“Elec”><productType>ธิ งค์แพดเอ็กซ์ 100อี </product Of the 32 patterns, brand name and product type Type></productName>has a proper name as the are the core part of the product name. A brand product type, “ThinkPad X 100E.” In the name is used to distinguish the product from the example, <productNametype="food"><product same one produced by other companies. A Type>หมูสับ</productType></productName>, the product type is usually used to differentiate product type is a common noun referring to similar products under the same brand name. “minced pork”. This pattern, in which only a Every pattern of product name would have the common word functions as the product type, is brand name as its core part. If the brand name is acceptable only if the same product is previously omitted, the product type would be used as the referred to using a product name pattern core part of the product name. These two containing a brand name. This is because, unlike components are essential in uniquely identifying a proper noun, a common noun by itself cannot the product. Therefore, ‘head’ in this paper refers specify what the product is. For example, we can to a brand name or a product type. use the product type “Jazz” without mentioning a The symbols used in the pattern of product brand name because the reader can understand names are described as follows. what we are referring to. In contrast, we cannot 1. (…) indicates the component that can be use a common word likeหมูสับ – “minced pork” as omitted in the product name. 60 the product name when first introduced in the </brandIndicator><brandName>อาร์ ทรี text since the readers cannot understand what the </brandName><genericNoun>ชาพร้ อมดื่ม product is. </genericNoun></productName> 2. Head-Initial Structures This example consists of a brand indicator This is the pattern in which the head is located “brand”, a brand name “Artea” and a generic at the beginning. This pattern consists of 4 sub- noun “tea”. patterns which account for 10.19% of the product  Generic noun + brand name indicator | names. product type indicator + brand name + product  Brand name + brand name indicator type [+brand] + generic noun Example: <productName type="cosSpa"> Example: <productName <genericNoun>ยาสีฟัน</genericNoun> type="gems"><brandName>ดามิอานิ <brandIndicator>ยี่ห้อ</brandIndicator> </brandName><brandIndicator>แบรนด์ <brandName>ฟลูโอคารี ล</brandName> </brandIndicator><genericNoun>เครื่ องประดับ <productType>40 พลัส</productType> </genericNoun></productName> This example consists of a brand name </productName> “Damiani”, a brand indicator “brand” and a This example consists of ageneric noun generic noun “jewelry”. “toothpaste”, a brand indicator “brand”, a brand  Brand name + generic noun + product name “Fluocaril” and a product type “40 plus”. type indicator + product type  Generic noun + brand name + product Example: <productName type=“Elec” ID=“P02”> type + generic noun <brandName> แบล็กเบอร์ รี่ </brand Name> Example:<productName type=“Auto”> <genericNoun>รถ</genericNoun><brandName> <proIndicator>รุ่น</proIndicator><productType> เชฟโรเลต</brandName><productType>โคโลราโด โบลด์</productType> </productName> </productType><genericNoun>ปิ คอัพอเมริ กนั พันธุ์แกร่ ง This example consists of a brand name “Blackberry”, a product type indicator “type” </genericNoun></productName> and a product type “Bold”. This example consists of a generic noun “car”, a  Brand name + product type + (generic brand name “Chevrolet”, a product type noun) “Colorado” and a generic noun “American pick- Example: <productName type="food"> up”. <brandName>ไวตามิ ้ลค์</brandName>  Brand name indicator + brand name + brand name indicator + generic noun <productType>โลว์ชกู าร์ </productType> Example: <productName type="fashion"> </productName> <brandIndicator>ไฟติ ้งแบรนด์ชื่อ</brandIndicator> This example consists of a brand name <brandName>จีแอนด์จี</brandName> (Guy&Girl) “Vitamilk” and a product type “Low sugar”.  Product type + product type indicator <brandIndicator>แบรนด์</brandIndicator> Example: <productName type=“Elec”> <genericNoun>ชุดชันใน</genericNoun> ้ <productType>ธิงค์แพด</productType> </productName> <proIndicator>ซีรีส์</proIndicator> This example consists of a brand indicator </productName> “fighting brand”, a brand name “G&G”, a brand This example consists of a product type indicator “brand” and a generic noun “ThinkPad” and a product type indicator “underwear”. “series”.  Generic noun + (brand name indicator) + 3. Head-Centre Structures brand name + (product type) + product type This is the pattern in which the head is located indicator + product type at the centre of the structure. This pattern Example: <productName type=“Auto”> consists of 5 sub-patterns which account for <genericNoun>รถ</genericNoun> 5.08% of the product names. <brandName>ฮอนด้ า</brandName>  Generic noun | brand name indicator + <productType>ซิตี ้</productType><proIndicator> brand name + generic noun รุ่น</proIndicator><productType>ปี 2008 Example: <productName type="food" ID=“P02”><brandIndicator>แบรนด์ </productType></productName> 61 This example consists of a generic noun “car”, a structure can be used without causing any brand name “Honda”, a product type “City” a confusion because normally the product is product type indicator “type” and a product type previously referred to in the text. The preference “year 2008” for the head-final structure conforms to the 4. Head-Final Structures structure of a proper name in Thai, in which a Besides the pattern head only structure, this is proper name is preceded by a common noun the most commonly used structure in product indicating its class, e.g. โรงเรี ยน|สวนกุหลาบ= school+ names. The pattern has the head located at the ‘Suankularp’, วัด|บัวขวัญ= temple+‘Buakhwan’, etc. final part of the structure. This pattern consists of Therefore, readers will perceive the kind of 4 sub-patterns which account for 41.41% of the product before the name of products. e.g., ปลาราด product names. พริ ก|ตรา|ปลายิม้ = fish with a chili sauce + a brand  (generic noun) + brand name indicator + brand name indicator ‘brand’ + a brand name ‘PlaYim’ Example: <productName type=“Elec”> 4.2 Clues for Identifying Product Names <genericNoun>โทรศัพท์เคลื่อนที่</genericNoun> To find contextual clues that would be useful in <brandIndicator>ภายใต้ แบรนด์</brandIndicator> identifying product names, words collocated with <brandName>แบล็กเบอร์ รี่</brandName> the product names and specific sentence patterns </productName> are examined as follows: This example consists of a generic noun “mobile 1. Word collocations phone”, a brand indicator “under brand” and a This section emphasizes the study of words brand name “Blackberry”. collocated with the product names. A  (generic noun) + brand name indicator + preliminary observation shows that some words generic noun + brand name indicator + brand located in front of product names tend to have a name meaning related to products such as ‘seller’, Example: <productName type="food" ID="P01"> ‘buyer’, ‘importer’, ‘sell’, ‘produce’, ‘import’ <genericNoun>ข้ าวสารบรรจุถงุ </genericNoun> etc. To determine the efficacy of these words as <brandIndicator>ภายใต้ แบรนด์</brandIndicator> an indicator of the product names, we analyzed <genericNoun>ข้ าว</genericNoun> the occurrence of every word found in front of a product name within the span of four words. <brandIndicator>ตรา</brandIndicator> Words occurring in the corpus less than 6 times <brandName>ฉัตร</brandName></productName> were excluded. Then, a percentage of how often This example consists of a generic noun “a bag the words collocated with product names was of rice”, a brand indicator “under brand”, a calculated and sorted. In this study, words with generic noun “rice”, a brand indicator “brand” more than 50% co-occurrence with a product and a brand name “Chut” name are considered useful. Only three words are  (brand name indicator [+brand]) + found with this criterion. They are ผูผ้ ลิต - ‘a generic noun + brand name producer’, แนะนํา - ‘introduce’ and ผูแ้ ทนจําหน่าย - ‘a Example:<productName type="food"> dealer’. When the span is set to be three words <brandIndicator>แบรนด์</brandIndicator> before the product name, only two words are <genericNoun>นํ ้าผลไม้ </genericNoun> found useful, namely แนะนํา - ‘introduce’ and ผูแ้ ทน <brandName>แบรี่ </brandName></productName> จําหน่าย - ‘a dealer’. This example consists of a brand indicator Although a preliminary observation intuitively “brand”, a generic noun “juice” and a brand indicates the close relation between the product name “Berri”. name and its collocations, the result does not  Generic noun + product type indicator confirm that observation because the percentages + product type of co-occurrences for most of the collocates are Example: <productName type=“Auto”> lower than 50%. <proIndicator>รุ่น</proIndicator> <productType> 2. Illustration sentences ซีรีส์ 7 ซีดาน</productType> </productName> A sentence pattern that is found to be useful for identifying a product name is the sentence This example consists of a product type indicator with illustration. In this pattern, product names “type” and a product type “Series 7 Sedan”. are found as a list of illustrations after the words Thai product names tend to be used with head ได้แก่ - ‘for example’,and เช่น- ‘such as’. The last structure and head-final respectively. Head- 62 product name usually comes after the When components in the product name cannot conjunction และ- ‘and’. In this example, ผูจ้ ดั หาเสื้ อผ้า| be used to identify the product category, a แบรนด์ |เช่ น|ลีวายส์ |และ |แรงเลอร์ (clothing dealer + brand contextual clue, which comes from a previous + such as + Levi’s + and + Wrangler), two mention of the product name in the text, is used. product names are listed after the word เช่น-‘such It is found that the categories for 431 product names can be identified by referring back to the as’. same product names previously presented in the 5 Product Category Identification text. If a product is referred to more than once in the text, its category is usually identified by The task of product named entity recognition considering the components inside the first includes not only identifying product name mention of the name. When the same product is boundaries but also product categories. In this referred to again using a reduced form, its section, we describe the criteria used for category can be inferred from the previous identifying product categories. From 2,463 mention. product names, we found that only 1,603 product In sum, based on the analysis of 2,463 product names (65%) can be assigned to a product names, we found that categories can be identified category by considering either the components in for only 65% of them by analyzing the the product name or contextual clues. components inside the product name (1,172) or 1. Components in the product name by referring to a previous mention of the product Of those 1,603 names, the product categories name (431). The rest, 860 product names (35%), can be determined for 1,172 by considering the cannot be assigned to their categories using these components within the product names. means. It seems that background knowledge is Components that are useful are generic nouns, needed in identifying the product category. brand names, and product types. These are usually a product which is well known,  Generic noun e.g. โค้ก= ‘Coke’, แพนทีน= ‘Pantene’, etc. Thus, Product categories can be easily product category identification is not an easy determined from the generic noun in the product task. name. For example,วิทยุ|โซนี่ = a radio + a brand name ‘Sony’ is categorized as ‘Electrical 6 Conclusion products’ because ‘radio’ is a subclass of This study concerns both product name and electrical products. In this example,นํา้ ดืม่ |สิ งห์ = product category identification. A linguistic drinking water + a brand name ‘Singha’ is analysis of Thai product names is carried out to categorized as ‘Foods’ because ‘drinking water’ reveal patterns of product names and clues that is a subtype of food. would be useful for product named entity  Brand name recognition in Thai. For some names, a part of the brand Though there is some preference for the head- name can be useful in identifying its category. only and head-final structures in Thai product For example, the brand name ไวตามิลค์ (Vitamilk) names, it is found that the patterns of Thai is categorized as ‘Foods’ because there is a word product names are quite varied. In addition, there ‘milk’ within the brand name. In this example, is no explicit clue for identifying a product name. ไอโฟน (iphone) is categorized as ‘Electronic Using collocates alone seems to be insufficient products’ because of the word ‘phone.’ The for identifying the product name. brand name เนสท์กาแฟ (Nescafé) is used to For product category identification, some categorize the product as ‘Foods’ because the inner clues can be found from the components in word ‘café’ in Thai means coffee. the product names. Keeping track of products  Product type referred to in the discourse can also help in In some cases, product category can be identifying the category when the name is used inferred from the product type. For example, มาม่า| in a reduced form. However, categories cannot be identified for a number of product names by รส|หมูสับ= a brand name ‘Mama’ + a product type this means. indicator ‘taste’ + a product type ‘minced pork’ Therefore, the problem of Thai product named can be categorized as ‘Foods’ because of the entity recognition is not an easy task. Further product type ‘minced pork’. research on this topic is needed. A general 2. Contextual clues named entity recognition model should be 63 implemented to verify whether the model that has been used in Thai named entity recognition could resolve this problem. We think that a named entity recognition that uses both word forms and part-of-speech sequences should suffice for identifying the product name boundaries. But identifying product category, if it is needed, should be implemented separately by keeping track of product names found previously and creating semantic relations between the product names and contextual words. Acknowledgments This research is supported by The Thailand Research Fund (TRF) under grant no MSG53Z0007, and partially supported by Chulalongkorn University Centenary Academic Development Project. References Boonpaisarnsatit, N. 2005.Semantic analysis of Thai Products’ Brand names.Unpublished master’s thesis, Chiang Mai University, Thailand. Department of Export Promotion.Ministry of Commerce.Product’s information.Retrieved from: http://www.depthai.go.th[accessed 11 March 2009] Lertcheva, N. and Aroonmanakun, W. 2009.A Linguistic Study of Product Names in Thai Economic News.In Proceeding of the 8th international symposium on natural language processing. October 20-21, 2009. Bangkok. Thailand Liu, F., Zhao, J., Lv, B., Xu, B., and Yu, H. 2005.Product Named entity Recognition Based on Hierarchical Hidden Markov Model.In Proceedings of the 4thSIGHAN Workshop on Chinese Language Processing. Nilsson, K., and Malmgren, A. 2005. Towards automatic recognition of product names: An exploratory study of brand names in economic texts. In Proceedings of the15th NODALIDA conference, Joensuu. Settels, B. 2004.Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. Wikipedia.Category:Brands by product type. Retrieved from: http://en.wikipedia.org/wiki/Category:Brands_by_ product_type [accessed 25 December 2008] 64 Mining Multi-word Named Entity Equivalents from Comparable Corpora Abhijit Bhole Goutham Tholpadi Raghavendra Udupa Microsoft Research India Indian Institute of Science Microsoft Research India Bangalore, India Bangalore, India Bangalore, India

[email protected] [email protected] [email protected]

Abstract general problem of MWNE equivalents from com- parable corpora. Named entity (NE) equivalents are useful In the MWNE equivalents mining problem, in many multilingual tasks including MT, a NE in the source language could be related transliteration, cross-language IR, etc. to a NE in the target language by, not just Recently, several works have addressed transliteration, but a combination of translitera- the problem of mining NE equivalents tion, translation, acronyms, deletion/addition of from comparable corpora. These methods terms, etc. To give an example, Figure 1 shows usually focus only on single-word NE a pair of comparable articles in English and equivalents whereas, in practice, most Hindi. ‘Sachin Tendulkar’ and ‘.sa;a;.ca;na .tea;nqu +.l+.k+.=’ NEs are multi-word. In this work, we are MWNE equivalents, and both words have present a generative model for extracting been transliterated. Another example is the pair equivalents of multi-word NEs (MWNEs) ‘Siddhivinayak Temple Trust’ and ‘;a;sa;a:;Ädâ ; a;va;na;a;ya;k from a comparable corpus, given a NE ma;a;nd: / = siddhivinayak mandir’. Here, the tagger in only one of the languages. We first word has been transliterated, the second one show that our method is highly effective translated, and the third omitted in Hindi. The task on three language pairs, and provide a is to (a) identify these MWNEs as equivalents, detailed error analysis for one of them. (b) infer the word correspondence between the MWNE equivalents, and (c) identify the type of 1 Introduction correspondence (transliteration, translation, etc.). NEs are important for many applications in natu- Such NE equivalents would not be mined cor- ral language processing and information retrieval. rectly by the previously mentioned approaches as In particular, NE equivalents, i.e. the same NE they would mine only the pair (Siddhivinayak, expressed in multiple languages, are used in sev- ;a;sa;a:;Ädâ ; a;va;na;a;ya;k). In practice, most NEs are multi- eral cross-language tasks such as machine trans- word and hence it makes sense to address the prob- lation, machine transliteration, cross-language in- lem of mining MWNE equivalents. formation retrieval, cross-language news aggrega- To the best of our knowledge, this is the first tion, etc. Recently, the problem of automatically work on mining MWNEs in a language-neutral constructing a table of NE equivalents in multi- manner. ple languages has received considerable attention In this work, we make the following contribu- from the research community. One approach to tions: solving this problem is to leverage the abundantly • We perform an empirical study of MWNE available comparable corpora in many different occurrences, and the issues involved in min- languages of the world (Udupa et al., 2008; Udupa ing (Section 2). et al., 2009a; Udupa et al., 2009b). While consid- erable progress has been made in improving both • We define a two-tier generative model for recall and precision of mining of NE equivalents MWNE equivalents in a comparable corpus from comparable corpora, most approaches in the (Section 4). literature are applicable only to single-word NEs, and particularly to transliterations (e.g. Tendulkar • We propose a modified Viterbi algorithm and .tea;nqu +.l+.k+.=). In this work, we consider the more for identifying MWNE equivalents, and 65 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 65–72, Chiang Mai, Thailand, November 12, 2011. Mumbai, July 29: Sachin Tendulkar will make his Bollywood debut with a cameo role in a film about the miracles of Lord Ganesh. Tendulkar, widely regarded as one of the world's best batsmen, will play himself in Vighnaharta Shri Siddhivinayak," a film about the god, who is sometimes referred to as Siddhivinayak. "He will play a small role, as himself, either in a song sequence or in an actual scene," said Rajiv Sanghvi, whose company is handling the film's production. Tendulkar's office confirmed the cricketer would be shooting for the film after he returns from Sri Lanka where India is touring at the moment. Tendulkar, a devotee of Ganesh, had offered to be a part of the project and will not be charging for the role. The film is being produced by the Siddhivinayak Temple Trust, which looks after a famous temple dedicated to Ganesh in Mumbai. [ अपनी बल्ऱेबाजी से दनु नया भर के क्रिकेटप्रेममयों को अपना दीवाना बनाने वाऱे ]/O [ सचिन तें डुऱकर ]/[ Sachin Tendulkar ] [ अब ]/O [ बॉऱीवुड ]/[ Bollywood ] [ में पदापपण करने जा रहे हैं और गणपनि पर बनने वाऱी एक क्रिल्म में वह नजर आएॉगे ]/O [ गणपनि के परमभक्ि ]/O [ सचिन ]/[ Sachin ] [ ' ]/O [ ववध्नहतता ससविववनतयक ]/[ Vighnaharta Shri Siddhivinayak ] [ ' क्रिल्म में एक सॊक्षऺप्ि भूममका ननभाएॉगे ]/O [ क्रिल्म का ननमापण ]/O [ ससविववनतयक मंदिर ]/[ Siddhivinayak Temple Trust ] [ न्यास कर रहा है , जो मॊबु ई के प्रभादे वी इऱाके में स्थिि इस मशहूर मॊददर की दे खरे ख करिा है ]/O [ न्यास के प्रमख ु ]/O [ सभ ु तष मतयेकर ]/[ Subhash Mayekar ] [ ने कहा ]/O [ सचिन ]/[ Sachin ] [ कई साऱ से ननयममि रूप से इस मॊददर में आ मीडिया की खबरों के अनस ु ार क्रिल्म के ननमापण से जड़ ु ी कॊपनी के प्रमख ु ]/O [ रतजीव संघवी ]/[ Rajiv Sanghvi ] [ ने कहा ]/O [ सचिन ]/[ Sachin ] [ की इसमें सॊक्षऺप्ि भूममका होगी ]/O [ वह ]/O [ सचिन तें डुऱकर ]/[ Sachin Tendulkar ] [ के रूप में ही नजर आएॉगे ]/O Figure 1: An example of MWNE mining. for inferring correspondence information E.g. (Mahatma Gandhi, ma;h;a;tma;a ga;<a;Da;a) where (Section 4.3). (Mahatma, ma;h;a;tma;a) and (Gandhi, ga;<a;Da;a) are transliterations. • We evaluate the method on three language pairs (involving English (En), Arabic (Ar), 2. At least one word in the Hindi MWNE is Hindi (Hi) and Tamil (Ta)) (Section 6). a translation of some word in the English MWNE while the remaining words are In our method, we assume the existence of the fol- transliterations. E.g. (New Delhi, na;IR ; a;d; +:a lowing linguistic resources: a NE tagger, a transla- nai dillee) where (New, na;IR ) is a trans- tion model, a transliteration model, and a language lation and (Delhi, ; a;d; +:a) is a transliteration. model. We show good mining performance for En-Hi and En-Ta. We perform error analysis for 3. MWNEs contain abbreviations (initials). E.g. En-Ar, and identify sources of error (Section 6.5). (M. K. Gandhi, O;;ma. :ke. ga;<a;Da;a) where (M, O;;ma) and (K, :ke) are initials. 2 Empirical Study of Multi Word NE Equivalents 4. One-to-one correspondence between the words in the English and Hindi MWNEs. To understand the various issues in mining E.g. (New Delhi, na;IR ; a;d; +:a) MWNE equivalents from comparable corpora, we took a random sample of 100 comparable 5. One-to-many correspondence between the En-Hi news article pairs from the Indian news words in the English and Hindi MWNEs. portal WebDunia 1 . The English articles had 682 E.g. (Card, :pra;Za;~ta;a :pa:a prashasti unique NEs of which 252 (37%) were person patr). names, 130 (19%) were location names, and 300 (44%) were organization names. A substantial 6. Many-to-one correspondence between the percentage of the names comprised of more than words in the two MWNEs. E.g. (Air force, one word: locations 25%, person names 96%, and va;a;yua;sea;na;a vayusena). organizations 98%. For each English MWNE, we 7. Sequential correspondence between words in manually identified its equivalent (if any) in the the two MWNEs. E.g. (High Court, o+.a;ta;ma comparable Hindi article. We observed that the nya;a;ya;a;l+.ya ucchatam nyayalay) where MWNEs studied usually conformed to one/some (High, o+.a;ta;ma) and (Court, nya;a;ya;a;l+.ya) are of the following characteristics: equivalents. 1. Each word in the Hindi MWNE is a translit- 8. Non-sequential correspondence between eration of some word in the English MWNE. words in the two MWNEs. E.g. (Battle 1 http://www.webdunia.com Honour Gurais, gua:=+a;I+.sa yua:;Ädâ .sa;mma;a;na gurais 66 yuddha sammaan) where the correspon- 4 Mining algorithm dence is (Battle, yua:;Ädâ ), (Honour, .sa;mma;a;na) and 4.1 Key idea (Gurais, gua:=+a;I+.sa). We model the problem of finding NE equivalents 9. Some words in the English MWNE do not in the target sentence T using source NEs as a gen- have an equivalent in the Hindi MWNE. E.g. erative model. Each word t in the target sentence (Department of Telecommunication, dU:=+sMa;.ca;a:= is hypothesized to be either part of a NE, or gener- ; a;va;Ba;a;ga doorsanchaar vibhaag) ated from a target language model (LM). Thus, in where ‘of’ does not have an counterpart in the generative model, the source NEs N ’s plus the the Hindi MWNE. target language model constitute the set of hidden states. The t’s are the observations. We want to 10. Acronym transliteration by transliterating align states and observations, i.e. determine which each character separately. E.g. (IRRC, state generated which observation, and choose the A;a;IR A;a:= A;a:=+sa;a ai aar aar si) and alignment that maximizes the probability of the (RBC, A;a:= ba;a .sa;a aar bi si). observations. The probability of generating a tar- 11. Acronym transliteration by transliterating as get word t from a source NE state N is dependent a whole. E.g. (SAARC, .sa;a;kR saark) and on (TRAI, f";a;IR traai). • whether N is itself multi-word; if so, each Our study revealed that each of the above char- word in N acts as a substate and can generate acteristics is statistically important. Nearly 37% t. of location names and 77% of organization names involved both transliteration and translation. 12% • the context (the words preceding t in T ); note of person names, 30% of location names and 45% that the length of the context window for t of organization names had either one-to-many depends on the length of the source NE gen- or many-to-one correspondence between words. erating t, and is not a fixed parameter. 36% of organization names had non-sequential • the relationship (transliteration or translation) correspondence between words. These statis- the state/substate and the target word.2 tics clearly indicate that MWNEs need special treatment and any non-trivial MWNE equivalent Dynamic programming (DP) approaches are usu- mining technique must take into account the ally used to compute the best alignment, but it fails characteristics described above. here as the context size varies for each NE. Hence, we posit the generative model at two levels: 3 Problem Description 1. A sentence-level generative model (SGeM), Given a pair of comparable documents in differ- where each word in the target sentence is gen- ent languages, we wish to extract a set of pairs of erated either by the target LM or by one of the MWNEs, one in each language, that are equiva- source NEs. lent to each other. We are given a NE tagger in one of the languages, dubbed the source language, 2. A generative model for the NE (NEGeM), while the other language is called the target lan- where each word in the target NE is gener- guage (denoted with subscripts s and t). We are ated by one of the substates of the source NE. given a document pair (ds , dt ) and the NEs in ds i.e. {Ni }m i=1 and we want to find all possible NEs This is illustrated by the example in Figure 2. in dt which are equivalent to some Ni . The prob- The portions ’mMa;ga;l+.va;a:= k+:ea’ and ’:ke C+.a:a;ea nea lem now reduces to finding sequences of words in A;pa;nea’ of the Hindi sentence is generated by the dt that are equivalent to some Ni ’s. language model. ’.sa;a;o+.Tea;}å.pa;f;na yua;a;na;va;a;sRa;f ;a’ is In the example in Figure 1, {Ni }m i=1 = generated by the English NE ’University of {(Sachin, Tendulkar), (Lord, Ganesh), (Siddhiv- Southampton’. Note that without using the inayak, Temple, Trust), . . .}. We want to extract language model, ‘:ke’ would have been incorrectly the set { (Sachin Tendulkar, सिचन तेंडुलकर), aligned with ‘of’. Another example is ’O;;ma :ke (Siddhivinayak Temple Trust, सिद्िवनायक मंिदर), 2 We also use another relationship for letters in acronyms . . .}. that are transliterated. 67 ga;<a;Da;a . . . ’ which is equivalent to the NE “M. K. To model the relationship between the source and Gandhi”. Here, ’:ke’ is likely to be a part of the target terms, we introduce variables in a fash- NE. The language model not only reduces false ion similar to the introduction of B in (2). Let positives but also disambiguates NE boundaries. R = rki , . . . , ri where rp ∈ {transliteration, trans- lation, acronym, none} such that tp and njbp have the relationship rp . Then 3 ( ) P tj nai bj , rj ( )r = mtlat Ptlat tj nai bj tlat if rj = translation ( )r = mtlit Ptlit tj nai bj tlit if rj = transliteration [ ] = δ tj ≡ nai bj if rj = acronym = Plm (tj ) if rj = none Figure 2: Generation of a Hindi sentence from an The four probability terms on the right are ob- English NE. tained, respectively, from a translation model, a transliteration model 4 , an acronym model 5 , and a language model. 4.2 Generative Model SGeM Let T = t1 . . . tn be the target sentence Controlling ( i−1 target ) NE length In the SGeM, P ai a1 , Nai is the probability that Nai will and N = {Ni }m i=0 be the hidden states (as before), where N0 is the target LM state. In the SGeM, we generate ti . To compute this, we first note that, want to predict the hidden state used to produce for a given term ti , either ai = ai+1 i.e. Nai the next target term ti . Let ai = j if ti is generated continues to generate beyond ti , or ai 6= ai+1 by Nj . We find an alignment A = a1 . . . an which i.e. Nai terminates at ti . The probability of maximizes continuation depends on the length L of Nai and the length l of the target NE generated so far by P (T, A |N ) = Nai . Based on empirical observations, we defined n a function f (l, L) as ∏ ( ) ( i−1 ) P ai ai−1 1 , Nai P ti t1 , Nai (1) i=1 f (l, L) = 0 for l ∈ / {L − 2, L + 2} = 1 − for l ∈ {L − 1, L} By choosing which source NE generates each tar- get term, this model also controls the length of the = for l ∈ {L + 1, L + 2} target NE equivalent to a source NE. where f (l, L) is the probability of continuation, Let tki . . . ti−1 be the context for ti (all these and 1 − f (l, L) is the probability of termination. terms are aligned to Nai ). Then is a very small number. We now define ( ) ( ) i−1 ( ) P ti ti−1 1 , N a i = P t i t ki , N a i P ai a1i−1 , N = pN E if ai−1 = 0 NEGeM To model the generation of the target term ti given the context ti−1 and (the substates of) = f (i − ki , lai ) if ai−1 6= 0, ki < i 1 ( ) the source NE Nj , we let Nj = nj1 , . . . , njLj = 1 − f i − ki−1 , lai−1 if ai−1 6= 0, ki = i where njp is a substate. The internal alignment B = bki , . . . , bi is defined such that bp = s if tp is where the probabilities on the right are for begin- generated by njs . We get ning an NE, continuing an NE, and terminating a ( ) previous NE, respectively. P ti ti−1 ki , N ai = 3 δ [x] = 1 if condition x is true 4 ∑ i ∏ A character-level extended HMM described in (Udupa et ( ) ( ) al., 2009a). P bp bip+1 P tp nai bp (2) 5 A mapping from source language alphabets to target lan- B p=ki guage transliterations of the alphabets. 68 4.3 Modified Viterbi algorithm • Similarly, for a probability p given by We use the dynamic programming framework to the translation model, we calculate do the maximization in (1). For each target term ti , p0tlat = mtlat prtlat where rtlat ∈ R, for each source NE Nj , the subproblem is to find mtlat ∈ (0, +∞) the best alignment a1 . . . ai such that ai+1 6= ai In our experiments, we found that transliteration i.e. ti is the last term in the equivalent of Nj . probabilities were quite low compared to the oth- subproblem [i, j] = ers, followed by the translation probabilities. So, ( i−1 ) ( i−1 ) we used the following procedure to tune these pa- max P ai = j 6= ai+1 a1 , Nj P ti t1 , Nj ai1 rameters use a small hand-annotated set of docu- ment pairs. Let l be the length of the target NE ending at ti , based on the alignment so far. The first probability 1. Initially set pN E = +∞, and all other pa- term becomes rameters to zero. ( i ) P ai−l−1 6= ai−l = j 6= ai+1 |Nj 2. Tune rtlit to find as many of the transliter- = α × f (l, Lj ) (1 − f (l + 1, Lj )) ations as possible. Then, use mtlit to fine- This is non-zero only for certain values of tune it to improve precision without losing l, for which we can construct the solution to too much on recall. subproblem [i, j] using solutions for i = l. 3. Next, tune rtlat to find as many of the trans- Denote k = i − l, then lations as possible. Then, use mtlat to fine- subproblem [i, j] = tune it to improve precision without losing ( ) too much on recall. max subproblem [k − 1, j] × negem tpk , Ni j6=i 4. The system is now finding as many NEs as where the procedure negem computes the proba- possible, but it is also finding noise. Keep bility that a given sequence of target words is an lowering pN E to allow the language model equivalent of the given source NE. This procedure LM to absorb more and more noise. Do this solves a second (independent) DP problem (for until NEs also begin to get absorbed by LM. the NEGeM), constructed in a similar fashion. It also models conditions such as “If a target term is 6 Empirical Evaluation a transliteration, it cannot map to more than one source substate.” In this section, we study the overall precision and The output of the system is a set of MWNE recall of our algorithm for three different language pairs. For each pair, we also give the internal pairs. English (En) is the source language, and alignment between the words of the two NEs. Hindi (Hi), Tamil (Ta) and Arabic (Ar) are the tar- get languages. Hindi belongs to the Indo-Aryan 5 Parameter Tuning family, Tamil belongs to Dravidian family, and The MWNE model has five user-set parameters. Arabic belongs to the Semitic family of languages. These need to be tuned appropriately in order to be The results show that the method is applicable for able to compare probabilities from different mod- a wide spectrum of languages. els. In the following, we describe the parameters 6.1 Linguistic Resources and a systematic way to go about tuning them. Models We need four models (translation, • pN E ∈ (0, +∞) specifies how likely are we transliteration, language, and acronym) in order to to find an NE in a target sentence run the proposed algorithm. For a language pair, • Given a probability p returned by the we learnt these models using the following kinds transliteration model, the probability of data, which was available to us: value used for comparisons p0tlit is cal- • A set of pairs of NEs that are transliterations, culated as p0tlit = mtlit prtlit where to train the transliteration model rtlit ∈ R, mtlit ∈ (0, +∞). rtlit is tuned to boost/suppress p; mtlit is also used similarly, • A set of parallel sentences, to learn a transla- but to get more fine-grained control. tion model 69 Lang. Translit. Word Monolin. identifies transliterations in the target document. pairs pairs pairs corpus For MWNEs, the annotator also marks which En-Hi 15K 634K 23M words word in the source corresponds to each word in En-Ta 17K 509K 27M words the target MWNE. This constitutes gold standard En-Ar 30K 8.2M 47M words data that can be used to measure performance. (1K = 1 thousand, 1M = 1 million) 120 article pairs were annotated for En-Hi, 120 for En-Ta, and 36 for En-Ar. Table 1: Training data for the models. Evaluation The NEs mined from one article pair are compared with the gold standard for • A monolingual corpus in the target language, that pair, and one of three possible judgements is to train a language model made: • A dictionary mapping English alphabets to • Fully matched (if it fully matches some an- their transliterations in the target language. notated NE (both source and target)). One can get an idea of the scale of linguistic re- • Partially matched (if source NEs match, and sources used by looking at Table 1. the mined target NE is a subset of the gold Source language NER The Stanford NER tool target NE). (Finkel et al., 2005) was used for obtaining a list • Incorrect match (in all other cases). of English NEs from the source document. The algorithm is agnostic of the type of the 6.2 Corpus for MWNE mining NE (Person, Organization, etc.). So, reporting For each language pair, a set of comparable article the precision and recall for each NE type does pairs is required. The article pairs each for En- not provide much insight into the performance of Hi and En-Ta were obtained from news websites6 , the method. Instead, we report at different lev- where the article correspondence was obtained us- els of match—full or partial, and for different ing a method described in (Udupa et al., 2009b). categories of MWNEs—single word translitera- En-Ar article pairs were extracted from Wikipedia tion equivalents (SW), multi word transliteration using inter-language links. equivalents (including acronyms) (MW-Translit) Preprocessing The Stanford NER tags each and multi word NEs having at least one translation word in the source document as a person, location, equivalent (MW-Mixed). We compute the num- organization or other. A continuous sequence of bers for each article pair and then average over all identical tags was treated as a single MWNE. pairs. Completely capitalized NEs were treated as Parameter Tuning Parameter tuning was done acronyms. For each acronym (e.g. “FIFA”), following the procedure described in Section 5. both the acronym version (“FIFA”) as well as the For En-Hi and En-Ta, the following values were abbreviation version (“F I F A”) were included used: pN E = 1, mtlit = 100, rtlit = 7, mtlat = 1, in the list of source NEs. Each target document rtlat = 1. For En-Ar, mtlit = 1, rtlit = 14 was was sentence-separated and tokenized using used, the other parameters remaining the same. simple rules based on the presence of newlines, For the tuning exercise, 40 annotated article pairs punctuation, and blank spaces. If a word can be were used for En-Hi, 40 pairs for En-Ta, and 26 constructed by concatenating strings from the pairs for En-Ar. acronym model, it is treated as an acronym, and the acronym strings are separated out (e.g. ’O;;ma:ke’ 6.4 Results and Analysis emke is changed to ’O;;ma :ke’ em ke). We evaluated the algorithm on 80 article pairs for En-Hi, 80 pairs for En-Ta, and 11 pairs for En-Ar. 6.3 Experimental Setup The results are given in Table 2. Annotation Given an article pair, a human an- We observe that the results for both types of pre- notator looks through the list of source NEs, and cision (and recall) are nearly identical. This is so 6 En-Hi from Webdunia, En-Ta from The New Indian Ex- because, in most cases, the system is able to mine press. the entire NE. This validates our intuition of using 70 Lang Prec. Prec. Recall Recall levels of diacritization for the same words). Pair (full) (part.) (full) (part.) E.g. The target document contained ‫اﻟﺠﻤﻬﻮرﻳﻪ‬ En-Hi 0.84 0.86 0.89 0.89 al-jamhooriyah “republic”; the dictionary En-Ta 0.78 0.80 0.61 0.63 contained ‫اﻟﺠﻤﻬﻮرﻳﺎت‬ al-jamhooriyat, En-Ar 0.42 0.44 0.63 0.66 which has a different suffix, and hence was not En-Ar* 0.43 0.44 0.60 0.62 found. * including the data used for tuning Transliteration model The non-uniform usage Table 2: Precision and recall of the system of diacritics and affixes (across training and test data) as mentioned above affected the perfor- Category En-Hi En-Ta En-Ar mance of transliteration too. E.g. The model is SW 0.90 0.82 0.69 trained on data where the ’‫ ’ال‬prefix usually occurs MW - Translit 0.91 0.64 0.63 in the Arabic NE, but not in the English NE. As MW - Mixed 0.77 0.40 0.66 a result, it maps the ‘new’ in ‘new york’ to ‫اﻟﻨﻴﻮ‬ al-nyoo. The annotator had mapped ‘new’ to Table 3: Category-wise recall of the system ‫ ﻧﻴﻮ‬nyoo (i.e. without the prefix), causing the evaluation program to mark the system’s output language models to disambiguate NE boundaries. as a false positive. (The false negatives are mostly due to limitations of transliteration model and the dictionary.) The Generative Model Some errors occurred due to precision is relatively low in Arabic, even when deficiencies in the generative model. The model we include the tuning data. This suggests that the requires every word in the source NE to be mapped problem is not because of incorrect parameter val- to a unique word in the target NE. This causes ues. The error analysis for Arabic is discussed in problems when there are function words in the Section 6.5. source NE, or when two source words are mapped We also report recall of the system for various to the same target word. E.g. ‘yale school of man- categories of NEs in Table 3.7 Note that the MW agement’ corresponds to the 3-word NE ‘‫اﻻداره‬ cases and the SW case are mutually exclusive. ‫ ’ﻣﺪرﺳﻪ ﻳﻴﻞ‬where ’of’ has no Arabic counterpart. ‘al azhar’ corresponds to the single word ‫اﻻزﻫﺮ‬ 6.5 Error Analysis for Arabic al-azhar(which can be split as ‫ ال ازﻫﺮ‬al The system performed relatively poorly in Arabic azhar, but is never done in practice). than in the other languages. Detailed error analy- sis revealed the following sources of error. 7 Related work Source NER The text of the English articles au- Automatic learning of translation lexicons has tomatically extracted from Wikipedia was not very been studied in many works. Pirkola et al. clean, as compared to the newswire text used for (Pirkola et al., 2003) suggest learning trans- En-Hi and En-Ta. As a result, the source NER formation rules from dictionaries and applying wrongly identified many words as NEs, which the rules to find cross lingual spelling variants. were mapped to words on the target side, affecting Several works (Fung, 1995; Al-Onaizan and precision. E.g. words such as “best”, “foxe” were Knight, 2001; Koehn and Knight, 2002; Rapp, marked as NEs, and words with similar meaning 1999) suggest approaches to learn translation or sound were found in the target. But since the lexicons from monolingual corpora. Apart from annotator had ignored these words, the evaluation single word approaches, some works (Munteanu marked them as false positives. and Marcu, 2006; Chris Quirk, 2007) focus on mining parallel sentences and fragments from Translation model Many words were ignored ’near parallel’ corpora. by the translation model because of the presence On the other hand, out-of-vocabulary words are of diacritics, or affixes (e.g. ’‫ ’ال‬al in Arabic transliterated to the target language. Approaches is frequently prefixed to words; also, in Arabic, have been suggested for automatically learning different sources of text may have different transliteration equivalents. Klementiev et al. (Kle- 7 Since we cannot determine the category of false posi- mentiev and Roth, 2006) proposed the use of simi- tives, we do not report the precision here. larity of temporal distributions for identifying NEs 71 from comparable corpora. Tao et al. (Tao et al., Annual Meeting on Association for Computational 2006) used phonetic mappings for mining NEs Linguistics, pages 363–370, Morristown, NJ, USA. Association for Computational Linguistics. from comparable corpora, but their approach re- quires language specific knowledge which limits it Pascale Fung. 1995. A pattern matching method to specific languages. Udupa et al. (Udupa et al., for finding noun and proper noun translations from noisy parallel corpora. In IN PROCEEDINGS OF 2008; Udupa et al., 2009b) proposed a language- THE 33RD ANNUAL CONFERENCE OF THE AS- independent mining technique for mining single- SOCIATION FOR COMPUTATIONAL LINGUIS- word NE transliteration equivalents from compa- TICS, pages 236–243. rable corpora. In this work, we extend this ap- Alexandre Klementiev and Dan Roth. 2006. Weakly proach for mining NE equivalents from compara- supervised named entity transliteration and discov- ble corpora. ery from multilingual comparable corpora. In ACL- 44: Proceedings of the 21st International Confer- 8 Conclusion ence on Computational Linguistics and the 44th an- nual meeting of the Association for Computational Through an empirical study, we motivated the im- Linguistics, pages 817–824, Morristown, NJ, USA. portance and non-triviality of mining multi-word Association for Computational Linguistics. NE equivalents in comparable corpora. We pro- Philipp Koehn and Kevin Knight. 2002. Learning a posed a two-tier generative model for mining such translation lexicon from monolingual corpora. In equivalents, which is independent of the length of Proceedings of the ACL-02 workshop on Unsuper- vised lexical acquisition, pages 9–16, Morristown, NE. We developed a variant of the Viterbi algo- NJ, USA. Association for Computational Linguis- rithm for finding the best alignment in our gener- tics. ative model. We evaluated our approach for three Dragos Stefan Munteanu and Daniel Marcu. 2006. language pairs, and discussed the error analysis for Extracting parallel sub-sentential fragments from English-Arabic. non-parallel corpora. In ACL-44: Proceedings of Currently, unigram approaches are popular for the 21st International Conference on Computational most tasks in NLP, CLIR, MT, topic modeling, Linguistics and the 44th annual meeting of the As- sociation for Computational Linguistics, pages 81– etc. tasks. Phrase-based approaches are lim- 88, Morristown, NJ, USA. Association for Compu- ited by their efficiency and complexity, and also tational Linguistics. show limited improvement. We hope that this Ari Pirkola, Jarmo Toivonen, Heikki Keskustalo, Kari work will motivate researchers to explore princi- Visala, and Kalervo J¨arvelin. 2003. Fuzzy transla- pled methods that make use of NE phrases to sig- tion of cross-lingual spelling variants. In SIGIR ’03, nificantly improve the state-of-the-art in these ar- pages 345–352, New York, NY, USA. ACM. eas. The two-tier generative model is applicable Reinhard Rapp. 1999. Automatic identification of to any problem where the context of an observed word translations from unrelated english and german variable does not depend on a fixed number of past corpora. In ACL, pages 519–526. observed variables. Tao Tao, Su youn Yoon, Andrew Fister, Richard Sproat, and Chengxiang Zhai. 2006. Unsupervised named entity transliteration using temporal and phonetic References correlation. Yaser Al-Onaizan and Kevin Knight. 2001. Translat- Raghavendra Udupa, K. Saravanan, A. Kumaran, and ing named entities using monolingual and bilingual Jagadeesh Jagarlamudi. 2008. Mining named entity resources. In ACL ’02: Proceedings of the 40th An- transliteration equivalents from comparable corpora. nual Meeting on Association for Computational Lin- In CIKM ’08, pages 1423–1424. ACM. guistics, pages 400–408, Morristown, NJ, USA. As- sociation for Computational Linguistics. Raghavendra Udupa, K. Saravanan, Anton Bakalov, and Abhijit Bhole. 2009a. ”they are out there, if Arul Menezes Chris Quirk, Raghavendra Udupa U. you know where to look”: Mining transliterations 2007. Generative models of noisy translations with of oov query terms for cross-language information applications to parallel fragments extraction. In MT retrieval. In ECIR, volume 5478, pages 437–448. Summit XI, pages 377–284. European Association Springer. for Machine Translation. Raghavendra Udupa, K. Saravanan, A. Kumaran, and Jenny Rose Finkel, Trond Grenager, and Christopher Jagadeesh Jagarlamudi. 2009b. Mint: A method Manning. 2005. Incorporating non-local informa- for effective and scalable mining of named entity tion into information extraction systems by gibbs transliterations from large comparable corpora. In sampling. In ACL ’05: Proceedings of the 43rd EACL, pages 799–807. 72 An Unsupervised Alignment Model for Sequence Labeling: Application to Name Transliteration Najmeh Mousavi Nejad Shahram Khadivi Department of Engineering, Department of Computer Islamic Azad University, Science Engineering, Amirkabir University & Research Branch, Punak, of Technology 424 Hafez Ave, Ashrafi Isfahani, Tehran, Iran Tehran, Iran 15875-4413

[email protected] [email protected]

Abstract transformation rules to generate the target name. Obviously the alignment process highly affects In this paper a new sequence alignment the results. There are some alignment tools model is proposed for name transliteration which produce alignments from a bilingual systems. In addition, several new features are corpus such as GIZA++ (Och and Ney, 2003). introduced to enhance the overall accuracy in Previous studies can be divided into two a name transliteration system. Discriminative categories according to their alignment process: methods are used to train the model. Using this model, we achieve improvements on the those which apply alignment tools or predefined transliteration accuracy in comparison with algorithms in their transliteration process and the state-of-the-art alignment models. The 1- those that propose new algorithms for aligning best name accuracy is also improved using a word pairs. name selection method from the 10-best list There has been an exploration on several based on the contents of the web. This alignment methods for letter to phoneme method leads to a relative improvement of alignment (Jiampojamarn and Kondrak, 2010). 54% over 1-best transliteration. The M2M-aligner, ALINE which performs phonetic experiments are conducted on an English- alignment, constraint-based alignment and Persian name transliteration task. Integer Programming were investigated. The Furthermore, we reproduce the past studies results under the same conditions. system was evaluated on several data sets such as Experiments conducting on English to Combilex, English Celex, CMUDict, NETTalk, Persian transliteration show that new features OALD and French Brulex. provide a relative improvement of 5% over Furthermore transliteration based on phonetic previous published results. scoring has been studied using phonetic features (Yoon et al., 2007). This method was evaluated 1 Introduction for four languages – Arabic, Chinese, Hindi and Korean – and one source language – English. Transliteration is a phonetic translation that finds The name pairs were aligned using standard the phonetic equivalent in target language given string alignment algorithm based on Kruskal. a source language word. The quality of name Substring-based transliteration was transliteration plays an important role in a variety investigated applying GIZA++ for aligning name of applications such as machine translation, as pairs and using open-source CRF++ software proper nouns are usually not in the dictionary package for training the model (Reddy and and also new ones are introduced every day (e.g. Waxmonsky, 2009). The model was tested from scientific terms). English to three languages - Hindi, Kannada and The transliteration process consists of training Tamil. stage and testing stage. In the training stage the English-Japanese transliteration was model learns segment alignment and produces performed using a maximum entropy model transformation rules with a probability assigned (Goto et al., 2003). First the likelihood of a to each of them. In the test stage it uses these particular choice of letter chunking into English 73 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 73–81, Chiang Mai, Thailand, November 12, 2011. conversion units is calculated and the English language-independent alignment method word is divided into conversion units that are performs similar to GIZA++ results in Top-1 for partial English character strings in an English English-Persian transliteration and improves the word. Second each English conversion unit is accuracy and MRR 2 in Top-5 and Top-10. For converted into a partial Japanese character reverse transliteration (Persian to English), new strings called katakana. In this process the alignment shows a significant improvement over English and Japanese contextual information are GIZA++ outcome. Furthermore an approach considered simultaneously to calculate the based on name frequencies in the web contents is plausibility of conversion from each English applied to choose one name from 10 best conversion unit to various Japanese conversion possible transliterations. Since the dominant candidate units using a single probability model. language of web is English, the experiments There are a few researches which do not use were performed for Persian-to-English alignment in the transliteration process. For transliteration and not English-to-Persian. example in recent years two discriminative The rest of this paper is organized as follows: methods corresponding to local and global The feature set is described in Sec. 2. The modeling approaches were proposed (Zelenko proposed alignment method is described in Sec. and Aone, 2006). These methods do not require 3. In Sec. 4 our experimental study is described. alignment of names in different languages and Choosing one name from 10 best transliterations the features for discriminative training are is described in Sec. 5 and the conclusion is extracted directly from the names themselves. An described in Sec. 6. experimental evaluation of these methods for name transliteration was performed from three 2 Feature Set languages (Arabic, Korean, and Russian) into Maximum entropy models use features for English. maximizing log likelihood. Consequently The language pair we perform our tests on, is defining proper features has a high impact on the Persian-English and vice versa. There have been final results. We define two types of features a few researches on Persian language (Karimi et which are binary-valued vectors. For both types al., 2007). The quality of transliterated names has of features (consonant-vowel and n-gram), been improved in the past studies. However, the current context (current letter), two past and two proposed method is language specific and the future contexts (neighboring letters) are used. We algorithm is designed for Persian language. The choose a window with a length of 5, since best general language independent model in the experiments show that lower length or higher mentioned paper is CV-MODEL3. To compare length would have degrade the results. our new method, we have reproduced its results under similar conditions. In both systems the 2.1 Consonant-Vowel Features same corpus was used and both experiments are 10-fold cross-validation. Every language has a set of consonant and vowel In this paper, the openNlP maximum entropy letters. The consonant letters can be divided into package is used for training the model 1 . We different groups based on their types (Table 1). define new features for discriminative training. Plosive (stop) p,b,t,d,k,g,q Moreover a new approach for aligning name Fricative f,v,s,z,x,h pairs is proposed. In the case studies, we investigate the effect of each feature by adding it Plosive-Fricative j,c to and removing it from training process. As a Flap (tap) r result, the best combination of features is Nasal m,n achieved for English-Persian language pair. In addition, we compare our proposed alignment Lateral approximant l,y method to GIZA++. Our main concern is finding Table 1. Six group of consonants an alignment model for transliteration. We have found that the most common word alignment tool Most combinations of consonant-vowel for transliteration alignment is GIZA++ (Hong, features were tested for English-Persian language et al., 2009; Karimi, et al., 2007; Sravana Reddy pair. We have found the following consonant- and Sonjia Waxmonsky, 2009). The proposed vowel features are the most effective ones for 1 2 Available at http://incubator.apache.org/opennlp/ Mean Reciprocal Rank 74 generating current target letter (tn). Si is used to model, the following set of features has been represent the source name characters and ti used: represents the target name characters. CV is an f1: abbreviation for consonant- vowel. f2: f3: f4: f5: f6: f7: f8: The best sequence of above features, varies from one language pair to another. We report the best combination for English-Persian language We have defined three types of CV features. pair in Sec. 4. CV-TYPE1 is some basic features to reproduce past studies results. These features consist of 3 The Proposed Alignment Method and . To achieve better results, some new features are presented called Features explained in the previous section, are CV-TYPE2 which is an augmented set of extracted from the aligned names. In other features including to Finally to track words, first the alignments of source and target the effect of new consonant grouping strategy, names should be produced. Our proposed CV-TYPE3 is defined which is similar to CV- alignment method is a two-dimensional Cartesian TYPE2 except that the consonant letters are coordinate system. The horizontal axis is labeled divided according to Table 1. with the source name and the vertical axis is Table 1 can be used for categorizing any labeled with the target name (or vice versa). A language letters as well, by replacing each line is drawn from the coordinate (0,0) to the English letter with its corresponding letter in the point with coordinate (source_name_length , target language. These features improve target_name_length). We mark the transliteration, but still are not sufficient. corresponding cell in each column of the Therefore we need n-gram features. alignment matrix which has the less distance to the line (Figure 1). Considering Figure 1 the 2.2 N-gram Features following alignments are achieved: In n-gram features for source name, two past and (a, ) , (b, ) , (r, ) , (a, ) , (m, ) , (s, ) two future contexts are used (a window with a length of 5). For target name however, only two s past contexts are used (because we don’t have m future context yet). Since the maximum entropy is used for training, the whole approach for target a name can be considered as Maximum Entropy r Markov Model (MEMM) which is a simple extension of the ME classification and is useful b for modeling sequences as it takes into account a the previous classification decision. But for source name the future letters are known and are used for feature extraction. So the MEMM Figure 1. Alignment matrix of (abrams, ) concept cannot be broadcast to source name as well. The name pair in Figure 1 has a simple Using S to demonstrate the source name and T alignment. For more complex alignments, some to demonstrate the target name, the n-gram fixed points are needed in order to draw the lines. features for each name can be summarized as: These fixed points are coordinates of segments that are known to be always alignments of each other. For instance in English-Persian, " " is For any language pair, all combinations of si always aligned to "b" or "bb". If there exists any and ti can be used to define a feature. In our fixed point in the name pair, one line is drawn 75 from origin to the fixed point coordinate and the digraph (two letters) or at most a trigraph (three other one is drawn from the fixed point to the letters). Once a set of fixed points are found for a point with (source_name_length , language pair, they stay constant for all other target_name_length) coordinate. In other words transliterations and do not change. In other words if there are n fixed points in the name pair, there it is sufficient to run Moses one time and use will be n+1 lines in the plane. In Figure 2, (bb, ) produced fixed points for any transliteration task and (n, ) are fixed points. So the following related to that language pair. In the second approach the training dataset alignments are achieved: helps the system find the fixed points set. We (g, ) , (i, ) , (bb, ) , (o, ) , (n, ) , (i, ) introduce FPA algorithm which is an unsupervised approach that adopts the concept of i EM training. In the expectation step the training name pairs are aligned using current model and n in the maximization step the most probable o alignments are added to the fixed points set. The algorithm is as follows: b 1. An initial and inaccurate alignment is b considered, assuming just one line in the alignment matrix. i 2. The discriminative model learns the mapping between source and target names g using maximum entropy. 3. Using the trained model (ME) and extracting the most probable mappings, an Figure 2. Alignment matrix of (gibboni, ) initial set of fixed points are nominated. This process is repeated until the algorithm These fixed points help us to perform the converges. alignment process more accurately. The more accurate they are, the better the final results are. A brief sketch of FPA algorithm is presented Finding fixed points is difficult for some in Figure 3. In line 2 we initialize the fixed points language pairs, especially for the ones about with an empty set. Line 3 shows the convergence which we have no knowledge. Based on the fact of the algorithm. It means when the fixed points that our goal is to design a language independent set do not change, the final set is found. In line 6 transliteration system, an automatic way to find name pairs with equal lengths are only the fixed points is of interest. considered. The corresponding consonant-vowel We investigate two approaches for finding the sequences of the name pairs are generated. If the fixed points. In the first one, Moses, a statistical CV sequences are exactly similar to each other, machine translation system is used to define the the name pair is included in the training stage. fixed points. Moses trains translation models for Although the whole training data can be used in any language pair automatically (Koehn, et al., the first iteration, this condition produces a 2007). In translation process, it produces a phrase reasonable result with the advantage of ignoring table which contains source and target phrases a large amount of training data and saving the with different lengths and the conditional time in the first iteration. Line 11 to 21 shows the probability of those phrases. If each letter in process of updating the fixed points set. In line transliteration is considered as a word and each 14 forcedAlignment means using current ME name as a sentence, Moses can be used to find model to transliterate source name with the the fixed points automatically. condition in which the produced transliterations To produce the phrase table, Moses should be should be the same as the target name. This run on a bilingual corpus. Any corpus containing condition guarantees the convergence of the name pairs can be used. Then the phrase table is algorithm. Suppose the source name length is J parsed and the phrases with the maximum and the target name length is I, then the decoding probabilities are extracted. The length of the process is as follows: phrases is usually between 1 and 3, since for 1. For each letter of the source name choose most natural languages, the maximum length of top N transformation rules with highest corresponding phonemes of each grapheme is a probabilities which lead to producing the 76 target name. points set. We change the value of threshold 2. Build a search tree: add N 3-tuple (current between 0.7 and 1, and find out the best value for letter, generated transliteration, threshold is 0.9. The test stage starts after finding transformation probability) to an N- the final fixed points set. The decoding process complete tree. in test stage is similar to forcedAlignment, but 3. Do beam search to find the best path in the here the condition for generated transliteration tree. (Best path is the highest multiplication (forcing algorithm to produce target name) is of edges probability). meaningless. So any transliteration can be added 4. Update set A: to Top-N results. !"#$ % & ' () * + #$,, &- ) . / 0 / 0 / A good method for finding the fixed points 1/ 234(5 generates a set similar to other methods. For !"#$ % & ' () * + #$, &-- ) . 2 0 2 0 2 6 example both approaches introduced in this 1/ 234(5 section, lead to similar results. That’s why we We change the value of N between 1 and 5. present only the second approach results in the Results show that there is no significant experiment section. From another point of view, improvement after N = 3 (N > 3). Also time it is sufficient to find the fixed points set for each complexity and memory usage increases language pair only once. Because the fixed exponentially. Therefore the best value for N is points set which is found by a proper corpus, is 3. Line 17 and 18 are final steps in producing very similar to the set produced by a different fixed points set. |k| is the number of distinct corpus on the same language pair. Therefore if segments in the best path set and 7#&89 :$;9 ) is the more than one set are produced using different corpora, the intersection of these sets is probability of &89 <$;9 transformation rule. Once considered as the final fixed points set for other the probabilities are calculated, they are transliteration tasks regarding that language pair. compared to a predefined threshold. If they are bigger than threshold, they are added to the fixed 1: Algorithm FPA 2: fixedPoints = {} , oldFixedPoints = {} 3: while( fixedPoints != oldFixedPoints) { 4: oldFixedPoints = fixedPoints; 5: if( first iteration){ 6: fixedPoints = updateFixedPoints(names_with_equal_CV_sequence) 7: }else{ 8: fixedPoints = updateFixedPoints(whole_training_corpus) 9: } 10: } 11: Function updateFixedPoints(training_data){ 12: bestPathEdges = {}; 13: for( all name pairs) do { 14: A = forcedAlignment(sourceName, targetName, currentModel) 15: } 16: for (all segment pairs in A) do{ = >1 ?8@ A;@ 3 17: 7#&89 :$;9 ) * =CB >1?B A;@ 3 = >1 ?8@ A;@ 3 18: 7#$;9 :&89 ) * = B >1AB ?8@ 3 D 19: } 20: if (p > threshold) { add transformation rule to the fixedPoints } 21: } Figure 3. Sketch of FPA algorithm 77 We present a list of most common fixed relative improvement of 1% over CV-TYPE2 points for English to Persian transliteration (third row). CV-TYPE1, CV-TYPE2 and CV- which is sorted in descending order of the TYPE3 are explained in Sec. 2.2. probability values. The best word accuracy in Table 3 is 58.4%. { ( mm) , ( dd) , ( bb) , ( wh) , ( rr) , Comparing word accuracies, it can be concluded that for English-Persian transliteration, the ( x) , ( kn) , ( nn) , ( ff) , ( tt) , ( pp) , following features are the most effective ones: ( ll) , ( h) , ( n) , ( r) , ( d) , ( g) , ( b) , ( t) , ( sh) , ( p) , ( l) , ( m) , ( j) , f1: f2: ( ph) , ( ss) , ( z) , ( w) , ( q) , ( f) , f3: ( v) , ( y) , ( s) , ( k) } f5: There is a study on statistical machine f7: translation which combines discriminative training and Expectation-Maximization (Fraser As we can see, does not help in better and Marcu, 2006). The proposed EMD algorithm transliteration. Because written Persian omits uses discriminative training to control the short vowels, and only long vowels appear in contributions of sub-models. Furthermore, EM is texts. So is completely irrelevant for applied to estimate the parameters of sub-models. generating current Persian letter. Using f2 and f3 In contrast to their method, we generate fixed simultaneously, improves the results much more points set by Expectation-Maximization and no than f4, f5 or f6 alone. Since each of them has parameter estimation is done during EM. The the power of bigram feature and together, they new fixed points set, updated in EM step, provide trigram features. improves the alignment quality and consequently Experimental results for English to Persian causes the model to reestimates its parameters. transliteration show that CV-TYPE3 has the best word accuracy among all other consonant-vowel 4 Case Studies grouping strategy. Therefore, we use this type of consonant-vowel features for the reverse Two types of experiments have been performed, direction as well. Furthermore English-Persian one for effectiveness of different features and the experiments imply n-gram features with a other for the effectiveness of alignment process. distance of two letters are not useful for Persian A corpus consisting of 16760 word pairs has names. This is due to Persian language nature. been used. These words are names of This fact reduces the number of experiments, geographical places, people and companies. This since it removes f4, f5 and f6 from n-gram is the same corpus which previous study features. Table 3 shows the effect of several experiments were performed on (Karimi et al., feature combinations on mean word accuracy in 2007). Each name has only one transliteration. Top-1 for Persian-English transliteration task. Many words of different language origins (such The best word accuracy in Table 3 is 20.6%. as Arabic, French, and Dutch) were included in Therefore, the following features result in best the corpus. This corpus is referred to as B+. The performance. experiments apply 10-fold cross-validation in which the whole corpus is partitioned into 10 f1: disjoint segments. This type of experiment is an f2: alternative method for controlling over-fitting. f3: f7: 4.1 Effectiveness of Features f8: All combinations of f1 to f8 for English-Persian language pair were tested. Table 2 shows mean Persian-English transliteration is more difficult word accuracy in 10-fold, for English-Persian than English-Persian. Because moving from the transliteration. The first row in Table 2 shows language with smaller alphabet size to the one reproducing CV-MODEL3 results using some with larger size, increases the ambiguity. Using basic features. Extending CV-TYPE1 features to web page contents improves the transliteration. CV-TYPE2 improves the accuracy (second row). The strategy is explained in Sec. 5. Similarly applying the new grouping of consonant letters (CV-TYPE3), leads to a 78 f1 f2 f3 f4 f5 f6 f7 f8 CV- CV- CV- Mean TYPE1 TYPE2 TYPE3 WA E E E E E 55.3 E E E E E 56.7 E E E E E 57.2 E E E E 57.3 E E E E 57.3 E E E E E E E E E 57.3 E E E E E E 57.4 E E E E E 57.5 E E E E E E 58.0 E E E E E E E 58.2 E E E E E 58.4 E E E E E E 58.4 Table 2. The effect of several feature combinations on mean word accuracy in Top-1 for English-Persian transliteration f1 f2 f3 f7 f8 CV-TYPE3 WA E E E 17.1 E E E E E 19.3 E E E E E 19.8 E E E E E 20.5 E E E E E E 20.6 Table 3. The effect of several feature combinations on mean word accuracy in Top-1 for Persian-English transliteration 4.2 Effectiveness Of Alignment in the previous section, are included in the training stage. These combinations are The proposed alignment (FixedPointsAlign) specifically appropriate for English-Persian results are compared to GIZA++ alignment. The language pair. For other languages if the best settings of important parameters of GIZA++ are combination is not known, all the features, f1 to as follows: five iterations for each IBM1 model f8 should be included in the feature extraction. and HMM and three iterations for each IBM3 For each fold, word accuracy and MRR is and IBM4 models. We checked GIZA++ output computed. Table 4 and Table 5 show mean word for name pairs and discovered the alignments are accuracy and mean MRR in Top-1, Top-5 and always monotone, except for rare cases. That's Top-10 for English to Persian. Persian to English why it is used in past studies as well (Hong, et results are presented in Table 6 and Table 7. al., 2009; Karimi, et al., 2007; Sravana Reddy The transliteration systems that use GIZA++ and Sonjia Waxmonsky, 2009). The approaches in their alignment differ from each other by using GIZA++ utilize symmetrized alignments in transliteration generation process. Since GIZA++ both directions. All of the experiments are done has a unique strategy for aligning sentences or on B+ corpus, using 10-fold cross validation. The name pairs. CV-MODEL3 is a language- results are compared to CV-MODEL3 (Karimi et independent model which uses GIZA++ for al., 2007). The most effective features, founded aligning name pairs. Since our experiment 79 conditions are exactly the same as CV-MODEL3 N-Best GIZA++ FPA experiments conditions, the results are comparable. Top-1 14.6 20.6 Table 4 shows that our proposed alignment Top-5 21.3 29.7 method is a proper replacement for GIZA++ tool. It has an equal accuracy in Top-1 and also Top-10 22.1 30.8 improves accuracy in Top-5 and Top-10 Table 7. Mean MRR of 10-fold on B+ Corpus for transliterations. SLA (single line align) is the proposed method with an empty fixed points set. Persian to English As can be seen from Table 4, defining a proper 5 N-Best Reranking set of fixed points significantly improves the results. Furthermore Table 6 and Table 7 show Generating 10 best transliterations instead of one that for Persian to English transliteration, our name definitely has a better word accuracy, proposed alignment algorithm significantly because if the target name exist in one of the 10 improves the results. The outcomes lead us to the names, the word accuracy is equal to one for that conclusion that although GIZA++ provides good name pair. But an efficient transliteration system results in English to Persian transliteration, it should produce only the correct ones. does not produce a reasonable result in the A large corpus containing several names can reverse direction. This is due to parameters be considered as a reference to choose one name setting. Unlike our proposed alignment, GIZA++ from possible transliterations. First the unigram alignment is highly dependent to its parameters. probability of each transliteration is calculated. Then the transliteration with the max probability N-Best SLA CV- GIZA++ FPA is chosen as the final result. Since the dominant MODEL3 language in the web is English, it is the best Top-1 50.7 55.3 58.4 58.4 corpus for Persian to English transliteration. As a result, in this section the experiments were Top-5 77.0 84.5 86.8 88.7 performed for Persian-to-English transliteration Top-10 84.0 89.5 90.8 92.6 and not English-to-Persian. We calculate the probabilities of Top-10 for Table 4. Mean word accuracy of 10-fold on B+ each source name and the one with the maximum corpus for English to Persian transliteration probability is chosen as the final transliteration. A test file consisting of 1676 name pairs was produced. We extract 10% of the train file N-Best GIZA++ FPA randomly to generate the test file. The word accuracy of this approach is 32.1% and the Top-1 58.4 58.4 accuracy for the same test and train files, Top-5 70.2 70.9 generating only one transliteration (Top-1) is 20.8%. It means that this approach leads to a Top-10 70.7 71.5 relative improvement of 54% over Top-1 results. Table 5. Mean MRR of 10-fold on B+ Corpus 6 Conclusions for English to Persian. In this paper, we presented a language- independent alignment method for transliteration. Discriminative training is used in our system. N-Best SLA CV- GIZA++ FPA The proposed method has improved MODEL3 transliteration generation compared to GIZA++. Top-1 19.4 17.6 14.6 20.6 Furthermore we defined a number of new features in the training stage. Top-5 41.6 36.2 32.7 44.9 For Persian to English transliteration, web Top-10 50.4 46.0 38.4 53.2 pages contents are used to choose one name from 10-best hypothesis list. This approach leads to a Table 6. Mean word accuracy of 10-fold on B+ relative improvement of 54% over simple Top-1 corpus for Persian to English transliteration transliteration. 80 References Reddy, S., Waxmonsky, S., Substring-based Transliteration with Conditional Random Fields, Fraser, A., Marcu, D., Semi-Supervised Training for Proceedings of the 2009 Named Entities Workshop, Statistical Word Alignment, Proceedings of ACL- ACL-IJCNLP 2009, pages 92–95, Suntec, 2006, pp. 769-776, Sydney, Australia Singapore, 7 August 2009. Goto, I., Kato, N., Uratani, N., Ehara, T., Yoon, S., Kim, K., Sproat, R., “Multilingual Transliteration Considering Context Information Transliteration Using Feature based Phonetic Based on the Maximum Entropy Method, In Proc. Method”, Proceedings of the 45th Annual Meeting Of IXth MT Summit. (2003) of the Association of Computational Linguistics, pages 112–119, Prague, Czech Republic, June Hong, G., Kim, M., Lee, D., Rim, H., A Hybrid to 2007, Association for Computational Linguistics. English-Korean Name Transliteration, Proceedings of the 2009 Named Entities Workshop, ACL- Zelenko, D., Aone. C., Discriminative Methods for IJCNLP 2009, pages 92–95, Suntec, Singapore, 7 Transliteration, Proceedings of the 2006 August 2009. Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 612– Jiampojamarm, S., Kondrak, G., Letter-Phoneme 617, Sydney, July 2006. Alignment: An Exploration, Proceedings of the 48th Annual Meet-ing of the Association for Computational Li-nguistics, pages 780–788, Uppsala, Sweden, 11-16 July 2010. Josef, F., Ney. H., Discriminative Training and Maximum Entropy Models for Statistical Machine Translation, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 295-302. Josef, F., Ney. H., A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics, vol.29 (2003), pp. 19- 51 Karimi, S., Scholer, F., Turpin, A., Collapsed Consonant and Vowel Models: New Approaches for English-Persian Transliteration and Back- Transliteration, The 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), pages 648-655, Prague, Czech Republic, June 2007. Karimi, S., Machine Transliteration of Proper Names between English and Persian, A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy, BEng. (Hons.), MSc. Koehn, P., Hoang, H., Birch A., CallisonBurch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer C., Bojar, O., Constantin, A., Herbst E., 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07) – Companion Volume, June. Microsoft Web N-gram Services available at http://research.microsoft.com/en-us/collaboration/ focus/cs/web-ngram.aspx 81 Forward-backward Machine Transliteration between English and Chinese Based on Combined CRFs Ying Qin Guohua Chen Department of Computer Science, National Research Centre for Foreign Language Beijing Foreign Studies University Education, Beijing Foreign Studies University

[email protected] [email protected]

In previous researches, syllable segmentation Abstract and alignment were done in terms of single syl- lables in training a transliteration model. (Yang The paper proposes a forward-backward translitera- et al., 2009; Yang et al., 2010; Aramaki and tion system between English and Chinese for the Abekawwa, 2009; Li et al., 2004). Sometimes, shared task of NEWS2011. Combined recognizers however, it is hard to split an English word and based on Conditional Random Fields (CRF) are ap- align each component with a single Chinese plied to transliterating between source and target lan- character, which is always monosyllabic. For guages. Huge amounts of features and long training time are the motivations for decomposing the task instance, when TAX is transliterated into 塔克斯 into several recognizers. To prepare the training data, (Ta Ke Si) in Chinese, no syllable is mapped segmentation and alignment are carried out in terms onto the characters 克 and 斯 , for X is pro- of not only syllables and single Chinese characters, as nounced as two phonemes rather than a syllable. was the case previously, but also phoneme strings and In this paper, we try to do syllable segmentation corresponding character strings. For transliterating and alignment on a larger unit, that is, phoneme from English to Chinese, our combined system strings. achieved Accuracy in Top-1 0.312, compared with the best performance in NEWS2011, which was 0.348. Conditional Random Fields (CRF) was suc- For backward transliteration, our system achieved cessfully applied in transliteration of NEWS2009 top-1 accuracy 0.167, which is better than others in and NEWS2010 (Li et al. 2009; Li et al. 2010). NEWS2011. Transliteration was viewed as a task of two-stage labeling (Yang et al. 2009; Yang et al., 2010; 1 Introduction Aramaki and Abekawwa, 2009). Syllable seg- mentation was done at the first stage, and then The surge of new named entities is a great chal- target strings were assigned to each chunk at the lenge for machine translation, cross-language IR, next stage. The huge amounts of features in the cross-language IE and so on. Transliteration, second stage made model training time- mostly used for translating personal and location consuming. Thirteen hours on an 8-core server names, is a way of translating source names into were expended to train the CRF model in the target language with approximate phonetic work done by Yang et al. (2010). equivalents (Li et al., 2004), while backward To reduce training time and requirement of transliteration traces back to the foreign names high-specification hardware, we adopt a com- (Guo and Wang, 2004). Phonetic-based and bined CRF transliteration system by dividing the spelling-based approaches are popularly applied training data into several pools and each being in machine transliteration (Karimi et al. 2011). used to train a recognizer to predict the target Recently direct orthographical mapping (DOM) characters. The final transliteration results are the between two languages, a kind of spelling-based arranged according to the probabilities of all transliteration approach, outperforms that of CRF outputs. phonetic-based methods. Most systems in In the following, section 2 describes how NEWS2009 and NEWS2010 utilized this ap- segmentation and alignment are done on the unit proach to automatic transliteration (Li et al., of phoneme strings. Section 3 explains how the 2009; Li et al., 2010). forward-backward transliteration system between English and Chinese is built. Performances of the 82 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 82–85, Chiang Mai, Thailand, November 12, 2011. system on all the metrics of NEWS2011 are To deal with case 1, we take a two-step match- listed in section 4, which is followed by discus- ing—strict matching and then loose matching— sions. The last section is the conclusion. between the consonant in pinyin and the English word. If the same consonant is not available, 2 Segmentation and Alignment strings of a similar pronunciation are sought. For instance, the consonant in pinyin Fu is f, if there Lack of gold standard syllable segmentation and is no letter f in the English transliteration, v, ph, alignment data is an obstacle to transliteration gh are adopted for segmentation. model training. Yang et al. (2009) applied N- We apply transformation rules to optimize the gram joint source-channel and EM algorithm, syllable alignment result. The rules are induced while Aramaki and Abekawwa (2009) made use manually by observation of segmentation errors. of word alignment tool in GIZA++ to obtain a We believe gold alignment training corpora are syllable segmentation and alignment corpus from the foundation of good performance no matter the training data given. Neither of them reported which algorithms is applied. how precise their alignments were. Yang et al. However, we find that some chunks in English (2010) proposed a joint optimization method to correspond to Chinese strings in most transla- reduce the propaganda of alignment error. tions. Some of such chunks are given in Table 1 Pinyin is known as romanized pronunciation as examples. We keep the alignment between of Chinese characters. Due to the nature of pin- these chunks and corresponding Chinese charac- yin, there are many similarities between English ter strings, calling it phoneme strings based orthography and Chinese pinyin. Of the 24 Eng- alignment. lish consonants, 17 have almost the same pro- nunciation in pinyin. Since English orthography SKIN 斯金 SKI 斯基 SCO 斯科 has a close relationship with phonetic symbols, MACA 麦考 MACA 麦卡 MACC 麦克 we believe that consonants in pinyin can also MACKI 麦金 X 克斯 SKEW 斯丘 provide clues for syllable segmentation and Table 1. Alignment of English chunks and corre- alignment. In the following example, the conso- sponding Chinese character strings nant sequence in English is same as that in pinyin. The alignment of phoneme strings has advan- tages over single phoneme alignment. Since each English syllable string may be mapped onto sev- eral possible Chinese characters, there will be Therefore we can do syllable segmentation fewer choices if the alignment is based on pho- with the help of pronunciations of Chinese char- neme strings when an English syllable sequence acters. Segmentation is carried out from the sec- is finally transliterated into Chinese character ond character, for there is no need to split from strings. For example, s can be mapped onto the the initial letter of a string. Chinese characters 斯(Si), 丝(Si) and 思(Si), ky However not all mappings between spelling and phoneme are involved in this approach. The can be mapped onto 基(Ji), 吉(Ji) and 季(Ji), but following cases are insolvable. for sky, it is usually transliterated into 斯基(Si Ji), Case 1: there is no corresponding consonant. not others sequences serve as alternatives. There- For instance, ARAD 阿拉德 (A La De). fore, we think phoneme strings alignment is bet- Case 2: several letters occupy one phoneme. ter than single phoneme alignment. The follow- For instance, BAECK 贝克 (Bei Ke). ing is an example of alignment based on pho- Case 3: duplicate letters cause ambiguity. For nemes strings. instance, ANNADA LE 安娜代尔 (An Na Dai Er). Case 4: consonants are sometimes mismatched. For instance, ACQUARELLI 阿奎雷利 (A As to the backward transliteration, segmenta- Kui Lei Li). tion and alignment are also based on phoneme Case 5: there are inconsistencies complicating strings. Following are two columns of aligned data for CRF model training. the situation. For instance: ADDINGTON 阿哈 HA 丁顿 ( A Ding Dun). Therefore pinyin-based segmentation is only 克斯 X treated as a preliminary result. 83 3 Forward and Backward Translitera- according to the capability of our PCs. If some tion System errors occur in some pools but not in all, a cor- rect predication can still be made by the CRFs CRF is a discriminative model and makes a trained on correct pools. global optimum prediction according to the con- The combined CRF recognizers are both used ditional probability (Lafferty et al., 2001). When for forward and backward transliterations at the applying CRF to transliteration, the task is second stage. The workflow of our transliteration treated as labeling source words with target lan- system is depicted in Figure 1. guage strings. Similar to previous works (Yang et al., 2010; Aramaki and Abekawwa, 2009), we build a two-stage CRF transliteration system be- tween English and Chinese. The first stage CRF decoder splits the source words into several chunks. Outputs of the first stage are then sent to the second CRF to label what target characters are transliterated. The final transliteration of the source word is the sequence of all the target characters. For training the CRF chunker with the given corpora segmented and aligned, each character is labeled with the BI scheme, that is, B for the be- ginning character of a chunk, I for the characters in other position. For example, in English to Chinese training data, ABBE is segmented and aligned as follows. A 阿 Figure 1. Workflow of Transliteration System BBE 贝 The two-column data for training the CRF 4 Performances and Discussion chunker is, We use the open CRF++ 2 toolkits to build the A B two-stage CRF transliteration with all given data B B of NEWS2011. B I E I 4.1 Performances The window size is set as 3, the same as the experiment by Aramaki and Abekawwa (2009). The number of recognizers may affect the per- Though a larger window is propitious to pro- formance of the whole system. To suit the best vide more contextual information, there are too capacity of our PC, we train 10 forward and 20 many features for training the second stage CRF. backward recognizers. We also train another We have to reduce the window size. In the sec- forward transliteration consisting of 20 recogniz- ond stage of CRF training, the window size is 2, ers for comparison. Due to time limit, we do not that is, features used are C-2, C-1, C0, C1, C2, C- try other numbers in backward and forward transliteration during NEWS2011. Because the 1C0, C0C1, C-2C-1C0 and C0C1C2, which C0 de- notes the current chunk. Still the time it takes to test data of NEWS2011 are reserved for future train a model on a normal PC is intolerably long 1 . use, we can not try other numbers to build trans- Even the training data aligned on phoneme literation systems for comparison. strings are checked manually, errors are still Table 2 shows the common evaluation of our sometimes somewhere. To reduce the risk of lo- transliteration system between English (E) and cal errors in segmentation and alignment, we di- Chinese (C). We can see that the performance of vide the training data randomly and evenly into E->C transliteration varies slightly with different several pools. The size of the pools is set simply numbers of combination on all evaluation met- rics. The performance of backward transliteration is lower than that of the forward direction on 1 Using the same parameters setting of CRF learner as ACC but is better on Mean F score. Aramaki and Abekawwa (2009), the training time on a PC (2.3GHZ, 4GB ) with NEWS2011 data (37753 English 2 names) reaches 4800 hours. http://crfpp.sourceforge.net/ 84 CRFs ACC Mean MRR MAPref The work is supported by the National Social F Science Fund (No. 10CYY024). Thanks are due E->C 10 0.312 0.669 0.339 0.310 to the support of National Research Centre for 20 0.308 0.666 0.337 0.306 Foreign Language Education, BFSU. C->E 20 0.167 0.765 0.202 0.167 We are grateful to all anonymous reviewers. Table 2. Performance of Combined Translitera- tion System References Eiji Aramaki and Takeshi Abekawa. 2009. Fast de- 4.2 Discussions coding and easy implementation: Transliteration as • Granularity of syllable segmentation and a sequential labeling. Proceeding of ACL/IJCNLP. alignment Named Entities Workshop Shared Task. 65-68. Preprocessing training data on phoneme strings Yuqing Guo, Haifeng Wang. 2004. Chinese-to- alignment is our approach in attempting to im- English Backward Machine Transliteration. Com- prove transliteration between English and Chi- panion Volume to Proceedings of the 1st Interna‐ nese. In backward transliteration, our system is tional Joint Conference on Natural Language Proc‐ better than others in the shared task of essing (IJCNLP 04). 17‐20. NEWS2011. Can we assume that larger granular- Sarvnaz Karimi, Falk Scholer and Andrew Turpin. ity alignment is better than a smaller one? Which 2011. Machine Transliteration Survey. ACM granularity is optimum? Computing Surveys, 43(4): 1–57. • Number of CRF recognizer John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabil- With more data, the time it takes to train a model istic models for segmenting and labeling sequence based on CRF increases sharply. We train trans- data. International Conference on Machine Learn- literation models with the same algorithm but ing (ICML01). different usage of data and then combine the re- Chun-Jen Lee and Jason S. Chang. 2003. Acquisition sults of all recognizers. In this way, training time of English-Chinese Transliterated Word Pairs from is reduced. However we can see from the result Parallel-Aligned Texts Using a Statistical Machine of testing that the performance of transliteration Transliteration Model. Proceedings of HLT- varies with the number of recognizers. What is NAACL. 96-103. the comparison between combined system and Haizhou Li, Min Zhang, and Jian Su. 2004. A joint single system? Which number of combinations is source-channel model for machine transliteration. the best? We will need to explore these questions Proceedings of 42nd ACL Annual Meeting. 159– with more data. 166. 5 Conclusion Haizhou Li, A Kumaran, Vladimir Pervouchine and Min Zhang. 2009. Report of NEWS 2009 Machine Two-stage CRFs are applied to transliterating Transliteration Shared Task. Proceedings of the between English and Chinese. We try to improve 2009 Named Entities Workshop, ACL-IJCNLP. 1– the performance from two directions, one is 18. training data processing, which is segmented and Haizhou Li, A Kumaran, Min Zhang and Vladimir aligned based on phoneme strings; another is Pervouchine. 2010. Report of NEWS 2010 Trans- system building, in which several models on dif- literation Generation Shared Task. Proceedings of ferent parts of data are trained and their outputs the 2010 Named Entities Workshop, ACL 2010. 1– are combined. The final results of the translitera- 11. tion are arranged in sequential order in accor- Dong Yang, Paul Dixon, Yi-Cheng Pan, Tasuku Oo- dance with the degree of probability of all the nishi, Masanobu Nakamura and Sadaoki Furui. recognizers. 2009. Combining a Two-step Conditional Random In future work, we will focus on good standard Field Model and a Joint Source Channel Model for data and methods of combination to further im- Machine Transliteration. Proceedings of the 2009 prove the performance of forward-backward Named Entities Workshop, ACL-IJCNLP. 72–75. transliteration system. Dong Yang, Paul Dixon and Sadaoki Furui. 2010. Jointly optimizing a two-step conditional random Acknowledgments field model for machine transliteration and its fast decoding algorithm. Proceedings of the ACL 2010. Conference Short Papers. 275–280. 85 English-to-Chinese Machine Transliteration using Accessor Variety Features of Source Graphemes Mike Tian-Jian Jiang Department of Computer Science, National Tsing Hua University Institute of Information Science, Academia Sinica

[email protected]

Chan-Hung Kuo Wen-Lian Hsu Institute of Information Science, Institute of Information Science, Academia Sinica Academia Sinica

[email protected] [email protected]

obtain the bilingual orthographical correspond- Abstract ence directly to reduce the possible errors intro- duced in multiple conversions. The hybrid ap- This work presents a grapheme-based ap- proach attempts to utilize both phoneme and proach of English-to-Chinese (E2C) translit- grapheme information for transliteration. Oh and eration, which consists of many-to-many Choi (2006) proposed a strategy to include both (M2M) alignment and conditional random phoneme and grapheme features in a single fields (CRF) using accessor variety (AV) as learning process. an additional feature to approximate local context of source graphemes. Experiment re- This work presents a grapheme-based ap- sults show that the AV of a given English proach of English-to-Chinese (E2C) translitera- named entity generally improves effectiveness tion using many-to-many alignment (M2M- of E2C transliteration. aligner) (Jiampojamarn et al., 2007) and condi- tional random fields (CRF) (Lafferty et al., 2001) 1 Introduction with additional features of accessor variety (AV) (Feng et al., 2004). The remainder of this article Transliteration is a subfield of computation lin- is organized as follows. Section 2 briefly intro- guistics, and is defined as the phonetic transla- duces related works involving M2M-aligner, tion of names across languages. Transliteration CRF, and AV. The concept of this work for of named entities is essential in numerous appli- transliteration using M2M-aligner, CRF, and AV cations, such as machine translation, corpus are explained in Section 3. Section 4 describes alignment, cross-language information retrieval, the experiment results and discussion. Finally, information extraction, and automatic lexicon the conclusion is presented in Section 5. acquisition. The transliteration modeling ap- proaches can be classified as phoneme-based, 2 Related Works grapheme-based, and a hybrid of phoneme and grapheme. 2.1 CRF-based Transliteration Numerous studies focus on the phoneme- Yang et al. (2009) proposed a two-step CRF based approach (Knight and Graehl, 1998; Virga model for direct orthographical mapping (DOM) and Khudanpur, 2003). Suppose that E is an machine transliteration, in which the first CRF English name and C is its Chinese transliteration, segments a source word into chunks and the se- the phoneme-based approach first converts E cond CRF maps the chunks to a word in the tar- into an intermediate phonemic representation p, get language. Reddy and Waxmonsky (2009) and then converts p into its Chinese counterpart presented a phrase-based translation system that C. The idea is to transform both the source and characters are grouped into substrings to be target names into comparable phonemes so that mapped atomically into the target language, the phonetic similarity between the two names which showed how substring representation can can be measured easily. The grapheme-based be incorporated into a CRF model with local approach, which treats the transliteration as a context and phonemic information. Shishtla et al. statistical machine translation problem under (2009) adopted a statistical transliteration tech- monotonic constraint, has also attracted much nique that consists of alignment model of GI- attention (Li et al., 2004). This approach aims to ZA++ (Och and Ney, 2003) and CRF model. 86 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 86–90, Chiang Mai, Thailand, November 12, 2011. The approach of this work is similar to the Cohen et al., 2002; Huang and Powers, 2003; technique of Shishtla et al., yet this work focus- Tanaka-Ishii, 2005; Jin and Tanaka-Ishii, 2006; es on the additional AV feature of CRF and uses Cohen et al., 2007). The basic idea behind these M2M-aligner, which will be described in Sec- measurements is closely related to one particular tion 2.2, instead of GIZA++. perspective of n-gram and information theory of cross entropy or perplexity. Zhao and Kit (2007) 2.2 M2M-Aligner induced that AV and BE both assume that the Jiampojamarn et al. (2007) argued that previous border of a potential word is located where the work has generally assumed one-to-one align- uncertainty of successive characters increases, ment for simplicity, but letter strings and pho- where AV and BE are regarded as the discrete neme strings are not typically in the same length, and continuous versions, respectively, of the so null phonemes or null letters must be intro- fundamental work of Harris (1970), and then duced to make one-to-one-alignments possible. chose to adopt AV as the additional feature of Furthermore, two letters frequently combine to CRF-based Chinese Word Segmentation (CWS). produce a single phoneme (double letters), and a The AV of a string s is defined as: single letter can sometimes produce two pho- nemes (double phonemes). For example, the AV (s) = min{Lav (s), Rav (s)} (1) English word “ABERT” with its Chinese trans- . literation “阿贝特”, which Jaimpojamarn et al. referred as “phonemes”, is aligned as: In Eq. (1), Lav(s) and Rav(s) are defined as the A BE RT number of distinct preceding and succeeding | | | characters, except when the adjacent character is 阿贝特 absent due to a sentence boundary, and then the The letters “BE” are an example of the double pseudo-character of the beginning or end of a letter problem which mapping to the single pho- sentence is accumulated indistinctly. Feng et al. neme “贝.” These alignments provide more ac- (2004) also developed more heuristic rules to curate grapheme-to-phoneme relationships for a remove strings that contain known words or ad- phoneme prediction model. Hence the M2M- hesive characters. For the strict meaning of un- aligner is for alignments between substrings of supervised features and for simplicity, this study various lengths and based on the expectation does not include those additional rules. maximization (EM) algorithm. For more details The necessity of AV is primarily on the de- of the algorithm, readers are encouraged to ex- mand for semi-supervised learning. Since AV plore previous works of Ristad and Yianilos can be extracted from large corpora without any (1998), and Jiampojamarn et al. (2007). manual segmentation or annotation, hidden vari- Despite ambiguity between Chinese translit- ables underlying frequent surface patterns of eration and phoneme, the above paragraph of the languages may be captured via an inexpensive opinion of Jaimpojamarn et al. indicates a par- and unsupervised algorithm such as suffix array. ticular problem of E2C transliteration, that the Unsupervised feature selection of AV or similar training data comprised pairs of names written in features has generally improved effectiveness of source and target scripts lacks explicit graph- supervised CWS on cross-domain and unlabeled eme-level alignment. This work uses M2M- data (Jiang et al., 2010), and this work conse- aligner as an unsupervised method for generat- quently considers that AV of un-segmented Eng- ing alignments of the training data, which pro- lish names from training, development, and test vide hypotheses of DOM without null graph- data might help enhancing E2C transliteration. emes. 3 Transliteration using EM and CRF 2.3 Accessor Variety 3.1 CRF Alignment Labeling Feng et al. (2004) proposed accessor variety (AV) to measure how likely a character sub- In the work, M2M-aligner first maximizes the string is a Chinese word. Another similar meas- probability of the observed source-target word urement of English and Chinese words called pairs using the EM algorithm and subsequently boundary entropy or branching entropy (BE) sets the grapheme alignments via maximum a was used in several works (Tung and Lee, 1994; posteriori estimation. CRF is then conditioned Chang and Su, 1997; Cohen and Adams, 2001; on the grapheme alignments to produce globally 87 optimal solutions. However, the performance of Context the EM algorithm is frequently affected by the Function C0, C-1, C1, C0, C-1, C1, C0, C-1, C1, initialization. To obtain better alignment results C-2, C2 C-2, C2 of M2M-aligner, this work empirically sets the C0C1, C-3, C3 C-1C0 , C0C1, “maxX” parameter for the maximum size of sub- C-1C0 , C0C1, alignments in the source side to 8, and sets the C-2C1, C-1C0 , “maxY” parameter for the maximum size of sub- C1C2 C-2C1 , alignments in the target side to 1 (denoted as C1C2 X8Y1 in short), since one of the well known a C-3C-2, C2C3 priori of Chinese is that almost all Chinese char- Notation 1UB 2UB 3UB acters are monosyllabic, which reflects the situa- Positioning Tag of Prediction Label tion of “double phoneme” mentioned in Section Function B, I B, I, E 2.2. Notably, this work follows the definition of Notation PBI PBIE grapheme described by Oh and Choi (2005) to Chinese Grapheme of Prediction Label prevent from confusion of phoneme, grapheme, Function On B only On B and I character, and letter, that graphemes refer to the Notation GB GBI basic units (or the smallest contrastive units) of Table 3. Conventional CRF Features written language: for example, English has 26 graphemes or letters or characters, Korean has 3.2 CRF with AV 24, and German has 30. Table 1 is an example of This work extends the work of Zhao and Kit M2M-aligner results. With aligned training data, (2008) into a unified representation for AV fea- a transliteration model can be then trained by tures of English graphemes. The representation CRF to generate names in the target language accommodates both the position of a string and from names in the source language. This work the string’s likelihood ranking by the logarithm. uses Wapiti (Lavergne et al., 2010) as CRF Formally, the ranking function for a string, s, toolkit. Table 2 is an example of training data with a score, x, counted by AV is defined as: for a CRF alignment labeling, where the tags B and I indicate whether the grapheme is in the starting position of the sub-alignment. f (s) = r, if 2r ≤ x < 2r +1 (2) This work tests several combinations of con- . ventional CRF features along with their abbrevi- ated notations for E2C transliteration, as shown The logarithm ranking mechanism in Eq. (2) in Table 3, where Ci represents the input graph- is inspired by Zipf’s law to alleviate the poten- emes bound individually to the prediction label tial data sparseness of infrequent strings. The at its current position i. Take Table 2 as an ex- rank r and the corresponding positions of a ample, if the current position is at the label “B string are then concatenated as feature tokens. 迪”, features generated by C-1, C0 and C1 are “A” To provide readers with a clearer picture of the “D” and “I” respectively. Note that a prediction appearance of feature tokens, a sample represen- label may either comprise a positioning tag and tation for AV is presented and explained in Ta- a Chinese grapheme, or just be the positioning ble 4. tag itself. For example, considering strings with two graphemes, one of the strings “AB” is ranked r = Source Target M2M-Aligner Result 3; therefore, the column of di-grapheme feature ABBADIE 阿巴迪 A:B|B:A|D:I:E| 阿|巴|迪| tokens has “A” denoted as 3B and “B” denoted Table 1. An Example of M2M Alignment as 3E. If another di-grapheme string, “BA,” Character Label Input AV Feature Label A B阿 1 char 2 char 3 char 4 char 5 char B I A 7S 3B 2B 0B 1B B阿 B B巴 B 5S 3E 2B 0B 1B I A I B 5S 3B 2B 0B 1B B巴 D B迪 I A 7S 4B 2B 1B 1B I I D 7S 4E 3B 1B1 1E B迪 E I I 5S 4E 3B1 1B2 0E I Table 2. Example of a CRF labeling format E 7S 3E 3E 1E 0E I for E2C transliteration Table 4. Example of AV features 88 competes with “AD” at the position of “A” with “COMMONWEALTH OF THE BAHAMAS” a higher rank of r = 4, then 4B is selected for and “巴哈马联邦,” and this phenomenon is noted feature representation of the token at a certain as “semi-semantic transliteration” for conven- position. Notably, when the string “AD” con- ience. In fact, the M2M parameter “maxX” of flicts with the string “DI” at the position of “D” this work has been designed for these phrasal with the same rank of r = 4, the corresponding structure to be relatively larger and less symmet- position with the ranking of the leftmost string, rical to the parameter “maxY” than previous which is 4E in this case, is applied arbitrarily. works that usually set both X and Y to 2 as de- fault values. Since the M2M and the CRF mod- 4 Results and Discussions els might over-fit the development set, phrasal structure and semi-semantic transliterations that 4.1 E2C Transliteration Results only appeared in the development set probably In the interest of brevity, only the 3rd and the 4th became noises according to the test set. standard runs that exceed 0.3 in terms of top-1 To analyze semi-semantic transliterations, accuracy (ACC) are listed in Table 5. Numerous NEWS-2011 Chinese-to-English (C2E) back- models of pilot tests have been trained using transliteration corpus have been acquired, and both the training set and the development set, the corresponding standard runs have been sub- and then evaluated on the development set for mitted owing to the policy of NEWS shared task. optimizing CRF feature combinations, as shown The C2E experiments, however, encountered a in Table 6. serious problem of CRF L-BFGS training re- quirement on space complexity, therefore the 4.2 Error Analysis and Discussions submitted results are actually incomplete and Based on observations of the pilot tests, there is erroneous, since C2E transliteration using the a clear trend that AV features improve perfor- proposed approach produces too many labels mances significantly. However, improvements and features to train a CRF model with the on the test set are not as good as expected. After whole training set. In authors’ experiences, even carefully investigating NEWS-2011 data, one a workstation with 24GB memory spaces is in- particular phenomenon has been noticed: only sufficient for such training. Notably, the similar the development set contains phrasal named en- hardware constraint makes the 4th standard run tities. Furthermore, some E2C word pairs are not of E2C, which is the primary one, to regress to pure transliterations and aligned in very different the simpler Chinese grapheme labeling strategy, character lengths, such as the word pair of namely GB, while introducing deeper contexts and more specific positioning tags, to trade effi- ID Configuration ACC Mean ciency of CRF training phases. F-score 4 X8Y1, 3UB, PBIE, GB, AV 0.327 0.688 3 X8Y1, 2UB, PBI, GBI, AV 0.303 0.675 5 Conclusion and Future Work Table 5. Selected E2C standard runs This work proposes to use AV of source graph- eme for E2C transliteration. Experiments indi- Configuration ACC Mean cate the AV features generally improve the per- F-score formance in terms of ACC. Recommended fu- X8Y1, 1UB, PBI, GB 0.001 0.151 X8Y1, 1UB, PBI, GB, AV 0.000 0.078 ture investigations would be features of target X8Y1, 2UB, PBI, GB 0.001 0.122 graphemes or source-channel models (Li et al., X8Y1, 2UB, PBI, GB, AV 0.000 0.064 2004) that are efficient and capable of recogniz- X8Y1, 3UB, PBI, GB, AV 0.569 0.860 ing semi-semantic transliteration. X8Y1, 1UB, PBI, GBI 0.454 0.762 X8Y1, 1UB, PBI, GBI, AV 0.547 0.813 Acknowledgements X8Y1, 2UB, PBI, GBI 0.547 0.814 This research was supported in part by the Na- X8Y1, 2UB, PBI, GBI, AV 0.753 0.910 tional Science Council under grant NSC 100- X8Y1, 1UB, PBIE, GB 0.182 0.586 2631-S-001-001, and the research center for X8Y1, 1UB, PBIE, GB, AV 0.273 0.656 Humanities and Social Sciences under grant IIS- X8Y1, 2UB, PBIE, GB 0.347 0.708 X8Y1, 2UB, PBIE, GB, AV 0.483 0.800 50-23. Wallace Academic Editing service is ap- X8Y1, 3UB, PBIE, GB 0.449 0.771 preciated for their editorial assistance. The au- X8Y1, 3UB, PBIE, GB, AV 0.597 0.857 thors would like to thank anonymous reviewers Table 6. Selected E2C pilot tests for their constructive criticisms. 89 References Haizhou Li, Min Zhang and Jian Su. 2004. A Joint Source Channel Model for Machine Translitera- Jing-Shin Chang and Keh-Yih Su. 1997. An unsu- tion. Proceedings of the 42nd ACL, 159-166. pervised iterative method for Chinese new lexicon extraction. Computation Linguistics and Chinese Franz Josef Och and Hermann Ney. 2003. A System- language Processing, 2(2):97-148. atic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19-51. Paul Cohen and Niall Adams. 2001. An Algorithm for Segmenting Categorical Time Series into J. H. Oh and K. S. Choi. 2006. An Ensemble of Meaningful Episodes. Advances in Intelligent Da- Transliteration Models for Information Retrieval. ta Analysis, 198-207. Information Processing and Management, 42:980- 1002. Paul Cohen, Niall Adams and Brent Heeringa. 2007. Voting Experts: An Unsupervised Algorithm for Eric Sven Ristad and Peter N. Yianilos. 1998. Learn- Segmenting Sequences. Intelligent Data Analysis, ing String Edit Distance. IEEE Transactions on 11(6):607-625. Pattern Recognition and Machine Intelligence, 20(5):522-532. Paul R Cohen, B Heeringa and Niall M Adams. 2002. An Unsupervised Algorithm for Segmenting Sravana Reddy and Sonjia Waxmonsky. 2009. Sub- Categorical Timeseries into Episodes. Proceed- string-based transliteration with conditional ran- ings of the ESF Exploratory Workshop on Pattern dom fields. Proceedings of the 2009 Named Enti- Detection and Discovery, 49-62. ties Workshop, 92-95. Haodi Feng, Kang Chen, Xiaotie Deng, and Wiemin Praneeth Shishtla, V. Surya Ganesh, Sethuramalin- Zheng. 2004. Accessor Variety Criteria for Chi- gam Subramaniam and Vasudeva Varma. 2009. A nese Word Extraction. Computational Linguistics, language-independent transliteration schema using 30(1):75-93. character aligned models at NEWS 2009. Pro- ceedings of the 2009 Named Entities Workshop, Zellig Sabbetai Harris. 1970. Morpheme boundaries 40-43. within words. Papers in Structural and Transfor- mational Linguistics, 68-77. Kumiko Tanaka-Ishii. 2005. Entropy as an Indicator of Context Boundaries: An Experiment Using a Jin Hu Huang and David Powers. 2003. Chinese Web Search Engine. Proceedings of International Word Segmentation based on contextual entropy. Joint Conference on Natural Language Pro- Proceedings of the 17th Asian Pacific Conference cessing , 93-105. on Language, Information and Computation, 152- 158. Cheng-Huang Tung and His-Jian Lee. 1994. Identifi- cation of unknown words from corpus. Computa- Sittichai Jiampojamarn, Grzegorz Kondrak and Tarek tional Proceedings of Chinese and Oriental Lan- Sherif. 2007. Applying Many-to-Many Align- guages, 131-145. ments and Hidden Markov Models to Letter-to- Phoneme Conversion. Proceedings of the Annual P. Virga and S. Khudanpur. 2003. Transliteration of Conference of the North American Chapter of the Proper Names in Cross-lingual Information Re- Association for Computational Linguistics, 372- trieval. In the Proceedings of the ACL Workshop 379. on Multi-lingual Named Entity Recognition. Tian-Jian Jiang, Shih-Hung Liu, Cheng-Lung Sung Dong Yang, Paul Dixon, Yi-Cheng Pan, Tasuku and Wen-Lian Hsu. 2010. Term Contributed Oonishi, Masanobu Nakamura, Sadaoki Furui. Boundary Tagging by Conditional Random Fields 2009. Combining a two-step conditional random for SIGHAN 2010 Chinese Word Segmentation field model and a joint source channel model for Bakeoff. Proceeding of the First CIPS-SIGHAN machine transliteration. Proceedings of the 2009 Joint Conference on Chinese Language Pro- Named Entities Workshop, 72-75. cessing, 266-269. Hai Zhao and Chunyu Kit. 2007. Incorporating K. Knight and J. Graehl. 1998. Machine Translitera- Global Information into Supervised Learning for tion. Computational Linguistics, 24(4):599-612. Chinese Word Segmentation. Proceedings of the 10th Conference of the Pacific Association for John Lafferty, Andrew McCallum, Fernando Pereira. Computation Linguistics, 66-74. 2001. Conditional Random Fields Probabilistic Models for Segmenting and Labeling Sequence Hai Zhao and Chunyu Kit. 2008. Unsupervised Seg- Data. Proceedings of ICML, 591-598. mentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Enti- Thomas Lavergne, Oliver Cappé and François Yvon. ty Recognition. Proceedings of the Sixth SIGHAN 2010. Practical Very Large Scale CRFs. Proceed- Workshop on Chinese Language Processing. ings the 48th ACL, 504-513. 90 The Amirkabir Machine Transliteration System for NEWS 2011: Farsi-to-English Task Najmeh Mousavi Nejad Shahram Khadivi Kaveh Taghipour Department of Engineering, Department of Computer Department of Computer Islamic Azad University, Engineering, Amirkabir Engineering, Amirkabir Science & Research University of Technology University of Technology Branch, Punak, Ashrafi 424 Hafez Ave, Tehran, 424 Hafez Ave, Tehran, Isfahani , Tehran, Iran Iran 15875-4413 Iran 15875-4413

[email protected] [email protected] [email protected]

Abstract and Moses. Our training and test data is English to Persian set from NEWS 2011 Name In this paper we describe the statistical Transliteration Shared Task (Zhang et al., 2011). machine transliteration system of Amirkabir We use openNlP maximum entropy package to University of Technology, developed for train our system. We define new features for NEWS 2011 shared task. This year we discriminative training. Moreover a new participated in English to Persian language approach for aligning name pairs is proposed. pair. We use three systems for transliteration: the first system is a maximum entropy model with a new proposed alignment algorithm. 2 The Transliteration Process The second system is Sequitur g2p tool, an Our Maximum Entropy transliteration system has open source grapheme to phoneme convertor. The third system is Moses, a phrased based the following steps: statistical machine translation system. In 1. Preprocessing addition, several new features are introduced 2. Alignment of name pairs to enhance the overall accuracy in the maximum entropy model. The results show 3. Definition of proper features for aligned that the combination of our maximum names entropy system with Sequitur g2p tool and Moses lead to a considerable improvement 4. Training the model to produce features weight over each system result. 2.1 Preprocessing 1 Introduction Preprocessing plays an important role in many This paper describes the statistical machine NLP Applications. The amount and kind of transliteration system used for participation in processing done depends on the nature of the the NEWS 2011 shared task workshop. We language. Since there are some letters in Persian participated in English to Persian task and used language which have more than one Unicode (for three different systems for transliteration example “ ”), we run a normalization tool on the generation. training set to uniform the letters. There have been a few researches on Persian language (Karimi et al., 2007). The quality of 2.2 Alignment of Name Pairs transliterated names has been improved in the The features for maximum entropy training are past studies. However, the proposed method is extracted from aligned names. Our proposed language specific and the algorithm is designed alignment method is a two-dimensional Cartesian for Persian language. We present two combined coordinate system. The horizontal axis is labeled transliteration systems. The first system is a with the source name and the vertical axis is combination of a maximum entropy model along labeled with the target name (or vice versa). A with our proposed alignment algorithm and line is drawn from the coordinate (0,0) to the Sequitur g2p tool. The second system is a point with coordinate (source_name_length , combination of our maximum entropy system target_name_length). We mark the 91 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 91–95, Chiang Mai, Thailand, November 12, 2011. corresponding cell in each column of the transformation rule. Once the probabilities are alignment matrix which has the less distance to calculated, they are compared to a predefined the line. A single line is not enough for a name threshold (in our case threshold is 0.9). pair and is only suitable for names with equal length. For more complex alignments, some 2.3 Definition of Proper Features for fixed points are needed in order to draw the lines. Aligned Names In Figure 1, (bb, ) and (n, ) are fixed points and We define two types of features: consonant- the following alignments are achieved: vowel and n-gram. For both types current context (g, ) , (i, ) , (bb, ) , (o, ) , (n, ) , (i, ) (letter), two past and two future contexts are used. We choose a window with a size of 5, since lower or higher length would have degraded the results. 2.3.1 Consonant-Vowel Features Every language has a set of consonant and vowel letters. The consonant letters can be divided into g i b b o n i different groups based on their types (Table 1). Figure 1.Alignment matrix of (gibboni, ) Plosive (stop) p,b,t,d,k,g,q Based on the fact that our goal is to design a Fricative f,v,s,z,x,h language independent transliteration system, an Plosive-Fricative j,c automatic way to find the fixed points is of interest. We introduce FPA algorithm (Fixed Flap (tap) r Points Alignment) which is an unsupervised Nasal m,n approach that adopts the concept of EM training. Lateral approximant l,y In the expectation step the training name pairs Table 1. Six group of consonants are aligned using the current model and in the Most combinations of consonant-vowel maximization step the most probable alignments features were tested for English to Persian are added to the fixed point set. A brief sketch of transliteration. We have found the following FPA algorithm is presented in Figure 2. Line 5 to consonant-vowel features are the most effective 11 shows the process of updating the fixed points ones for generating current target letter (tn). Si is set. In line 7 forcedAlignment means using the used to represent the source name characters and current ME model to transliterate source name ti represents the target name characters. CV is an with the condition that the produced abbreviation for consonant- vowel. Note that the transliterations should be the same as the target consonant letters are divided according to Table name. This condition guarantees the convergence of the algorithm. Line 9 is the last step in producing fixed point set. |k| is the number of The consonant-vowel features improve distinct segments in the best path set and transliteration, but still are not sufficient. Therefore we need n-gram features. is the probability of the 1: while( fixedPoints != oldFixedPoints) { 2: oldFixedPoints = fixedPoints; 3: fixedPoints = updateFixedPoints(whole_training_corpus) 4: } 5: Function updateFixedPoints(training_data){ 6: for( all name pairs) do 7: A = forcedAlignment(sourceName, targetName, currentModel) 8: for (all segment pairs in A) do 9: , ! 10: if (p > threshold) { add transformation rule to the fixedPoints} 11: } Figure 2. Sketch of the FPA algorithm 92 2.3.2 N-gram Features feature file for each line in the training set. In other words all alignments of the first Persian In n-gram features for source name, two past and transliteration are added to the feature file. For two future contexts are used (a window with a other variants only the alignments which were size of 5). For the target name however, only two not seen in the previous Persian transliterations, past contexts are used (since we don’t have are added to the file. future context yet). Approach 3: we assign an equal weight to each Using S to demonstrate the source name and T Persian transliteration of an English name. For to demonstrate the target name, the n-gram example if an English name has 4 Persian features for each name can be summarized as: transliteration, the value of each name weight "#$% "#$& "# "#'& "#'% will be 0.25. (#$% (#$& ) ) ) Approach 4: only one Persian name is selected for training. The selection process uses the For any language pair, all combinations of si previous model to estimate the best Persian and ti can be used to define a feature. We tested transliteration. almost any combination of above features for The best word accuracy in Table 2 belongs to English to Persian transliteration. The results the last row. So in the rest of the paper we use show that (#$% does not help in better approach 2 for the first phase and approach 1 for transliteration. Because written Persian omits the second phase. short vowels, and only long vowels appear in texts. So (#$% is completely irrelevant for generating current Persian letter. But other Phase 1 Phase 2 WA CA contexts lead to a better transliteration. Approach 1 Approach 2 65.7 82.4 The details of FPA algorithm and feature selection strategies are explained in our research Approach 1 Approach 1 66.8 82.5 paper which was accepted by NEWS 2011. Approach 3 Approach 1 66.8 82.5 2.3 Training the Model and Producing Approach 4 Approach 2 67.2 82.7 Features Weight Approach 2 Approach 2 67.3 82.7 As mentioned earlier, we use openNlP maximum entropy package in the training stage. The Approach 4 Approach 1 68.2 82.9 features which were extracted in the previous Approach 2 Approach 1 68.3 82.9 section are inputs for maximum entropy model. After a number of iterations, ME builds the model and produces the features weight. These Table 2. The Effect of multi transliteration dataset on word accuracy and character accuracy weights will be used in the test stage. in Top-1 tested on the development set Some names in the workshop dataset have more than one transliteration. Several 3 System Combination experiments were done to study the effect of multi transliteration dataset on our system. Table System combination is the method of combining 2 shows the results. The numbers and phases in stand alone systems to achieve a better result. We the table are defined as follows: have three separate systems for transliteration Phase 1: updating the fixed points set which generate a reasonable output. The first Phase 2: finding features weight System is the ME model along with our new Approach 1: each Persian variant and alignment approach. The second system is the corresponding English name is considered as one open source Sequitur G2P which is a grapheme name pair. So if a line in the training file has one to phoneme conversion tool (Bisani and Ney, English name and 5 Persian transliterations, we 2010). Considering the transliteration direction, will have 5 name pairs for that line. This the names in the source language are regarded as approach causes many similar alignments to be graphemes and the names in the target language added to the feature file for a single line in the as phonemes. The third System is Moses, a training file. phrased based statistical machine translation Approach 2: This approach is similar to approach system. In order to have an accurate 1, except that we add distinct alignments to the transliteration system with a phrase-based 93 statistical translation model, Moses is trained The workshop released train and development with an unconstrained phrase length. Having no dataset have overlap and some names in the limit for the maximum phrase length is feasible training set are repeated in the development set. in the transliteration case since the number of Therefore a memory based approach will phrase pairs are much less when compared to the improve the results very much. In this approach translation. Having no restriction for the phrase if the test data is observed in the training set, its length enables the model to learn all proper transliterations are put on top of the N-best list. phrases and also to perform as a translation The accuracy in Top-1 with memory based memory. In addition, the decoder is not permitted approach for the forth system is 86.4 and for the to reorder the phrases by setting the distortion fifth system is 86.0. limit to zero. Moreover, the beam threshold, hypothesis stack size and the translation table ID Systems WA MRR F-Score MAPref limit is set to have maximum performance. The final combined system should produce10 1 MEM 66.5 77.5 94.6 65.5 candidates for each name in the test data. To with FPA achieve this goal, the first combined system 2 Sequitur 67.7 79.5 95.0 66.9 which is a combination of Sequitur g2p and G2P MEM with FPA, has the following steps: First 3 Moses 67.5 78.8 93.8 66.5 g2p produces 50 candidates for each name, ranked by the probability that the model assigns 4 1 70.0 81.0 95.2 69.2 to them (P1). Therefore if the number of test combined with 2 names are N, we will have N*50 name pairs. 1 5 68.2 79.7 94.9 67.1 Then we apply forceAlignmnet to each pair combined which was described in Section 2.2. This process with 3 produces another probability for each pair (P2), Table 3. Results on the second half of the which is the multiplication of the best path edges in the search tree (see Figure 2 for further development set (in %) details). Now we can use a linear combination of P1 and P2. The final probability for each pair is: 5 Conclusions *+,#-. / 0 *& 1 2 3 / 0 *% (3.1) In this paper, we presented a language- Once / is found, 10 best transliterations which independent alignment method for transliteration. have highest *+,#-. , are enumerated as final Discriminative training is used in our system and transliterations. numbers of new features are defined in the The second combined system is a combination of training stage. Furthermore a new grapheme to Moses and MEM with FPA. The process is phoneme tool is recommended for transliteration similar to the first combined system. The task, assuming one side as graphemes and the difference is the value of /. The values of / for other side as phonemes. Additionally, a phrase- each combined system are reported in the next based statistical translation model is configured section. to have maximum transliteration accuracy and is used as one of the independent components of 4 Results the system combination process. Results showed that the combination of three systems improves We report our results on the development data overall accuracy. provided by the NEWS 2011 task. For the development runs, we use the training set for Acknowledgments training and the development set for testing. The best combinations of features, founded in section This work has been partially supported by 2.3, are included in the training stage. Iranian Research Institute for ICT (Ex. ITRC) We split development data into two half. The under grant 500/1141. The authors wish to thank first half is used for tuning / and the second half the anonymous reviewers for their criticisms and is used for systems evaluation. Table 3 shows suggestions. word accuracy in Top-1 and MRR in Top-10 for the five systems. The value of / for the forth system is set to 0.57 and for the fifth system is set to 0.7. 94 References Zelenko, D., Aone. C., Discriminative Methods for Transliteration, Proceedings of the 2006 Bisani, M., Ney, H., Joint-Sequence Models for Conference on Empirical Methods in Natural Grapheme-to-Phoneme Conversion, Speech Language Processing (EMNLP 2006), pages 612– Communication (2008), doi: 617, Sydney, July 2006. 10.1016/j.specom.2008.01.002 Zhang, M., Kumaran, A. , Li, H., Whitepaper of Fraser, A., Marcu, D., Semi-Supervised Training for NEWS 2011 Shared Task on Machine Statistical Word Alignment, Proceedings of ACL- Transliteration, In Proceedings of the ACL- 2006, pp. 769-776, Sydney, Australia IJCNLP 2011 Named Entity Workshop Goto, I., Kato, N., Uratani, N., Ehara, T., Transliteration Considering Context Information Based on the Maximum Entropy Method, In Proc. Of IXth MT Summit. (2003) Jiampojamarm, S., Kondrak, G., Letter-Phoneme Alignment: An Exploration, Proceedings of the 48th Annual Meet-ing of the Association for Computational Li-nguistics, pages 780–788, Uppsala, Sweden, 11-16 July 2010. Josef, F., Ney. H., Discriminative Training and Maximum Entropy Models for Statistical Machine Translation, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 295-302. Karimi, S., Scholer, F., Turpin, A., Collapsed Consonant and Vowel Models: New Approaches for English-Persian Transliteration and Back- Transliteration, The 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), pages 648-655, Prague, Czech Republic, June 2007. Karimi, S., Machine Transliteration of Proper Names between English and Persian, A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy, BEng. (Hons.), MSc. Koehn, P., Hoang, H., Birch A., CallisonBurch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer C., Bojar, O., Constantin, A., Herbst E., 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07) – Companion Volume, June. OpenNLP Maximum EntropyPackage Available at http://incubator.apache.org/opennlp/ Yoon, S., Kim, K., Sproat, R., “Multilingual Transliteration Using Feature based Phonetic Method”, Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 112–119, Prague, Czech Republic, June 2007, Association for Computational Linguistics. 95 English-Chinese Personal Name Transliteration by Syllable-Based Maximum Matching Oi Yee Kwong Department of Chinese, Translation and Linguistics City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong

[email protected]

Abstract 2 Related Work This paper reports on our participation in the The reports of the shared task in NEWS 2009 (Li NEWS 2011 shared task on transliteration et al., 2009) and NEWS 2010 (Li et al., 2010) generation with a syllable-based Backward highlighted two particularly popular approaches Maximum Matching system. The system uses for transliteration generation among the partici- the Onset First Principle to syllabify English pating systems. One is phrase-based statistical names and align them with Chinese names. machine transliteration (e.g. Song et al., 2010; The bilingual lexicon containing aligned seg- Finch and Sumita, 2010) and the other is Condi- ments of various syllable lengths subsequently tional Random Fields which treats the task as one allows direct transliteration by chunks. The of sequence labelling (e.g. Shishtla et al., 2009). official results suggest that our system could Besides these popular methods, for instance, potentially be improved with a re-ranking module for English-to-Chinese transliteration, Huang et al. (2011) used a non-parametric while its performance on Chinese-to-English Bayesian learning approach in a recent study. back transliteration reached the state of the art. Regarding the basic unit of transliteration, tra- ditional systems are mostly phoneme-based (e.g. 1 Introduction Knight and Graehl, 1998). Li et al. (2004) sug- gested a grapheme-based Joint Source-Channel This paper describes our system participating in Model within the Direct Orthographic Mapping two tracks of the NEWS 2011 shared task on framework. Models based on characters (e.g. transliteration generation, including English-to- Shishtla et al., 2009), syllables (e.g. Wutiwi- Chinese transliteration (EnCh) and Chinese-to- watchai and Thangthai, 2010), as well as hybrid English back transliteration (ChEn). units (e.g. Oh and Choi, 2005), are also seen. In Our system is essentially a syllable-based addition to phonetic features, others like tem- Backward Maximum Matching (BMM) system, poral, semantic, and tonal features have also which works bi-directionally for EnCh and ChEn. been found useful in transliteration (e.g. Tao et The Onset First Principle in phonology was used al., 2006; Li et al., 2007; Kwong, 2009a). to syllabify English names and align them with the Chinese renditions. A bilingual lexicon con- 3 Datasets taining segment pairs of various syllable lengths was then produced from the aligned names. This The transliteration data provided by the shared lexicon was subsequently used in transliteration, task organiser are mostly based on name pairs during which a source name was first syllabified from Xinhua News Agency (1992). For EnCh, and then segmented using BMM with syllables there are 37,753 English-Chinese name pairs in as the basic units. Target candidates were gener- the training set, 2,802 pairs in the development ated by looking up the bilingual lexicon and set, and another 2,000 English names in the test ranked by unigram probabilities. set. For ChEn, there are 28,678 Chinese-English We will briefly review related work in Section name pairs in the training set, 2,719 pairs in the 2, and introduce the datasets used in this study in development set, and another 2,266 Chinese Section 3. The system will be described and its names in the test set. The Chinese translitera- performance reported in Section 4, followed by tions basically correspond to Mandarin Chinese future work and conclusion in Section 5. pronunciations of the English names, as used by the media in Mainland China. 96 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 96–100, Chiang Mai, Thailand, November 12, 2011. In the current study, we focused entirely on tan3) when they appear as part of different personal name transliteration. The small propor- names. So based on the concept of translation tion of place names in the data was not handled. memory, if a larger chunk can be matched, trans- Most of them contain multiple English words or literation becomes easier and less uncertain. In otherwise are not entirely phonemically rendered this way, the context embedding a syllable is in- in Chinese (e.g. Africa 非洲, transcribed as fei1 corporated, and it might also reduce error propa- zhou1 in Hanyu Pinyin). They are better dealt gation in the pipeline during syllabification and with by a specific lookup table of place names, phoneme mapping. but since we only participated in the standard With the above linguistic and practical consid- runs and did not use any external resources, erations, a syllable-based Maximum Matching those cases were practically ignored. approach is thus adopted, and the following sub- All English names are in upper case letters, sections explain the steps involved. and all occurrences of “X” were replaced by “KS” before processing to facilitate subsequent English Chinese Hanyu Pinyin syllabification, as a single letter “X” in an Eng- JACOB 雅各布 ya3 ge4 bu4 lish word often corresponds to the consonant JACOBS 雅各布斯 ya3 ge4 bu4 si1 cluster /ks/ when pronounced. JACOBSEN 雅各布森 ya3 ge4 bu4 sen1 JACOBSTEIN 雅各布斯坦 ya3 ge4 bu4 si1 tan3 4 System Description ARENSTEIN 阿伦斯坦 a4 lun2 si1 tan3 BARTENSTEIN 巴滕斯坦 ba1 teng2 si1 tan3 Our system is motivated linguistically and for DUBERSTEIN 杜伯斯坦 du4 bo2 si1 tan3 practical reasons. On the one hand, translitera- Table 1. Examples of Transliteration by Chunks tion is to render a source name in a phonemically similar way in a target language, and syllable is 4.1 Syllabification an important concept in pronunciation. Accord- The English names in the training data and de- ing to Ladefoged (2006), for alphabetic writing velopment data were first syllabified with the systems, syllables are systematically split into Onset First Principle. According to Katamba their components. A syllable is composed of an (1989), the principle suggests that syllable-initial optional onset containing consonants and a man- consonants are first maximised to the extent con- datory rhyme. The rhyme comprises a mandato- sistent with the syllable structure conditions of ry nucleus containing vowels and an optional the language in question, followed by the maxi- coda containing consonants. English has com- misation of syllable-final consonants. plex onsets and codas, whereas Mandarin Chi- In English, written symbols do not necessarily nese has simple onsets and only allows nasal bear a one-to-one relationship with phonological consonants in the coda. According to Dobro- segments. So in practice, with reference to volsky and Katamba (1996), native speakers of common phonics patterns, we drew up a list of any language intuitively know that certain words possible onsets containing graphemic units that come from other languages sound unusual which may correspond to simple phonemes (e.g. and they often adjust the segment sequences of “CH”, “TH”) or complex onsets (e.g. “PL”, these words to conform to the pronunciation re- “STR”) to be used in syllabification. quirements of their own language. These intui- During syllabification, all vowels were first tions are based on a tacit knowledge of the per- marked as nucleus (N). The longest acceptable missible syllable structures of the speaker’s own consonant sequences on the left of the vowels language. Hence, the complex onset in the Eng- were then marked as onset (O), and finally all lish syllable “STEIN” (as in Figure 1) violates remaining consonants were marked as coda (C). the onset constraints in Chinese and is therefore From left to right, syllables are marked for each resolved into two Chinese syllables as “斯坦” longest matching chain of ONC, ON, NC, or N. (si1 tan3). Hence syllable is apparently the The top half of Figure 1 illustrates these steps. proper basic unit for machine transliteration. Subsequently, the syllable chain was subject to On the other hand, during transliteration, peo- sub-syllabification considering the difference in ple tend not to re-invent the wheel for a similar phonotactics between English and Chinese. In chunk of syllables in the source name. The ex- particular, Chinese syllables have no complex amples in Table 1 illustrate this observation. As onsets and only allow nasal consonants for codas. seen, “JACOB” is consistently rendered as “雅各 So if the syllabification step produces fewer Eng- 布” (ya3 ge4 bu4) and “STEIN” as “斯坦” (si1 lish syllables than Chinese syllables, the sub- 97 syllabification process will try to expand the The middle part of Figure 1 illustrates the sub- English syllables, with the number of syllables syllabification process. checked after each expansion. At any point if the English syllables outnumber the Chinese ones, 4.2 Alignment the sub-syllabification process will try to contract Upon syllabification and sub-syllabification, if the English syllables. the number of English syllables equals the num- ber of Chinese syllables, alignment can be done σ σ σ directly in a one-to-one manner. Otherwise some heuristics would be used to attempt some com- plex alignments. As long as Chinese syllables O N O N C O N C still outnumber English syllables, the next Eng- Syllabification lish syllable with four or more letters or starting J A C O B S T E I N with two different consonants will absorb two Chinese syllables, assuming such long segments Sub-syllabification are actually pronounced as two syllables. For σ σ σ σ σ example, “A/L/THOU/SE” does not have enough Alignment syllables to align with its Chinese rendition “奥尔特豪斯” (ao4 er3 te4 hao2 si1), so “THOU” 雅各布斯坦 will be forced to take up two Chinese syllables “特豪” (te4 hao2). At any point, if the remain- Figure 1. Syllabification and Alignment ing Chinese syllables fall short of English sylla- The expansion process will thus follow the or- bles, the rest will be aligned as a whole without der of precedence below: further breaking into syllables. For example, (1) From left to right, split up complex onsets. “YON/GE” will simply be aligned with the Chi- For example, “STEIN” is split up into “S/TEIN”. nese name “扬” (yang2). The bottom part of (2) From right to left, split up complex codas Figure 1 shows the alignment step. or separate coda from nucleus if the coda is not 4.3 Lexicon Production available in the target language. For example, “COB” is sub-syllabified as “CO/B”. Based on the aligned names, segment pairs of (3) From right to left, separate liquids and various syllable lengths were extracted to pro- glides (“L”, “R”, “W”) from the nucleus if the duce a bilingual lexicon as follows: Chinese rendition has “尔” (er3) or “夫” (fu1) in For i = 1 to n (# of aligned segment pairs) it. For example, with the pair “MINKOWSKI” For j = i to n and “明科夫斯基” (ming2 ke1 fu1 si1 ji1), initial Extract segment-i to segment-j syllabification produces three syllables, Next j “MIN/KOW/SKI”. During sub-syllabification, Next i “SKI” will be split into “S/KI” with (1) above, Hence for the aligned name in Figure 1, the fol- but the English side is still one syllable short. So lowing segment pairs will enter into the lexicon: “KOW” will be split into “KO/W” in the next JA/雅 (ya3), JACO/雅各 (ya3 ge4), JACOB/雅 expansion. 各布 (ya3 ge4 bu4), JACOBS/雅各布斯 (ya3 (4) From left to right, expand diphthongs as ge4 bu4 si1), JACOBSTEIN/雅各布斯坦 (ya3 necessary. For example, diphthongs like “IA” ge4 bu4 si1 tan3), CO/各 (ge4), COB/各布 (ge4 will be split up as in “A/ME/LI/A”. bu4), COBS/各布斯 (ge4 bu4 si1), COBSTEIN/ The contraction process will follow the order of precedence below: 各布斯坦 (ge4 bu4 si1 tan3), B/布 (bu4), BS/布 (1) Contract the name-initial “M/C”, if any, 斯 (bu4 si1), BSTEIN/布斯坦 (bu4 si1 tan3), S/ with the following syllable. 斯 (si1), STEIN/斯坦 (si1 tan3), and TEIN/坦 (2) From right to left, contract nasals, liquids (tan3). Note that we use “segment pairs” instead and glides followed by “E” with the previous of “syllable pairs” here as the alignment may syllable. For example, “AALLIBONE” for “阿 involve one or more syllables on either side. 利本” (a4 li4 ben3) will be initially syllabified as 4.4 Backward Maximum Matching “AA/LLI/BO/NE”, which will then be contracted to “AA/LLI/BONE”. During transliteration, an English source name was first syllabified using the syllabification and 98 sub-syllabification procedures described above, the former is often restricted to the end of a name. except the contraction part. The name was then This would not be realised for now, unless a segmented using Backward Maximum Matching longer segment can be matched, e.g. “VELO” with the lexicon. The matching was syllable- could only be matched on single syllables, so “夫 based, unless even the shortest syllable cannot be 洛” (fu1 luo4) came before “维洛” (wei2 luo4), matched with the lexicon. In that case the sylla- but “VELASCO” found a longer match with “维 ble would be matched as a string of characters. 拉斯科” (wei2 la1 si1 ke1) as the first candidate. The same procedures were applied to EnCh This suggests that Maximum Matching is useful, and ChEn, as the lexicon contains bilingual seg- but re-ranking is needed for better performance. ment pairs, and can be looked up bi-directionally. ChEn is apparently more difficult, and scores Maximum Matching can be done with the Eng- are lower in general. Nevertheless, our system lish segments or Chinese segments accordingly. came in the top three, giving even better Mean F- Chinese source names do not need particular syl- score and MRR than the system with the best labification as Chinese characters are syllabic. ACC. The more severe graphemic ambiguity for 4.5 Candidate Generation and Ranking ChEn may make it a more difficult task. Accord- ing to Kwong (2009b), on average one English With the segmented source name, target candi- segment (syllable) has 1.7 Chinese renditions but dates were generated by looking up the lexicon one Chinese character can be mapped to 10 dif- for each segment and its rendition(s) in the target ferent English segments. Another major problem language. In the current study, the candidates for ChEn is unseen characters and the spelling were simply ranked by unigram probabilities. conventions of English or other European lan- Figure 2 shows an example of Maximum Match- guages. For example, “云” (yun2) was not found ing and candidate generation. in the training and development data and there- σ σ fore “云格” (yun2 ge2) could not be properly back transliterated. Also, some candidates end up with triple consonants which are obviously O N C O N C not acceptable in English and should be avoided. Syllabification Metric Run 1 Run 2 Best M A R K S T E I N ACC 0.305 0.285 0.348 Mean F-score 0.672 0.660 0.700 Sub-syllabification MRR 0.378 0.349 0.462 σ σ σ σ σ MAPref 0.297 0.276 0.342 Table 2. Official EnCh Results on Test Data Syllable-based MM Metric Run 1 Run 2 Best 马克斯坦 ACC 0.155 0.154 0.167 Candidate Generation 马克施泰因 Mean F-score 0.766 0.757 0.765 马克恩斯坦 MRR 0.215 0.206 0.202 马克茨坦 MAPref 0.155 0.154 0.167 Table 3. Official ChEn Results on Test Data Figure 2. Max Matching and Candidate Generation 5 Future Work and Conclusion 4.6 Official Results Table 2 and Table 3 show the official results for Thus the performance of our approach on EnCh the two standard runs we submitted and the best has room for improvement, possibly with a re- system in EnCh and ChEn respectively. The first ranking module, and that on ChEn is close to the run used segment pairs with frequency two or state of the art. Forward and Backward Maxi- above, and the second run used those with fre- mum Matching could potentially be used togeth- quency five or above. The evaluation metrics er to better handle overlapping ambiguity so as follow the definitions in the whitepaper of the not to miss other possible candidates. shared task (Zhang et al., 2011). The performance of our system on EnCh is in Acknowledgements the mid-range, and re-ranking with n-gram fea- The work described in this paper was substantial- tures is apparently important. For instance, VE/ ly supported by a grant from City University of 夫 (fu1) is more frequent than VE/维 (wei2), but Hong Kong (Project No. 7008004). 99 References Oh, J-H. and Choi, K-S. (2005) An Ensemble of Grapheme and Phoneme for Machine Translit- Dobrovolsky, M. and Katamba, F. (1996) Pho- eration. In R. Dale et al. (Eds.), Natural Lan- nology: the function and patterning of sounds. guage Processing – IJCNLP 2005. Springer, In W. O’Grady, M. Dobrovolsky and F. LNAI Vol. 3651, pp.451-461. Katamba (Eds.), Contemporary Linguistics: An Introduction. Essex: Addison Wesley Shishtla, P., Ganesh, V.S., Sethuramalingam, S. Longman Limited. and Varma, V. (2009) A language- independent transliteration schema using char- Finch, A. and Sumita E. (2010) Transliteration acter aligned models. In Proceedings of using a phrase-based statistical machine trans- NEWS 2009, Singapore. lation system to re-score the output of a joint multigram model. In Proceedings of NEWS Song, Y., Kit, C. and Zhao, H. (2010) Reranking 2010, Uppsala, Sweden. with multiple features for better transliteration. In Proceedings of NEWS 2010, Uppsala, Swe- Huang, Y., Zhang, M. and Tan, C.L. (2011) den. Nonparametric Bayesian Machine Translitera- tion with Synchronous Adaptor Grammars. In Tao, T., Yoon, S-Y., Fister, A., Sproat, R. and Proceedings of ACL-HLT 2011: Short Papers, Zhai, C. (2006) Unsupervised Named Entity Portland, Oregon, pp.534-539. Transliteration Using Temporal and Phonetic Correlation. In Proceedings of EMNLP 2006, Katamba, F. (1989) An Introduction to Phonolo- Sydney, Australia, pp.250-257. gy. Essex: Longman Group UK Limited. Wutiwiwatchai, C. and Thangthai, A. (2010) Syl- Knight, K. and Graehl, J. (1998) Machine Trans- lable-based Thai-English Machine Translitera- literation. Computational Linguistics, tion. In Proceedings of NEWS 2010, Uppsala, 24(4):599-612. Sweden, pp.66-70. Kwong, O.Y. (2009a) Homophones and Tonal Xinhua News Agency. (1992) Chinese Translit- Patterns in English-Chinese Transliteration. eration of Foreign Personal Names. The In Proceedings of the ACL-IJCNLP 2009 Con- Commercial Press. ference Short Papers, Singapore, pp.21-24. Zhang, M., Kumaran, A. and Li, H. (2011) Kwong, O.Y. (2009b) Graphemic Approxima- Whitepaper of NEWS 2011 Shared Task on tion of Phonological Context for English- Machine Transliteration. In Proceedings of Chinese Transliteration. In Proceedings of NEWS 2011, Chiang Mai, Thailand. NEWS 2009, Singapore, pp.186-193. Ladefoged, P. (2006) A Course in Phonetics. Thomson Wadsworth. Li, H., Zhang, M. and Su, J. (2004) A Joint Source-Channel Model for Machine Translit- eration. In Proceedings of ACL 2004, Barce- lona, Spain, pp.159-166. Li, H., Sim, K.C., Kuo, J-S. and Dong, M. (2007) Semantic Transliteration of Personal Names. In Proceedings of ACL 2007, Prague, Czech Republic, pp.120-127. Li, H., Kumaran, A., Pervouchine, V. and Zhang, M. (2009) Report of NEWS 2009 Machine Transliteration Shared task. In Proceedings of NEWS 2009, Singapore. Li, H., Kumaran, A., Zhang, M. and Pervouchine, V. (2010) Report of NEWS 2010 Translitera- tion Generation Shared Task. In Proceedings of NEWS 2010, Uppsala, Sweden. 100 Statistical Machine Transliteration with Multi-to-Multi Joint Source Channel Model Yu Chen, Rui Wang, Yi Zhang Language Technology Lab German Research Center for Artificial intelligence (DFKI) Saarbr¨ucken, Germany {firstname.lastname}@dfki.de Abstract previous work (e.g., (Knight and Graehl, 1998), denoted as phoneme-based methods) and showed This paper describes DFKI’s participation that incorporating Pinyin as one of the features did in the NEWS2011 shared task on ma- help the transliteration performance finally. Li et chine transliteration. Our primary sys- al. (2007) included two other useful features, lan- tem participated in the evaluation for guage of origin and the gender association. This English-Chinese and Chinese-English lan- is our first participation of this shared task, instead guage pairs. We extended the joint source- of considering the “best” setting, we aim at a basic channel model on the transliteration task but extensible architecture at first. into a multi-to-multi joint source-channel model, which allows alignments between 2 Systems substrings of arbitrary lengths in both source and target strings. When the Transliteration can be viewed as a special case of model is integrated into a modified phrase- the translation task, namely translation at a charac- based statistical machine translation sys- ter level. State-of-the-art statistical machine trans- tem, around 20% of improvement is ob- lation systems were reported as being able to de- served. The primary system achieved liver satisfactory results for the transliteration task 0.320 on English-Chinese and 0.133 on without additional knowledge on the languages Chinese-English in terms of top-1 accu- (Knight and Graehl, 1998). However, general sta- racy. tistical machine translation systems do not con- sider the key features of the transliteration task, 1 Introduction which, on the other hand, have been emphasized by the joint source channel models. Machine transliteration has drawn a lot of atten- tion in the previous years. In particular, the pre- Our primary system is a standard phrase-based vious two shared tasks (Li et al., 2009; Li et al., statistical machine translation (PBSMT) system 2010) attracted more than 30 participants. This with a modification based on the Multi-to-Multi year’s task only focuses on the transliteration gen- Joint Source Channel model. We hope the combi- eration task. As our first attempt in this area, we nation could benefit from the simplicity of a joint participated in English-to-Chinese transliteration source channel model without losing the flexibility (En-Ch) and Chinese-to-English back translitera- of the PBSMT system. tion (Ch-En) tasks. 2.1 Phrase-based SMT For En-Ch and Ch-En transliterations, there was a discussion on whether to use the intermediate The basic architecture of a phrase-based SMT phonemic interpretation, i.e., Pinyin. Li et al. system is an instance of the noisy-channel ap- (2004) showed empirically that by skipping the proaches (Brown et al., 1993). In the context of intermediate phonemic interpretation (denoted as transliteration, the term “phrase” in phrase-based grapheme-based methods), the transliteration er- SMT would refer to a sequence of characters cho- ror rate was reduced significantly, since the map- sen by its statistical rather then any grammatical ping between Pinyin and Chinese characters was properties. The transliteration of a name s in the not trivial. Oh et al. (2009) had a more generalized source language into a name t in the target lan- version of Li et al. (2004)’s system as well as other guage is modeled as: 101 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 101–105, Chiang Mai, Thailand, November 12, 2011. have the transliteration probability defined as: K X arg max P (t|s) = arg max(P (t)P (s|t)); P (s, t, α) = P (< e, c >k | < e, c >k−1 t t k−n+1 ) k=1 (1) The system involves a phrase table, a list of where < e, c >k is the k th aligned pair of trans- character sequences identified in a source name lation units. Therefore, forward and backward together with potential transliterations. These se- transliteration can be uniformly obtained by (2) quences derived from the source names may over- and (3). lap and also have several correspondences in the target language. The process of searching for the t = argmax P (s, t, α) (2) s,α target names starts with selecting a subset of the entries in the table. The members of the selected s = argmax P (s, t, α) (3) t,α subset must then be arranged in a specific order to give a translation. These operations are deter- The alignment statistics can be obtained with mined by statistical properties of the target lan- an Expectation-Maximization procedure over the guage enshrined in the so-called language model. training corpus. The segments in the source name and their For English-Chinese bidirectional translitera- counterparts in the target language should always tion, Li et al. (2004) assumed that each Chinese be exactly in the same order, which is clearly not character aligns with a sequence of one or more the case for general machine translation tasks. In letters in English. This assumption drastically re- addition to ordering, there are many other strict duces the number of possible alignments. For a rules such that the transliteration task is relatively English source s and a Chinese target t, the num- more deterministic than the translation process. ber of possible alignment under this assumption is For instance, although it is common that many |s| − 1 (|s| − 1)! Chinese characters have the same pronunciation, = |t| − 1 (|t| − 1)!(|s| − |t|)! only a small set of Chinese characters can be used in the transliterated western names. Accordingly, While the assumption holds true in most of the for each source name, there are only a limited set cases, several obvious limitations arise. First, it is of candidate transliterations, unlike the infinite tar- assumed that the source string is at least as long get set for the general translation task. as the target which is not necessary true. Sec- It is critical to take into account these charac- ond, and more importantly, in some cases multi- teristics mentioned above when utilizing an SMT ple Chinese characters should align with one sin- system for transliteration. First, the distortion gle English letter (for example ‘X’), and in others, model, one of the major components in a stan- multiple Chinese characters constitute one single dard PBSMT system, is redundant for transliter- transliteration unit. Therefore, instead of adopting ation. Including the unnecessary model expands the “one Chinese character per unit” assumption, the search space and makes it more difficult to find we allow alignments between substrings of arbi- the good candidates. Second, the word alignment trary lengths in both the source and the target. We model (Och and Ney, 2004) in a PBSMT system call this a Multi-to-Multi Joint Source-Channel also assumes flexible ordering of correspondence model (M2M-JSC). This constitutes a much larger to some extent. This could introduce additional model, with more possible transliteration units on noise to the translation models if applied directly the Chinese side. To simplify the calculation, to transliteration tasks without any modifications. we use the 1-gram model for the calculation of the transliteration probability, and hope that the larger transliteration units to compensate for the 2.2 M2M Jonit Source-Channel Model Markovian effect of mutual dependencies between The joint source-channel machine transliteration alignment pairs. We use the similar Expectation- model (Li et al., 2004) calculates the n-gram Maximization procedure to train the model on the transliteration probability. More specifically, for corpus. One slight variation from Li et al. (2004) a source name s, a target transliteration t, and an is that instead of choosing a random segmentation alignment α between the source and the target, we in the initialization step, we generate all possible 102 multi-to-multi alignment hypotheses, and normal- 3 Experiment setup ize the counts by the number of hypotheses of each transliteration pair. The segmentation alignment 3.1 Preprocessing obtained is significantly different from the origi- We worked with the English data only in the up- nal Joint Source-Channel model. Table 1 shows percase form as provided in the training set. The some examples of the M2M-JSC alignment. names are tokenized into characters, but we did not perform any further phonetic mapping for both English Chinese languages as the phonetic mapping requires addi- A/JA/X 埃/甲/克斯 tional knowledge which was not available in the A/BA/STE/NIA 阿/巴/斯蒂/尼亚 training data. AHL/BERG 阿尔/伯格 Even though it is possible to combine the train- Table 1: Examples of M2M Joint Source-Channel ing sets for both English-to-Chinese and Chinese- Alignment Result to-English, we restrained ourselves to the set that are designated for the particular direction. In other words, the Chinese-to-English training set was not 2.3 Combined system included for training of all the components of our In order to benefit from both previous described English-to-Chinese system and vice versa. components, the M2MJSC model is integrated into the PBSMT system as a substitute of the trans- 3.2 SMT system for transliteration lation model. Figure 1 illustrates the structure of 3.2.1 Statistical models the combined system. Our system consists the following major statistical components: Parallel Transliterated Names Names • An n-gram language model; Joint Source Channel • A translation model, including two phrase Language translation probabilities (both directions), Phrase Model two lexical weightings (both directions) in- Table duced from word translation probabilities, and a phrase penalty. This model is further Source SMT Target decomposed into phrases; Name Decoder Name • Word penalty used to penalize longer hy- Figure 1: Phrase-based Transliteration System potheses. with Joint Source Channel Model The n-gram language model is estimated using the M2MJSC is first applied to the training set to SRILM toolkit (Stolcke, 2002). The translation divide each source name in parallel with the cor- model is built from the character alignments given responding target name into the same number of the M2MJSC model and we did not construct any segments. These segments are then considered as distortion models. words that are one-to-one aligned. The PBSMT system takes multiple segments, namely phrases, 3.2.2 Moses decoder as translation units. The phrase extraction fol- We used the open-source SMT decoder lows the heuristic that starts with the given word Moses (Koehn et al., 2007). Moses allows a alignment and expands to the adjacent alignment log-linear model to combine various models and points (Koehn et al., 2003). The translation prob- implements an efficient beam search algorithm abilities of the extracted phrases are estimated ac- that quickly finds the best translation among the cordingly. large number of hypotheses. In order to adapt As the last step, we split all the segments in the the SMT decoder to the transliteration task, we translation model into characters to allow more not only supplied the decoder with no reordering straightforward integration into the original PB- models, but also constrained the decoder in a SMT system that relies on character based inputs. monotone manner by setting distortion limit to 0. 103 Tasks System ACC Mean F MRR Map ref English-to-Chinese M2MJC+PBSMT 0.320 0.674 0.397 0.308 English-to-Chinese M2MJC 0.260 0.638 0.340 0.251 Chinese-to-English M2MJC+PBSMT 0.133 0.746 0.210 0.133 Chinese-to-English M2MJC 0.117 0.731 0.177 0.117 Table 2: Official results 3.2.3 Parameter tuning 5 Conclusion The system integrates all the models into a more We successfully participated in this year’s En-Ch complex discriminative model in a log linear for- and Ch-En machine transliteration shared tasks. mulation. The weights for the individual mod- We extended the original joint source-channel els can be optimized on development data so that model proposed by Li et al. (2004) by allow- the system outputs are as close as possible to ing more possible transliteration units than single correct candidates. Minimum error rate train- characters (in Chinese) and single letters (in En- ing (MERT) (Och, 2003) is one of the common glish). When the M2M-JSC model is integrated method for balancing between features on differ- into a modified phrase-based SMT system, around ent bases. We used Z-MERT (Zaidan, 2009) to 20% of improvement is observed. In the future, search for the set of feature weights that maxi- we will further explore the M2M-JSC model with mizes the official f-score evaluation metric on the richer feature sets as well as the integration of development set. other SMT approaches. Moreover, we extracted a small development set of 500 names randomly from the official develop- Acknowledgments ment set. The rest of the official development set The first two authors were supported by the Eu- served as a development test set, so we could run roMatrixPlus project (IST-231720) funded by the additional experiments on the provided data set European Community under the Seventh Frame- apart from our submission. The feature weights work Programme for Research and Technological we used for our submission are obtained from the Development. The last was supported by the Ger- complete development set. man Excellence Cluster of Multimodal Computing and Interaction. 4 Results We participated in English-to-Chinese and References Chinese-to-English transliteration tasks in Peter Brown, Stephen A. Della Pietra, Vincent J. NEWS2011. Table 2 lists the official evaluation Della Pietra, and Robert Mercer. 1993. The math- ematics of statistical machine translation: Parame- scores for our submission to these two tracks. ter estimation. Computational Linguistics, 19:263– Our contrast system is the stand-alone M2MJSC 311. system. It is clear that the final combined system Kevin Knight and Jonathan Graehl. 1998. Machine has outperformed the M2MJSC system by around transliteration. Comput. Linguist., 24:599–612, De- 20% for both directions. cember. We notice that there is a group of multi-word Philipp Koehn, Franz Josef Och, and Daniel Marcu. names in the development set that are particularly 2003. Statistical phrase-based translation. In Pro- difficult for our system to transliterate correctly. ceedings of the 2003 Conference of the North Amer- Most of these names consists of parts that should ican Chapter of the Association for Computational be translated by the meanings instead of translit- Linguistics on Human Language Technology - Vol- ume 1, NAACL ’03, pages 48–54, Stroudsburg, PA, erated by the phonemes, for example, “DEMO- USA. Association for Computational Linguistics. CRATIC AND POPULAR REPUBLIC OF AL- GERIA”. To handle such cases, we need to in- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, clude additional recognition and translation mod- Brooke Cowan, Wade Shen, Christine Moran, ules that clearly require knowledge beyond the Richard Zens, Chris Dyer, Ondrej Bojar, Alexan- provided training data set. dra Constantin, and Evan Herbst. 2007. Moses: 104 Open Source Toolkit for Statistical Machine Trans- lation. In the 45th Annual Meeting of the Associ- ation for Computational Linguistics (ACL), Prague, Czech Republic, June. Haizhou Li, Min Zhang, and Jian Su. 2004. A joint source-channel model for machine translitera- tion. In The 42nd Annual Meeting of the Associa- tion for Computational Linguistics, pages 160–167, Barcelona, Spain, July. Haizhou Li, Khe Chai Sim, Jin-Shea Kuo, and Minghui Dong. 2007. Semantic transliteration of personal names. In The 45th Annual Meeting of Associa- tion for Computational Linguistics, pages 120–127, Prague, Czech Republic, June. Haizhou Li, A Kumaran, Vladimir Pervouchine, and Min Zhang. 2009. Report of news 2009 machine transliteration shared task. In Proceedings of the 2009 Named Entities Workshop, pages 1–18, Singa- pore, Singapore, August. Haizhou Li, A Kumaran, Min Zhang, and Vladimir Pervouchine. 2010. Report of news 2010 transliter- ation generation shared task. In Proceedings of the 2010 Named Entities Workshop, pages 1–11, Upp- sala, Sweden, July. Franz Josef Och and Hermann Ney. 2004. The align- ment template approach to statistical machine trans- lation. Comput. Linguist., 30(4):417–449. Franz Josef Och. 2003. Minimum error rate train- ing in statistical machine translation. In ACL ’03: Proceedings of the 41st Annual Meeting on Asso- ciation for Computational Linguistics, pages 160– 167, Morristown, NJ, USA. Association for Compu- tational Linguistics. Jong-Hoon Oh, Kiyotaka Uchimoto, and Kentaro Torisawa. 2009. Can chinese phonemes im- prove machine transliteration?:a comparative study of english-to-chinese transliteration models. In Pro- ceedings of the 2009 Conference on Empirical Meth- ods in Natural Language Processing, pages 658– 667, Singapore, Singapore, August. Andreas Stolcke. 2002. SRILM - an extensible lan- guage modeling toolkit. In the 7th International Conference on Spoken Language Processing (IC- SLP) 2002, Denver, Colorado. Omar F. Zaidan. 2009. Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. The Prague Bulletin of Mathematical Linguistics, 91:79–88. 105 Named Entity Transliteration Generation Leveraging Statistical Machine Translation Technology Pradeep Dasigi Mona Diab Computer Science Department Center for Computational Learning Systems Columbia University Columbia University in the City of New York in the City of New York

[email protected] [email protected]

Abstract In this paper, we address the problem of gen- erating valid transliterations for proper names in Automatically identifying that different one language into some phonetic transcription orthographic variants of names are refer- (transliteration) in another language. The prob- ring to the same name is a significant chal- lem is not so bad if the two languages are pho- lenge for processing natural language pro- netically close, share a script, and there exists an cessing since they typically constitute the orthographic standard. However, if the two lan- bulk of the out-of-vocabulary tokens. The guages use different orthographic scripts and pos- problem is exacerbated when the name is sess different phonetic inventories, we are faced foreign. In this paper we address the prob- with a much more complex situation. lem of generating valid orthographic vari- We attempt to solve the problem for the latter ants for proper names, namely transliterat- case, namely for language pairs that are distant and ing proper names in different scripts. We that possess significantly different phonetic inven- attempt to solve the problem for three dif- tories. We target three language pairs: English → ferent language pairs: English → Hindi, Hindi, English → Persian, and Arabic → English. English → Persian, and Arabic → English. English uses the Latin script, Arabic uses Arabic We adopt a unified approach to the prob- script, Persian uses an extended Arabic script to lem. We frame the problem from a statis- account for 6 extra sounds over Arabic, and Hindi tical Machine Translation perspective. We uses Devanagari. We adopt a unified approach to further post edit the output applying lin- the problem for the three language pairs. We lever- guistically informed rules particular to the age a statistical Machine Translation framework language pair and re-rank the output using to address the problem. We apply linguistic ex- machine learning methods. pansion rules that are tailored for each language pair and transliteration direction. We view this as 1 Introduction a generation problem, and we apply some post hoc In a world of pervasive online media and glob- filtering techniques to re-rank the output. alization, we are flooded with streams of events 2 Linguistic Background where participants come from all over the world and they spell things in a myriad of ways espe- Hindi, Persian, Arabic, and English pertain to dif- cially where there are no orthographic standards. ferent language families but more importantly for The problem is exacerbated for proper names es- the task at hand, they have different phonetic in- pecially when they are foreign. There are no stan- ventories. There are shared cognates between dard spellings for such names. Accordingly ortho- Hindi, Arabic and Persian due to historical rea- graphic variants are rampant. People typically rely sons, however their sound repositories are signif- on some form of phonetic transcription or what icantly different from each other and in turn dif- is referred to as transliteration. Humans have no ferent from English. For instance, the /p/ and /v/ issue identifying variants of names as the same, sounds in Persian do not exist in Arabic, the voice- however for automatic algorithms in general and less uvular plosive /q/ and the pharyngeal /h/ in Natural Language Processing (NLP) in particular, Arabic have no real equivalents in English, the as- proper name variants constitute a large portion of pirated /b/ and /t/ in Hindi do not exist in English the out of vocabulary (OOV) phenomenon. nor in Arabic or Persian for that matter. Such dis- 106 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 106–111, Chiang Mai, Thailand, November 12, 2011. tinctions in the sound inventories result in variable eration direction. transcriptions, especially when a proper name in Hindi that has any of those aspirated letters such 3 Related Work as the /b/, or the /q/ in Arabic. For example, the Automatic Transliteration has been well studied Arabic name qAfy1 has a myriad of spelling vari- and various statistical approaches have been tried, ants such as Kazafi, Qazafi, Kaddafi, Qadafy, starting from the seminal work by (Knight and Gaddafy, Gadaffy, etc. This is partly a result of Graehl, 1997). The noisy channel model has been the lack of the phonetic sound in the inventory of extensively used by (Yuxiang et al, 2009) and English, but also due to the fact that different di- the problem was dealt with in a manner similar alects of Arabic pronounce the /q/ sound differ- to that of Statistical Machine Translation (SMT). ently affecting the foreign (in this case English) Further, it has been modeled as a phrase based transliteration of it, for instance, in Egyptian Ara- SMT problem in (Finch and Sumita, 2009), (Finch bic, the /q/ sound is pronounced as a glottal stop, and Sumita, 2010), (Hong et al, 2009), (Noeman, while in the Gulf it is pronounced as a /g/ sound. 2009). (Finch and Sumita, 2009) reported accu- The problem is further compounded for lan- racy of 0.788, F-score of 0.969 and Mean Re- guages such as Arabic and Persian which have ciprocal Rank of 0.788 on English → Hindi test underspecified orthographies. In both languages, data in NEWS 2009. (El-Kahky et al, 2011) mod- the short vowels and certain other phonetic mark- eled character sequence level alignments as bi- ers such as consonantal gemination are under- partite graphs, and used graph reinforcement and specified in the surface orthography except when link re-weighting to improve transliteration min- the genre of the text is liturgical such as in the ing. They addressed two problems that arise from Quran or the Bible, or in pedagogical materials data sparsity - data coverage and erroneous trans- for language learners, However the majority of lation probabilities due to ambiguous mappings. text written for both languages lack short vowels (Varadarajan and Rao, 2009) used Hidden Markov which are typically expressed as diacritics. For Models to derive substring alignments from train- instance the name mHmd in Arabic, as is evi- ing data and learn a weighted Finite State Trans- dent in the transliteration, is expressed using only ducer from these alignments. They reported an the consonants, and it corresponds to Muham- accuracy of 0.398, F-score of 0.855 and MRR of mad/Mohamed/Mohamad etc., in English. We 0.515 on English → Hindi test data in NEWS note the presence of the short vowels ‘a, u, o’ in 2009. (Noeman and Madkour, 2010) proposed the English transliteration, as well as the gemina- a language independent technique for translitera- tion of the medial letter ‘m’. tion. They used Giza++ (2010) to model initial Different considerations need to be paid atten- alignments. A Finite State Automaton (FSA) built tion to depending on the transliteration direction. from those alignments is used to generate translit- Transliterating Arabic names into English is dif- erations at an edit distance of at most k from the ferent from transliterating English names into Ara- source word. Their best performing system had an bic. For instance, Arabic names when translit- F-measure of 0.915 on English to Arabic translit- erated from English to Arabic, should lead to a eration task in NEWS 2010. In general, most of smaller set of variants, than if an Arabic name is this work was to build an initial alignment and use transliterated into English due to the underspeci- statistical techniques in some form to generate bet- fication of vowels inherent in the orthography of ter transliterations, and hence language indepen- Arabic. For instance, the name Bloomberg can dent. Our work differs in that it takes a more lin- be spelled as blwmbyrj/blmbrj/blwmbrj, while guistically informed approach towards generating a name such as AbdAllTyf would warrant at least better transliterations by customizing the solutions the following variants in English Abdel lateef, per language pair and transliteration direction. Abdallattif, Abdellatyff, Abd Allatif, Abd Al- lattyf, etc. Accordingly in our algorithms we 4 Approach and Experimental Design will be modeling for the language pair specifically bearing in mind the particularities of the translit- In our basic approach, we model the problem as a noisy channel problem. We leverage Phrase Based 1 We use the Arabic Buckwalter transliteration scheme to Statistical Machine Translation (SMT) technology express Arabic script throughout the paper. www.qamus.org. (Zens et al, 2002). Our statistical transliteration 107 system is implemented using Moses(Koehn et al, distinction. For example, the English translitera- 2007). Each name is represented as a sentence for tion of the names amandip and parijat both have training, tuning and decoding. A name could be the letter ‘i’, but in Hindi script it represents a long composite comprising multiple name units, such vowel in the first case and a short vowel in the as Michael Jackson corresponding to mAykyl second. Similarly, the ‘a’ sounds are short in the jAkswn in Arabic. Each character is treated as first word and long in the second. Accordingly, a separate token by the system, and name bound- the SMT output is augmented by expanding short aries are marked using special characters. Ac- vowels with long vowels and vice-versa. cordingly, the sentence pair for the name Michael 4.1.2 Initial vs Medial vowels Jackson and it’s Arabic counterpart will be rep- resented as follows to the SMT system for train- Like other Indian scripts, vowels in Devanagari are ing and tuning: m i c h a e l # j a c k s o n written as diacritic symbols if written after a con- corresponding to m A y k y l # j A k s w n. sonant, and in independent form if not. So, when Giza++ (Och and Ney, 2010) is used for building the SMT system is trained, vowels in English are alignments between name pairs. For all the lan- aligned to both forms and some candidates have guage pairs, the language scripts are represented incorrect forms of vowels. As a post-processing in UTF-8 encoding. We further improve the output step, those errors are automatically corrected. This of the MT system by applying some language spe- is done by replacing diacritic symbols that occur cific post-processing techniques. The following at the beginning of names with vowel forms and sub-sections describe those techniques for each vowels forms that occur after consonants with di- language pair. All the techniques (except sec- acritic symbols. tion 4.3.1) essentially expand the output given by 4.2 English-Persian our SMT system. Since the methods of expansion yield large 4.2.1 Vowel interchange rule numbers of output candidates, a filtering technique It has been observed from the output of MT sys- is used to be able to distinguish the correct translit- tem that a common mistake is between long vow- erations from the incorrect ones. We build a bi- els ‘A’ and ‘w’, and ‘A’ and ‘y’. To deal with this nary classifier that labels each candidate translit- problem, the output is augmented by adding new eration as correct or incorrect. We employ two candidates that have an ‘A’ sound replaced with features in training: a language model (LM) log ‘w’ or ‘y’ and vice-versa. probability for each name from the target side of 4.2.2 Words beginning with A the training data corpus to ensure that the gener- ated candidate is a fluent target name; the second In many cases where the source word begins with feature is the string edit distance of each candi- letter ‘A’, that sound is not transliterated by the date from its nearest name obtained from direct SMT system. The transliterated candidate begins mapping. This second feature is a measure of how with the sound of the consonant following the let- much the candidate has changed due to expansion. ter in these cases. This is probably because the The filtering classifier is applied to the expanded sound corresponding to the letter is dropped in data. The training data is synthetically generated cases where it occurs in name medial positions. from expanding the candidates according to the This is more common with words of Persian ori- linguistic rules. We label the training data as cor- gin. Although a good language model takes into rect and the expanded data as incorrect. To make account the position of the letter in the name as sure that incorrect expansions do not overwhelm well, some lower ranked candidates in the output correct transliterations, we remove some incorrect have this error. To deal with this, ‘A’ is appended candidates from the training data for the classifier. in those cases where the source word begins with ‘A’ and the output candidate does not begin with a 4.1 English-Hindi vowel. 4.1.1 Short vs long vowels 4.3 Arabic-English Hindi clearly distinguishes between short and long 4.3.1 Direct Mapping vowels, however English transliterations are not A direct mapping of Arabic letters to their equiv- necessarily consistent in faithfully expressing that alent sounds in English is performed, for exam- 108 ple an ‘m’ is transliterated as an ‘m’. However • One common problem in this language pair is some of the letters are tricky since they have no to recognize the name may be segmented into equivalent simple orthographic forms in English parts when written in English such as Abu- such as the Arabic ‘ain or ‘E’ sound, the Arabic mAzn may be transliterated in English to Abu ghain or ‘g’ sound. In these cases we opted for Mazen or Abu-Mazen. To tackle this, if a multiple correspondents. In the former ‘E’ case, candidate begins with patterns such as Abw, we expanded to a possible 0 or A sounds and for AbA, Abn, ibn, bin, a space or a hyphen is the ‘g’ sound we expanded to the following pos- introduced after the first portion of the name. sibilities gh, g, q. We also noted in the devel- opment and training data the existence of some 5 Experimental Results dialectal replacements indicating that the translit- The official task training data was directly used for erations should also reflect dialectal variants, i.e. training. The official task development data was the transliteration is not only constrained to the split into two equal parts, with half the data be- modern standard Arabic (MSA) sound inventory, ing used for tuning the system and the other half hence we allowed for dialectal expansions such as for initial testing (Dev). We report results of our for the Arabic letter thaal or ‘∗’ was mapped to systems on both the Dev and the official shared th, z, d and the letter thaa or ‘v’ was expanded to task Test data. Details of the data used, their sizes th, s. This mapping is devised by a native Ara- and sources can be found in the Task Organizer’s bic speaker. All possible sequences of sounds in Whitepaper (TOW) (Zhang et al, 2011). English for a given Arabic name are treated as its Table 1 contains the results of our system on transliteration candidates.2 Accordingly, a name English-Hindi. The metrics used Accuracy, Fs- such as mgrby is translated directly as maghrebi, core, MRR and MAP are described in detail in magrebi, magreby, maghrabi, etc. TOW. The first set of results is of SMT out- 4.3.2 Vowel Expansion put containing the top translation candidate for Arabic similar to Persian is underspecified for each source name (H-1best SMT[Dev]). H-Nbest short vowels in its orthography hence two names SMT[Dev] corresponds to the output containing such as zamar and zumur will be spelled the 10 top ranked transliterations per source language same way appearing as zmr in Arabic. Hence, name. H-SMT+exp[Dev] and H-SMT+exp[Test] we expand the names by placing short vowel be- illustrate the results after application of the two ex- tween any two consecutive consonants. We main- pansion rules described in Section 4.1 on the Dev tain a vowelless version for every expansion spot. and Test data respectively. The results clearly indi- Also we do not epenthesize with a vowel at name cate that yielding more candidates results in better boundaries where a name is composite and con- performance, i.e. returning N-best results is bet- tains multiple names such as Abw-MAzn. We use ter that the top result (N-best is better than 1-best), rules such as: if two consonants are preceded by a improving the overall accuracy, F score, MRR and long vowel A, w, y, then we should expect to ex- MAP for the system as a whole. Moreover, apply- pand with one of the 5 vowels of English. ing expansion rules in the form of our devised lin- guistic rules significantly improves the quality of 4.3.3 Composite Names and their Internal transliterations for the dev set on nearly all metrics Boundaries except for MAP, (H-SMT+exp[Dev]) outperforms In case of composite names that have subparts, we (n-best H-SMT[Dev]). We note a significant drop applied the following rules: in accuracy between the Dev and Test data, how- ever we see an improvement for the MAP metric.3 • If the candidate has a subpart that begins with Table 1 shows three sets of results for English- bn, only vowels i or e is used between the two Persian task. The first set is 10-best results from consonants. bin or ben, meaning ‘son of’, SMT system (P-SMT[Dev]), without any expan- is frequent in Arabic names and hence other sion. P-SMT+exp[Dev] and P-SMT+exp[Test] vowels are not likely to occur between these correspond to the output of Dev and Test, respec- specific two consonants. tively, as expanded using rules described in sec- 2 3 A full listing of the Transliteration mapping is available We do not have access to the Test data key answers for upon request. any of the language pairs to perform error analysis. 109 Condition Acc. F Score MRR MAP Condition Acc. F Score MRR MAP H-1best SMT[Dev] 0.340 0.850 0.340 0.340 1. DirectMap[Dev] 0.018 0.763 0.045 0.022 H-Nbest SMT[Dev] 0.631 0.937 0.631 0.393 2. DirectMap+vow-exp[Dev] 0.065 0.805 0.139 0.065 H-SMT+exp[Dev] 0.718 0.951 0.718 0.316 3. 10-best[Dev] 0.194 0.835 0.330 0.189 H-SMT+exp[Test] 0.387 0.860 0.516 0.387 4. 10-best+vow-exp[Dev] 0.226 0.847 0.361 0.188 P-SMT[Dev] 0.575 0.920 0.587 0.481 5. 40-best[Dev] 0.363 0.897 0.507 0.286 P-SMT+exp[Dev] 0.710 0.953 0.725 0.339 6. 40-best+vow-exp[Dev] 0.396 0.904 0.535 0.299 P-SMT+exp[Test] 0.606 0.933 0.697 0.589 7. 40-best+vow-exp+filt[Dev] 0.375 0.898 0.512 0.288 8. 150-best[Dev] 0.559 0.941 0.677 0.426 9. 150-best+vow-exp[Dev] 0.590 0.946 0.702 0.442 Table 1: English-Hindi and English-Persian re- 10. 150-best+vow-exp+filt[Dev] 0.546 0.936 0.657 0.413 sults 11. 150-best+vow-exp[Test] 0.526 0.928 0.628 0.386 12. 150-best+vow-exp+filt[Test] 0.519 0.927 0.612 0.383 tion 4.2. Clearly, these rules significantly improve Table 2: Arabic-English - Transliteration Results the quality of the transliterations on the Dev set for all metrics. We note a similar trend to the English- (150-best+vow-exp[Test]) and 12 (150-best+vow- Hindi results with a significant drop in accuracy, exp+filt[Test]) is not that significant, though 11 F-score, MRR between the Dev and Test data, yields higher results. however we see an improvement for the MAP met- ric. 6 Discussion For Arabic-English, Table 2 illustrates the re- sults of the different conditions: 1. the direct The impact of each approach taken for English- mapping as described in section 4.3.1 for Dev; Arabic transliteration can be seen from the exam- 2. DirectMap with vowel expansion of the Dev ple of >bAbTyn. When the direct mapping tech- (DirectMap+vow-exp[Dev]); conditions 3, 5, and nique is used, one of the best transliterations is 8. are SMT N-best conditions for Dev data; condi- Ababtyn. When expansions are applied, it be- tions 4, 6, 9 and 11 are N-Best results for Dev and comes Aba Batyn. The SMT system produces Test data; finally, conditions 7, 10 and 12 present Ababatin, and after expansion, it becomes Abaa the results after applying filtering to the output of Bateen, which is in the reference list, although not the SMT expanded system for both Dev and Test in the first few ranks. Filtering this list reduced its data. We use three thresholds for N in the N Best size from 39 to 5 and removed incorrect names like conditions: 10, 40 and 150. Ababwotyn and Ababoutyn. The Direct Map results in the worst perform- The English - Hindi system has specific limita- ing conditions, however we do note relative im- tions. Words like Gertrude and Canada are gen- provement from DirectMap to DirectMap+vow- erally not transliterated correctly to Hindi. This exp across the 4 metrics indicating that vowel ex- can be because of the high number of names of pansion is a good move for this language pair. Us- Indian origin in the training data. Hindi names al- ing SMT for transliteration improves significantly most always have one to one letter to sound match- over Direct Mapping as illustrated by the relative ing. The same holds when they are transliterated improvement of condition 3 (10-best[Dev]) over to English. So, a foreign origin word that has let- condition 2 DirectMap+vow-exp[Dev]. Increas- ters which do not have their most common pronun- ing the number of returned N Best results from ciation is a challenge for this approach. This may 10 to 40 and subsequently to 150 shows signifi- be resolved by trying to filter words that do not cant improvement comparing conditions 3, 5, and have Indian origin and treating them separately. 8. Further applying vowel expansion shows con- sistent improvement in performance in conditions 7 Conclusions and Future Directions 4, 6, and 9. We further applied filtering to the re- sulting output however this did not yield improve- We showed that phrase based SMT systems can ments in the results as illustrated in conditions 7 be useful for the problem of NE transliteration. 40-best+vow-exp+filt[Dev] and 10 150-best+vow- But with the application of linguistic rules as a exp+filt[Dev], however, filtering helped prune the post-processing step, the performance can be sig- 100s of outputs generated from the vowel expan- nificantly improved. For English Persian and En- sion step in smart ways. In fact we note that on glish Hindi tasks, direct application of such rules the Test data the difference between condition 11 improved the performance of the systems signifi- 110 cantly. However, Arabic-English task proved to be Richard Zens, Franz Josef Och, and Hermann Ney a different and a more complex problem, due to the Phrase-Based Statistical Machine Translation. KI ’02 Proceedings of the 25th Annual German Con- transliteration direction from a highly underspec- ference on AI: Advances in Artificial Intelligence ified orthography (Arabic) to a more phonetically specified one. We showed that this problem can Andrew Finch and Eiichiro Sumita Transliteration be handled by a vowel expansion technique on the by Bidirectional Statistical Machine Translation. Named Entiy Workshop 2009, ACL-IJCNLP 2009 SMT output. Applying a filtering technique us- ing a classifier proved to be an effective method of Andrew Finch and Eiichiro Sumita Transliteration us- eliminating incorrect candidates in the expanded ing a Phrase-based Statistical Machine Translation System to Re-score the Output of a Joint Multigram output without significantly affecting the perfor- Model. Named Entiy Workshop 2010, ACL 2010 mance of the system. In the future, we plan to ap- ply these approaches to larger data sets and more Gumwon Hong, Min-Jeong Kim, Do-Gil Lee and language pairs in various transliteration directions. Hae-Chang Rim A Hybrid Approach for English- Korean Name Transliteration. Named Entiy Work- shop 2009, ACL-IJCNLP 2009 References Min Zhang, A Kumaran, Haizhou Li NEWS 2011 Shared Task on Machine Translit- Ali El-Kahky, Kareem Darwish, Ahmed Saad Aldein, eration Whitepaper http://translit.i2r.a- Mohamed Abd El-Wahab, Ahmed Hefny, Waleed star.edu.sg/news2011/news2011whitepaper.pdf Ammar Improved Transliteration Mining Using Graph Reinforcement.. Emperical Methods in Nat- ural Language Processing 2011 Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexan- dra Constantin, Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Compu- tational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007. Kevin Knight and Jonathan Graehl Machine Translit- eration. Journal Computational Linguistics archive Volume 24 Issue 4, December 1998 Sara Noeman and Amgad Madkour Language Inde- pendent Transliteration Mining System Using Finite State Automata Framework.. Named Entity Work- shop 2010 Franz Josef Och and Hermann Ney Improved Statis- tical Alignment Models.. Proc. of the 38th Annual Meeting of the Association for Computational Lin- guistics, pp. 440-447, Hongkong, China, October 2000 Jia Yuxiang, Zhu Danqing and Yu Shiwen A Noisy Channel Model for Grapheme-based Machine Transliteration. Named Entiy Workshop 2009, ACL-IJCNLP 2009 Sara Noeman Language Independent Translitera- tion System Using Phrase Based SMT Approach on Substrings. Named Entiy Workshop 2009, ACL- IJCNLP 2009 Balakrishnan Varadarajan and Delip Rao extension Hidden Markov Models and Weighted Transducers for Machine Transliteration.. Named Entity Work- shop 2009 111 Author Index Aroonmanakun, Wirote, 58 Yamamoto, Seiichi, 49 Banchs, Rafael E., 41 Zhang, Min, 1, 14 Bhargava, Aditya, 36 Zhang, Yi, 101 Bhole, Abhijit, 65 Charoenporn, Thatsanee, 28 Chen, GuoHua, 82 Chen, Yu, 101 Dasigi, Pradeep, 106 Diab, Mona, 106 Dixon, Paul, 23 Finch, Andrew, 23, 49 Fukunishi, Takaaki, 49 Hauer, Bradley, 36 Hsu, Wen-Lian, 86 Jiang, Mike Tian-Jian, 86 Khadivi, Shahram, 73, 91 Kondrak, Grzegorz, 36 Kruengkrai, Canasai, 28 Kumaran, A, 1, 14 Kuo, Chan-Hung, 86 Kwong, Oi Yee, 96 Lertcheva, Nattadaporn, 58 Li, Haizhou, 1, 14 Liu, Ming, 1 Mousavi Nejad, Najmeh, 73, 91 Qin, Ying, 82 Sornlertlamvanich, Virach, 28 Sumita, Eiichiro, 23, 49 Taghipour, Kaveh, 91 Tholpadi, Goutham, 65 Tsai, Richard Tzong-Han, 32 Udupa, Raghavendra, 65 Wang, Rui, 101 Wang, Yu-Chun, 32 113

(PDF) Named Entity Transliteration Generation Leveraging Statistical Machine Translation Technology