Statistical and Neural Machine Translation
Statistical and Neural Machine Translation
This website contains resources for research in statistical and neural machine
translation, i.e. the translation of text from one human language to
another by a computer that learned how to translate from vast amounts
of translated text.
Events
Conference on machine translation:
2022
2021
2020
2019
2018
2017
2016
Workshop on machine translation:
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
Workshop on building and using parallel text 2015
Machine Translation Marathon:
2022
2019
2018
2017,
2016
2015
2014
2013
2012
2011b
2011a
2010
2009,
2008
2007.
Machine Translation Marathon of the Americas:
2022
2019
2018
2017
2016
2015
Resources
Textbook: Neural Machine Translation
(2020)
Textbook: Statistical Machine Translation
(2010)
Moses
statistical machine translation toolkit
Machine Translation Research Survey Wiki
Proceedings of the European Parliament Proceedings (Europarl)
1 Billion Word Language Model Benchmark
News Commentary
N-gram counts and language models from the CommonCrawl (2014)
SIGIR 2020 Tutorial: Searching the Web for Cross-lingual Web Data
Data for "On the Impact of Various Types of Noise on Neural Machine Translation" (2018)
Early Release of Parallel Data of Paracrawl (2016)
Benchmark data for "Paracrawl: Web-Scale Acquisition of Parallel Corpora" (2020)
Code and data for "Simulated Multiple Reference Training (SMRT) Improves Low-Resource Machine Translation" (2020)
Parallel Named Entity Corpus for "XLEnt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment" (2021)
Data for "Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings" (2017)
Daat for experiments on context-aware neural machine translation (2018)
CC-100: Monolingual data used to train XLM-R extracted from CommonCrawl (2020)
CC-Matrix
Translation Service Containers for the European Language Grid
Monolingual News Crawl used for WMT
Monolingual News Discussions used for WMT 2020
Data for "PMIndia - A Collection of Parallel Corpora of Languages of India" (2020)
PRISM: Data for "Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing" (2020)
Wikititles used for WMT
University of Edinburgh's models from WMT
2020
2019
2017
2016
Data resources for WMT
2022
2021
2020
2019
2018
2017
2016
2015
2013
CC-Aligned: A Massive Collection of Cross-lingual Web-Document Pairs (2020)
Resources for the paper "When Does Unsupervised Machine Translation Work?" (Marchisio et al., 2020)
Wiki of the Machine Translation Research Group at Johns Hopkins University
External Historic Links: Introduction to Statistical MT Research
The Mathematics of Statistical Machine Translation
by Brown, Della Petra, Della Pietra, and Mercer
Statistical MT Handbook
by Kevin Knight
SMT Tutorial (2003)
by Kevin Knight and Philipp Koehn
ESSLLI Summer Course on SMT (2005),
day1
by Chris Callison-Burch and Philipp Koehn.
MT Archive
by John Hutchins, electronic repository and bibliography of articles, books and papers on topics in machine translation and computer-based translation tools
External Historic Software
Giza++
a training tool for IBM Model 1-5
(version for gcc-4)
Moses
, a complete SMT system
UCAM-SMT
, the Cambridge Statistical Machine Translation system
Phrasal
, a toolkit for phrase-based SMT
cdec
, a decoder for syntax-based SMT
Joshua
, a decoder for syntax-based SMT
Jane
, decoder for syntax-based SMT
Pharaoh
a decoder for phrase-based SMT
Rewrite
a decoder for IBM Model 4
BLEU scoring tool
for machine translation evaluation
External Parallel Corpora
OPUS ... the open parallel corpus
LDC
Linguistic Data Consortium
Canadian Hansards
Global Voices
Acquis Communitaire
ELRA
maintained by Philipp Koehn
US