Translation Task - ACL 2019 fourth Conference on Machine Translation
Shared Task: Machine Translation of News
The recurring translation task of the
WMT workshops
focuses on
news text and (mostly) European language pairs. For this year the language pairs are:
Chinese-English
Czech-English (this year
English-to-Czech only
Finnish-English
German-English
Gujarati-English
Kazakh-English
Lithuanian-English
Russian-English
German-Czech (only into Czech, only unsupervised MT / without parallel data)
NEW
French-German
(topic: EU elections)
We provide parallel corpora for all languages as training data, and additional resources
for download
GOALS
The goals of the shared translation task are:
To investigate the applicability of current MT techniques when translating into languages other than English
To examine special challenges in translating between European languages, including word order differences and morphology
To investigate the translation of low-resource, morphologically rich languages
To create publicly available corpora for machine translation and machine translation evaluation
To generate up-to-date performance numbers in order to provide a basis of comparison in future research
To offer newcomers a smooth start with hands-on experience in state-of-the-art statistical machine translation methods
To investigate the usefulness of multilingual and third language resources
To compare unsupervised MT in a controlled environment
To assess the effectiveness of document-level approaches
We hope that both beginners and established research groups will participate in this task.
IMPORTANT DATES
Release of training data for shared tasks (by)
31 January, 2019
Test suite source texts
must reach us
March 24, 2019
Test data released
April 8, 2019
Translation submission deadline
April 16, 2019 (10am UK)
Translated test suites shipped back to test suites authors
April 26, 2019
Start of manual evaluation
April 29, 2019
End of manual evaluation
May 27, 2019
TASK DESCRIPTION
We provide training data for all language pairs, and a common
framework. The task is to improve current methods. We encourage
a broad participation -- if you feel that your method is interesting
but not state-of-the-art, then please participate in order to disseminate
it and measure progress.
Participants will use their systems to translate a test set of unseen
sentences in the source language. The translation quality is measured by
a manual evaluation and various automatic evaluation metrics.
Participants agree to contribute to the manual evaluation about eight
hours of work, per system submission.
You may participate in any or all of the eight language pairs.
For all language pairs we will test translation in both directions. To
have a common framework that allows for comparable results, and also to
lower the barrier to entry, we provide a common training set, and a
pre-processed version (TBC). You are not limited to this training set, and you
are not limited to the training set provided for your target language pair. This means
that multilingual systems are allowed, and classed as constrained as long as they
use only data released for WMT19 (or older WMT Hindi-English and Turkish-English corpora,
as listed below).
If you use additional training data (not provided by the WMT19 organisers)
or existing translation systems, you
must flag that your system uses additional data. We will distinguish
system submissions that used the provided training data (constrained)
from submissions that used significant additional data resources. Note
that basic linguistic tools such as taggers, parsers, or morphological
analyzers are allowed in the constrained condition.
Your submission report should highlight in which ways your own methods
and data differ from the standard task. You should make it clear which
tools you used, and which training sets you used.
The following two aspects of the task are
new
for 2019: Unsupervised learning and Document-level MT.
Unsupervised learning
For 2019, we also have an unsupervised subtrack:
German to Czech
translations, using monolingual German and Czech training data only, as well as last years' parallel dev and test sets for bootstrapping. The training data should come from the the constrained monolingual sets of WMT news translation data.
No German-Czech parallel data is provided, and the participants cannot use any monolingual or parallel data for other languages and language pairs (thus zero-shot, transfer-learning and pivoting-based systems will be treated as part of the general news translation track).
Document-level MT
In 2019, we are particularly interested in approaches which consider the whole document. We invite submissions of such approaches
for
English to German and Czech
, and for
Chinese to English
We will perform document-level human evaluation for these pairs.
For English to German, we will be releasing as much of the training data as possible with document boundaries intact.
For English to Czech,
CzEng 1.7
(unchanged from last year) does already offer cross-sentential context for most of its "domains". No complete documents are available but all sentences in a "block" (i.e. those with the same "-bNUM-" number in the ID, e.g. subtitlesM-
b15
-00train-f000001-s*) formed a consecutive sequence in the original text. Sometimes the block is very short (just 1 sentence), and it is always limited to 13 or 15 sentences. No context information is available for the domains "techdoc", "navajo" and "tweets". The best context-aware domains are "news", "eu", "subtitles*" (well, subtitles) and "fiction".
Additional Test Suites Linked to News Translation Task
At no additional burden on the News Translation Task participants
(aside from having to translate much larger input data), we will again collectively
provide a deeper analysis of various qualities of the translations.
See the corresponding section of Findings 2018 for an inspiration.
See
WMT19 Test Suites Google Document
for more details.
System developers may want to learn in advance what their systems will be tested on.
Everyone is welcome to contribute additional test suites.
Authors of additional test suites will be invited to report on their evaluation method
and its results in a separate paper
DATA
LICENSING OF DATA
The data released for the WMT19 news translation task can be freely used for research purposes, we just ask that you cite the WMT19 shared
task overview paper, and respect any additional citation requirements on the individual data sets. For other uses
of the data, you should consult with original owners of the data sets.
TRAINING DATA
We aim to use publicly available sources of data wherever possible. Our main sources of training data are
the
Europarl corpus
, the
UN corpus
, the news-commentary corpus and the
ParaCrawl
corpus. We also release a monolingual
News Crawl
corpus. Other language-specific corpora will be made available.
We have added suitable additional training data to some of the language pairs.
You may also use the following monolingual corpora released by the LDC:
LDC2011T07
English Gigaword Fifth Edition
LDC2009T13
English Gigaword Fourth Edition
LDC2007T07
English Gigaword Third Edition
LDC2009T27
Chinese Gigaword Fourth Edition
Note that the released data is not tokenized and includes sentences of
any length (including empty sentences). All data is in Unicode (UTF-8)
format. The following Moses tools allow the processing of the training data
into tokenized format:
Tokenizer
tokenizer.perl
Detokenizer
detokenizer.perl
Lowercaser
lowercase.perl
SGML Wrapper
wrap-xml.perl
These tools are available in the
Moses git repository
DEVELOPMENT DATA
To evaluate your system during development, we suggest using the
2018 test set. The data is provided in raw text format and in an
SGML format that suits the NIST scoring tool. We also release other
dev and test sets from previous years. For the new language pairs, we release dev sets in January,
prepared in the same way as the test sets.
Year
CS-EN
DE-EN
FI-EN
GU-EN
KK-EN
LT-EN
RU-EN
ZH-EN
FR-DE
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
The 2019 test sets will be created from a sample of online newspapers from September-November 2018. For the established languages
(i.e. English to/from Chinese, Czech, German, Finnish and Russian) the English-X and X-English test sets will be distinct, and only
consist of documents created originally in the source language. For the new languages (i.e English to/from
Gujarati, Kazakh and Lithuanian) the test sets include 50% English-X translation, and 50% X-English translation. In previous
recent tasks, all the test data was created using the latter method.
We have released development data for the tasks that are new this year. It is created in the same way as the test set and
included in the
development tarball
The news-test2011 set has three additional Czech translations that you may want to use. You can download them
from
Charles University
Parallel data:
File
CS-EN
DE-EN
FI-EN
GU-EN
KK-EN
LT-EN
RU-EN
ZH-EN
FR-DE
Notes
Europarl v9
New:
Re-extracted to include document boundaries.
europarl-v7 for fr-de
ParaCrawl v3
New version for 2019
(except en-ru). Please use the
bicleaner filtered
version.
Common Crawl corpus
Same as last year.
new for fr-de
News Commentary v14
Updated
, and now with document boundaries.
NB
For the kk-en task, we include part of this data in the dev set, and have created -wmt19 versions of the corpora, which have the dev set removed.
CzEng 1.7
Register and download CzEng 1.7.
(cross-sentential context available for some domains)
Yandex Corpus
ru-en
Wiki Titles v1
New release
for 2019
UN Parallel Corpus V1.0
Register and download
Rapid corpus of EU press releases
This is part of the Tilde Model Corpus
Document-split Rapid corpus
New
A recrawled version of the Rapid corpus, with document boundaries intact. Also prepared by Tilde.
CWMT Corpus
Additional training data for Gujarati-English
The only gu-en corpus listed above is Wikititles. In addition, we propose the following data-sets, as well as
specifically encouraging unconstrained submissions (i.e. bring your own data).
The
Bible Corpus
, extracted from data available
here
Localisation
extracted from
OPUS
, and consisting mainly of open-source software localisation data.
The
Emille Corpus
, available from
ELRA
free for academic use. The Emille corpus is not actually parallel, but does contain some parallel text.
Some
small corpora
which seem to be only available to Indian citizens
Crowd-sourced
bilingual dictionaries
collected by Ellie Pavlick and collaborators
The
HindEnCorp
created by CUNI, or the
larger one
created by IIT Bombay for the Workshop on Asian Language Translation shared task. If pivoting through Hindi is feasible, then these would be useful.
parallel corpus extracted from wikipedia
and contributed by Alexander Molchanov of PROMT.
crawled corpus
produced for this task. It is very noisy, but contains some parallel data. A
cleaned version
is also available, cleaned using language detection and simple length heuristics. We recommened that you either use the cleaned version, or apply your own cleaning to the raw version.
You can use the wikipedia
en
and
gu
dumps as a comparable corpus.
Additional training data for Kazakh-English
In addition to the wikititles and news-commentary above, we provide:
An
English-Kazakh crawled corpus
of about 100k sentences, prepared by Bagdat Myrzakhmetov of Nazarbayev University. The corpus is distributed as a tsv file with the original URLs included, as well as an alignment score.
crawled Russian-Kazakh corpus
of about 5M sentences, also prepared by Bagdat Myrzakhmetov.
an additional English-Kazakh crawled corpus
of about 500k sentences, also prepared by Bagdat Myrzakhmetov.
NB: this was not part of the task training data
We created a -wmt19 version of the news-commentary corpus, which has the dev set removed.
You may also use any of the data previously released for the English-Turkish task.
Monolingual training data:
Corpus
CS
DE
EN
FI
GU
KK
LT
RU
ZH
FR
Notes
News crawl
Updated
Large corpora of crawled news, collected since 2007. Versions up to 2017 are as before, except
they are re-filtered and re-shuffled. For de and en, document-split versions are available.
News discussions
Updated
Corpora crawled from comment sections of online newspapers. Available in English and French.
Europarl
Monolingual version of European parliament crawl. Superset of the parallel version.
europarl-v7 for fr
News Commentary
Updated
Monolingual text from news-commentary crawl. Superset of parallel version. Use v14.
NB
For the kk-en task, we include part of this data in the dev set, and have created -wmt19 versions of the corpora, which have the dev set removed.
Common Crawl
Deduplicated with development and evaluation sentences removed. English was updated 31 January 2016 to remove bad UTF-8. Downloads can be verified with
SHA512 checksums
More English is available for unconstrained participants.
Wiki dumps
New
Monolingual text wikipedia, extracted using
WikiExtractor
Development sets
Test sets
Test sets (including additional test suites)
PREPROCESSED DATA
We will provide preprocessed versions of all training and development data (by mid-February). These are preprocesed with standard Moses tools and ready for use in MT training. This preprocessed data
is distributed with the intention that it will be useful as a standard data set for future research.
The preprocessed data can be obtained
here
TEST SET SUBMISSION
To submit your results, please first convert into into SGML format as
required by the NIST BLEU scorer, and then upload it to the
website
matrix.statmt.org
For Chinese output, you should submit unsegmented text, since our primary measure is human evaluation. For automatic scoring (in the matrix)
we use BLEU4 computed on characters, scoring with v1.3 of the NIST scorer only. A script to convert a Chinese SGM file to characters can be found
here
SGML Format
Each submitted file has to be in a format that is used by standard
scoring scripts such as NIST BLEU or TER.
This format is similar to the one used in the source test set files that
were released, except for:
First line is
, with trglang set to
either
en
de
fr
es
cs
or
ru
. Important: srclang is
always
any
Each document tag also has to include the system name,
e.g.
sysid="uedin"
CLosing tag (last line) is
The script
wrap-xml.perl
makes the conversion
of a output file in one-segment-per-line format into the required SGML
file very easy:
Format:
wrap-xml.perl LANGUAGE SRC_SGML_FILE SYSTEM_NAME < IN > OUT
Example:
wrap-xml.perl en newstest2019-src.de.sgm Google < decoder-output > decoder-output.sgm
Upload to Website
Upload happens in three easy steps:
Go to the website
matrix.statmt.org
Create an account under the menu item Account -> Create Account.
Go to Account -> upload/edit content, and follow the link "Submit a system run"
select as test set "newstest2019" and the language pair you are submitting
select "create new system"
click "continue"
on the next page, upload your file and add some description
You can use the matrix to list all your systems, and edit the metadata. This is important since after the test week ends, you need
to decide which are your primary systems (that get included in the human evaluation, and the overview paper) and to ensure that
you are happy with the system naming.
To access your system list, log in and select Account -> my current systems. You should see a list of all your systems, along with their
metadata, and an edit button. Some instructions are included on this screen.
EVALUATION
Evaluation will be done both automatically as well as by human judgement.
Manual Scoring: We will collect subjective judgments about translation
quality from human annotators. If you participate in the shared task,
we ask you to perform a defined amount of evaluation per language pair
submitted. The amount of manual evaluation will be approximately 8 hours.
As in previous years, we expect the translated submissions to be in
recased, detokenized, XML format, just as in most other translation
campaigns (NIST, TC-Star).
ACKNOWLEDGEMENTS
This task would not have been possible without the sponsorship of test sets from Microsoft, Yandex, Tilde, LinguaCustodia,
the University of Helsinki, Charles University Prague, Le Mans University and
funding
from the European Union's Horizon 2020 research
and innovation programme under grant agreements
825299 (GOURMET) and EU CHIST-ERA M2CR project.