WMT16 Quality Estimation Shared Task Training and Development Data

PID

Training and development data for the WMT16 QE task. Test data will be published as a separate item.

This shared task will build on its previous four editions to further examine automatic methods for estimating the quality of machine translation output at run-time, without relying on reference translations. We include word-level, sentence-level and document-level estimation. The sentence and word-level tasks will explore a large dataset produced from post-editions by professional translators (as opposed to crowdsourced translations as in the previous year). For the first time, the data will be domain-specific (IT domain). The document-level task will use, for the first time, entire documents, which have been human annotated for quality indirectly in two ways: through reading comprehension tests and through a two-stage post-editing exercise. Our tasks have the following goals:

  • To advance work on sentence and word-level quality estimation by providing domain-specific, larger and professionally annotated datasets.
  • To study the utility of detailed information logged during post-editing (time, keystrokes, actual edits) for different levels of prediction.
  • To analyse the effectiveness of different types of quality labels provided by humans for longer texts in document-level prediction.

This year's shared task provides new training and test datasets for all tasks, and allows participants to explore any additional data and resources deemed relevant. A in-house MT system was used to produce translations for the sentence and word-level tasks, and multiple MT systems were used to produce translations for the document-level task. Therefore, MT system-dependent information will be made available where possible.

Identifier
PID http://hdl.handle.net/11372/LRT-1646
Related Identifier http://hdl.handle.net/11372/LRT-1631
Related Identifier http://hdl.handle.net/11372/LRT-1974
Related Identifier http://www.statmt.org/wmt16/quality-estimation-task.html
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11372/LRT-1646
Provenance
Creator Specia, Lucia; Logacheva, Varvara; Scarton, Carolina
Publisher University of Sheffield
Publication Year 2016
Rights AGREEMENT ON THE USE OF DATA IN QT21; https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21; PUB
OpenAccess true
Contact University of Sheffield
Representation
Language English; German
Resource Type corpus
Format application/x-gzip; text/plain; charset=utf-8; downloadable_files_count: 7
Discipline Linguistics