WMT16 Tuning Shared Task Models (Czech-to-English)

Dataset

PID

The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English.

CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.

Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl.

Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.

Identifier
PID	http://hdl.handle.net/11372/LRT-1671
Related Identifier	http://www.statmt.org/wmt16/tuning-task/
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11372/LRT-1671

Provenance
Creator	Kamran, Amir; Jawaid, Bushra; Bojar, Ondřej; Stanojevic, Milos
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL); University of Amsterdam, ILLC
Publication Year	2016
Funding Reference	info:eu-repo/grantAgreement/EC/H2020/645452
Rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); http://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Czech; English
Resource Type	corpus
Format	application/x-gzip; application/octet-stream; text/plain; charset=utf-8; downloadable_files_count: 5
Discipline	Linguistics