-
EPIC-EuroParl-UdS: A GPT-2 and NMT Surprisal-Annotated Corpus for Translation...
EPIC-EuroParl-UdS is a bidirectional document- and sentence-aligned English–German corpus of European Parliament debates (up to mid-July 2018). It includes the official written... -
Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
This dataset contains data for testing machine translation and topic classification in Piedmontese. It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple,... -
X-SRL Dataset and mBERT Word Aligner
This code contains a method to automatically align words from parallel sentences by using multilingual BERT pre-trained embeddings. This can be used to transfer source... -
A Gold Standard Word Alignment for English-Swedish
A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2.... -
A Gold Standard Word Alignment for English-Swedish (2015-10-12)
A Gold Standard Word Alignment for English-Swedish (GES) is a resource containing 1164 manually word aligned sentences pairs from English and Swedish versions of Europarl v. 2. -
Czech-English Manual Word Alignment
Corpus of manually aligned Czech-English parallel sentences. It comprises 2500 parallel sentences from 7 different sources. -
X-SRL Dataset and mBERT Word Aligner
This code contains a method to automatically align words from parallel sentences by using multilingual BERT pre-trained embeddings. This can be used to transfer source...
