Dataset - B2FIND

Parallel sense-annotated corpus ELEXIS-WSD 2.0

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 2.0 contains subcorpora with...

Parallel sense-annotated corpus ELEXIS-WSD 1.3

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.3 contains sentences for 10...

Slovenian translation corpus Spook 1.1

The Spook corpus was compiled to enable corpus-based studies in translation and comprises 713 texts and about 375 thousand words. It is composed of three types of texts. The...

The MultiplEYE Text Corpus Data and Materials

Data and materials for the 39 language versions of the MultiplEYE Text Corpus pertaining to Kaspere, Bondar, Nisioi, Stegenwallner-Schütz et al. (2026). Text Corpus: Towards a...

Lithuanian-English Parallel Cybersecurity Corpus – DVITAS

Lithuanian-English Parallel Cybersecurity Corpus consists of official cybersecurity documents of the Republic of Lithuania and their English translations, dating from 2014 to...

English-Lithuanian Parallel Migration Corpus

English-Lithuanian Parallel Migration Corpus includes original English texts and their Lithuanian translations, aligned at the sentence level. The texts are drawn from EU legal...

Parallel Corpus (EN-LT-DA) of General Data Protection Regulation (ELEXIS)

Trilingual parallel corpus on general data protection regulation. The size of the corpus is 54,468 words in English, 42,566 words in Lithuanian, and 47,740 words in Danish. The...

FAUST cs-en 0.5

This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308)....

Czech and English abstracts of ÚFAL papers

This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles...

Synthetic part of CzEng 2.0

CzEng is a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL). While the full CzEng 2.0 is freely available for...

IDENTICv1.0

IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide...

Hunglish Corpus

Billingual written general; 2 million sentences

EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set...

CsEnVi Pairwise Parallel Corpora

CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:...

LongEval Test Collection

The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on...

Czech-Slovak Parallel Corpus

Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –...

IDENTICv1.0-raw

Raw Text

WMT 13 Test Set

We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness,...

Multilingual corpus of juridical texts

International conventions and treaties arranged as a paralell corpus aligned on paragraph level

OdiEnCorp 2.0

Data We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel...

91 datasets found