Dataset - B2FIND

Turkish-English parallel corpus MaCoCu-tr-en 1.0

The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other...

Bulgarian-English parallel corpus MaCoCu-bg-en 1.0

The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other...

Finnish-English parallel corpus fienWaC 1.0

The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-level domain for Finland. The corpus was built with Spidextor...

Parallel Corpus (EN-LT) of EUR-Lex Documents That Include Terms with the Adje...

Bilingual parallel corpus of the EU English documents containing terms with the adjective 'green' and their Lithuanian translations. The size of the corpus is 4,447,683 words in...

Post-edited and error annotated machine translation corpus PErr 1.0

The PE²rr corpus contains source language texts from different domains along with their automatically generated translations into several morphologically rich languages, their...

English-Lithuanian Parallel Cybersecurity Corpus - DVITAS

English-Lithuanian parallel corpus DVITAS includes original English texts on cybersecurity and their Lithuanian translations aligned on the sentence level. The corpus was...

TED-ELH Parallel Corpus

The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data.

English-French-Lithuanian Parallel Corpus of EU Financial Documents

The corpus is comprised of 154 EU legislative documents (English documents and their translations into French and Lithuanian) related to various financial issues and enacted in...

EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set...

CsEnVi Pairwise Parallel Corpora

CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:...

UFAL Parallel Corpus of North Levantine 1.0

This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the...

English-Urdu Religious Parallel Corpus

English-Urdu parallel corpus is a collection of religious texts (Quran, Bible) in English and Urdu language with sentence alignments. The corpus can be used for experiments with...

OdiEnCorp 2.0

Data We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel...

FAUST cs-en 0.5

This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308)....

Czech-English Parallel Corpus 1.0 (CzEng 1.0)

CzEng 1.0 is the fourth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for...

Czech-Slovak Parallel Corpus

Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –...

PAWS

PAWS is a multi-lingual parallel treebank with coreference annotation. It consists of English texts from the Wall Street Journal translated into Czech, Russian and Polish. In...

KonText Web Demo

An interactive web demo for querying selected ÚFAL and LINDAT corpora. LINDAT/CLARIN KonText is a fork of ÚČNK KonText (https://github.com/czcorpus/kontext, maintained by Tomáš...

Czech and English abstracts of ÚFAL papers (2022-11-11)

This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics,...

LongEval Train Collection

The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on...

84 datasets found