85 datasets found

Keywords: parallel corpus

Filter Results
  • UFAL Parallel Corpus of North Levantine 1.0

    This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the...
  • CzEng 0.7

    CzEng 0.7 is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague. The corpus contains no manual...
  • FAUST 0.5

    Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test...
  • KonText Web Demo

    An interactive web demo for querying selected ÚFAL and LINDAT corpora. LINDAT/CLARIN KonText is a fork of ÚČNK KonText (https://github.com/czcorpus/kontext, maintained by Tomáš...
  • ParaCrawl Corpus version 1.0

    The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of...
  • Additional German-Czech reference translations of the WMT'11 test set

    Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation...
  • LongEval Train Collection

    The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on...
  • Hindi Visual Genome 1.0

    Data Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We...
  • Czech-English Manual Word Alignment

    Corpus of manually aligned Czech-English parallel sentences. It comprises 2500 parallel sentences from 7 different sources.
  • PAWS

    PAWS is a multi-lingual parallel treebank with coreference annotation. It consists of English texts from the Wall Street Journal translated into Czech, Russian and Polish. In...
  • Covert translation: popular science

    Translation corpora of original texts with translations and comparable texts from the genre popular scientific prose. Übersetzungs- und Vergleichskorpus mit authentischen...
  • Covert translation: Business Communication (old)

    Translation corpora of original texts with translations and comparable texts from the genre external business communication. CLARIN Metadata summary for Covert translation:...
  • Covert translation: Business Communication (old)

    Translation corpora of original texts with translations and comparable texts from the genre external business communication.
  • Covert translation: popular science

    Translation corpora of original texts with translations and comparable texts from the genre popular scientific prose. Übersetzungs- und Vergleichskorpus mit authentischen...
  • Covert translation: Business Communication (new)

    Translation corpora of original texts with translations and comparable texts from the genre external business communication. Übersetzungs- und Vergleichskorpus mit...
  • Covert translation: Business Communication (new)

    Translation corpora of original texts with translations and comparable texts from the genre external business communication. Übersetzungs- und Vergleichskorpus mit authentischen...
  • Icelandic-English parallel corpus MaCoCu-is-en 2.0

    The Icelandic-English parallel corpus MaCoCu-is-en 2.0 was built by crawling the “.is” internet top-level domain in 2021, extending the crawl dynamically to other domains as...
  • Turkish-English parallel corpus MaCoCu-tr-en 1.0

    The Turkish-English parallel corpus MaCoCu-tr-en 1.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other...
  • Finnish-English parallel corpus fienWaC 1.0

    The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-level domain for Finland. The corpus was built with Spidextor...
  • TED-ELH Parallel Corpus (ELEXIS)

    The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data. See also: http://hdl.handle.net/20.500.11821/34
You can also access this registry using the API (see API Docs).