500 datasets found

None: application/x-gzip

Filter Results
  • UMC 0.1: Czech-Russian-English Multilingual Corpus

    UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of...
  • Universal Dependencies 1.0

    Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...
  • Automatic Paraphrases of Czech Reference Sentences for WMT11, 13 and 14

    This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years...
  • Vystadial 2013 – scripts

    Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems....
  • Package of word embeddings of Czech from a large corpus

    This package comprises eight models of Czech word embeddings trained by applying word2vec (Mikolov et al. 2013) to the currently most extensive corpus of Czech, namely SYN v9...
  • Deep Universal Dependencies 2.5

    Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional...
  • Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2013 – VERSION 1)

    german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...
  • Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2022 – VERSION 1)

    german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...
  • Czech-Slovak Parallel Corpus

    Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –...
  • Open SDP 1.2

    The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data...
  • Diakorp v6: diachronic corpus of Czech

    Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not...
  • MTMonkey

    MTMonkey is a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre-...
  • Universal Segmentations 1.0 (UniSegments 1.0)

    Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation...
  • IDENTICv1.0-raw

    Raw Text
  • DiscoMT 2016 Shared Task on Cross-lingual Pronoun Prediction

    Files for the DiscoMT 2016 shared task on cross-lingual pronoun prediction
  • MUSCIMarker

    MUSCIMarker is an open-source tool for annotating visual objects and their relationships in binary images. It is implemented in Python, known to run on Windows, Linux and OS X,...
  • SYN2006PUB: corpus of Czech newspapers

    Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and...
  • Tigrinya Web Corpus

    Tigrinya web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
  • Universal Derivations v1.1

    Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...
  • onion

    onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and...
You can also access this registry using the API (see API Docs).