Dataset - B2FIND

UMC 0.1: Czech-Russian-English Multilingual Corpus

UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of...

Universal Dependencies 1.0

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Automatic Paraphrases of Czech Reference Sentences for WMT11, 13 and 14

This dataset contains automatic paraphrases of Czech official reference translations for the Workshop on Statistical Machine Translation shared task. The data covers the years...

Vystadial 2013 – scripts

Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems....

Package of word embeddings of Czech from a large corpus

This package comprises eight models of Czech word embeddings trained by applying word2vec (Mikolov et al. 2013) to the currently most extensive corpus of Czech, namely SYN v9...

Deep Universal Dependencies 2.5

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2013 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2022 – VERSION 1)

german version see below The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the...

Czech-Slovak Parallel Corpus

Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –...

Open SDP 1.2

The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data...

Diakorp v6: diachronic corpus of Czech

Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not...

MTMonkey

MTMonkey is a web service which handles and distributes JSON-encoded HTTP requests for machine translation (MT) among multiple machines running an MT system, including text pre-...

Universal Segmentations 1.0 (UniSegments 1.0)

Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation...

IDENTICv1.0-raw

Raw Text

DiscoMT 2016 Shared Task on Cross-lingual Pronoun Prediction

Files for the DiscoMT 2016 shared task on cross-lingual pronoun prediction

MUSCIMarker

MUSCIMarker is an open-source tool for annotating visual objects and their relationships in binary images. It is implemented in Python, known to run on Windows, Linux and OS X,...

SYN2006PUB: corpus of Czech newspapers

Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and...

Tigrinya Web Corpus

Tigrinya web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.

Universal Derivations v1.1

Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...

onion

onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and...

500 datasets found