CLARIN - Repositories

Czech Legal Text Treebank

The Czech Legal Text Treebank (CLTT) is a collection of 1133 manually annotated dependency trees. CLTT consists of two legal documents: The Accounting Act (563/1991 Coll., as...

EdUKate translation software 1

This software package includes three tools: web frontend for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server and a tool...

Czech OOV Inflection Dataset

Czech OOV Inflection Dataset is a Czech inflection dataset of nouns, focused on evaluation in out-of-vocabulary (OOV) conditions. It consists of two parts: a standard...

VPS-30-En

VPS-30-En is a small lexical resource that contains the following 30 English verbs: access, ally, arrive, breathe, claim, cool, crush, cry, deny, enlarge, enlist, forge,...

Coreference in Universal Dependencies 1.0 (CorefUD 1.0)

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version...

VALLEX 2.5

The Valency Lexicon of Czech Verbs, Version 2.5 (VALLEX 2.5), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description...

Treebanks for Unified Taxonomy of Deep Syntactic Relations

The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on...

AKCES 2

Corpus AKCES 2 consists of trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition Corpora). It is the same data...

Czech Court Decisions Dataset

We present the Czech Court Decisions Dataset (CCDD) -- a dataset of 300 manually annotated court decisions published by The Supreme Court of the Czech Republic and the...

Czech Sociological Review 1993-2016

Selected research articles and essays published in Czech Sociological Review from 1993 to 2016. Originally Czech, non-translated material only. 522 documents in total. In terms...

czes

First version of the very large Czech corpus Czes created with a new set of tools. It comprises 465,102,710 tokens.

CoNLL-based Extended Czech Named Entity Corpus 1.0

This is a Czech Named Entity Corpus 1.0 transformed into the CoNLL format. The original corpus can be downloaded from: http://hdl.handle.net/11858/00-097C-0000-0023-1B04-C. The...

VIADAT-STAT

A VIADAT module; the purpose of VIADAT-STAT is statistical analysis of recordings stored by the platform. Developed in cooperation with ÚSD AV ČR and NFA.

SLäNDa

SLäNDa, the Swedish literature corpus of narrative and dialogue, is a corpus made up of eight Swedish literary novels from the late 19th and early 20th centuries, manually...

COSTRA 1.0: A Dataset of Complex Sentence Transformations

COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard...

WordSim353-cs: Evaluation Dataset for Lexical Similarity and Relatedness, bas...

Czech translation of WordSim353. The Czech translation of English WordSim353 word pairs were obtained from four translators. All translation variants were scored according to...

MorfFlex SK 170914

Slovak morphological dictionary modeled after the Czech one. It consists of (word form, lemma, POS tag) triples, reusing the Czech morphological system for POS tags and lemma...

EMMT (Eyetracked Multi-Modal Translation)

Eyetracked Multi-Modal Translation (EMMT) is a simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios. It contains...

Derinet 2.2

DeriNet is a lexical network which models derivational and compositional relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges...

WMT17 De-En APE Shared Task Data

Training and development data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in German-English...

1,494 datasets found