CLARIN - Repositories

Visible Vowels

This program enables the user to plot vowels in the F1/F2 space for multiple points in the vowel interval, e.g. at 20%, 50% and 80%.

ESIC 1.0 -- Europarl Simultaneous Interpreting Corpus

ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and...

IWPT 2021 Shared Task Data and System Outputs

This package contains data used in the IWPT 2021 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal...

Prague Dependency Treebank 3.5

The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied...

DeriNet 2.0

DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational or...

Slovak MorphoDiTa Models 170914

Slovak models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex SK...

VIADAT (2019-12-31)

This component integrates other VIADAT modules; together with VIADAT-REPO this composes the Virtual Assistant for accessing historical audiovisual data. The zip archive contains...

ORAL2013: balanced corpus of informal spoken Czech (transcriptions)

ORAL2013 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole...

Prague Czech-English Dependency Treebank 2.0 - Russian translation

Prague Czech-English Dependency Treebank - Russian translation (PCEDT-R) is a project of translating a subset of Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) to...

EvaLatin 2020 models for UDPipe 2 (2020-08-31)

POS Tagger and Lemmatizer models for EvaLatin2020 data (https://github.com/CIRCSE/LT4HALA). The model documentation including performance can be found at...

WMT18 APE Shared Task: En-DE NMT Train and Dev Data

Training and development data for the WMT 2018 Automatic post-editing task. They consist in English-German triplets (source, target and post-edit) belonging to the information...

Universal Derivations v0.5

Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...

MorfFlex CZ 2.1 (2024-12-23)

MorfFlex CZ 2.1 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex CZ 2.1 is a part of the...

Deep Universal Dependencies 2.6

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3226). It contains additional...

CoNLL 2018 Shared Task System Outputs

Test data parsed by systems submitted to the CoNLL 2018 UD parsing shared task.

Italian Function Words

This dictionary is a curated list of Italian function words in a JSON Lines format text file, particularly useful for tasks such as POS-Tagging or Syntactic Parsing. It contains...

Lingua::Interset 2.026

Lingua::Interset is a universal morphosyntactic feature set to which all tagsets of all corpora/languages can be mapped. Version 2.026 covers 37 different tagsets of 21...

IWPT 2020 Shared Task Data and System Outputs

This package contains data used in the IWPT 2020 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal...

Czech Legal Text Treebank 2.0

The Czech Legal Text Treebank 2.0 (CLTT 2.0) annotates the same texts as the CLTT 1.0. These texts come from the legal domain and they are manually syntactically annotated. The...

Vystadial 2013 – Czech data

Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems....

1,494 datasets found