CLARIN - Repositories

Optimal reference translation of English-Czech WMT2020

We define "optimal reference translation" as a translation thought to be the best possible that can be achieved by a team of human translators. Optimal reference translations...

ParaDi 2.0

ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides...

Vystadial 2013 – scripts

Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems....

czTenTen12 v9 subcorpus of problematic phenomena

czTenTen12 v9 subcorpus containing problematic features (interlingual homographs, foreign proper names, named entities)

STYX 1.0

STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech...

Arabic Enclitics Lexicon

An XML-based file containing all Arabic enclitics

Indonesian web corpus (idWac)

Indonesian text corpus from web. Crawling done by SpiderLing in 2017. Filtering by JusText and Onion (see http://corpus.tools/ for details). Tagged and lemmatized by MorphInd...

ForFun 1.0

ForFun is a database of linguistic forms and their syntactic functions built with the use of the multi-layer annotated corpora of Czech, the Prague Dependency Treebanks. The...

Deep Universal Dependencies 2.5

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional...

ParaDi: Dictionary of Paraphrases of Czech Complex Predicates with Light Verbs

Dictionary of single verb paraphrases of Czech light verb constructions.

MADED

Moroccan Dialect Electronic Dictionary (MDED) is an electronic lexicon containing almost 15000 MSA entries. They are written in Arabic letters and translated to Moroccan Arabic...

EvaldioData 1.0

EvaldioData 1.0 is the language corpus of spoken performances by non-native speakers of Czech. It includes recordings capturing the oral part of the Czech Language Certificate...

Restaurant Reviews CZ ABSA corpus v2

Restaurant Reviews CZ ABSA - 2.15k reviews with their related target and category The work done is described in the paper: https://doi.org/10.13053/CyS-20-3-2469

Optimal Reference Translations from English to Czech

This corpus contains annotations of translation quality from English to Czech in seven categories on both segment- and document-level. There are 20 documents in total, each with...

EduPo: Analysis and Generation of Czech Poetry, v0.5

A suite of tools for analysis and generation of Czech poetry. This is a snapshot of the public Github repository at https://github.com/ufal/edupo -- the beta-version of the tool...

Continuous Rating; Supplementary materials

Collected data from Continuous Rating evaluation study; collected Continuous Rating scores and Questionnaires.

Open SDP 1.2

The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data...

Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115

Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ...

Diakorp v6: diachronic corpus of Czech

Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not...

HamleDT 2.0

HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a...

1,492 datasets found