CLARIN - Repositories

Netgraph

Netgraph is a graphically oriented client-server application for searching in linguistically annotated treebanks. The query language of Netgraph is simple and intuitive, yet...

LINDAT Translation service

Source code of the LINDAT Translation service frontend. The service provides a UI and a simple rest api that accesses machine translation models served by tensorflow serving....

Manually Classified Errors in En->Sk Translation

Manual classification of errors of English-Slovak translation according to the classification introduced by Vilar et al. [1]. 50 sentences randomly selected from WMT 2011 test...

MSTperl parser (2015-05-19)

MSTperl is a Perl reimplementation of the MST parser of Ryan McDonald (http://www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html). MST parser (Maximum Spanning Tree parser)...

Coreference in Universal Dependencies 0.1 (CorefUD 0.1)

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version...

APE Shared Task WMT17: Human Post-edits Test Data EN-DE

Human post-edited test sentences for the WMT 2017 Automatic post-editing task. This consists in 2,000 German sentences belonging to the IT domain and already tokenized. Source...

Coreference in Universal Dependencies 0.2 (CorefUD 0.2)

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version...

Czech Named Entity Corpus 1.1

Czech Named Entity Corpus 1.1 fixes some issues of the Czech Named Entity Corpus 1.0: misannotated entities are fixed, all formats contain the same data, tmt format is replaced...

VIADAT-ANNOTATE (2019-12-31)

A VIADAT module; VIADAT-ANNOTATE is an interactive annotation environment. Developed in cooperation with ÚSD AV ČR and NFA.

Czech WordNet 1.9 PDT

A slightly modified version of the Czech Wordnet. This is the version used to annotate "The Lexico-Semantic Annotation of PDT using Czech WordNet":...

NomVallex I.

The NomVallex I. lexicon describes valency of Czech deverbal nouns belonging to three semantic classes, i.e. Communication (dotaz 'question'), Mental Action (plán 'plan') and...

SYN2009PUB: corpus of Czech newspapers

Corpus of contemporary Czech newspapers and magazines sized 700 MW. It contains various titles published between 1995–2007. The corpus is lemmatized and morphologically tagged...

Manual Re-evaluation of Translation Quality of WMT 2018 English-Czech systems

This data set contains four types of manual annotation of translation quality, focusing on the comparison of human and machine translation quality (aka human-parity). The...

PML Tree Query

System for querying annotated treebanks in PML format. The querying uses it own query language with graphical representation. It has two different implementations (SQL and Perl)...

SynSemClass 3.5

The SynSemClass 3.5 synonym verb lexicon investigates semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English and German-English language...

AKCES 2 ver. 2

Corpus AKCES 2 ver. 2 consists of full, unabridged trancripts of recordings of classes at Czech elementary and secondary schools (AKCES/CLAC - Czech Language Acquisition...

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data...

Czech and English abstracts of ÚFAL papers (2022-11-11)

This is a parallel corpus of Czech and mostly English abstracts of scientific papers and presentations published by authors from the Institute of Formal and Applied Linguistics,...

Prague Czech-English Dependency Treebank 2.0 Coref

The Prague Czech-English Dependency Treebank 2.0 Coref (PCEDT 2.0 Coref) is a parallel treebank building upon the original PCEDT 2.0 release and enriching it with the extended...

Engineering job ads corpus

The corpus presented consists of job ads in Spanish related to Engineering positions in Peru. The documents were preprocessed and annotated for POS tagging, NER, and topic...

1,494 datasets found