CLARIN - Repositories

The ACL RD-TEC 2.0

The ACL RD-TEC 2.0 has been developed with the aim of providing a benchmark for the evaluation of methods for terminology extraction and classification as well as entity...

APE Shared Task WMT17: Human Post-edits Test Data DE-EN

Human post-edited test sentences for the WMT 2017 Automatic post-editing task. This consists in 2,000 English sentences belonging to the IT domain and already tokenized. Source...

LatinISE corpus (version 5)

The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoque websites. Corpus texts have rich metadata containing information as genre,...

Arabic Special verbs Lexicon

An XML-based file containing Arabic Stop-words respecting verbs syntax

VIADAT-ANALYZE

A VIADAT module; VIADAT-ANALYZE is an interactive environment that enables the end user to work with stored, annotated and indexed audio recordings. Allowing visualization and...

sholva-0.6

Semantic net `sholva' contains more than 150 000 records for which there was sufficient agreement among annotators. Indvidual words are labeled in the following categories:...

MorfoCzech 1.1

A dictionary of morphologically segmented word forms in Czech. Rules of manual segmentation are described in Pelegrinová, K., Mačutek, J., Čech, R. (2021). The Menzerath-Altmann...

GECCC Grammar Error Correction Corpus for Czech (2022-09-28)

Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website...

Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0)

A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, the Prague Dependency Treebank –...

Universal Dependencies 2.10 models for UDPipe 2 (2022-07-11)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data...

Amharic Web Corpus

Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC...

SYN v9: large corpus of written Czech

Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed...

Arabic characters lexicon

A XML-based file containing all Arabic characters (letters, vowels and punctuations). Each character described with a description, different displays (isolated, at the...

Czech Models (CNEC) for NameTag

Czech models for NameTag, providing recognition of named entities. The models are trained on Czech Named Entity Corpus 2.0 and 1.1.

CzEng 0.7

CzEng 0.7 is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague. The corpus contains no manual...

Semantic Difference Keywords - word2vec embeddings

Embeddings from word2vec model described in "From Diachronic to Contextual Lexical Semantic Change: Introducing Semantic Difference Keywords (SDKs) for Discourse Studies". Full...

English-Czech parallel song lyrics

English–Czech parallel corpus of song lyrics, aligned section by section. The songs are sourced from musical films. The dataset is provided in JSON format with the following...

WMT17 En-De APE Shared Task Data

Training data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in 11,000 English-German triplets...

Universal Dependencies 2.12 models for UDPipe 2 (2023-07-17)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 131 treebanks of 72 languages of Universal Depenencies 2.12 Treebanks, created solely using UD 2.12 data...

CorPipe 23 multilingual CorefUD 1.1 model (corpipe23-corefud1.1-231206)

The corpipe23-corefud1.1-231206 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 23 (https://github.com/ufal/crac2023-corpipe). It is...

1,494 datasets found