CLARIN - Repositories

LMF Arabic characters lexicon

An LMF conformant XML-based file containing all Arabic characters (letters, vowels and punctuations). Each character described with a description, different displays (isolated,...

CzEngClass 0.3

The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language...

Italian Content Words v3

This resource is the third version of the Italian morphological dictionary for content words (http://hdl.handle.net/11372/LRT-2630), encoded in a JSON Lines format. Compared to...

Arabic WordNet ontology

This improved version is an extension of the original Arabic Wordnet (http://globalwordnet.org/arabic-wordnet/awn-browser/), it was enriched by new verbs, nouns including the...

Special Nouns Lexicon

An XML-based file containing Arabic Stop-words respecting nouns syntax; particle nouns, signal nouns, separated pronouns and connected nouns Citation: Driss Namly, Yasser...

VIADAT-ANALYZE (2019-12-31)

A VIADAT module; VIADAT-ANALYZE is an interactive environment that enables the end user to work with stored, annotated and indexed audio recordings. Allowing visualization and...

OdiEnCorp 1.0

Data We have collected English-Odia parallel and monolingual data from the available public websites for NLP research in Odia. The parallel corpus consists of English-Odia...

"Al wassit" Arabic dictionary

An XML-based file containing the electronic version of al wassit dictionary. An Arabic monolingual dictionary accomplished by the Academy of the Arabic Language in Cairo

Vystadial 2013 – English data

Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems....

Italian Function Words v3

This dictionary is the third version of 11372/LRT-2288, a curated list of Italian function words in a JSON Lines format text file, particularly useful for tasks such as part of...

VPS-GradeUp (2016-10-10)

VPS-GradeUp is a collection of triple manual annotations of 29 English verbs based on the Pattern Dictionary of English Verbs (PDEV) and comprising the following lemmas:...

GECCC Grammar Error Correction Corpus for Czech

Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website...

Prague Dependency Treebank 2.5

The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects (see...

Test Data EN-DE MT_NMT APE Shared Task WMT18

Test data for the WMT 2018 Automatic post-editing task. They consist in English-German pairs (source and target) belonging to the information technology domain and already...

HetWiK: Heterogene Widerstandskulturen

The representative full-text digitalized HetWiK corpus is composed of 140 manually annotated texts of the German Resistance between 1933 and 1945. This includes both well-known...

Czech Multiword Expressions

The dataset contains 4731 frozen continuous Czech multiword expressions. Inflectional word forms are generated for those MWEs where applicable. In total, the dataset contains...

VIADAT-ANNOTATE

A VIADAT module; VIADAT-ANNOTATE is an interactive annotation environment. Developed in cooperation with ÚSD AV ČR and NFA.

Tamil Dependency Treebank v0.1

Tamil Dependency Treebank version 0.1 (TamilTB.v0.1) is an attempt to develop a syntactically annotated corpora for Tamil. TamilTB.v0.1 contains 600 sentences enriched with...

DeriNet 2.3

DeriNet is a lexical network modeling derivational and compositional relations in Czech. The nodes of the network represent Czech lexemes, while the edges capture...

Victor

Victor is a web page cleaning tool. It is aimed at removing menu, ads, footers, headers, etc. from HTML web pages, so that only main web page content remains. Victor is based on...

1,494 datasets found