CLARIN - Repositories

Lacandon corpus

Documentation of the Lacandon project (DoBeS project)

Natural Language Toolkit

Open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks. NLTK includes the...

CLaRK System - XML-based system for Corpora Development

The CLaRK System incorporates several technologies: - XML technology - Unicode - Cascaded Regular Grammars; - Constraints over XML Documents On the basis of these technologies...

Czech HS Contracts Dataset (CHSC) 1.0

Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK. Contracts are obtained from the Hlídač Státu web portal....

The Swedish Parole corpus

mixed-genre (press, fiction, pop science, public information); appr. 19 MW; POS tags (in CWB format)

Deltacorpus

Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger...

CzEngClass 0.2

The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language...

≠Akhoe Hai//om

Documentation of the ≠Akhoe Hai//om project (DoBeS project)

Oromo web corpus

Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.

Cercador OBNEO

Search engine of the BOBNEO data bank, a database of neologisms present in the mass media in Spanish and Catalan, written and oral, from 1992.

Deep Universal Dependencies 2.4

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-2988). It contains additional...

Czech and English abstracts of ÚFAL papers

This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles...

LiFR-Lite (2021-11-05)

Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features...

BulTreeBank Frequency List

100 000 most frequent Cyrillic tokens in the BulTreeBank text archive, UTF-16 list of token-frequency pairs

Synthetic part of CzEng 2.0

CzEng is a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL). While the full CzEng 2.0 is freely available for...

L2 Acquisition Heide Wegener

Language Acquisition corpus

Turkish Natural Language Processing Pipeline

This is a state-of-the-art pipeline of Turkish NLP tools (sentence splitting, tokenisation, normalisation, deasciification, vowelisation, spelling correction, morphological...

The Atlas of Place Names (Paikannimikartasto)

Finnish place names

Copenhagen Dependency Treebanks versions 1-3

Parallel treebanks with annotation of syntax, discourse, coreference, morphology, and semantics. Version 3 also includes the Danish Dependency Treebank (version 1) and the...

Lexico-Semantic Annotation of PDT using Czech WordNet

This dataset contains annotation of PDT using Czech WordNet ontology: http://hdl.handle.net/11858/00-097C-0000-0001-4880-3 Data is stored in PML format. This is a stand-off...

4,938 datasets found