-
Lacandon corpus
Documentation of the Lacandon project (DoBeS project) -
Natural Language Toolkit
Open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks. NLTK includes the... -
CLaRK System - XML-based system for Corpora Development
The CLaRK System incorporates several technologies: - XML technology - Unicode - Cascaded Regular Grammars; - Constraints over XML Documents On the basis of these technologies... -
Czech HS Contracts Dataset (CHSC) 1.0
Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK. Contracts are obtained from the Hlídač Státu web portal.... -
The Swedish Parole corpus
mixed-genre (press, fiction, pop science, public information); appr. 19 MW; POS tags (in CWB format) -
Deltacorpus
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger... -
CzEngClass 0.2
The CzEngClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language... -
≠Akhoe Hai//om
Documentation of the ≠Akhoe Hai//om project (DoBeS project) -
Oromo web corpus
Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated. -
Cercador OBNEO
Search engine of the BOBNEO data bank, a database of neologisms present in the mass media in Spanish and Catalan, written and oral, from 1992. -
Deep Universal Dependencies 2.4
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-2988). It contains additional... -
Czech and English abstracts of ÚFAL papers
This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles... -
LiFR-Lite (2021-11-05)
Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features... -
BulTreeBank Frequency List
100 000 most frequent Cyrillic tokens in the BulTreeBank text archive, UTF-16 list of token-frequency pairs -
Synthetic part of CzEng 2.0
CzEng is a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL). While the full CzEng 2.0 is freely available for... -
L2 Acquisition Heide Wegener
Language Acquisition corpus -
Turkish Natural Language Processing Pipeline
This is a state-of-the-art pipeline of Turkish NLP tools (sentence splitting, tokenisation, normalisation, deasciification, vowelisation, spelling correction, morphological... -
The Atlas of Place Names (Paikannimikartasto)
Finnish place names -
Copenhagen Dependency Treebanks versions 1-3
Parallel treebanks with annotation of syntax, discourse, coreference, morphology, and semantics. Version 3 also includes the Danish Dependency Treebank (version 1) and the... -
Lexico-Semantic Annotation of PDT using Czech WordNet
This dataset contains annotation of PDT using Czech WordNet ontology: http://hdl.handle.net/11858/00-097C-0000-0001-4880-3 Data is stored in PML format. This is a stand-off...
