-
One-million Corpus of Croatian Literary Language
written; reference corpus; general; diachornic; monolingual -
ParCzech PS7 2.0
The ParCzech PS7 2.0 corpus is the second version of ParCzech PS7 consisting of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between... -
Open SDP
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data... -
Språkbanken (Swedish Language Bank)
Mainly written Swedish corpora (all time periods except Runic Swedish; various genres, including learner corpora) and lexicons; some non-Swedish corpora (Faroese, Old Icelandic,... -
YouTube-ASL Clip Keypoint Dataset
The YouTube-ASL Clip Keypoint Dataset is a curated collection of sentence-level American Sign Language (ASL) keypoint sequences derived from publicly available YouTube videos.... -
tfidf
It calculates the Term Frequency and the Inverse Document Frequency of a word in a given corpus (a statistical measure used to evaluate how important a word is to a document in... -
Teop
Documentation of the Teop project (DoBeS project) -
ConsILR - Consortium for the Romanian Language: Resources & Tools
Resources and tools developed for Romanian -
Brockhaus' Kleines Konversations-Lexikon
Aufl. 1911; Fokus auf Politik, Wirtschaft, Kultur und Technik zu Beginn des 20. Jahrhunderts -
Nottinghamer Korpus Deutscher YouTube-Sprache (The NottDeuYTSch Corpus)
The NottDeuYTSch corpus contains over 33 million words taken from approximately 3 million YouTube comments from videos published between 2008 to 2018 targeted at a young,... -
SALDO
SALDO (Swedish Associative Thesaurus version 2) is an extensive lexicon resource for modern Swedish written language created for the purpose of language technology research and... -
Narrangansett dictionary
The Narrangansett corpus contains a cultural linguistic dictionary and grammatical information on the Narrangansett language, an extinct language of the USA. -
Wikipedia-Lexikon
Präsentation der am häufigsten gesuchten (20.000) Stichwörter der Wikipedia -
Romanian Explanatory Dictionary
292.792 definitions -
Lacandon corpus
Documentation of the Lacandon project (DoBeS project) -
Natural Language Toolkit
Open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks. NLTK includes the... -
CLaRK System - XML-based system for Corpora Development
The CLaRK System incorporates several technologies: - XML technology - Unicode - Cascaded Regular Grammars; - Constraints over XML Documents On the basis of these technologies... -
Czech HS Contracts Dataset (CHSC) 1.0
Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK. Contracts are obtained from the Hlídač Státu web portal.... -
The Swedish Parole corpus
mixed-genre (press, fiction, pop science, public information); appr. 19 MW; POS tags (in CWB format) -
Deltacorpus
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger...