CLARIN - Repositories

One-million Corpus of Croatian Literary Language

written; reference corpus; general; diachornic; monolingual

ParCzech PS7 2.0

The ParCzech PS7 2.0 corpus is the second version of ParCzech PS7 consisting of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between...

Open SDP

The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data...

Språkbanken (Swedish Language Bank)

Mainly written Swedish corpora (all time periods except Runic Swedish; various genres, including learner corpora) and lexicons; some non-Swedish corpora (Faroese, Old Icelandic,...

YouTube-ASL Clip Keypoint Dataset

The YouTube-ASL Clip Keypoint Dataset is a curated collection of sentence-level American Sign Language (ASL) keypoint sequences derived from publicly available YouTube videos....

tfidf

It calculates the Term Frequency and the Inverse Document Frequency of a word in a given corpus (a statistical measure used to evaluate how important a word is to a document in...

Teop

Documentation of the Teop project (DoBeS project)

ConsILR - Consortium for the Romanian Language: Resources & Tools

Resources and tools developed for Romanian

Brockhaus' Kleines Konversations-Lexikon

Aufl. 1911; Fokus auf Politik, Wirtschaft, Kultur und Technik zu Beginn des 20. Jahrhunderts

Nottinghamer Korpus Deutscher YouTube-Sprache (The NottDeuYTSch Corpus)

The NottDeuYTSch corpus contains over 33 million words taken from approximately 3 million YouTube comments from videos published between 2008 to 2018 targeted at a young,...

SALDO

SALDO (Swedish Associative Thesaurus version 2) is an extensive lexicon resource for modern Swedish written language created for the purpose of language technology research and...

Narrangansett dictionary

The Narrangansett corpus contains a cultural linguistic dictionary and grammatical information on the Narrangansett language, an extinct language of the USA.

Wikipedia-Lexikon

Präsentation der am häufigsten gesuchten (20.000) Stichwörter der Wikipedia

Romanian Explanatory Dictionary

292.792 definitions

Lacandon corpus

Documentation of the Lacandon project (DoBeS project)

Natural Language Toolkit

Open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks. NLTK includes the...

CLaRK System - XML-based system for Corpora Development

The CLaRK System incorporates several technologies: - XML technology - Unicode - Cascaded Regular Grammars; - Constraints over XML Documents On the basis of these technologies...

Czech HS Contracts Dataset (CHSC) 1.0

Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK. Contracts are obtained from the Hlídač Státu web portal....

The Swedish Parole corpus

mixed-genre (press, fiction, pop science, public information); appr. 19 MW; POS tags (in CWB format)

Deltacorpus

Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger...

4,793 datasets found