4,793 datasets found

Filter Results
  • One-million Corpus of Croatian Literary Language

    written; reference corpus; general; diachornic; monolingual
  • ParCzech PS7 2.0

    The ParCzech PS7 2.0 corpus is the second version of ParCzech PS7 consisting of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between...
  • Open SDP

    The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data...
  • Språkbanken (Swedish Language Bank)

    Mainly written Swedish corpora (all time periods except Runic Swedish; various genres, including learner corpora) and lexicons; some non-Swedish corpora (Faroese, Old Icelandic,...
  • YouTube-ASL Clip Keypoint Dataset

    The YouTube-ASL Clip Keypoint Dataset is a curated collection of sentence-level American Sign Language (ASL) keypoint sequences derived from publicly available YouTube videos....
  • tfidf

    It calculates the Term Frequency and the Inverse Document Frequency of a word in a given corpus (a statistical measure used to evaluate how important a word is to a document in...
  • Teop

    Documentation of the Teop project (DoBeS project)
  • ConsILR - Consortium for the Romanian Language: Resources & Tools

    Resources and tools developed for Romanian
  • Brockhaus' Kleines Konversations-Lexikon

    Aufl. 1911; Fokus auf Politik, Wirtschaft, Kultur und Technik zu Beginn des 20. Jahrhunderts
  • Nottinghamer Korpus Deutscher YouTube-Sprache (The NottDeuYTSch Corpus)

    The NottDeuYTSch corpus contains over 33 million words taken from approximately 3 million YouTube comments from videos published between 2008 to 2018 targeted at a young,...
  • SALDO

    SALDO (Swedish Associative Thesaurus version 2) is an extensive lexicon resource for modern Swedish written language created for the purpose of language technology research and...
  • Narrangansett dictionary

    The Narrangansett corpus contains a cultural linguistic dictionary and grammatical information on the Narrangansett language, an extinct language of the USA.
  • Wikipedia-Lexikon

    Präsentation der am häufigsten gesuchten (20.000) Stichwörter der Wikipedia
  • Romanian Explanatory Dictionary

    292.792 definitions
  • Lacandon corpus

    Documentation of the Lacandon project (DoBeS project)
  • Natural Language Toolkit

    Open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks. NLTK includes the...
  • CLaRK System - XML-based system for Corpora Development

    The CLaRK System incorporates several technologies: - XML technology - Unicode - Cascaded Regular Grammars; - Constraints over XML Documents On the basis of these technologies...
  • Czech HS Contracts Dataset (CHSC) 1.0

    Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK. Contracts are obtained from the Hlídač Státu web portal....
  • The Swedish Parole corpus

    mixed-genre (press, fiction, pop science, public information); appr. 19 MW; POS tags (in CWB format)
  • Deltacorpus

    Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger...