4,938 datasets found

Repositories: CLARIN

Filter Results
  • QT21 Data

    Post-editing and MQM annotations produced by the QT21 project. As described in @InProceedings{specia-etal_MTSummit:2017, author = {Specia, Lucia and Kim Harris and...
  • KER - Keyword Extractor

    KER is a keyword extractor that was designed for scanned texts in Czech and English. It is based on the standard tf-idf algorithm with the idf tables trained on texts from...
  • POPPINS

    Document classifier
  • Czech Lexico-Semantic Database 0.1

    A lexicographical project, whose aim is to digitize and align two Czech onomasiological dictionaries (Haller 1969–77; Klégr 2007) in order to create an integrated digital...
  • IFA speech corpus

    Spoken corpus containing speech of 4 male and 4 female speakers. 50,000 words segmented at phoneme level
  • Korpus 90

    written, general language; 22 million tokens
  • The National Certificates corpus

    The NC test results, background information, speaking and writing performances in 9 foreign / second languages. A web-based data base (html files).
  • Hunglish Corpus

    Billingual written general; 2 million sentences
  • LongEval Click-Model Relevance Judgements (Qrels)

    The collection comprises the relevance judgments used in the 2023 LongEval Information Retrieval Lab (https://clef-longeval.github.io/), organized at CLEF. It consists of three...
  • Terminal-based CoNLL-file viewer, v2

    A simple way of browsing CoNLL format files in your terminal. Fast and text-based. To open a CoNLL file, simply run: ./view_conll sample.conll The output is piped through less,...
  • Universal Dependencies 1.3

    Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...
  • Individual Textual Profiles of Hillary Clinton and Donald Trump

    This corpus consists of full transcriptions of both Democratic and Republican 2016 presidential candidate debates, with a special focus on the idiolects of Hillary Clinton and...
  • Project Gutenberg

    Possibility to download or to browse free electronic books; Angebot: Download von und Online-Zugang zu frei verfügbaren E-Books; deutschsprachige Literatur stellt nur einen...
  • LAC Nqeq Corpus

    Language and Cognition corpus
  • Plant names in Dutch dialect (PLAND)

    Plant names in Dutch dialect
  • Gesprächanalytisches Informationssystem (GAIS)

    web-based information system on scientific community (news, events, persons, job market, mailing list, database on research projects and corpora, bibliography, glossary and...
  • Test Data EN-DE APE Shared Task WMT17

    Test data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in 2,000 English-German pairs (source and...
  • TITUS Laz

    ca. 900 tokens
  • MEd

    MEd is an annotation tool in which linearly-structured annotations of text or audio data can be created and edited. The tool supports multiple stacked layers of annotations that...
  • TITUS Prakrit

    ca. 7.000 tokens; linked with relational database; XML-encoding in progress