88 datasets found

Keywords: corpus

  • The Multilingual Emotional Football Corpus (MEmoFC)

    Multilingual Emotional Football Corpus, (MEmoFC) has been manually collected from English, German, and Dutch websites of individual football clubs to investigate the way...
  • Korpus 2

    Korpus 2
  • KAMOKO: KAsseler MOrgenstern KOrpus

    KAMOKO is a structured and commented french learner-corpus. It addresses the central structures of the French language from a linguistic perspective (18 different courses). The...
  • OdiEnCorp 2.0

    Data We have collected English-Odia parallel data for the purposes of NLP research of the Odia language. The data for the parallel corpus was extracted from existing parallel...
  • Large-Scale Colloquial Persian 0.5

    "Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a...
  • Hindi Visual Genome 1.0

    Data Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We...
  • Czech Text Document Corpus v 2.0

    BASIC INFORMATION Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text...
  • Czech Court Decisions Dataset

    We present the Czech Court Decisions Dataset (CCDD) -- a dataset of 300 manually annotated court decisions published by The Supreme Court of the Czech Republic and the...
  • CEHugeWebCorpus

    This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered...
  • Indonesian web corpus (idWac)

    Indonesian text corpus from web. Crawling done by SpiderLing in 2017. Filtering by JusText and Onion (see http://corpus.tools/ for details). Tagged and lemmatized by MorphInd...
  • Khresmoi Summary Translation Test Data 2.0

    This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech,...
  • Khresmoi Query Translation Test Data 2.0

    This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans...
  • NAFIS Arabic Stemming Gold Standard Corpus

    Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a collection of texts, selected to be representative of...
  • Czech Malach Cross-lingual Speech Retrieval Test Collection

    The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four...
  • SYN v4: large corpus of written Czech

    Corpus of contemporary written (printed) Czech sized 3.6 GW (i.e. 4.3 billion tokens). It covers mostly the period of 1990–2014 and it is a traditional corpus (as opposed to the...
  • CsEnVi Pairwise Parallel Corpora

    CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:...
  • Balaxan Corpus of Kurmanji

    Balaxan is the first speech corpus of Kurmanji Kurdish with 58 utterances by speakers of Kurmanji. utterances are divided into 4 categories based on their sentence structures:...
  • Czech Legal Text Treebank

    The Czech Legal Text Treebank (CLTT) is a collection of 1133 manually annotated dependency trees. CLTT consists of two legal documents: The Accounting Act (563/1991 Coll., as...
  • Tamil Dependency Treebank v0.1

    Tamil Dependency Treebank version 0.1 (TamilTB.v0.1) is an attempt to develop a syntactically annotated corpora for Tamil. TamilTB.v0.1 contains 600 sentences enriched with...
  • Czech-English Parallel Corpus 1.0 (CzEng 1.0)

    CzEng 1.0 is the fourth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for...