78 datasets found

Keywords: manual annotation

Filter Results
  • Corpus of term-annotated texts RSDO5 1.0

    The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually...
  • Croatian Twitter training corpus ReLDI-NormTag-hr 1.0

    ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...
  • Q-CAT Corpus Annotation Tool 1.5

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The...
  • Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1

    ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • Training corpus ssj500k 1.4

    The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named...
  • Corpus of comma placement Vejica 1.3

    A collection of sentences demonstrating and correcting comma usage. The sentences come from five sources: - KUST: a Slovene learner corpus,...
  • xLiMe Twitter Corpus XTC 1.0.1

    The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,...
  • Dataset of Slovene idiomatic expressions SloIE

    SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an...
  • Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1

    ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...
  • Q-CAT Corpus Annotation Tool 1.4

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The...
  • Training corpus jos1M 1.2

    The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This...
  • Choice of plausible alternatives dataset in Serbian COPA-SR

    The COPA-SR dataset (Choice of plausible alternatives in Serbian) is a translation of the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by following the...
  • MULTEXT-East "1984" annotated corpus 4.0

    The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...
  • Training corpus SUK 1.1

    The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with...
  • CMC training corpus Janes-Norm 3.0

    Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs,...
  • Training corpus hr500k 1.0

    The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...
  • Sentiment Annotated Dataset of Croatian News

    We present a collection of sentiment annotations for news articles (article links) in Croatian language. A set of 2025 news articles was gathered from 24sata, one of the leading...
  • Tweet code-switching corpus Janes-Preklop 1.0

    Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance),...
  • CMC training corpus Janes-Tag 1.2

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
  • Terminology identification dataset KAS-term 1.0

    The dataset contains 22,950 term candidates extracted from 15 Slovenian PhD theses. The term candidates are of length 1 to 4, extracted via morphosyntactic patterns and the...
You can also access this registry using the API (see API Docs).