68 datasets found

Keywords: manual annotation

Filter Results
  • CMC training corpus Janes-Tag 1.2

    Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...
  • CMC training corpus Janes-Norm 1.2

    Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...
  • CMC shortening corpus Janes-Kratko 1.0

    Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to the supplied typology covering different types of spelling, lexical and...
  • CMC training corpus Janes-Syn 1.0

    Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene...
  • Reference corpus of historical Slovene goo300k 1.2

    goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899. Each text...
  • Q-CAT Corpus Annotation Tool 1.5

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The...
  • Croatian linguistic training corpus hr500k 2.0

    The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...
  • Q-CAT Corpus Annotation Tool 1.4

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The...
  • Q-CAT Corpus Annotation Tool 1.3

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these...
  • Q-CAT Corpus Annotation Tool 1.2

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these...
  • Q-CAT Corpus Annotation Tool 1.1

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these...
  • Q-CAT Corpus Annotation Tool 1.0

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these...
  • Corpus of comma placement Vejica 1.3

    A collection of sentences demonstrating and correcting comma usage. The sentences come from five sources: - KUST: a Slovene learner corpus,...
  • Training corpus ssj500k 2.0

    The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....
  • xLiMe Twitter Corpus XTC 1.0.1

    The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,...
  • Dataset of normalised Slovene text KonvNormSl 1.0

    Data used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content....
  • Corpus of comma placement Vejica 1.0

    A collection of sentences demonstrating and correcting comma usage. The sentences come from four sources: - KUST: a Slovene learner corpus,...
  • Training corpus jos1M 1.1

    The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This...
  • Training corpus ssj500k 1.3

    The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from...
  • Annotated Corpus of Pre-Standardized Balkan Slavic Literature 1.1

    The corpus contains 23 linguistically annotated samples of "damaskini" and other Balkan Slavic manuscripts and print editions from the 15th-19th century, together with over 50...
You can also access this registry using the API (see API Docs).