Dataset - B2FIND

Training corpus ssj500k 2.1

The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....

Lexicon of historical Slovene imp25k 1.1

The imp25k lexicon of historical Slovene was created automatically from the goo300k and foo3M annotated corpora and contains attested and manually verified word forms and their...

Multilingual comparable corpora of parliamentary debates ParlaMint 1.0

ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting at the end of 2015 and extending to mid-2020, with each corpus being about...

Training corpus ssj500k 2.2

The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....

Ukrainian parliamentary corpus ParlaMint-UA 4.0.1

The Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 is an extended version of the ParlaMint-UA 4.0 corpus (available as a collection of plain texts along with TSV metadata of...

CMC shortening corpus Janes-Kratko 1.0

Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to the supplied typology covering different types of spelling, lexical and...

Corpus of term-annotated texts RSDO5 1.0

The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually...

Croatian Twitter training corpus ReLDI-NormTag-hr 1.0

ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Corpus of 1968 Slovenian literature Maj68 3.0

Maj68 corpus contains 1,521 texts (about a million words) by 198 known authors published between 1964 and 1972 in the periodicals "Tribuna", "Problemi" and "Problemi....

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 4.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Collection of Slovenian paremiological units Pregovori 1.0

This corpus collects and annotates the extensive and highly valuable diachronic collection of Slovenian proverbs, 50 years and more in the making at the ZRC SAZU Institute of...

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1

ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

Japanese-Slovene learner's dictionary jaSlo 3.1

The jaSlo dictionary is primarily intended for Slovene students learning Japanese. For each entry, it contains the Japanese headword (kanji, hiragana or katakana, and romaji),...

Training corpus ssj500k 1.4

The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named...

Spoken corpus Gos VideoLectures 4.1 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. It can be used for training...

Written corpus ccKres 1.0

Corpus ccKres consists of 9,376 documents, each containing information about the source (e.g. newspapers, magazines), year of publication, text type (fiction, newspaper), the...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1

ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

Training corpus jos1M 1.2

The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint-en.ana 4.0 is the English machine translation of the ParlaMint.ana 4.0 (http://hdl.handle.net/11356/1860) set of corpora of parliamentary debates across Europe. The...

Blog post and comment corpus Janes-Blog 1.0

Janes-Blog is an annotated corpus of Slovene blogs from websites rtvslo.si and publishwall.si from the period 2006-10 to 2016-01. The corpus is structured into individual texts...

141 datasets found