Dataset - B2FIND

Cretan Institutional Inscriptions

The database Cretan Institutional Inscriptions was created as part of the PhD research project in Ancient Heritage Studies Kretikai Politeiai: Cretan Institutions from VII to I...

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

Reference corpus of historical Slovene goo300k 1.2

goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899. Each text...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

ReLDI-NormTagNER-sr 2.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

The "Arcticae horulae" dictionary of German borrowings in Slovenian

The "Arcticae horulae" dictionary of German borrowings in Slovenian was a project of continuous development, from a private amateur collection of German borrowings in the...

Spoken corpus Gos 2.1 (transcriptions)

The spoken corpus Gos 2.1 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand...

Multilingual comparable corpora of parliamentary debates ParlaMint 2.0

ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million...

CMC training corpus Janes-Tag 1.0

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Serbian Twitter training corpus ReLDI-NormTag-sr 1.0

ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Corpus of texts by Hijacint Repič in "Cvetje z vertov sv. Frančiška" CVET 1.0

The CVET corpus contains 230 texts (around 175 thousand words) of varying length, published in the religious journal "Cvetje z vertov sv. Frančiška" between 1887 and 1916, when...

CMC training corpus Janes-Norm 1.0

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...

Wikipedia talk corpus Janes-Wiki 1.0

Janes-Wiki is an annotated corpus of discussion pages from the Slovene Wikipedia from the period 2003-08 to 2017-06. The corpus contains page and user talks and is structured...

Slovenian parliamentary corpus (1990-1992) SlovParl 2.0

The SlovParl corpus contains minutes of the Assembly of the Republic of Slovenia for the legislative period 1990-1992, i.e. it covers the period before, during, and after...

Corpus of 1968 Slovenian literature Maj68 1.0

Maj68 corpus contains 874 texts published between 1964 and 1972 in the periodicals "Tribuna", "Problemi" and "Problemi. Literatura." The texts contain complete bibliographical...

Corpus of academic Slovene KAS 1.0

The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,7 billion tokens)...

Training corpus ssj500k 1.3

The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from...

Multilingual comparable corpora of parliamentary debates ParlaMint 2.1

ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20...

CMC training corpus Janes-Tag 2.0

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Training corpus jos1M 1.1

The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This...

Spoken corpus Gos VideoLectures 3.0 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

141 datasets found