Dataset - B2FIND

MULTEXT-East "1984" annotated corpus 4.0

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...

Training corpus SUK 1.1

The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora...

Multilingual comparable corpora of parliamentary debates ParlaMint 4.0

ParlaMint 4.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Corpus of academic Slovene KAS 2.0

The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens)...

CMC training corpus Janes-Norm 3.0

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs,...

Slovenian parliamentary corpus (1990-2022) siParl 3.0

The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of...

Training corpus hr500k 1.0

The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...

Slovenian parliamentary corpus (1990-2018) siParl 2.0

The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint-en 3.0 comprises linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 (http://hdl.handle.net/11356/1488) which were...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million...

Spoken corpus Gos VideoLectures 4.0 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Spoken corpus Gos VideoLectures 4.2 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. It can be used for training...

Tweet code-switching corpus Janes-Preklop 1.0

Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance),...

CMC training corpus Janes-Tag 1.2

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Corpus of Slovenian school texts SBSJ 1.0

Corpus of Slovenian school texts is a lemmatized and POS-tagged specialized corpus, which includes 428 short school texts written primarily by primary-school students from 1st...

Slovenian parliamentary corpus (1990-2022) siParl 4.0

The siParl 4.0 corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of...

Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0

The corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 1919 to 1939 (Zbirka stenografskih beležk, zapisnikov sej...

English-Slovene term candidates KAS-biterm 1.0

KAS-biterm is an automatically generated glossary of English terms with their translations into Slovene. The pairs, possibly with their English and Slovene acronyms, were...

Corpus of Academic Slovene (BSc/BA theses) KAS-dipl 1.0

The KAS-dipl corpus of Slovene BSc/BA theses consists of almost 65,000 texts (3,5 million pages or 1,1 billion tokens) written 2000 - 2018 and gathered from the digital...

141 datasets found