-
Czech Proofreading Rules
The collection describes proofreading errors in Czech covered by Opravidlo 1.0. It consists of: - the grammar rules applicable via the SET Czech syntactic parser - description... -
Content-based annotation of page images from the (archaeological) historical ...
This dataset employs a comprehensive 11-label classification scheme to categorize scanned images of document pages. The types are based on their content and presentation format.... -
StarwarsNER French Italian Corpus - sample
The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. It... -
StarwarsNER French Italian Corpus - sample
The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. It... -
KIParla - KIPasti transcripts
The KIPasti corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The ParlaBO corpus was compiled... -
KIParla - ParlaTO transcripts
The ParlaTO corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The ParlaTO corpus was was... -
KIParla - ParlaBO transcripts
The ParlaBO corpus is part of the larger KIParla collection, which can be freely queried through the NoSketch Engine interface. The ParlaBO corpus was compiled within the... -
Corpus of Slovenian historical legal texts SI-IUS 1.0
The SI-IUS collection of older law texts is meant to be used both as a digital library and as a language corpus. For the former, each text has been carefully annotated in TEI... -
Slovenian Emotion Dimension and Emotion Association Lexicon SloEmoLex 1.0
SloEmoLex is a lexicon of emotion, valence, arousal and dominance for 19,998 Slovenian entries. It includes and extends the Slovenian part of the LiLaH lexicon (Ljubešić et... -
Spoken corpora of parliamentary debates ParlaSpeech 3.0
The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus... -
Parallel Corpus (EN-LT-DA) of General Data Protection Regulation (ELEXIS)
Trilingual parallel corpus on general data protection regulation. The size of the corpus is 54,468 words in English, 42,566 words in Lithuanian, and 47,740 words in Danish. The... -
Developmental corpus Šolar 3.0
The Developmental corpus Šolar consists of 5,485 texts written by students in Slovenian secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15),... -
Developmental corpus Šolar 2.0
The Developmental corpus Šolar 2.0 consists of 5,485 texts written by students in Slovene secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school... -
KIParla - KIP transcripts
The KIP corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The KIP corpus was compiled within... -
Diccionario de Arquitectura_ES-Dictionnaire d'Architecture_FR
The Diccionario de Arquitectura_ES – Dictionnaire d’Architecture_FR is a bilingual Spanish–French lexical resource focused on contemporary architecture. It has been designed as... -
Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dat...
This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main... -
The "Mobile languages" corpus MoJezik 1.0 (audio)
The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno... -
The "Mobile languages" corpus MoJezik 1.0 (transcription)
The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno... -
The YouTube Corpus of Singapore English Podcasts
The YouTube Corpus of Singapore English Podcasts (YCSEP) contains transcripts from 620 hours of over 1,300 podcast episodes by Singapore-based content creators. The dataset,... -
Pilot corpus of student academic texts KOŠ 1.0
The Pilot corpus of student academic texts KOŠ 1.0 consists of authentic texts written by undergraduate students (approx. age 19–23 years) as part of their coursework at two...
