-
Frekvensordliste (nynorsk) frå ord i norsk avisspråk
Frekvensordliste frå Norsk aviskorpus (nynorsk). Frekvensordlista er ei liste over dei 10 000 mest frekvente orda i Norsk aviskorpus, sortert på frekvens frå høgast til lågast.... -
Frekvensordliste (bokmål) fra ord i norsk avisspråk
Frekvensordliste fra Norsk aviskorpus (bokmål). Frekvensordlisten er en liste av de 10.000 mest frekvente ordene i Norsk aviskorpus, sortert på frekvens fra høyest til lavest.... -
Monitor corpus of Slovene Trendi 2025-08
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-08 covers the period from January... -
Slovene instruction-following dataset for large language models GaMS-Instruct...
GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of units of... -
Monitor corpus of Slovene Trendi 2025-07
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-07 covers the period from January... -
The "Mobile languages" corpus MoJezik 1.0 (audio)
The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno... -
The "Mobile languages" corpus MoJezik 1.0 (transcription)
The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno... -
Comparable corpus of parliamentary debates ParlaMint-IL 1.0
The ParlaMint-IL corpus is the Israeli contribution to the ParlaMint collection of comparable parliamentary corpora (https://www.clarin.eu/parlamint), which contain... -
Slovene instruction-following dataset for large language models GaMS-Instruct...
GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of pairs of... -
Domain-Specific Languages for the GreekSchools project
The repository hosts the Context-Free Grammars for the Domain-Specific Languages developed within the GreekSchools project. The repository includes diplomatic and literary DSLs... -
GreekSchools Public Editions
The GitHub repository archive hosting the XML documents for the open access critical edition of the 885222-GreekSchools ERC project. GreekSchools XML Data for PHerc. 327... -
Women’s Empowerment – Inner and Outer Communication (Pilot Corpus)
The submitted data consists of the Women’s Empowerment Pilot Corpus, a curated collection of 30 short texts and dialogue excerpts documenting the communicative journey of... -
Oral History Resource: Lithuanian Testimonies of Siberian Deportations
The oral history resource includes: (1) Audio recordings (recorded in 2009-2010) of personal narratives by siblings Pranas Šuminskas and Vladislava Šuminskaitė about their... -
Lists of Slovene accentuated units SNES 1.0
SNES (Stalno naglašene enote iz Sloleksa; Constantly accentuated units from Sloleks) is a dataset containing Slovene final accentuated word parts (i.e., the ending part of an... -
The corpus of older Slovenian narrative prose PriLit 1.0
The PriLit corpus contains 37 texts of older Slovenian narrative prose by 12 authors. One text, Sreča v nesreči (Fortune in Misfortune) by Janez Cigler (first published in... -
Semantic lexicon of Slovene sloWNet 3.1
sloWNet is the Slovene WordNet developed in the expand approach: it contains the complete Princeton WordNet 3.0 and over 70,000 Slovene literals. These literals have been added... -
Monitor corpus of Slovene Trendi 2025-06
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-06 covers the period from January... -
Dataset for primary stress identification in Croatian and related languages a...
The dataset contains recordings and offset annotations of a sample of the Croaitan parliamentary recordings from the corpus ParlaSpeech-HR. It contains training and testing data... -
Spoken corpora of parliamentary debates ParlaSpeech 3.0
The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus... -
Slovenian Day of Resistance X & news corpus
The dataset contains social media posts from X and traditional media articles from online news sources related to the Slovenian commemorations of the Day of Resistance. We used...