CLARIN - Repositories

Monitor corpus of Slovene Trendi 2025-08

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-08 covers the period from January...

Monitor corpus of Slovene Trendi 2025-07

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-07 covers the period from January...

Comparable corpus of parliamentary debates ParlaMint-IL 1.0

The ParlaMint-IL corpus is the Israeli contribution to the ParlaMint collection of comparable parliamentary corpora (https://www.clarin.eu/parlamint), which contain...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of pairs of...

Domain-Specific Languages for the GreekSchools project

The repository hosts the Context-Free Grammars for the Domain-Specific Languages developed within the GreekSchools project. The repository includes diplomatic and literary DSLs...

GreekSchools Public Editions

The GitHub repository archive hosting the XML documents for the open access critical edition of the 885222-GreekSchools ERC project. GreekSchools XML Data for PHerc. 327...

Women’s Empowerment – Inner and Outer Communication (Pilot Corpus)

The submitted data consists of the Women’s Empowerment Pilot Corpus, a curated collection of 30 short texts and dialogue excerpts documenting the communicative journey of...

Oral History Resource: Lithuanian Testimonies of Siberian Deportations

The oral history resource includes: (1) Audio recordings (recorded in 2009-2010) of personal narratives by siblings Pranas Šuminskas and Vladislava Šuminskaitė about their...

The corpus of older Slovenian narrative prose PriLit 1.0

The PriLit corpus contains 37 texts of older Slovenian narrative prose by 12 authors. One text, Sreča v nesreči (Fortune in Misfortune) by Janez Cigler (first published in...

Semantic lexicon of Slovene sloWNet 3.1

sloWNet is the Slovene WordNet developed in the expand approach: it contains the complete Princeton WordNet 3.0 and over 70,000 Slovene literals. These literals have been added...

Monitor corpus of Slovene Trendi 2025-06

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-06 covers the period from January...

Dataset for primary stress identification in Croatian and related languages a...

The dataset contains recordings and offset annotations of a sample of the Croaitan parliamentary recordings from the corpus ParlaSpeech-HR. It contains training and testing data...

Slovenian Day of Resistance X & news corpus

The dataset contains social media posts from X and traditional media articles from online news sources related to the Slovenian commemorations of the Day of Resistance. We used...

Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0

The corpus of Slovenian periodicals sPeriodika contains linguistically annotated periodicals published during the 18th, 19th, and beginning of 20th century (1771-1914). The...

Uniform Meaning Representation 2.1 (Czech and Latin)

Czech and Latin UMR data, both manually annotated and programmatically converted from manually annotated tectogrammatical data.

Desam v2.0

DESAM is a czech morphologically annotated corpus which has been manually disambiguated. Each token annotated for lemma, part-of-speech and all grammatical categories using the...

Carniolan Provincial Assembly corpus Kranjska 1.0

The corpus contains meeting proceedings of the Carniolan Provincial Assembly from 1861 to 1913 (Obravnave deželnega zbora kranjskega / Bericht über die Verhandlungen des...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint-en.ana 5.0 is the English machine translation of the ParlaMint.ana 5.0 (http://hdl.handle.net/11356/2005) set of corpora of parliamentary debates across Europe. The...

Multilingual comparable corpora of parliamentary debates ParlaMint 4.1

ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

4,938 datasets found