-
Pilot corpus of student academic texts KOŠ 1.0
The Pilot corpus of student academic texts KOŠ 1.0 consists of authentic texts written by undergraduate students (approx. age 19–23 years) as part of their coursework at two... -
ASR model evaluator
Docker image with ASR evaluation tool that has support for WER calculation on punctuated and capitalised transcripts. The UI allows uploading the reference and predicted... -
Czech Models (MorfFlex CZ 2.0 + PDT-C 1.0) for MorphoDiTa 220710
Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ 2.0,... -
MIXPAR Database: Version 1.0 (September 2025)
MIXPAR: A Database of Mixed Perfective Auxiliation in Italo-Romance (v1.0). This is the first public release (v1.0) of the MIXPAR database, a large-scale dataset documenting... -
Lists of Slovene accentuated units SNES 1.0
SNES (Stalno naglašene enote iz Sloleksa; Constantly accentuated units from Sloleks) is a dataset containing Slovene final accentuated word parts (i.e., the ending part of an... -
Business English learner speech corpus SAPS
SAPS is a specialized speech corpus which contains business meeting simulations in English between undergraduate students of Languages for Business and Economics at the School... -
Lexical Dataset of Czech nene- Constructions
Databáze dokladů slov s dvojí záporkou nene- v češtině, sloužící primárně pro účely bakalářské práce na ÚČJTK FFUK (Dvojitá negace nene- (typ nenedostal, nenepatrný) v češtině;... -
Stereotypes and Discourse Connectors in Czech
The purpose of the dataset is to test three variables: (i) the effect of argument order in Ale-constructions (But-constructions) “A, ale B” (“A, but B”): positive A, but... -
Treebanks for Unified Taxonomy of Deep Syntactic Relations
The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on... -
LegISTyr test set
LegISTyr is a machine translation test set for evaluating the quality of legal terminology translation from Italian to South Tyrolean German, a minor standard variety of German.... -
Frekvensordliste (nynorsk) frå ord i norsk avisspråk
Frekvensordliste frå Norsk aviskorpus (nynorsk). Frekvensordlista er ei liste over dei 10 000 mest frekvente orda i Norsk aviskorpus, sortert på frekvens frå høgast til lågast.... -
Frekvensordliste (bokmål) fra ord i norsk avisspråk
Frekvensordliste fra Norsk aviskorpus (bokmål). Frekvensordlisten er en liste av de 10.000 mest frekvente ordene i Norsk aviskorpus, sortert på frekvens fra høyest til lavest.... -
Monitor corpus of Slovene Trendi 2025-08
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-08 covers the period from January... -
Slovene instruction-following dataset for large language models GaMS-Instruct...
GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of units of... -
Monitor corpus of Slovene Trendi 2025-07
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-07 covers the period from January... -
The "Mobile languages" corpus MoJezik 1.0 (audio)
The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno... -
The "Mobile languages" corpus MoJezik 1.0 (transcription)
The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno... -
Comparable corpus of parliamentary debates ParlaMint-IL 1.0
The ParlaMint-IL corpus is the Israeli contribution to the ParlaMint collection of comparable parliamentary corpora (https://www.clarin.eu/parlamint), which contain... -
Slovene instruction-following dataset for large language models GaMS-Instruct...
GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of pairs of... -
Domain-Specific Languages for the GreekSchools project
The repository hosts the Context-Free Grammars for the Domain-Specific Languages developed within the GreekSchools project. The repository includes diplomatic and literary DSLs...