CLARIN - Repositories

Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dat...

This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main...

The YouTube Corpus of Singapore English Podcasts

The YouTube Corpus of Singapore English Podcasts (YCSEP) contains transcripts from 620 hours of over 1,300 podcast episodes by Singapore-based content creators. The dataset,...

Pilot corpus of student academic texts KOŠ 1.0

The Pilot corpus of student academic texts KOŠ 1.0 consists of authentic texts written by undergraduate students (approx. age 19–23 years) as part of their coursework at two...

ASR model evaluator

Docker image with ASR evaluation tool that has support for WER calculation on punctuated and capitalised transcripts. The UI allows uploading the reference and predicted...

Czech Models (MorfFlex CZ 2.0 + PDT-C 1.0) for MorphoDiTa 220710

Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ 2.0,...

Lists of Slovene accentuated units SNES 1.0

SNES (Stalno naglašene enote iz Sloleksa; Constantly accentuated units from Sloleks) is a dataset containing Slovene final accentuated word parts (i.e., the ending part of an...

Treebanks for Unified Taxonomy of Deep Syntactic Relations

The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of pairs of...

Domain-Specific Languages for the GreekSchools project

The repository hosts the Context-Free Grammars for the Domain-Specific Languages developed within the GreekSchools project. The repository includes diplomatic and literary DSLs...

GreekSchools Public Editions

The GitHub repository archive hosting the XML documents for the open access critical edition of the 885222-GreekSchools ERC project. GreekSchools XML Data for PHerc. 327...

Semantic lexicon of Slovene sloWNet 3.1

sloWNet is the Slovene WordNet developed in the expand approach: it contains the complete Princeton WordNet 3.0 and over 70,000 Slovene literals. These literals have been added...

Uniform Meaning Representation 2.1 (Czech and Latin)

Czech and Latin UMR data, both manually annotated and programmatically converted from manually annotated tectogrammatical data.

Desam v2.0

DESAM is a czech morphologically annotated corpus which has been manually disambiguated. Each token annotated for lemma, part-of-speech and all grammatical categories using the...

MorfoCzech

A dictionary of morphologically segmented word forms in Czech. Rules of manual segmentation are described in Pelegrinová, K., Mačutek, J., Čech, R. (2021). The Menzerath-Altmann...

CzeSL Grammatical Error Correction Dataset (CzeSL-GEC)

CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech and...

FAUST cs-en 0.5

This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308)....

Arabic Phonetic Rules

Description: this xml file describes the Arabic phonetic constraints (rules) resulting from the analysis of the lexicons(Taj Alarous, Al ain, Lisan Al arab, Alwassit and...

LiFR-Lite

Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features...

Universal Dependencies 1.2 Models for Parsito

Parsing models for all Universal Depenencies 1.2 Treebanks, created solely using UD 1.2 data (http://hdl.handle.net/11234/1-1548). To use these models, you need Parsito binary,...

Open SDP

The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data...

1,492 datasets found