CLARIN - Repositories

Suulise keele korpus Corpus of Spoken Estonian

The Department of Estonian Language initiated the corpus of spoken Estonian in 1997. The corpus is compiled by the research group of Spoken Estonian (Tiit Hennoste, Airi...

Free Galician morphological database for Majka

Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside...

Datasets and R scripts for modelling Czech translation counterparts of Romanc...

This repository contains the datasets and code used in the study “Predicting translation counterparts in causative constructions.” The datasets consist of annotated examples of...

Free morphological analyzer Majka

Majka is a fast tool which assigns a lemma (basic form) and all possible grammatical tags to each word form on the input. It can be used also for word forms generation or...

Automatically Annotated Corpora with Stanza and UDPipe for Czech, English, an...

This resource contains six automatically annotated corpora derived from the Leipzig Corpora Collection, covering three languages: Czech, English, and Greek. For each language,...

Free Catalan morphological database for Majka

Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside...

Free Polish morphological database for Majka

Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside...

Trans-related Online Corpus (TROC)

The resource includes three corpora. 0. Raw corpus of YouTube and Reddit comments (see readme.txt). 1. Stance annotated corpus of contra-trans and pro-trans comments...

Lexicon of Lithuanian Basketball Slang Terms

The lexicon is compiled applying the method of crowdsourcing using the dictionary-editing system LEXONOMY. It was compiled as a study project by the group of students in the...

Eesti keele spontaanse kõne foneetiline korpus v.1.0.0 Phonetic Corpus of Es...

The aim of the corpus is to compile a large amount of quality recordings of spontaneous Estonian and segment it phonetically on different levels. The project started in autumn...

Mulgi sõnaraamat

Mulgi suur veebisõnaraamat esitab Mulgi murdesõnu kõigist viiest Mulgimaa kihelkonnast: Hallistest, Helmest, Karksist, Paistust ning Tarvastust. Sõnaraamatus on üle 20 000...

Slovenian legal natural language inference dataset SLawNLI

SLawNLI is a human-annotated dataset for Natural Language Inference (NLI) in the Slovenian legal domain. It contains 2,214 examples constructed according to the standard NLI...

Slovenian translation corpus Spook 1.1

The Spook corpus was compiled to enable corpus-based studies in translation and comprises 713 texts and about 375 thousand words. It is composed of three types of texts. The...

Slovene morphological segmentation and word formation dataset KOBOS

This dataset provides word-level multidimensional morphological annotations for Slovene, containing 1,935 entries manually annotated by two domain experts. The target words in...

Verbs annotated for morphemic structure in Czech, English, German, Spanish v2

A sample of verb lemmas in four languages: Czech (19,040 lemmas), English (9,969 lemmas), German (27,158 lemmas), Spanish (11,768 lemmas). Each verb lemma is annotated for its...

Lexical Dataset of Czech nene- Constructions (2026-02-28)

Databáze dokladů slov s dvojí záporkou nene- v češtině, sloužící primárně pro účely bakalářské práce na ÚČJTK FFUK (Dvojitá negace nene- (typ nenedostal, nenepatrný) v češtině;...

HeCz: Large Scale Self-Paced Reading Corpus Newspaper Headlines in Czech

The HeCz corpus comprises self-paced reading data for 1919 newspaper headlines (23,634 words) in Czech, with each headline being accompanied by a yes–no comprehension question,...

Free Welsh morphological database for Majka

Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside...

Free Russian morphological database for Majka

Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside...

Free Asturian morphological database for Majka

Data for assigning lemmata and tags to analyzed word forms for Majka. Majka is a free morphological analyzer that can be downloaded from https://nlp.fi.muni.cz/ma/ alongside...

4,930 datasets found