CLARIN - Repositories

Projekt_ZDH_transkripce

Text written in kurrent transcribed through Transkribus and then finished by hand.

English (L2) Learner Corpus

The NEC corpus samples used in the study comprises 433 examination responses (essays) written in L2 English on two topics, namely, The importance of volunteering for young...

English (L2) Learner Corpus

This data was used to analyze learner writing reported in Juknevičienė and Šeškauskienė (2014). It consists of two resources: a corpus of learner writing and a summative results...

Monitor corpus of Slovene Trendi 2026-02

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 60 publishers. Trendi 2026-02 covers the period from January...

Collection of Slovenian legal texts COLESLAW 1.0

COLESLAW 1.0 is a large-scale collection of Slovenian legal texts compiled from authoritative public sources. The corpus covers legislative, judicial, and governmental legal...

A multilingual benchmark for evaluating metalinguistic knowledge WALS-Bench 1.0

This is a large-scale multilingual benchmark for evaluating metalinguistic knowledge (i.e. explicit knowledge about the structure of languages) in large language models using...

Monitor corpus of Slovene Trendi 2026-01

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 60 publishers. Trendi 2026-01 covers the period from January...

South Slavic web corpus collection CLASSLA-web 2.0

The CLASSLA-web 2.0 collection is a large-scale, comparable set of web corpora covering all seven South Slavic languages: Slovenian, Croatian, Bosnian, Montenegrin, Serbian,...

Training corpus of spoken Slovenian ROG 1.1

Training corpus of spoken Slovenian ROG 1.1 is an improved version of the ROG 1.0 corpus (http://hdl.handle.net/11356/1992). The main differences between the original and the...

Training corpus of spoken Slovenian ROG 1.0

Training corpus of spoken Slovenian ROG 1.0 is the main resource for Slovenian language to train and evaluate technologies aimed at processing speech or speech transcripts, such...

Macedonian web corpus CLASSLA-web.mk 1.0

The Macedonian web corpus CLASSLA-web.mk 1.0 is based on the MaCoCu-mk 2.0 web corpus crawl (http://hdl.handle.net/11356/1801), which was additionally cleaned and enriched with...

Serbian web corpus CLASSLA-web.sr 1.0

The Serbian web corpus CLASSLA-web.sr 1.0 is based on the MaCoCu-sr 1.0 web corpus crawl (http://hdl.handle.net/11356/1807), which was additionally cleaned and enriched with...

Montenegrin web corpus CLASSLA-web.cnr 1.0

The Montenegrin web corpus CLASSLA-web.cnr 1.0 is based on the MaCoCu-cnr 1.0 web corpus crawl (http://hdl.handle.net/11356/1809), which was additionally cleaned and enriched...

Croatian web corpus CLASSLA-web.hr 1.0

The Croatian web corpus CLASSLA-web.hr 1.0 is based on the MaCoCu-hr 2.0 web corpus crawl (http://hdl.handle.net/11356/1806), which was additionally cleaned and enriched with...

Bulgarian web corpus CLASSLA-web.bg 1.0

The Bulgarian web corpus CLASSLA-web.bg 1.0 is based on the MaCoCu-bg 2.0 web corpus crawl (http://hdl.handle.net/11356/1800), which was additionally cleaned and enriched with...

Bosnian web corpus CLASSLA-web.bs 1.0

The Bosnian web corpus CLASSLA-web.bs 1.0 is based on the MaCoCu-bs 1.0 web corpus crawl (http://hdl.handle.net/11356/1808), which was additionally cleaned and enriched with...

Slovenian web corpus CLASSLA-web.sl 1.0

The Slovenian web corpus CLASSLA-web.sl 1.0 is based on the Slovenian MaCoCu-sl 2.0 web corpus crawl (http://hdl.handle.net/11356/1795), which was additionally cleaned and...

SYN v14: large corpus of written Czech

Corpus of contemporary written (printed) Czech sized almost 5.5 GW (i.e. 6.6 billion tokens). It covers mostly the 1990-2024 period and features rich metadata including detailed...

Code and data accompanying the SynSemClass paper @ LREC 2026

Snapshot of code and data accompanying the paper accepted at LREC 2026: "Automatic Suggestions Help Extending Eventive Ontology: A Case Study on SynSemClass". The timestamp of...

SYN2025: representative corpus of written Czech

Representative corpus of contemporary written (printed) Czech sized 100 MW. It was created as a representation of printed language from 2020–2024 containing a wide range of text...

4,930 datasets found