CLARIN - Repositories

Macedonian web corpus CLASSLA-web.mk 1.0

The Macedonian web corpus CLASSLA-web.mk 1.0 is based on the MaCoCu-mk 2.0 web corpus crawl (http://hdl.handle.net/11356/1801), which was additionally cleaned and enriched with...

Serbian web corpus CLASSLA-web.sr 1.0

The Serbian web corpus CLASSLA-web.sr 1.0 is based on the MaCoCu-sr 1.0 web corpus crawl (http://hdl.handle.net/11356/1807), which was additionally cleaned and enriched with...

Montenegrin web corpus CLASSLA-web.cnr 1.0

The Montenegrin web corpus CLASSLA-web.cnr 1.0 is based on the MaCoCu-cnr 1.0 web corpus crawl (http://hdl.handle.net/11356/1809), which was additionally cleaned and enriched...

Croatian web corpus CLASSLA-web.hr 1.0

The Croatian web corpus CLASSLA-web.hr 1.0 is based on the MaCoCu-hr 2.0 web corpus crawl (http://hdl.handle.net/11356/1806), which was additionally cleaned and enriched with...

Bulgarian web corpus CLASSLA-web.bg 1.0

The Bulgarian web corpus CLASSLA-web.bg 1.0 is based on the MaCoCu-bg 2.0 web corpus crawl (http://hdl.handle.net/11356/1800), which was additionally cleaned and enriched with...

Bosnian web corpus CLASSLA-web.bs 1.0

The Bosnian web corpus CLASSLA-web.bs 1.0 is based on the MaCoCu-bs 1.0 web corpus crawl (http://hdl.handle.net/11356/1808), which was additionally cleaned and enriched with...

Slovenian web corpus CLASSLA-web.sl 1.0

The Slovenian web corpus CLASSLA-web.sl 1.0 is based on the Slovenian MaCoCu-sl 2.0 web corpus crawl (http://hdl.handle.net/11356/1795), which was additionally cleaned and...

SYN v14: large corpus of written Czech

Corpus of contemporary written (printed) Czech sized almost 5.5 GW (i.e. 6.6 billion tokens). It covers mostly the 1990-2024 period and features rich metadata including detailed...

Code and data accompanying the SynSemClass paper @ LREC 2026

Snapshot of code and data accompanying the paper accepted at LREC 2026: "Automatic Suggestions Help Extending Eventive Ontology: A Case Study on SynSemClass". The timestamp of...

SYN2025: representative corpus of written Czech

Representative corpus of contemporary written (printed) Czech sized 100 MW. It was created as a representation of printed language from 2020–2024 containing a wide range of text...

SALSA - The SAarbrücken Lexical Semantics Annotation and Analysis Project

The SALSA corpus is based on the TIGER corpus. The TIGER corpus (Version 2.1) consists of app. 900,000 tokens (50,000 sentences) of German newspaper text, taken from the...

epic-uds

This dataset has no description

Multilingual training dataset for CAP policy topic classification ParlaCAP-train

The multilingual training dataset for CAP policy topic classification ParlaCAP-train is a collection of parliamentary speeches in 29 European languages, automatically annotated...

Thesaurus of Modern Slovene 2.2

Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. The current version 2.2 contains 102,068 keywords and 362,464...

Comprehensive Slovenian-Hungarian Dictionary 3.0

The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University...

Collocations Dictionary of Modern Slovene KSSS 2.2

The database of the Collocations Dictionary of Modern Slovene 2.2 contains 4,425,942 collocations in 78,046 entries. Collocations occur in 81 different syntactic relations....

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED-Termset is an instruction-following dataset containing 975,060 prompt-response units in Slovene from the medical domain. It focuses on medical terms, with...

Corpus-grounded evaluation dataset for grammatical question answering GramQA 1.0

The Corpus-grounded evaluation dataset for grammatical question answering (GramQA) consists of 13 grammatical questions inspired by WALS, the World Atlas of Language Structures...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-MED-Anatomy is an instruction-following dataset containing 711,805 prompt-response units in Slovene (with English and Latin terminology). The units form a...

Comprehensive Slovenian-Hungarian Dictionary 2.0

The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University...

4,938 datasets found