CLARIN - Repositories

Multilingual dataset of COVID tweets for relation-level metaphor analysis TCM...

TCMeta is a dataset of noun phrase constructions from COVID-related tweets, annotated for relation-level metaphor. It contains 2,138 Slovene and 2,221 English instances in...

Slovene learner corpus KOST 1.0

The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 6,311 texts (just over 1 million words) written by adult speakers for whom...

LOCOLE (Longitudinal Corpus of Learner English)

Information about LOCOLE This corpus comprises essays written by university students of English Philology over the course of one academic year. The essays were collected four...

Corpus of Transcriptions - part 1

The first part of the Corpus of Transcriptions contains phonemic transcriptions of a short passage from Lecumberri and Maidment (2000, p. 78) performed by the undergraduate...

Eesti-inglise paralleelkorpus Estonian-English parallel corpus

Eesti-inglise paralleelkorpus. More info at http://www.cl.ut.ee/korpused/paralleel/index.php?lang=en Annotated and sentence-aligned parallel text corpus; contains: 1. Estonian...

EKI veamärgendatud E2 õppijakorpus (versioon 2) EKI error-annotated Estonian...

Veamärgendatud korpuse materjalid põhinevad EMMA õppijakeelekorpusel, sisaldades andmeid Haridus- ja Noorteameti tasemetöödest (7. klass, 504 teksti), põhikooli lõpueksamitest...

Eesti keele ühendkorpus 2023 (annoteerimata) Estonian National Corpus 2023 (...

Estonian corpus of written texts. Consists of the Estonian Reference Corpus (90s–2008), Contemporary and old literature, Estonian Web (2013, 2017, 2019, 2021, 2023), Timestamped...

English-Lithuanian Parallel Migration Corpus

English-Lithuanian Parallel Migration Corpus includes original English texts and their Lithuanian translations, aligned at the sentence level. The texts are drawn from EU legal...

LegISTyr test set

LegISTyr is a machine translation test set for evaluating the quality of legal terminology translation from Italian to South Tyrolean German, a minor standard variety of German....

SFU Opinion and Comments Corpus (SOCC) for NoSketch Engine

The SFU Opinion and Comments Corpus (SOCC) is a corpus for the analysis of online news comments. It contains opinionated articles and comments. It was tagged using TreeTagger...

Czech Proofreading Rules

The collection describes proofreading errors in Czech covered by Opravidlo 1.0. It consists of: - the grammar rules applicable via the SET Czech syntactic parser - description...

Content-based annotation of page images from the (archaeological) historical ...

This dataset employs a comprehensive 11-label classification scheme to categorize scanned images of document pages. The types are based on their content and presentation format....

StarwarsNER French Italian Corpus - sample

The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. It...

StarwarsNER French Italian Corpus - sample

The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. It...

KIParla - KIPasti transcripts

The KIPasti corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The ParlaBO corpus was compiled...

KIParla - ParlaTO transcripts

The ParlaTO corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The ParlaTO corpus was was...

KIParla - ParlaBO transcripts

The ParlaBO corpus is part of the larger KIParla collection, which can be freely queried through the NoSketch Engine interface. The ParlaBO corpus was compiled within the...

Corpus of Slovenian historical legal texts SI-IUS 1.0

The SI-IUS collection of older law texts is meant to be used both as a digital library and as a language corpus. For the former, each text has been carefully annotated in TEI...

Slovenian Emotion Dimension and Emotion Association Lexicon SloEmoLex 1.0

SloEmoLex is a lexicon of emotion, valence, arousal and dominance for 19,998 Slovenian entries. It includes and extends the Slovenian part of the LiLaH lexicon (Ljubešić et...

Spoken corpora of parliamentary debates ParlaSpeech 3.0

The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus...

4,938 datasets found