CLARIN - Repositories

Deltacorpus

Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger...

Oromo web corpus

Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.

Deep Universal Dependencies 2.4

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-2988). It contains additional...

LiFR-Lite (2021-11-05)

Corpus of Czech educational texts for readability studies, with paraphrases, measured reading comprehension, and a multi-annotator subjective rating of selected text features...

Lexico-Semantic Annotation of PDT using Czech WordNet

This dataset contains annotation of PDT using Czech WordNet ontology: http://hdl.handle.net/11858/00-097C-0000-0001-4880-3 Data is stored in PML format. This is a stand-off...

AKCES 1

Corpus AKCES 1 includes texts written in czech by youth (native speakers); it is the same data as the corpus SKRIPT 2012

Broken plural list

An LMF conformant XML-based file containing a comprehensive Arabic broken plural list. The file contains 12,249 singular words with their corresponding BPs

Debiasing Algorithm through Model Adaptation

Debiasing Algorithm through Model Adaptation (DAMA) is based on guarding stereotypical gender signals and model editing. DAMA is performed on specific modules prone to convey...

Addressed Arabic Phonetic Rules

This xml file describes the Arabic phonetic constraints are to be applied on Arabic root. The first rule category lists the letters that may not occur in the same root,...

Czech Models for Korektor 2

The Czech models for Korektor 2 created by Michal Richter, 02 Feb 2013. The models can either perform spellchecking and grammarchecking, or only generate diacritical marks.

Mapping Czech Verbal Valency to PropBank Argument Labels: LREC2024 - verifica...

Mapping table for the article Hajič et al., 2024: Mapping Czech Verbal Valency to PropBank Argument Labels, in LREC-COLING 2024, as preprocess by the algorithm described in the...

IDENTICv1.0

IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide...

Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0)

The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) is a corpus of spoken language, consisting of 742,316 tokens and 73,835 sentences, representing 7,324 minutes...

NomVallex 2.0

NomVallex 2.0 is a manually annotated valency lexicon of Czech nouns and adjectives, created in the theoretical framework of the Functional Generative Description and based on...

Morphological Analyzer for Shipibo-Konibo

This tool is the first morphological analyzer ever for this language. The analyzer is a FST that produces all possible segmentations and tagging sequences in a word-by-word...

jusText

jusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD...

PDT-Vallex: Czech Valency lexicon linked to treebanks 4.0 (PDT-Vallex 4.0)

The valency lexicon PDT-Vallex 4.0 has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague...

Saxophone Trills Dataset

This is the audio data of saxophone trills, used for difficulty estimation in the paper "Modeling the difficulty of saxophone music" by Šimon Libřický and Jan Hajič jr., ISMIR...

SynSemClass3.0

The SynSemClass synonym verb lexicon is a result of a project investigating semantic ‘equivalence’ of verb senses and their valency behavior in parallel Czech-English language...

LongEval Click-Model Relevance Judgements (Qrels)

The collection comprises the relevance judgments used in the 2023 LongEval Information Retrieval Lab (https://clef-longeval.github.io/), organized at CLEF. It consists of three...

1,492 datasets found