CLARIN - Repositories

Multilingual static embeddings for Verbal Multiword Expressions trained on PA...

This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi,...

AKCES 1

Corpus AKCES 1 includes texts written in czech by youth (native speakers); it is the same data as the corpus SKRIPT 2012

Broken plural list

An LMF conformant XML-based file containing a comprehensive Arabic broken plural list. The file contains 12,249 singular words with their corresponding BPs

WMT16 Tuning Shared Task Models (English-to-Czech)

This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training...

Debiasing Algorithm through Model Adaptation

Debiasing Algorithm through Model Adaptation (DAMA) is based on guarding stereotypical gender signals and model editing. DAMA is performed on specific modules prone to convey...

Addressed Arabic Phonetic Rules

This xml file describes the Arabic phonetic constraints are to be applied on Arabic root. The first rule category lists the letters that may not occur in the same root,...

STYX

The STYX system is an electronic exercise book for practising Czech morphology and syntax consisting of more than 11, 000 sentences.

Fairytale child

Fairytale Child is a simple chatbot trying to simulate a curious child. It asks the user to tell a fairy tale, often interrupting to ask for details and clarifications. However,...

ParCzech PS7 1.0

The ParCzech PS7 1.0 corpus is the very first member of the corpus family of data coming from the Parliament of the Czech Republic. ParCzech PS7 1.0 consists of stenographic...

Czech Models for Korektor 2

The Czech models for Korektor 2 created by Michal Richter, 02 Feb 2013. The models can either perform spellchecking and grammarchecking, or only generate diacritical marks.

Mapping Czech Verbal Valency to PropBank Argument Labels: LREC2024 - verifica...

Mapping table for the article Hajič et al., 2024: Mapping Czech Verbal Valency to PropBank Argument Labels, in LREC-COLING 2024, as preprocess by the algorithm described in the...

IDENTICv1.0

IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide...

Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0)

The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) is a corpus of spoken language, consisting of 742,316 tokens and 73,835 sentences, representing 7,324 minutes...

ViCoS: Visualizing Conceptual Spaces

Extension of LEXUS to define semantic relations between lexical entries within or across lexica

NomVallex 2.0

NomVallex 2.0 is a manually annotated valency lexicon of Czech nouns and adjectives, created in the theoretical framework of the Functional Generative Description and based on...

Digital Listing of the Dictionary of Karelian (Karjalan kielen sanakirjan hak...

94 532 words, XML (data), HTML (interface)

Propädeutische Grammatik (ProGr@mm)

Web Information &Learning System, contains reusable e-learning resources, XML format

TITUS Tokharian B (West)

ca. 200.000 tokens; linked with relational database; XML-encoding in progress

Electronic text corpus of Sumerian literature (ETCSL)

A selection of nearly 400 literary compositions recorded on sources which come from ancient Mesopotamia and date to the late third and early second millennia BCE. The corpus...

DOESTE v0.5

DOESTE v0.5 is a set of developmental corpora of texts written by Brazilian and Portuguese school-age children and adolescents. It is a work in progress. The texts written by...

4,938 datasets found