Dataset - B2FIND

Replication Data for: “Emotions in Context: A Semantic Field Viewed through G...

The basis for the analysis are the grammatical profiles (relative distribution of morphological cases in corpus data) of 65 high-frequency (>1000 attestations in Czech National...

DeriVallex 1.0

DeriVallex 1.0 is a valency lexicon of automatically generated valency frames of Czech noun and adjectival derivatives the valency of which exhibits systemic correspondences...

Human Label Variation in Coreference (Hlava Cor)

Human Label Variation in Coreference (Hlava COR) is a collection of commented multiple annotations (three annotators) of coreferential relations in Czech, i.e. the annotation of...

Datasets and R scripts for modelling Czech translation counterparts of Romanc...

This repository contains the datasets and code used in the study “Predicting translation counterparts in causative constructions.” The datasets consist of annotated examples of...

Multi-Dimensional Analysis of Czech

Original data for a general-purpose multi-dimensional analysis model of register variation in Czech. This post contains a CSV data set of 137 linguistic features measured on...

Czech word and MWE lists

This post contains word and MWE (multi-word expression) lists used for the operationalization of some of the linguistic features in the multi-dimensional analysis (MDA) of Czech...

Replication data for: V-temporal adverbials in Slavic

The database includes 271 Russian examples and their equivalents in Ukrainian, Belarusian, Polish and Czech. The data were culled from the ParaSol parallel corpus (see...

Parent-child conversations about motion events (Russian, Russian-German, Czech)

The dataset contains transcripts of parent-child communication over picture stimuli depicting motion events. The transcripts are partly-coded and transcribed in purpose of...

Data from the project Sociolinguistic analysis of the use of prothetic /v/ in...

Data from the project Sociolinguistic analysis of the use of prothetic /v/ in Czech. Altogether, 28 893 tokens of words which may contain prothetic v- taken from sociolinguistic...

Metonymy in Word-Formation: Russian, Czech, and Norwegian

Publication abstract: A foundational goal of cognitive linguistics is to explain linguistic phenomena in terms of general cognitive strategies rather than postulating an...

Czech Models (MorfFlex CZ 2.0 + PDT-C 1.0) for MorphoDiTa 220710

Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ 2.0,...

Stereotypes and Discourse Connectors in Czech

The purpose of the dataset is to test three variables: (i) the effect of argument order in Ale-constructions (But-constructions) “A, ale B” (“A, but B”): positive A, but...

Possessive Pronoun Preference

The contribution includes the data frames and the R script (Markdown file) belonging to the paper "Morphological and Pragmatic Conditioning of Reflexivity in Possessive...

Czech HS Contracts Dataset (CHSC) 1.0

Czech Contracts dataset was created as a part of the thesis Low-resource Text Classification (2021), A. Szabó, MFF UK. Contracts are obtained from the Hlídač Státu web portal....

Czech Models for Korektor 2

The Czech models for Korektor 2 created by Michal Richter, 02 Feb 2013. The models can either perform spellchecking and grammarchecking, or only generate diacritical marks.

NomVallex 2.0

NomVallex 2.0 is a manually annotated valency lexicon of Czech nouns and adjectives, created in the theoretical framework of the Functional Generative Description and based on...

Czech Lexico-Semantic Database 0.1

A lexicographical project, whose aim is to digitize and align two Czech onomasiological dictionaries (Haller 1969–77; Klégr 2007) in order to create an integrated digital...

Czech Models (MorfFlex CZ + PDT) for MorphoDiTa

Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ and...

RobeCzech Base

RobeCzech is a monolingual RoBERTa language representation model trained on Czech data. RoBERTa is a robustly optimized Transformer-based pretraining approach. We show that...

FERNET-C5

The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4...

59 datasets found