Dataset - B2FIND

Corpus of contemporary blogs

In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators...

Quality and Efficiency of Manual Annotation: Data from the Pre-annotation Bia...

Input data, individual experimental annotations, and a complete and detailed overview of the measured results related to the experiment described in the referenced paper.

PDT-Vallex: Czech Valency lexicon linked to treebanks 4.5 (PDT-Vallex 4.5)

The valency lexicon PDT-Vallex 4.5 is a part of the PDT-C 2.0 release https://hdl.handle.net/11234/1-5813. It is a slightly modified version of PDT-Vallex 4.0 from 2020 (as a...

KUKY1.0

KUKY is a curated selection of 224 Czech administrative and legal documents for readability research, stored in two JSON files. The documents come partly from public databases...

Czech Malach Cross-lingual Speech Retrieval Test Collection

The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four...

PDT-Vallex: Czech Valency lexicon linked to treebanks

The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague...

Quickstart: Annotation in the EXMARaLDA Partitur Editor

A quickstart introduction into annotation in the EXMARaLDA Partitur Editor

Die Erstellung von Fachgebärdenlexika am Institut für Deutsche Gebärdensprach...

Detailed description of how six corpus-based LSP dictionaries German – German Sign Language (DGS) were produced including elicitation methods, annotation and...

Transkriptionskonventionen im Vergleich

Synopsis of transcription conventions used in six international sign language research projects including annotation tool and tiers in transcripts, divided into conventional...

PiRATE: a Pipeline to Retrieve and Annotate Transposable Elements

To date, genome assembly of non-model organisms is usually not at chromosomal level and higly fragmented. This fragmentation is recognized to be, in part, the result of a bad...

Analyzing Dataset Annotation Quality Management in the Wild

This is the accompanying data for the paper "Analyzing Dataset Annotation Quality Management in the Wild". Data quality is crucial for training accurate, unbiased, and...

Lessons Learned from a Citizen Science Project for Natural Language Processing

This is the accompanying data for our paper "Lessons Learned from a Citizen Science Project for Natural Language Processing". Many Natural Language Processing (NLP) systems use...

Opinion role extractor

System for the Extraction of Subjective Expressions, Sentiment Sources and Sentiment Targets from German Text

Converter for content-to-head style syntactic dependencies

A set of Python scripts that convert function-head style encodings in dependency treebanks in a content-head style encoding (as used in the UD treebanks) and vice versa (for...

The MSC Data Set

From this page you can download resources we created for modal sense classification as reported in Zhou et al. (2015), Marasović et al. (2016) and Marasović and Frank (2015)...

DeModify

deModify consists of 3631 instances, each with three annotations obtained through CrowdFlower. An instance is a short story in which a modifier is annotated with respect to its...

tweeDe

A German UD Twitter treebank, with >12,000 tokens from 519 tweets, annotated in the Universal Dependencies framework

German causal language annotations and lexicon (verbs, nouns, prepositions) (DE)

Annotations of causal verbs, nouns and prepositions in context and lexicon file for causal verbs, nouns and prepositions.

Establishment of the Infrastructure to Automatically Analyse other Datasets

This deliverable D9.6 documents the installation of a tool chain for processing sign language data external to the project, mostly meant to be run on a high performance...

Tools for Harmonizing Available Annotations to a Common Format

This deliverable D6.6 provides tools for harmonizing available annotations to a common interchange format. It defines the interchange format and provides example implementations...

44 datasets found