-
Difficulty Prediction for Language Tests
This collection includes various resources for predicting the difficulty of language proficiency tests. -
German Relatedness Datasets
The datasets on this page were obtained by asking human subjects to assign a similarity or relatedness judgment to a number of German word pairs. The datasets have been used to... -
Quality Flaw Prediction in Wikipedia
Dataset to extract reliable training instances from Wikipedia -
Hierarchy Identification
The page list data sets and experiments presented in the paper Hierarchy Identification for Automatically Generating Table-of-Contents. -
CLEVR-Hans7
A compositionally complex data set for investigating confounders and explainability. -
Domain-specific context-sensitive semantic verb relations
This is a data set of semantic verb relations in English from the domain of everyday educational topics. The data set consists of 12403 pairs of propositions which have been... -
Wikipedia Edit Category Corpus
For the corpus itself, please refer to/cite: Johannes Daxenberger and Iryna Gurevych (2012). "A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia... -
Sense Similarity
Sense and Similarity: A Study of Sense-level Similarity Measures -
EUR-Lex Dataset
The EUR-Lex text collection is a collection of documents about European Union law. It contains many different types of documents, including treaties, legislation, case-law and... -
Turk Bootstrap Word Sense Inventory (TWSI) 2.0
Turk Bootstrap Word Sense Inventory (TWSI) 2.0. This lexical resource, created by a crowdsourcing process using Amazon Mechanical Turk (http://www.mturk.com), encompasses a... -
German-English Modality Verbclasses
This is a semantic classification of more than 600 German lexical verbs and their English translation introduced in the paper: Judith Eckle-Kohler. Verbs Taking Clausal and... -
Text Reuse Annotations
Text Reuse Detection Using a Composition of Text Similarity Measures -
Wikipedia Article Feedback
The corpus lists article IDs of biographies of living and dead people, rated as above average or below average along four categories (trustowrthy, objective, well written,... -
OSS-Net trained models
Trained OSS-Net models of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data". -
Predictive Whittle Networks for Time Series
Dataset for paper "Predictive Whittle Networks for Time Series" Use with code at: https://github.com/ml-research/PWN -
Re-rating Studies
A Reflective View on Text Similarity -
Spelling Difficulty Prediction
Extracted spelling errors from various corpora. -
BWS Argument Similarity Corpus
The BWS Argument Similarity Corpus includes 3,400 sentence pairs for 8 controversial topics with 425 argument pairs each for every topic. Each argument-pair was annotated via... -
Wikipedia Text Segmentation
For corpus generation, we extracted top-level sections of featured articles and concatenated their textual contents to a pure-text corpus file. The content of a section is... -
UKP Sentential Argument Mining Corpus
The UKP Sentential Argument Mining Corpus includes 25,492 sentences over eight controversial topics. Each sentence was annotated via crowdsourcing as either a supporting...