Dataset - B2FIND

Opinion Mining Corpus on German Tweets about the Covid-19 Pandemic

The UKP Covid-19 Twitter Corpus includes 2,785 tweets annotated by student annotators and 200 expert-annotated tweets in German. Each tweet was annotated as either a supporting...

Dataset for color terms, 2012

This dataset comprises adjective-noun phrases with color terms.

PeerQA-XT

The rapid growth of scientific publications makes it increasingly difficult for researchers to keep up with new findings. Scientific question answering (QA) systems aim to...

EnglishWordNet 2020: Improving and Extending aWordNet for English using an Op...

The Princeton WordNet, while one of the most widely used resources for NLP, has not been updated for a long time, and as such a new project English WordNet has arisen to...

Stock Values and Earnings Call Transcripts: a Sentiment Analysis Dataset

The dataset reports a collection of earnings call transcripts, the related stock prices, and the sector index In terms of volume, there is a total of 188 transcripts, 11970...

A dataset containing job descriptions suitable for NLP and NN processing.

We describe a dataset that contains job description published on a popular online website in the information and technology sector. As the website focus mainly on United Kingdom...

Enriching plWordNet with morphology

In the paper, we present the process of adding morphological information to the Polish WordNet (plWordNet). We describe the reasons for this connection and the intuitions behind...

Wordnet – a Basic Resource for Natural Language Processing: the Case of plWor...

This paper presents a wide scope of wordnet applications on the example of applications of plWordNet – a wordnet of Polish. Wordnets are large lexical-semantic databases...

Combining text and vision in compound semantics: Towards a cognitively plausi...

In the current state-of-the art distributionalsemantics model of the meaning of noun-noun compounds (such aschainsaw, but-terfly, home phone),CAOSS(Marelli...

Scrambled text: training Language Models to correct OCR errors using syntheti...

This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data". In addition...

NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers

NCSE v2.0 Dataset RepositoryThis repository contains the NCSE v2.0 dataset and associated supporting data used in the paper "Reading the unreadable: Creating a dataset of 19th...

Extracted and NER-ed Pi Newspaper Articles

JSONL records for each issue of digitised Pi (student periodical from UCL Special Collections) at UCL*. The issues are grouped into folders by publication date. *Disclaimer: The...

AMR parse quality prediction [Source Code]

Accuracy prediction for AMR parsing predicts 33 accuracy metrics for a given sentence and its (automatic) AMR parse Abstract (Opitz and Frank, 2019): Semantic proto-role...

NLP in Diagnostic Texts from Nephropathology [Research Data]

This data set contains all annotated topic word tables from the work "NLP in Diagnostic Texts from Nephropathology", as well as all pre-processed and tf-idf-vectorized text...

DZ Interset

DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset...

MSTperl parser (2015-05-19)

MSTperl is a Perl reimplementation of the MST parser of Ryan McDonald (http://www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html). MST parser (Maximum Spanning Tree parser)...

OpenLegalData (2022 - Corpus)

OpenLegalData is a free and open platform that makes legal documents and information available to the public. The aim of this platform is to improve the transparency of...

CorpusExplorer

Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 45 interactive visualizations under a user-friendly interface. Routine tasks...

MSTperl parser

MSTperl is a Perl reimplementation of the MST parser of Ryan McDonald (http://www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html). MST parser (Maximum Spanning Tree parser)...

XML NLP Pipeline

The XML NLP Pipeline is a Java command line application that integrates the Stanford CoreNLP pipeline (Manning et al. 2014) in an XML-based processing pipeline. It uses a...

50 datasets found