Dataset - B2FIND

Author profiling resources

El zip conté tots els recursos que s'han generat durant el desenvolupament de la tesi. Per una banda, hi ha el codi, amb el qual es poden extreure el conjunt de features tal i...

MARD: Multimodal Album Reviews Dataset

- MARD contains texts and accompanying metadata originally obtained from a much larger dataset of Amazon customer reviews, which have been enriched with music metadata from...

How2Sign: a large-scale multimodal dataset for continuous American Sign Language

How2Sign consists of a parallel corpus of 80 hours of sign language videos (collected with multi-view RGB and depth sensor data) with corresponding speech transcriptions and...

Replication Data for: Sign language translation for instructional videos

This repo contains the I3D data used for the paper "Sign Language Translation from Instructional Videos". Together with the data, and weights of models, we also provide the .tsv...

Data on Google News coverage in Brazil, Colombia, Mexico, Portugal and Spain

This dataset contains the set of records extracted from the main pages of some version of Google News (Brazil, Colombia, Mexico, Portugal, Spain). The data were extracted using...

Expert-Annotated Reddit Posts on Six Classes of Psychological Abuse

This dataset accompanies the papers "Decoding Psychological Abuse: A Comparative Study of Natural Language Processing (NLP) Classifiers Using Reddit Data" and "The Use of...

Exploring Gender differences with Natural Language Processing: Language Chara...

This dataset contains the R code for the analysis and results of the study "Exploring Gender differences with Natural Language Processing: Language characteristics of male and...

RESPONSE: Dataset for Commonsense Reasoning about Disaster Management

This dataset contains 1789 data instances with problem identification, missing resource, time-dependent questions and answers pairs for disaster management.

PitVQA: A Dataset of Visual Question Answering in Pituitary Surgery

PitVQA dataset comprises 25 videos of endoscopic pituitary surgeries from the National Hospital of Neurology and Neurosurgery in London, United Kingdom, similar to the dataset...

Scrambled text: training Language Models to correct OCR errors using syntheti...

This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data". In addition...

NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers

NCSE v2.0 Dataset RepositoryThis repository contains the NCSE v2.0 dataset and associated supporting data used in the paper "Reading the unreadable: Creating a dataset of 19th...

Extracted and NER-ed Pi Newspaper Articles

JSONL records for each issue of digitised Pi (student periodical from UCL Special Collections) at UCL*. The issues are grouped into folders by publication date. *Disclaimer: The...

Data on Google News coverage in Brazil, Colombia, Mexico, Portugal and Spain

This dataset contains the set of records extracted from the main pages of some version of Google News (Brazil, Colombia, Mexico, Portugal, Spain). The data were extracted using...

GenWiki: A Dataset of 1.3 Million Content-Sharing Text and Graphs for Unsuper...

Paper: "GenWiki: A Dataset of 1.3 Million Content-Sharing Text and Graphs for Unsupervised Graph-to-Text Generation" (COLING 2020) by Zhijing Jin, Qipeng Guo, Xipeng Qiu, and...

CLadder: Assessing Causal Reasoning in Language Models

Paper: "CLadder: Assessing Causal Reasoning in Language Models" (NeurIPS 2023) by Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin,...

Supplementary data for Corr2Cause: "Can Large Language Models Infer Causation...

Paper: "Can Large Language Models Infer Causation from Correlation?" (2023) by Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab,...

Supplementary Model Files for "Tasty Burgers, Soggy Fries: Probing Aspect Rob...

Paper: "Tasty Burgers, Soggy Fries: Probing Aspect Robustness in Aspect-Based Sentiment Analysis" (EMNLP 2020) by Xiaoyu Xing, Zhijing Jin, Di Jin, Bingning Wang, Qi Zhang, and...