VQA-MHUG

DOI

We present VQA-MHUG - a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA), collected using a high-speed eye tracker. To the best of our knowledge, this is the first resource containing multimodal human gaze data over a textual question and the corresponding image. Our corpus encompasses task-specific gaze on a subset of the benchmark dataset VQAv2 val2. Our dataset is unique in that it is the first to provide real human gaze data on both images and corresponding questions and, as such, allows researchers to jointly study human and machine attention.

We use our dataset to analyse the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models: Modulated Co-Attention Network (MCAN) with either grid or region features, Pythia, Bilinear Attention Network (BAN), and the Multimodal Factorised Bilinear Pooling Network (MFB). While prior work has focused on studying the image modality, our analyses show - for the first time - that for all models, higher correlation with human attention on text is a significant predictor of VQA performance. This finding points at a potential for improving VQA performance and, at the same time, calls for further research on neural text attention mechanisms and their integration into architectures for vision and language tasks, including but potentially also beyond VQA.

More information can be found in the README.md.

Identifier
DOI https://doi.org/10.18419/darus-4428
Related Identifier IsCitedBy https://doi.org/10.18653/v1/2021.conll-1.3
Metadata Access https://darus.uni-stuttgart.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18419/darus-4428
Provenance
Creator Sood, Ekta; Kögel, Fabian ORCID logo; Bulling, Andreas ORCID logo
Publisher DaRUS
Contributor Sood, Ekta; Kögel, Fabian
Publication Year 2024
Rights info:eu-repo/semantics/restrictedAccess
OpenAccess false
Contact Sood, Ekta (Universität Stuttgart); Kögel, Fabian (Universität Stuttgart)
Representation
Resource Type Dataset
Format application/octet-stream; text/x-python; text/markdown
Size 62275; 131554; 4906516; 60029; 90970; 3215364; 79843; 7953; 3681; 100119; 268038; 1539411; 56643679; 96471; 437080; 12775812
Version 1.0
Discipline Other