PeerQA-XT

The rapid growth of scientific publications makes it increasingly difficult for researchers to keep up with new findings. Scientific question answering (QA) systems aim to automatically respond to questions based on scientific articles. Advancing these systems requires high-quality, large-scale datasets. Current work is either limited to small scale due to costly manual annotation or lacks realistic depth when generated synthetically. To address this gap, this thesis introduces a novel framework for automatically generating scientific QA pairs from research literature using large language models (LLMs). The framework extracts QA pairs from peer reviews and rebuttals with state-of-the-art open-source LLMs, applying automated filtering and validation to ensure coherence and relevance. The resulting dataset comprises 12,628 free-form, open-ended QA pairs across ten scientific domains. We conduct extensive experiments to evaluate the dataset, examining both the impact of fine-tuning on our resource and its performance across several benchmarks. Results show that fine-tuning substantially improves a model’s ability to understand and apply scientific knowledge. These findings highlight the value of our framework and demonstrate the potential of peer review–based resources in advancing scientific QA, particularly for generative tasks and long-context reasoning.

Identifier
Source https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/5041
Related Identifier Cites https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/4467
Related Identifier Cites https://aclanthology.org/2025.naacl-long.22/
Related Identifier IsSupplementTo https://doi.org/10.26083/tuda-7777
Metadata Access https://tudatalib.ulb.tu-darmstadt.de/server/oai/openairedata?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:tudatalib.ulb.tu-darmstadt.de:tudatalib/5041
Provenance
Creator Ngen, Joy Jiaxi ORCID logo
Publisher Technische Universität Darmstadt
Contributor Technische Universität Darmstadt
Publication Year 2026
Rights CC-BY-NC-SA 4.0; info:eu-repo/semantics/openAccess; https://creativecommons.org/licenses/by-nc-sa/4.0/
OpenAccess true
Contact https://tudatalib.ulb.tu-darmstadt.de/docs/en/kontakt/
Representation
Language English
Resource Type Text
Format application/zip
Size 197.1 MB
Version v1.0
Discipline Other