SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

Dataset

We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others.

Identifier
Source	https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/4994
Metadata Access	https://tudatalib.ulb.tu-darmstadt.de/server/oai/openairedata?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:tudatalib.ulb.tu-darmstadt.de:tudatalib/4994

Provenance
Creator	Baumgärtner, Tim ; Gurevych, Iryna
Publisher	Technische Universität Darmstadt
Contributor	Technische Universität Darmstadt
Publication Year	2026
Rights	Creative Commons Attribution 4.0 International; info:eu-repo/semantics/openAccess; https://creativecommons.org/licenses/by/4.0
OpenAccess	true
Contact	https://tudatalib.ulb.tu-darmstadt.de/docs/en/kontakt/

Representation
Language	English
Resource Type	Text
Format	application/zip
Size	1.87 MB
Version	v1.0
Discipline	Other