SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others.

Identifier
Source https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/4994
Metadata Access https://tudatalib.ulb.tu-darmstadt.de/server/oai/openairedata?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:tudatalib.ulb.tu-darmstadt.de:tudatalib/4994
Provenance
Creator Baumgärtner, Tim ORCID logo; Gurevych, Iryna ORCID logo
Publisher Technische Universität Darmstadt
Contributor Technische Universität Darmstadt
Publication Year 2026
Rights Creative Commons Attribution 4.0 International; info:eu-repo/semantics/openAccess; https://creativecommons.org/licenses/by/4.0
OpenAccess true
Contact https://tudatalib.ulb.tu-darmstadt.de/docs/en/kontakt/
Representation
Language English
Resource Type Text
Format application/zip
Size 1.87 MB
Version v1.0
Discipline Other