Supplementary dataset and reproducible codes for LLM-assisted mapping feedstocks of eight conversion technologies from over 121,000 studies

Dataset

DOI

This dataset was developed to systematically characterise feedstock–technology relationships across eight major biomass conversion technologies by mining a large Scopus-derived bibliographic corpus (1887–2025; partial coverage for 2025). The workflow is LLM-assisted and fully reproducible, combining automated extraction of feedstock and technology phrases from bibliographic text fields (titles, abstracts, and keywords) with rule-based cleaning and a subsequent LLM-based validation step, followed by targeted manual curation for final release. The dataset is intended for use in technology landscape analyses, evidence synthesis, and comparative assessments of biomass conversion pathways, where consistent and traceable feedstock descriptors are required across a very large volume of studies.

A data descriptor titled "A large-scale, LLM-assisted and validated dataset of biomass and waste conversion technologies and feedstocks" with the following abstract will published based on this dataset:

Biomass, organic wastes and biogenic by-products are increasingly targeted for low-carbon fuels and value-added chemicals. However, strategic decision-making from a circular economy perspective requires a big-picture view of the relative significance of different conversion technologies in handling diverse feedstock portfolios, and no large-scale, cross-technology mapping of these portfolios is currently available. Thus, a literature-derived dataset was assembled, that links eight major waste-to-x valorisation technologies (gasification, pyrolysis, hydrothermal liquefaction, torrefaction, anaerobic digestion, aerobic digestion, fermentation and transesterification) to their reported feedstocks. Using the Scopus database, 121,365 records were retrieved with harmonised search strings, spanning publications from 1887 to 2025. This constrained yet scalable search strategy both facilitates automated extraction and validation and yields a rich dataset. Further, a large language model assisted workflow was implemented to extract candidate technology and feedstock phrases, followed by a two-level validation that combines rule-based cleaning with targeted LLM re-evaluation to minimise manual curation. The resulting dataset provides technology-specific, validated feedstock descriptors that supports comparative analyses and decision-support applications in a circular bioeconomy context.

Python, 3.11

Microsoft Excel, 365

Identifier
DOI	https://doi.org/10.18710/JM6U7B
Metadata Access	https://dataverse.no/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18710/JM6U7B

Provenance
Creator	Barahmand, Zahir (ORCID: 0000-0001-9031-596X)
Publisher	DataverseNO
Contributor	Barahmand, Zahir; University of South-Eastern Norway
Publication Year	2026
Rights	CC0 1.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/publicdomain/zero/1.0
OpenAccess	true
Contact	Barahmand, Zahir (University of South-Eastern Norway)

Representation
Resource Type	Data from Literature; Dataset
Format	text/plain; application/zip
Size	21698; 13051037; 15281895; 32419720; 145381741
Version	2.0
Discipline	Chemistry; Construction Engineering and Architecture; Earth and Environmental Science; Engineering; Engineering Sciences; Environmental Research; Geosciences; Natural Sciences
Spatial Coverage	University of South-Eastern Noway