Silver Corpus for Plant-based Food Fermentation Information Extraction

Dataset

DOI

This is a silver corpus for Plant-based Food Fermentation Information Extraction task. This dataset focuses on plant-based food fermentation and comprises 2,500 abstracts retrieved from PubMed. The data was automatically annotated using a Large Language Model (LLM), resulting in a structured JSON format that includes a total of 23,563 entities. These entities are categorized into four distinct classes to capture the complexity of the fermentation process: Molecule (8,809 entities), Microbe (7,677 entities), Plant/Food (5,151 entities), and Habitat (1,926 entities). By mapping the relationships between microbial strains, raw plant substrates, and resulting aromatic or chemical compounds, this dataset provides a comprehensive resource for knowledge graph construction and relation extraction within the domain of sustainable food science.

Identifier
DOI	https://doi.org/10.57745/MI7Q9W
Metadata Access	https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/MI7Q9W

Provenance
Creator	ZHU, XINGYU; Nédellec, Claire; Bossy, Robert
Publisher	Recherche Data Gouv
Contributor	ZHU, XINGYU; Entrepôt Recherche Data Gouv
Publication Year	2026
Rights	etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess	true
Contact	ZHU, XINGYU (INRAE)

Representation
Resource Type	Dataset
Format	application/zip
Size	1574464
Version	1.0
Discipline	Computer Science