Silver Corpus for Plant-based Food Fermentation Information Extraction

DOI

This is a silver corpus for Plant-based Food Fermentation Information Extraction task. This dataset focuses on plant-based food fermentation and comprises 2,500 abstracts retrieved from PubMed. The data was automatically annotated using a Large Language Model (LLM), resulting in a structured JSON format that includes a total of 23,563 entities. These entities are categorized into four distinct classes to capture the complexity of the fermentation process: Molecule (8,809 entities), Microbe (7,677 entities), Plant/Food (5,151 entities), and Habitat (1,926 entities). By mapping the relationships between microbial strains, raw plant substrates, and resulting aromatic or chemical compounds, this dataset provides a comprehensive resource for knowledge graph construction and relation extraction within the domain of sustainable food science.

Identifier
DOI https://doi.org/10.57745/MI7Q9W
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/MI7Q9W
Provenance
Creator ZHU, XINGYU; Nédellec, Claire; Bossy, Robert
Publisher Recherche Data Gouv
Contributor ZHU, XINGYU; Entrepôt Recherche Data Gouv
Publication Year 2026
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact ZHU, XINGYU (INRAE)
Representation
Resource Type Dataset
Format application/zip
Size 1574464
Version 1.0
Discipline Computer Science