Large-scale shotgun metagenomic cohorts of gut microbiota samples with standardized minimal metadata from adults in industrialized countries

Dataset

DOI

This dataset combines 8 public cohorts from 9 countries, with a total of 5,494 human stool samples profiled by shotgun metagenomic sequencing, and includes both healthy individuals and patients with various diseases (e.g., cardiovascular disease, colorectal cancer, inflammatory bowel disease, obesity, metabolic syndrome, Parkinson's disease).

Metacohort overview

Project ID | Associated publication | Number of samples | Countries | Clinical status ------------|------------------------|-------------------|-------------|------------------- PRJEB39223 | Asnicar et al. | 1098 | USA,GBR | Healthy PRJEB37249 | MetaCardis | 888 | FRA,DEU,DNK | Healthy & Diseased PRJEB11532 | Zeevi et al. | 969 | ISR | Healthy PRJNA834801 | Wallen et al. | 653 | USA | Healthy & Diseased PRJDB4176 | Yachida et al. | 645 | JPN | Healthy & Diseased PRJNA319574 | Schrimer et al. | 471 | NLD | Healthy PRJNA530339 | Wang et al. | 385 | CHN | Healthy PRJEB21528 | Jie et al. | 385 | CHN | Healthy & Diseased

Data processing Data was processed separately for each study according to the following procedure. Data download Whole Metagenome Sequencing data was downloaded from the European Nucleotide Archive (ENA). Quality control All DNA sequencing reads were quality trimmed and filtered from sequencing adapters using fastp (v0.23.2). Remaining contamination by the host genome was removed by aligning the reads against the human reference genome T2T-CHM13v2.0 with Bowtie2 (v2.5.1) and using at least 90% nucleotide identity threshold for filtering, employing samtools (v1.9). Microbial species annotation For all samples, genes and microbial species were identified and quantified with METEOR 2 using human gut microbial gene catalogue ( IGC2 , comprising 10.4 million genes), with GTDB R226 release for taxonomic annotation. Metadata download & curation Raw metadata were collected from ENA, NCBI, and supplementary materials of the original publications. Variables were standardized across cohorts, and a common set of minimal metadata was retained (age, gender, body mass index (BMI), health status, country, study name). Metadata additional flagging We added two annotation columns ("flag" and "flag_reason") to the minimal metadata to highlight cases that may warrant specific attention in subsequent analyses. Cross-sample contamination identification Cross-sample contamination was checked with CroCoDeEL (v1.0.8), following the current guidelines: samples were flagged if species added by contamination exceed 12% or if the contamination rate >1% with additional species exceeding 10%. Sequencing depth Samples with a sequencing depth lower than 20,000,000 reads were flagged. Longitudinal series Samples originating from longitudinal studies (time points beyond baseline, i.e. t > 1) were flagged. Missing information on the clinical status Samples with missing health status were flagged.

Identifier
DOI	https://doi.org/10.57745/UPITJ0
Metadata Access	https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/UPITJ0

Provenance
Creator	LE CHATELIER, Emmanuelle ; SOLA, Mathilde ; PLAZA ONATE, Florian
Publisher	Recherche Data Gouv
Contributor	SOLA, Mathilde; LE CHATELIER, Emmanuelle; Entrepôt Recherche Data Gouv
Publication Year	2025
Rights	etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess	true
Contact	SOLA, Mathilde (INRAE, MetaGenoPolis); LE CHATELIER, Emmanuelle (INRAE, MetaGenoPolis)

Representation
Resource Type	Dataset
Format	text/tab-separated-values
Size	43252779; 399695; 503391
Version	1.0
Discipline	Life Sciences; Medicine