This dataset combines 8 public cohorts from 9 countries, with a total of 5,494 human stool samples profiled by shotgun metagenomic sequencing, and includes both healthy individuals and patients with various diseases (e.g., cardiovascular disease, colorectal cancer, inflammatory bowel disease, obesity, metabolic syndrome, Parkinson's disease).
Metacohort overview
Project ID | Associated publication | Number of samples | Countries | Clinical status
------------|------------------------|-------------------|-------------|-------------------
PRJEB39223 | Asnicar et al. | 1098 | USA,GBR | Healthy
PRJEB37249 | MetaCardis | 888 | FRA,DEU,DNK | Healthy & Diseased
PRJEB11532 | Zeevi et al. | 969 | ISR | Healthy
PRJNA834801 | Wallen et al. | 653 | USA | Healthy & Diseased
PRJDB4176 | Yachida et al. | 645 | JPN | Healthy & Diseased
PRJNA319574 | Schrimer et al. | 471 | NLD | Healthy
PRJNA530339 | Wang et al. | 385 | CHN | Healthy
PRJEB21528 | Jie et al. | 385 | CHN | Healthy & Diseased
Data processing
Data was processed separately for each study according to the following procedure.
Data download
Whole Metagenome Sequencing data was downloaded from the European Nucleotide Archive (ENA).
Quality control
All DNA sequencing reads were quality trimmed and filtered from sequencing adapters using fastp (v0.23.2). Remaining contamination by the host genome was removed by aligning the reads against the human reference genome T2T-CHM13v2.0 with Bowtie2 (v2.5.1) and using at least 90% nucleotide identity threshold for filtering, employing samtools (v1.9).
Microbial species annotation
For all samples, genes and microbial species were identified and quantified with METEOR 2 using human gut microbial gene catalogue ( IGC2 , comprising 10.4 million genes), with GTDB R226 release for taxonomic annotation.
Metadata download & curation
Raw metadata were collected from ENA, NCBI, and supplementary materials of the original publications. Variables were standardized across cohorts, and a common set of minimal metadata was retained (age, gender, body mass index (BMI), health status, country, study name).
Metadata additional flagging
We added two annotation columns ("flag" and "flag_reason") to the minimal metadata to highlight cases that may warrant specific attention in subsequent analyses.
Cross-sample contamination identification
Cross-sample contamination was checked with CroCoDeEL (v1.0.8), following the current guidelines: samples were flagged if species added by contamination exceed 12% or if the contamination rate >1% with additional species exceeding 10%.
Sequencing depth
Samples with a sequencing depth lower than 20,000,000 reads were flagged.
Longitudinal series
Samples originating from longitudinal studies (time points beyond baseline, i.e. t > 1) were flagged.
Missing information on the clinical status
Samples with missing health status were flagged.