Taxonomic profiles, functional profiles and manually curated metadata of human fecal metagenomes from public projects coming from colorectal cancer studies

DOI

In the context of the FeMAI project (Federated Microbiome AI for human health), this dataset was created to assess various machine learning classification methods for colorectal cancer risk stratification.

Cohort overview This dataset gathers 2250 human stool samples characterized by shotgun metagenomic sequencing from 14 public cohorts spanning 9 countries, aiming at studying composition of the gut microbiota in healthy controls and patients with adenoma or colorectal cancer.

The BioProjects associated with the cohorts are :

PRJDB4176 (JPN, 645 individuals, 286 CRC patients) PRJEB10878 (CHN, 128 individuals, 74 CRC patients) PRJEB12449 (USA, 104 individuals, 52 CRC patients) PRJEB27928 (GER, 82 individuals, 22 CRC patients) PRJEB6070 (FRA, 156 individuals, 53 CRC patients - GER, 43 individuals, 38 CRC patients) PRJEB7774 (AUT, 156 individuals, 46 CRC patients) PRJNA389927 (USA, 56 individuals, 26 CRC patients - CAN, 28 individuals, 2 CRC patients) PRJNA397112 (IND, 110 individuals, no patients) PRJNA447983 (ITA, 140 individuals, 61 CRC patients) PRJNA531273 (IND, 30 individuals, 30 CRC patients) PRJNA608088 (CHN, 18 individuals, 6 CRC patients) PRJNA429097
(CHN, 193 individuals, 98 CRC patients) PRJNA763023 (CHN, 200 individuals, 100 CRC patients) PRJNA731589 (CHN, 161 individuals, 76 CRC patients) PRJNA961076 (BRA, 90 individuals, 30 CRC patients)

Data processing Sequencing data was downloaded from the European Nucleotide Archive.

Reads were quality trimmed and filtered from sequencing adapters using fastp. Remaining contamination by the host genome was filtered out by mapping reads against the human reference genome (T2T-CHM13v2.0) with bowtie2.

Microbial species identification and quantification was estimated according to both human gut reference gene catalogue (IGC2, 10.4M genes) and human oral gene catalogue (8.4M genes) clustered into Metagenomic Species Pangenomes taxonomically and functionally annotated.

Data provided The data associated with the cohorts are :

MetaGenomic Species abundance/count tables among samples and associated taxonomy (GTDB version RS214) Functional modules abundance among samples and associated annotation (KEGG version 92)
Manually curated metadata : All but 6 gut metagenomic samples from the 14 public projects are listed (96 virome samples from PRJNA389927 and 6 samples from PRJEB12449 not described in the associated paper and with no health status were discarded). A quality check was performed and 104 samples were identified as contaminated. They are listed in the metadata file but proposed to be suppressed. Comparison table between Meteor, Metaphlan2 and Metaphlan4.

Identifier
DOI https://doi.org/10.57745/7IVO3E
Related Identifier IsCitedBy https://doi.org/10.1093/bioinformatics/bty830
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/7IVO3E
Provenance
Creator BARBET, Pauline ORCID logo; ALMEIDA, Mathieu ORCID logo; PROBUL, Niklas ORCID logo; BAUMBACH, Jan ORCID logo; PONS, Nicolas ORCID logo; PLAZA ONATE, Florian ORCID logo; LE CHATELIER, Emmanuelle ORCID logo
Publisher Recherche Data Gouv
Contributor LE CHATELIER, Emmanuelle
Publication Year 2022
Funding Reference Agence Nationale de la Recherche (ANR) ANR-11-DPBS-0001 ; Agence Nationale de la Recherche (ANR) ANR-21-FAI1-0010 ; German Federal Ministry of Education and Research (BMBF) 01IS21079
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact LE CHATELIER, Emmanuelle (INRAE, MetaGenoPolis)
Representation
Resource Type Dataset
Format text/tab-separated-values; application/x-gzip
Size 13549345; 4237326; 451262; 16178; 24187; 23391977; 13033981; 364891
Version 9.0
Discipline Life Sciences; Pathology and Forensic Medicine; Microbial Ecology and Applied Microbiology; Biology; Omics