A catalog of genes and species of the human oral microbiota

DOI

Dataset overview This dataset provides:

a non-redundant high-quality catalog of 8.4 million genes 853 Metagenomic Species Pangenomes (MSPs)

This dataset can be used to analyze shotgun sequencing data of the human oral microbiota.

Methods Data Sources The oral gene catalog was built using three primary sources:

Bacterial Genomes from the Human Oral Microbiome Database (HOMD). Fungal Genomes from the NCBI RefSeq database. Metagenomic Sequencing Data from multiple oral microbiome studies.

The creation of the oral gene catalog was a multi-step process, combining and refining genes from each source Bacterial Genes A total of 1,505 bacterial genomes were downloaded from HOMD (version 20170215, accessed in December 2017). Genes shorter than 60 nucleotides or containing ambiguous bases were filtered out. Redundancy was removed using CD-HIT-EST (v4.6; parameters: -aS 0.9 -c 0.95 -T 0 -M 0 -t 0 -d 0 -G 0). This process yielded 1,459,394 unique HOMD genes for the catalog. Fungal Genes 1,017 fungal genomes were downloaded from NCBI RefSeq (May 2017). For the 492 genomes lacking existing annotations, gene calling was performed using Genemark-ES in fungi mode. After initial redundancy removal with CD-HIT-EST (v4.6; parameters: -aS 0.9 -c 0.95 -T 0 -M 0 -t 0 -d 0 -G 0), genes were selected for inclusion only if their corresponding genome was present in at least 20% of the samples in one of the metagenomic cohorts, determined by mapping reads with Bowtie2 (v2.2.3). This led to the selection of 2,440,644 fungal genes. Metagenomic Sequencing Data The gene catalog was supplemented with data from 689 oral metagenomes, including newly sequenced samples, from the following studies:

Human Microbiome Project (HMP): 382 samples (bioproject PRJNA255439). Chinese Cohort: 212 samples (bioproject PRJEB6997). TwinsUK Cohort: 48 newly sequenced samples (bioproject: PRJEB38483).

Raw reads were subjected to quality control and trimmed using AlienTrimmer 0.4.0 (parameters: -k 10 -l 45 -m 5 -p 40 -q 20). Human sequences were removed by mapping against the human reference genome (GRCh38.p11) using Bowtie2 2.2.3. Metagenomic assembly was performed using SPAdes 3.9.0 (parameters: “-k 21,33,55 --only-assembler –meta” for Illumina paired-end data, or “--iontorrent -t 24 -m 300 -k 21,33,55 --only-assembler” for Ion Torrent single-end data). Contigs shorter than 500 bp or with coverage less than 2x were discarded. Gene calling was conducted with Prodigal (parameters: -m -p meta). Genes shorter than 60 bp were filtered out, and redundancy was removed with CD-HIT-EST (v4.6; parameters: -aS 0.9 -c 0.95 -T 0 -M 0 -t 0 -d 0 -G 0). Final Gene Catalog The final gene catalog was assembled by sequentially adding non-redundant genes from each data source. Genes from HOMD and fungal genomes were combined first using cd-hit-est-2d. Then, non-redundant genes from the HMP, Chinese, and TwinsUK cohorts were sequentially added using cd-hit-est-2d (same parameters as cd-hit-est). A final redundancy removal step was performed. This process resulted in a catalogue of 8.4 million non-redundant genes

MSPs Recovery The 689 metagenomic samples were aligned against the final gene catalog using the Meteor software suite to produce a gene abundance table. Then, co-abundant genes were binned into 853 Metagenomic Species Pan-genomes (MSPs) using MSPminer.

MSPs Taxonomic Annotation Taxonomic annotation for the MSPs was performed by aligning all core and accessory genes against representative genomes from the GTDB database (release r214) using blastn (task: megablast, word_size: 16).

A species-level assignment was given if over 50% of the genes matched a representative genome with a mean nucleotide identity of at least 95% and a mean gene length coverage of at least 90%. The remaining MSPs were assigned to a higher taxonomic level (genus to superkingdom) if more than 50% of their genes shared the same annotation.

Identifier
DOI https://doi.org/10.15454/WQ4UTV
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.15454/WQ4UTV
Provenance
Creator Le Chatelier, Emmanuelle ORCID logo; Almeida, Mathieu ORCID logo; Plaza Oñate, Florian ORCID logo; Pons, Nicolas; Gauthier, Franck; Ghozlane, Amine ORCID logo; Ehrlich, Stanislav Dusko ORCID logo; Witherden, Elizabeth ORCID logo; Gomez-Cabrero, David ORCID logo
Publisher Recherche Data Gouv
Contributor Plaza Oñate, Florian; Entrepôt-Catalogue Recherche Data Gouv
Publication Year 2021
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact Plaza Oñate, Florian (INRAE)
Representation
Resource Type Dataset
Format text/tab-separated-values; application/x-gzip
Size 204950; 28335618; 2862620653; 1989700
Version 3.1
Discipline Life Sciences; Microbial Ecology and Applied Microbiology; Pathology and Forensic Medicine; Biology; Omics