Metagenome-Assembled Genomes (MAGs) from cow rumen fluid samples

Dataset

DOI

Dataset description

This dataset constitutes a collection of genomes from the cow rumen microbiome, constructed by integrating large-scale public metagenomic datasets with reference genomes from cultured isolates.

Compared to the MGnify cow rumen catalogue, currently one of the most comprehensive public resources, this dataset expands the known diversity by identifying more than 2,100 additional species not present in MGnify, while sharing a core of ~1,700 species. Furthermore, for shared species, representative genomes from this catalogue exhibit higher quality in approximately 76% of cases, based on completeness, contamination, and assembly continuity metrics. These improvements enhance the reliability of downstream genome-resolved analyses.

Data sources

Metagenomic data

Stewart et al. 2019 – BioProject PRJEB31266 (240 samples)

Stewart et al. 2018 – BioProject PRJEB21624 (43 samples)

Ruminomics – BioProject PRJEB21508 (58 samples)

Mu et al. 2021 – BioProject PRJNA639405 (24 samples)

Sato et al. 2024 – BioProject PRJDB16747 (37 samples)

7 unpublished deeply sequenced rumen metagenomes

Total: 409 metagenomic samples

Genomic data

Hungate1000 – BioProject PRJNA471733 (381 genomes)

Metagenomic assembly

Metagenomic assemblies were generated using metaSPAdes. Contigs shorter than 1,500 bp were removed prior to downstream analyses.

Genomic assembly

Isolate genomes were assembled using SPAdes with parameters --isolate and --cov-cutoff auto. Contigs shorter than 1,500 bp were discarded.

MAGs recovery

Metagenome-assembled genomes (MAGs) were reconstructed using COMEBin (multi-coverage mode). Genome quality was assessed using CheckM2.

MAGs were retained based on the following criteria:

Completeness ≥ 70% Contamination ≤ 5% N50 ≥ 5 kb

Genomes dereplication

Pairwise Average Nucleotide Identity (ANI) was computed using skani. Genomes were dereplicated at the species level using a 95% ANI threshold.

Taxonomic annotation

Taxonomic classification of dereplicated genomes was performed using GTDB-Tk, based on GTDB release r220.

Identifier
DOI	https://doi.org/10.57745/F9BMRL
Metadata Access	https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/F9BMRL

Provenance
Creator	PLAZA ONATE, Florian
Publisher	Recherche Data Gouv
Contributor	PLAZA ONATE, Florian
Publication Year	2026
Rights	etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess	true
Contact	PLAZA ONATE, Florian (INRAE)

Representation
Resource Type	Dataset
Format	text/tab-separated-values; application/x-compressed; application/x-xz
Size	5908254; 17610319192; 590763; 2401601064
Version	1.0
Discipline	Agriculture, Forestry, Horticulture; Life Sciences