Dataset description
This dataset constitutes a collection of genomes from the cow rumen microbiome,
constructed by integrating large-scale public metagenomic datasets with reference genomes from cultured isolates.
Compared to the MGnify cow rumen catalogue, currently one of the most comprehensive public resources,
this dataset expands the known diversity by identifying more than 2,100 additional species not present
in MGnify, while sharing a core of ~1,700 species. Furthermore, for shared species, representative genomes
from this catalogue exhibit higher quality in approximately 76% of cases, based on completeness,
contamination, and assembly continuity metrics. These improvements enhance the reliability of downstream
genome-resolved analyses.
Data sources
Metagenomic data
Stewart et al. 2019 – BioProject
PRJEB31266 (240 samples)
Stewart et al. 2018 – BioProject
PRJEB21624 (43 samples)
Ruminomics – BioProject
PRJEB21508 (58 samples)
Mu et al. 2021 – BioProject
PRJNA639405 (24 samples)
Sato et al. 2024 – BioProject
PRJDB16747 (37 samples)
7 unpublished deeply sequenced rumen metagenomes
Total: 409 metagenomic samples
Genomic data
Hungate1000 – BioProject
PRJNA471733 (381 genomes)
Metagenomic assembly
Metagenomic assemblies were generated using metaSPAdes. Contigs shorter than 1,500 bp were removed prior to downstream analyses.
Genomic assembly
Isolate genomes were assembled using SPAdes with parameters --isolate and --cov-cutoff auto.
Contigs shorter than 1,500 bp were discarded.
MAGs recovery
Metagenome-assembled genomes (MAGs) were reconstructed using COMEBin (multi-coverage mode). Genome quality was assessed using CheckM2.
MAGs were retained based on the following criteria:
Completeness ≥ 70%
Contamination ≤ 5%
N50 ≥ 5 kb
Genomes dereplication
Pairwise Average Nucleotide Identity (ANI) was computed using skani. Genomes were dereplicated at the species level using a 95% ANI threshold.
Taxonomic annotation
Taxonomic classification of dereplicated genomes was performed using GTDB-Tk, based on GTDB release r220.