KEGG-Orthology-parser

DOI

When working with genome annotations, you'll likely encounter a long list of KO numbers. In such cases, you might want to include the entire KEGG hierarchy in your annotations for various downstream analyses. Your first thought might be, "If I had a table with the complete hierarchical KEGG orthology, I could simply merge my annotations with it." However, reality and imagination do not always meet face to face. You will find it's not that straightforward to compile hierarchical information formats into simple and readable tables. This script helps parse one of the available KEGG formats, converting the entire orthology into a simple tabular form, preserving the hierachical information in a readable form.

OUTPUT STRUCTURE

The script generates a table with the following columns:

  • Class1: First KEGG supercategory (e.g., Metabolism, Genetic Information Processing, Environmental Information Processing)
  • Class2: Second KEGG category (e.g., Carbohydrate metabolism, Energy metabolism)
  • Class3: Third KEGG category, typically representing pathways for KO codes
  • PathBrite_code
  • PATH_or_BRITE: Indicates whether the annotation is a pathway or part of BRITE
  • KO_code
  • Abbreviated_gene_name: Multiple entries separated by ", "
  • Complete_gene_name
  • EC_number: Multiple entries separated by ";"

The output file is named: KEGG_Orthology_complete.tsv

The file is saved in the same folder as the input file.

HOW TO PARSE THE ENTIRE KEGG ORTHOLOGY TO A TAB FILE

  1. Download the htext file from KEGG: https://www.genome.jp/kegg-bin/get_htext?ko00001

  2. Set the input_file_path variable in the R script to the downloaded .keg file.

Example: input_file_path <- "C:/Users/user/Desktop/Antrihabitans_trees_ANI_AAI/Genomes/KEGG/kegg_orthology/ko00001.keg"

  1. Run the script.

A TSV file named KEGG_Orthology_complete.tsv will be created in the same directory as the input file.

Two test files are included in the repository: - ko00001_test.keg (simplified subset) - ko00001.keg (complete hierarchy, April 2024)

These can be used to test the script.

VARIABLES TO DEFINE IN THE SCRIPT

  • input_file_path: Path to the KEGG htext (.keg) file

DEPENDENCIES

  • R (tested with version 4.2.2; works with or without RStudio)
  • data.table (tested with version 1.15.4)

CONTACT INFORMATION

For questions or issues, contact: raul.riescoj@uib.es

Identifier
DOI https://doi.org/10.34810/DATA3148
Related Identifier IsReferencedBy https://doi.org/10.1099/ijsem.0.006695
Metadata Access https://dataverse.csuc.cat/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34810/DATA3148
Provenance
Creator Riesco Jarrín, Raúl ORCID logo
Publisher CORA.Repositori de Dades de Recerca
Contributor Riesco Jarrin, Raul; Universitat de les Illes Balears
Publication Year 2026
Funding Reference https://ror.org/003x0zc53 PID2021-124068NB-I00 ; https://ror.org/003x0zc53 TED2021-131105B-I00
Rights CC BY-NC 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by-nc/4.0
OpenAccess true
Contact Riesco Jarrin, Raul (Universitat de les Illes Balears)
Representation
Resource Type Program source code; Dataset
Format type/x-r-syntax; text/tab-separated-values; application/octet-stream; text/markdown; text/plain
Size 4439; 2673; 4101414; 1706; 2873; 6234
Version 1.0
Discipline Life Sciences; Medicine