KEGG-Orthology-parser

Dataset

DOI

When working with genome annotations, you'll likely encounter a long list of KO numbers. In such cases, you might want to include the entire KEGG hierarchy in your annotations for various downstream analyses. Your first thought might be, "If I had a table with the complete hierarchical KEGG orthology, I could simply merge my annotations with it." However, reality and imagination do not always meet face to face. You will find it's not that straightforward to compile hierarchical information formats into simple and readable tables. This script helps parse one of the available KEGG formats, converting the entire orthology into a simple tabular form, preserving the hierachical information in a readable form.

OUTPUT STRUCTURE

The script generates a table with the following columns:

Class1: First KEGG supercategory (e.g., Metabolism, Genetic Information Processing, Environmental Information Processing)
Class2: Second KEGG category (e.g., Carbohydrate metabolism, Energy metabolism)
Class3: Third KEGG category, typically representing pathways for KO codes
PathBrite_code
PATH_or_BRITE: Indicates whether the annotation is a pathway or part of BRITE
KO_code
Abbreviated_gene_name: Multiple entries separated by ", "
Complete_gene_name
EC_number: Multiple entries separated by ";"

The output file is named: KEGG_Orthology_complete.tsv

The file is saved in the same folder as the input file.

HOW TO PARSE THE ENTIRE KEGG ORTHOLOGY TO A TAB FILE

Download the htext file from KEGG: https://www.genome.jp/kegg-bin/get_htext?ko00001
Set the input_file_path variable in the R script to the downloaded .keg file.

Example: input_file_path <- "C:/Users/user/Desktop/Antrihabitans_trees_ANI_AAI/Genomes/KEGG/kegg_orthology/ko00001.keg"

Run the script.

A TSV file named KEGG_Orthology_complete.tsv will be created in the same directory as the input file.

Two test files are included in the repository: - ko00001_test.keg (simplified subset) - ko00001.keg (complete hierarchy, April 2024)

These can be used to test the script.

VARIABLES TO DEFINE IN THE SCRIPT

input_file_path: Path to the KEGG htext (.keg) file

DEPENDENCIES

R (tested with version 4.2.2; works with or without RStudio)
data.table (tested with version 1.15.4)

CONTACT INFORMATION

For questions or issues, contact: raul.riescoj@uib.es

Identifier
DOI	https://doi.org/10.34810/DATA3148
Related Identifier	IsReferencedBy https://doi.org/10.1099/ijsem.0.006695
Metadata Access	https://dataverse.csuc.cat/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34810/DATA3148

Provenance
Creator	Riesco Jarrín, Raúl
Publisher	CORA.Repositori de Dades de Recerca
Contributor	Riesco Jarrin, Raul; Universitat de les Illes Balears
Publication Year	2026
Funding Reference	https://ror.org/003x0zc53 PID2021-124068NB-I00 ; https://ror.org/003x0zc53 TED2021-131105B-I00
Rights	CC BY-NC 4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by-nc/4.0
OpenAccess	true
Contact	Riesco Jarrin, Raul (Universitat de les Illes Balears)

Representation
Resource Type	Program source code; Dataset
Format	type/x-r-syntax; text/tab-separated-values; application/octet-stream; text/markdown; text/plain
Size	4439; 2673; 4101414; 1706; 2873; 6234
Version	1.0
Discipline	Life Sciences; Medicine