When working with genome annotations, you'll likely encounter a long list of KO numbers. In such cases, you might want to include the entire KEGG hierarchy in your annotations for various downstream analyses. Your first thought might be, "If I had a table with the complete hierarchical KEGG orthology, I could simply merge my annotations with it." However, reality and imagination do not always meet face to face. You will find it's not that straightforward to compile hierarchical information formats into simple and readable tables. This script helps parse one of the available KEGG formats, converting the entire orthology into a simple tabular form, preserving the hierachical information in a readable form.
OUTPUT STRUCTURE
The script generates a table with the following columns:
- Class1: First KEGG supercategory (e.g., Metabolism, Genetic Information Processing, Environmental Information Processing)
- Class2: Second KEGG category (e.g., Carbohydrate metabolism, Energy metabolism)
- Class3: Third KEGG category, typically representing pathways for KO codes
- PathBrite_code
- PATH_or_BRITE: Indicates whether the annotation is a pathway or part of BRITE
- KO_code
- Abbreviated_gene_name: Multiple entries separated by ", "
- Complete_gene_name
- EC_number: Multiple entries separated by ";"
The output file is named:
KEGG_Orthology_complete.tsv
The file is saved in the same folder as the input file.
HOW TO PARSE THE ENTIRE KEGG ORTHOLOGY TO A TAB FILE
-
Download the htext file from KEGG:
https://www.genome.jp/kegg-bin/get_htext?ko00001
-
Set the input_file_path variable in the R script to the downloaded .keg file.
Example:
input_file_path <- "C:/Users/user/Desktop/Antrihabitans_trees_ANI_AAI/Genomes/KEGG/kegg_orthology/ko00001.keg"
- Run the script.
A TSV file named KEGG_Orthology_complete.tsv will be created in the same directory as the input file.
Two test files are included in the repository:
- ko00001_test.keg (simplified subset)
- ko00001.keg (complete hierarchy, April 2024)
These can be used to test the script.
VARIABLES TO DEFINE IN THE SCRIPT
- input_file_path:
Path to the KEGG htext (.keg) file
DEPENDENCIES
- R (tested with version 4.2.2; works with or without RStudio)
- data.table (tested with version 1.15.4)
CONTACT INFORMATION
For questions or issues, contact:
raul.riescoj@uib.es