GraphML files for undirected weighted graphs with nodes that represent protein sequences of glycoside hydrolase 19 homologues. Protein sequences were clustered by a threshold of 90% sequence identity to derive representative sequences. Pairwise sequence identity between two sequences was derived from global Needleman-Wunsch alignment. Protein sequence networks were generated with edge weights of pairwise sequence identity, filtered by a predefined threshold. Metadata of the nodes (e.g. annotations) and of the edges (the edge weights) were summarized in GraphML files.
The GraphML attributes for the edges comprise the edge weights (pairwise sequence identity, "weight"). The GraphML attributes for the nodes comprise the identifiers from the GH19ED ("sequence_id", "protein_id", "hfam_id", and "sfam_id" for sequence, protein, homologous family and superfamily identifiers, respectively), the NCBI taxonomy ID ("tax_id"), the annotated (organism) source name ("tax_name"), the taxonomic lineage of the source organism ("lineage", with taxa separated by "<--"), and the length of the amino acid sequence ("sequence_length"). In addition, suggested color names are given for both fill color and border color of each node ("color" and "color_border").