GraphML files for protein sequence networks of glycoside hydrolase 19 homologues


GraphML files for undirected weighted graphs with nodes that represent protein sequences of glycoside hydrolase 19 homologues. Protein sequences were clustered by a threshold of 90% sequence identity to derive representative sequences. Pairwise sequence identity between two sequences was derived from global Needleman-Wunsch alignment. Protein sequence networks were generated with edge weights of pairwise sequence identity, filtered by a predefined threshold. Metadata of the nodes (e.g. annotations) and of the edges (the edge weights) were summarized in GraphML files.

The GraphML attributes for the edges comprise the edge weights (pairwise sequence identity, "weight"). The GraphML attributes for the nodes comprise the identifiers from the GH19ED ("sequence_id", "protein_id", "hfam_id", and "sfam_id" for sequence, protein, homologous family and superfamily identifiers, respectively), the NCBI taxonomy ID ("tax_id"), the annotated (organism) source name ("tax_name"), the taxonomic lineage of the source organism ("lineage", with taxa separated by "<--"), and the length of the amino acid sequence ("sequence_length"). In addition, suggested color names are given for both fill color and border color of each node ("color" and "color_border").

Metadata Access
Creator Orlando, Marco (University of Milano Bicocca)
Publisher DaRUS
Contributor Pleiss, Jürgen
Publication Year 2020
Funding Reference Bundesministerium für Bildung und Forschung, 031B0571A; Deutsche Forschungsgemeinschaft, EXC2075
Rights CC BY 4.0; info:eu-repo/semantics/openAccess;
OpenAccess true
Contact Pleiss, Jürgen (Universität Stuttgart)
Resource Type Dataset
Format text/xml-graphml
Size 14996074; 5840906; 75625241
Version 1.0
Discipline Life Sciences; Medicine