Foldclass databases for protein structural domains in CATH and TED

DOI

This repository contains databases of protein domains for use with Foldclass and Merizo-search. We provide databases for all 365 million domains in TED, as well as all classified domains in CATH 4.3.Foldclass and Merizo-search use two formats for databases. The default format uses a PyTorch tensor and a pickled list of Python tuples to store the data. This format is used for the CATH database, which is small enough to fit in memory. For larger-than-memory datasets, such as TED, we use a binary format that is searched using the Faiss library.The CATH database requires approximately 1.4 GB of disk space, whereas the TED database requires about 885 GB. Please ensure you have enough free storage space before downloading. For best search performance with the TED database, the database should be stored on the fastest storage hardware available to you.IMPORTANT:We recommend going in to each folder and downloading the files; if you attempt to download each folder in one go, it will download a zip file which will need to be decompressed. This is particularly an issue if downloading the TED database, as you will need to have roughly twice the storage space needed as compared to downloading the individual files. Our GitHub repository (see Related Materials below) contains a convenience script to download each database; we recommend using that.

Identifier
DOI https://doi.org/10.5522/04/26348605.v2
Related Identifier HasPart https://ndownloader.figshare.com/files/50813226
Related Identifier HasPart https://ndownloader.figshare.com/files/50813403
Related Identifier HasPart https://ndownloader.figshare.com/files/50813406
Related Identifier HasPart https://ndownloader.figshare.com/files/50813532
Related Identifier HasPart https://ndownloader.figshare.com/files/50813850
Related Identifier HasPart https://ndownloader.figshare.com/files/50814849
Related Identifier HasPart https://ndownloader.figshare.com/files/50817567
Related Identifier HasPart https://ndownloader.figshare.com/files/50846193
Related Identifier HasPart https://ndownloader.figshare.com/files/50846196
Related Identifier HasPart https://ndownloader.figshare.com/files/51596159
Related Identifier HasPart https://ndownloader.figshare.com/files/51596162
Related Identifier HasPart https://ndownloader.figshare.com/files/51596228
Related Identifier HasPart https://ndownloader.figshare.com/files/51596447
Metadata Access https://api.figshare.com/v2/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:figshare.com:article/26348605
Provenance
Creator Kandathil, Shaun ORCID logo; Lau, Andy; Buchan, Daniel; Jones, David
Publisher University College London UCL
Contributor Figshare
Publication Year 2025
Rights https://creativecommons.org/licenses/by/4.0/
OpenAccess true
Contact researchdatarepository(at)ucl.ac.uk
Representation
Language English
Resource Type Dataset
Discipline Biology; Life Sciences