Données de réplication pour : Towards the improvement of thermodynamic solubility prediction – a review

DOI

Evaluating thermodynamic solubility is crucial to design successful drug candidates. Yet, predicting it with in silico approaches remains a challenge. Machine learning methods are used to develop regression models leveraged on molecular descriptors. Recently, powerful solubility predictive models have been published using feature- and graph-based neural networks. These models often display attractive performances, yet, their reliability may be deceiving when used for prospective prediction. This review investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the structure of the aqueous solubility dataverse and data quality. We demonstrate that new models are not ready for public usage because they lack a well-defined applicability domain and they overlook some historical data sources. On the basis of carefully reviewed dataset we are able to illustrate the influence the data quality on model predictivity. We comprehensively investigated over 20 years of published solubility datasets and models, highlighting overlooked and interconnected datasets. We benchmarked recently published models on a Sanofi dataset, as an example of pharmaceutical context, and they performed poorly. We observed the impact of factors influencing the performances of the models: interlaboratory standard deviation, ionic state of the solute and source of the solubility data. As a consequence we draw a general workflow to cure aqueous solubility data with the aim of producing predictive models. Our results show how data quality and applicability domain of public models have an impact on their utility in a real context in pharmaceutical industry. We found that some data sources may appear as less reliable than initially expected, as for instance, the eChem dataset. This exhaustive aqueous solubility data analysis led to the development of a curation workflow; the resulting models and datasets are publicly available.

Data are available as CSV files.

File AqSolDBc.csv Curated data from the AqSolDB. The available columns are:

ID Compound ID (string) InChI InChI code of the chemical structure (string) Solubility Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) at ~300K (float) SMILEScurated Curated SMILES code of the chemical structure (string) SD Standard laboratory Deviation, default value: -1 (float) Group Data quality label imported from AqSolDB (string) Dataset Source of the data point (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Error Identifier error on the data point, default value: None (String) Charge Estimated formal charge of the compound at pH 7: Positive, Negative, Zwiterion, Uncharged (Categorical)

File OChemUnseen.csv Solubility data from OChem, curated and orthogonal to AqSolDB. The available columns are:

SMILES Curated SMILES code of the chemical structure (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float)

File OChemOverlapping.csv Solubility data from OChem, curated; chemical structures are also present inside AqSolDB. The available columns are:

SMILES Curated SMILES code of the chemical structure (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float)

File OChemCurated.csv Solubility data from OChem, curated. The available columns are:

ID Compound ID (string) Name Compound name (string) SMILES Curated SMILES code of the chemical structure (string) SDi Standard laboratory Deviation, default value: -1 (float) Reference Unformated bibliographic reference which the data point is originating from (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float) EXTERNALID Compound ID as appearing in its data source, default value: None (string) CASRN CAS number of the compound, default value: None (string) ARTICLEID Source ID linked to the column Reference (string) Temperature Temperature of the measure, in K (float)

Identifier
DOI https://doi.org/10.57745/CZVZIA
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/CZVZIA
Provenance
Creator Llompart, Pierre ORCID logo; Minoletti, Claire ORCID logo; Baybekov, Shamkhal ORCID logo; Horvath, Dragos ORCID logo; Marcou, Gilles ORCID logo; Varnek, Alexandre (ORCID: 0000-0003-1886-925X)
Publisher Recherche Data Gouv
Contributor Marcou, Gilles; Université de Strasbourg; Centre national de la recherche scientifique; Entrepôt-Catalogue Recherche Data Gouv
Publication Year 2023
Funding Reference ANRT Cifre 2021/1684
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact Marcou, Gilles (CMC - UMR7140 ; CNRS, Université de Strasbourg ; Strasbourg ; France)
Representation
Resource Type Dataset
Format text/tab-separated-values; text/plain
Size 1869557; 993073; 205089; 100787; 2104
Version 1.0
Discipline Chemistry; Natural Sciences
Spatial Coverage Laboratory of Chemoinformatics (CMC - UMR7140)