Dataset

Kinetic modeling of enzymatic cephalexin synthesis with neural ODEs and surrogate-accelerated Bayesian inference

DOI

α-Amino ester hydrolases (AEHs) offer a promising route to the stereoselective synthesis of β-lactams such as cephalexin. However, published kinetic studies have encountered difficulty when extended beyond fitting of the data, indicating practical non-identifiability of the underlying kinetic models. Here, we address this issue using Bayesian inference combined with a reaction-consistent neural ODE surrogate that substantially accelerates parameter estimation. This framework enables efficient development of complex enzyme kinetic models even on limited hardware while providing rigorous uncertainty quantification of all parameters. To account for batch-dependent differences in active enzyme concentration, it was treated as a free parameter in each time series. Using this approach, the number of kinetic parameters was reduced from 12 to 9, and a useful kinetic model was obtained which is identifiable, mechanistically consistent, and predictive even under high substrate conditions. Available Models

models/model_04.json: The most comprehensive 12-parameter model including all major reaction pathways, competitive inhibition, substrate inhibition, and detailed enzyme regulation mechanisms. This model provides the most biologically detailed description but requires the most parameters to be estimated. models/model_06.json: A streamlined 9-parameter model that simplifies some regulatory interactions while maintaining core kinetic behavior. This represents a good compromise between detail and parameter identifiability. models/model_07.json: An intermediate 10-parameter model that includes additional regulatory terms compared to Model 06, capturing more complex enzyme behavior under varying substrate conditions. models/model_08.json: An optimized 9-parameter model that balances predictive accuracy with parameter parsimony. This model was developed through systematic model reduction to retain essential kinetic features while minimizing parameter uncertainty. models/model_04_no_e0.json: Identical to Model 04 but with fixed enzyme concentration (E₀) rather than estimating it from data. Use this when enzyme concentration is known or measured separately. models/model_08_no_e0.json: Identical to Model 08 but with fixed enzyme concentration. This provides a direct comparison of modeling approaches with and without enzyme concentration estimation.

Model File Structure and Components Each model file (JSON format) contains a complete mathematical description of the kinetic system:

Species definitions: Lists all chemical species with their names and symbolic identifiers used in equations Constants: Fixed parameters like enzyme concentration (p0) that may be estimated or held constant ODEs: The system of ordinary differential equations describing how each species concentration changes over time. These equations encode the reaction kinetics and mass balances. Parameters: Adjustable kinetic parameters (rate constants, binding affinities, inhibition constants) with their prior distributions for Bayesian inference Algebraic assignments: Complex mathematical expressions that define reaction rates, enzyme-substrate complexes, and regulatory terms as functions of the parameters and species concentrations

The models use symbolic mathematics where enzyme-substrate complexes and reaction rates are expressed algebraically, making them both interpretable and computationally efficient. System Requirements Software Dependencies The analysis pipeline requires several specialized Python packages for scientific computing, probabilistic programming, and machine learning: pip install catalax Hardware Requirements The computational analysis is moderately demanding due to Bayesian MCMC sampling and neural network training:

CPU: Multi-core processor (recommended: 12+ cores) - MCMC chains run in parallel across available cores for efficient sampling RAM: 16GB minimum, 32GB recommended - Memory requirements peak during MCMC sampling when storing large arrays of posterior samples

Operating System and Python Version

Supported OS: Linux or macOS (primary testing on macOS) Python version: 3.10 or higher required for compatibility with JAX and NumPyro Shell: Bash-compatible shell for running analysis scripts

How to Reproduce Quick Start

Install dependencies:

pip install catalax

Train the neural ODE surrogate:

jupyter notebook TrainNeuralODE.ipynb

Run all cells to create trained/rateflowode.eqx

Run the complete analysis:

export XLA_FLAGS="--xla_force_host_platform_device_count=12" # Adjust number for your CPU cores chmod +x fit_all.sh ./fit_all.sh What This Does The analysis pipeline:

Uses Bayesian inference (MCMC) to estimate kinetic parameters with uncertainty quantification Compares multiple model complexities (Models 04, 06, 07, 08) Treats enzyme concentration as a free parameter in each experiment Generates diagnostic plots and statistical summaries Saves all results to the results/ directory

Individual Model Analysis To analyze just one model: python run_inference.py models/model_08.json For analysis without enzyme concentration estimation: python run_inference.py models/model_08_no_e0.json --no-e0 Outputs Statistical Results Files These files contain the quantitative outcomes of the parameter estimation and model evaluation:

{model_name}_summary.csv: Comprehensive MCMC parameter statistics including posterior means, standard deviations, 95% credible intervals, effective sample sizes (ESS), and R-hat convergence diagnostics. This file provides the key numerical results for parameter interpretation. {model_name}_samples.nc: Complete posterior distribution samples stored in NetCDF format. Contains 10,000 samples × 12 chains for each parameter, enabling detailed uncertainty analysis, prediction intervals, and further statistical computations. {model_name}_metrics.json: Model performance metrics including various error measures (L1, L2 losses), coefficient of determination (R²), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). These metrics allow comparison of model quality and complexity. {model_name}_mean_e0.npy: Estimated enzyme concentrations for each experimental measurement (when E₀ estimation is enabled). This file contains the posterior mean enzyme concentrations that can be used for subsequent analyses or experimental validation.

Visualization Outputs (plots/ subdirectory) Diagnostic and result plots for model assessment and interpretation:

Trace plots: Time series of MCMC samples for each parameter, allowing visual inspection of mixing and convergence Corner plots: Two-dimensional projections of parameter correlations and marginal distributions Posterior distributions: Histograms and density plots showing parameter uncertainty Model fit plots: Comparison of model predictions vs. experimental data over time MCMC diagnostics: Monte Carlo Standard Error (MCSE) and Effective Sample Size (ESS) plots to assess sampling quality

Fitted Model Files (models/ subdirectory) Updated model definitions with estimated parameters:

{model_name}_bi.json: Model with parameters set to Bayesian posterior means. This represents the most probable parameter values given the data and priors, suitable for point predictions and further analysis. {model_name}_fitted.json: Model with parameters optimized using deterministic methods. These parameters minimize prediction errors and are typically used for the best-fit model predictions.

Catalax, 0.5.2

Python, 3.11

Identifier
DOI	https://doi.org/10.18419/DARUS-5539
Metadata Access	https://darus.uni-stuttgart.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.18419/DARUS-5539

Provenance
Creator	Range, Jan Peter ; Pleiss, Jürgen ; Bommarius, Andreas
Publisher	DaRUS
Contributor	Pleiß, Jürgen; Bommarius, Andreas; Range, Jan Peter
Publication Year	2025
Funding Reference	DFG SFB 1333 - 358283783 ; DFG EXC 2075 - 390740016 ; CDER U01FD006484
Rights	MIT License; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/MIT.html
OpenAccess	true
Contact	Pleiß, Jürgen (University of Stuttgart); Bommarius, Andreas (Georgia Institute of Technology); Range, Jan Peter (University of Stuttgart)

Representation
Resource Type	Dataset
Format	application/pdf; image/png; text/x-python; image/svg+xml; application/octet-stream; application/x-sh; application/json; application/netcdf; text/csv; text/tab-separated-values; application/x-ipynb+json
Size	69703; 295562; 14374; 457692; 394; 19318; 1408; 436; 1022749; 7724; 15577433; 4005444; 3897; 1742404; 192; 246; 908459; 7589; 7045717; 2359714; 3832; 1083902; 240; 1523898; 134974700; 904; 5453347; 3826; 2253080; 137536282; 1354; 5069033; 894714; 6047; 12758088; 3239178; 3279; 1521771; 2026277; 136486898; 1160; 4507011; 936259; 6636; 13556400; 3466199; 3270; 1557780; 247; 2171657; 136496423; 1205; 5235307; 958449; 6093; 11669606; 855010; 6010; 3470735; 1936711; 952068; 242; 1261416; 133073291; 682; 3934560; 3287734; 3063; 1570905; 244; 2998; 2075757; 135887942; 1142; 4492459; 23442; 111984; 23606; 108256; 3866; 12217; 4823; 4259; 735; 730; 448; 8305035; 256
Version	1.0
Discipline	Basic Biological and Medical Research; Biochemistry; Bioinformatics and Theoretical Biology; Biology; Chemistry; Computer Science; Computer Science, Electrical and System Engineering; Engineering Sciences; Life Sciences; Mathematics; Natural Sciences; Research Data Management