Generative Scientific Intelligence Designs Radically Novel, Functional, Lipase.
Executive Summary
• Enzyme design is one of science’s grand challenges
• Structural approaches to enzyme design result in low hit rates
• Xyme has built an enzyme discovery platform which converges quantum chemistry and protein science with state of the art generative AI
• The platform delivers leading expression rates in cell-free protein synthesis (78% vs 30%), high failure rate prevention (70% of failures identified in silico), and a 90% hit rate for enzyme activity
• Our Generative Scientific Intelligence (GSI) platform designed a radically novel and functional lipase with a sequence so distinct from the closest known lipase which we estimate it would take nature 1 billion years to evolve
Introduction
Over one third of the world’s Gross Domestic Product (GDP) is based on, or heavily influenced by, catalysed chemical reactions. Catalysts make chemical reactions run easier and faster but also enable reactions, which are not possible by other means. This market is currently dominated by heterogenous catalysts and zeolites which often require high temperatures and pressures to function, leading to high energy consumption and economically challenging processes.
To consider an alternative to the current state, we need look no further than how nature performs chemistry in cells. This is achieved through the use of biological macromolecules called enzymes; molecular machines capable of precise and exquisite chemical reactions that make up nearly all the materials and processes of life.. Enzymes offer compelling advantages over the heterogeneous catalysts and zeolites that currently dominate industrial catalysis, combining extreme chemo-, regio- and enantioselectivity with operation under remarkably mild conditions in aqueous media and near-ambient temperature and pressure. Their high catalytic efficiencies and ability to deliver single-step, high-yield routes to complex and chiral molecules can reduce the need for multistep sequences, energy-intensive conditions and extensive separations, while their benign, biodegradable and often metal-free nature improves the overall environmental and safety profile of catalytic processes.
To date the rational design of enzymes has been one of the grand challenges of science, requiring the convergence of advances in structural biology, chemistry and computer science. To design enzymes from scratch, a process known as de novo enzyme design, three major questions must be addressed:
1. Function – is the enzyme able to chemically perform the transformation desired?
2. Dynamics – is the enzyme and its chemistry stable under the motions of its atoms which inevitably occur in the real world?
3. Realisation – can you express the computational enzyme in a lab and validate it using an assay?
Addressing each of these areas has proved challenging due to the scale mismatch between them. For example, to understand the chemistry of function (1), it is necessary to be able to work on the quantum-mechanical scale of the electron, but to understand the dynamics of an enzyme (2), however, requires length- and timescales many orders of magnitude greater - requiring completely different computational approaches. Moving to realisation (3), requires macro-scale understanding of processes and emergent complexity.
Delivering true differentiated enzyme design requires us to think differently about both the problems and the algorithms which enable us to tackle them. In this report we demonstrate how combining state of the art chemical and protein AI, alongside a rigorous in silico screening pipeline can enable successful enzyme design campaigns with high expression rates and hit rates.
Lipases are the future workhorses of the $100B+ biofuels market, catalyzing the transformation of triglycerides into biodiesel and renewable fuel components under mild, environmentally friendly conditions. In this technical note, we describe how our Generative Scientific Intelligence (GSI) platform has directly designed and experimentally validated XYM-357 - a functionally active lipase so radically novel that natural evolution would require one billion years to discover it.
Challenge 1: Function
Moving Beyond the Bioprospecting of Theozymes with Flow Matching
AI has revolutionised the way we think about what is possible. From the well documented successes of large language models (LLMs) to the ability of image and video generators to provide high quality content on demand, AI is transforming industries. In 2024, the Nobel Prize in Chemistry was awarded to Hassabis, Jumper and Baker for their use of AI to transform the task of protein structure prediction from an expensive, artisanal endeavour to something so routine that you can now access it via a web server.
Recent years have seen remarkable progress in protein structure prediction, with models such as AlphaFold, Boltz, and Chai achieving near-experimental accuracy for single-chain and complex prediction. Attention has now shifted toward structure generation: using generative models to design proteins with prescribed functional properties. Methods such as RFDiffusion and, more recently, BoltzGen and La-Proteina have demonstrated impressive results in generating protein backbones conditioned on a structural motif, enabling the design of scaffolds around a fixed functional site. However, a fundamental limitation remains: the motif itself is taken directly from known data.
When the goal is to design catalytic function rather than simply reproduce it, we must move beyond a model where catalytic residues are held fixed as an input. One way to address this gap is to target the generation of catalytic pockets in a guided framework that allows the active site itself to be explored and refined toward enhanced catalytic properties.
Rather than generating entire protein chains, we focused on the generation of theozymes: up to 30 residues forming the catalytic active site, conditioned on a bound ligand. This focused scope enables rapid testing against active-site-specific properties without the cost of full-protein generation for every candidate. Our model was trained on an internally curated dataset of enzyme active sites obtained from the Protein Data Bank.
To generate a realistic estimate of the substrate conformation, the full reaction pathway for the triglyceride hydrolysis reaction was computed using our proprietary energy landscape software. The rate-limiting transition state (TS) was identified by analysing reaction rates across the entire landscape. This TS geometry was then aligned with the catalytic serine residue of the lipase triad and served as the ligand input for theozyme generation, providing the structural and chemical context required to guide the design of catalytically competent active sites.
Two architectural decisions drove the largest performance gains. First, ligand binding information was incorporated directly as a pairwise bias within the attention mechanism, encoding ligand-residue distances and orientations into every attention layer and ensuring geometrically consistent residue placement relative to the substrate. Second, after a cost/benefit analysis, we decided to not use the popular optimal transport (OT) coupling in our model. While OT is theoretically motivated, we found that in practice it introduced excessive rigidity; constraining the model to near-straight transport paths and limiting exploration of the multi-modal landscape of catalytic geometries. Removing OT allowed more expressive trajectories and substantially improved structural diversity.
Generated theozymes were evaluated against key hydrogen bond distances and angles across catalytically important residues, and the overall alignment required for catalytic competence. Structures satisfying all geometric thresholds were classified as passing and expanded into full enzymes. Tracking the pass rate across model variants provided a direct, physics-grounded performance signal independent of sequence identity or global fold quality. As shown in Figure 1, pass rates improved progressively as physical information was added: from baseline flow matching to ligand conditioning, to finally a full model with binding-aware pair bias, with each step yielding a measurable increase in catalytically competent outputs.
Xyme Design Rounds Produce Novel Enzymes
One of the areas in which we look to innovate is the ability to generate enzymes with novel active sites. Since the concept of a theozyme is a somewhat artificial construct, to ensure fair comparisons we consider novelty for both the first shell (any residue with 4A distance to the substrate) and the second shell (any residue with 4A distance to the first shell), since it is well established that second shell effects are important for active site characterization.
Taken over the design round, we split our analysis of the similarity of the theozyme to naturally existing structures for both the top performing experimentally validated candidates, and the median performing candidates. We note that there is a degree of similarity between the theozyme of our top structures, and the closest natural theozyme. This can be explained to a degree by the existence of lipases in the PDB and related training data. Whilst novelty is important, we would also want any AI model to be able to both discover existing performant motifs and also invent new ones, since not to do so would be to potentially miss significant opportunities. We do observe, however, that as we move to the second shell we observe a significant drop-off in similarity. This is particularly true for XYM-357, a small yet functional lipase which has a similarity in the second shell of only 38% to any known lipase.
We find this particularly compelling. With an overall sequence identity to the closest natural lipase of just 25% XYM-357 represents a genuinely novel solution that we estimate would take nature approximately one billion years to evolve. This demonstrates true generative capability rather than PDB memorization; we're creating functional enzymes that occupy entirely unexplored and thus patentable regions of sequence space. The fact that these radically novel designs maintain catalytic activity validates our physics-first approach, deriving function from first principles quantum chemistry.
It is important to note that this enzyme represents the smallest functional lipase reported, and with ~50% less mass than common commercially available lipases such as CALB. In a world where products are measured in activity per ton, size reduction is a powerful tool to improve margins, and an important target for our design engine. Excitingly, we do not believe that current performance characteristics of these “first-pass” designs in any way represent the ceiling for our GSI platform, and we hope to be able to demonstrate significant improvements to this enzyme in the near future.
Challenge 2: Dynamics
Catching Elusive Failure Modes with In Silico Filtering
A key principle of our pipeline is that single-structure (“static”) evaluation is not sufficient when starting from a single ML-predicted model. Many relevant observables, such as catalytic distances, are highly conformation-dependent. In real-world operations an enzyme will exhibit many different conformational states. Therefore, wherever possible, we evaluate properties as ensemble averages over molecular dynamics (MD) trajectories.
Internal benchmarking highlighted that 100 ns of production MD using a custom parameterization of AMBER in explicit TIP3P water provides a good compromise between speed and discriminative power. Furthermore, the context-dependent application of restraints along the reactive mode to maintain the relevant TS geometry and to encourage sampling of catalytically meaningful conformations is valuable.
From the resulting MD trajectories the following properties were computed and averaged over all frames:
1. Catalytic distances: distances between atoms/residues involved in proton transfer were monitored to quantify the fraction of time the catalytic geometry remained accessible.
2. Active-site SASA: used to detect pocket collapse or loss of accessibility during conformational fluctuations.
3. RMSD (local stability): RMSD was partitioned across protein regions, with a focus on active-site RMSD to ensure the designed theozyme geometry remained stable after relaxation into solution.
4. Hydrophobic profiling: average hydrophobicity in the binding region was tracked to ensure compatibility with the hydrophobic pocket expected for lipase substrates.
Designs failing any of the SASA, RMSD, catalytic-distance, or hydrophobicity criteria were removed.
For designs that passed dynamic filtering, candidates were reranked using electrostatic profiling calculated along the reactive coordinate. This potential was determined using QM/MM-derived charges and averaged over the MD ensemble. This electrostatic alignment with the reactive mode was used as a re-ranking rather than filtering metric due to both its mechanistic importance and its sensitivity to conformation. This ranking, in concert with scores pertaining to higher-order properties such as expressability, were then used to down-select candidates for final evaluation.
We believe that this detailed pipeline is very valuable in identifying failure modes which fall outside of the scope of current generative AI models before experimental validation. In a designed test, our pipeline was able to identify over 70% of designs which would turn out to fail experimental validation. For the lipase design round described in this report, we observed that it was still possible to identify and remove a significant percentage of structures for which it was highly likely that would not display activity in our experimental validation. That being said, the size of these pass-through rates are significant (70-80%), indicating that the overall quality of the candidate list generated by our AI system is high.
Xyme Design Rounds Have Class Leading Hit Rates
To date, the prediction and design of de novo enzymes has not had the widespread success of the more general protein structure prediction challenge. Whilst there have been some notable successes, these studies have typically been characterised by low hit rates:
These hit rates demonstrate the challenge of treating enzyme design as a structural data problem, which can be solved through carefully curated datasets and model tuning on structural losses. Following the in silico screening that was performed on the outputs of the generative AI-derived candidates, and subsequent expression, 52 candidates were sent for validation against an experimental activity assay. Of those 52 enzymes. 48 displayed functional activity – a 92% hit rate which is significantly beyond the hit-rates seen for other approaches.
Challenge 3: Realisation
Xyme Designed Enzymes Displayed Enhanced Cell Free Expression
Cell-free protein synthesis (CFPS) is an emerging technology for the synthesis of proteins which does not rely on a cellular host. Instead, the cellular machinery necessary for protein folding is extracted from a cell and transposed to an in vitro environment. This approach makes the fast prototyping of de novo designs possible, and in addition enables expression of proteins which might be toxic to a heterologous host.
Unfortunately, CFPS has been characterised in common practice by low expression rates which has limited its utilisation, with 30% expression rates being considered a reasonable expectation.
Since our GSI platform is capable of designing to arbitrary property profiles, we were able to positively select for sequences which had a strong likelihood of expressing via CFPS. We employed a range of fast scoring tools that predict key properties of the computationally generated enzymes to achieve an initial candidate ranking. These scores span sequence and structure-based assessment and are predicted using both protein language models and classical bioinformatics tooling. We have found that targeting thermostability, solubility and aggregation propensity enables us to achieve high expression rates in experimental assessment.
Since the introduction of this approach to our system we have seen a significant enhancement in CFPS rates, with some expression rounds displaying as high as 92% expression rate. For the lipase designs discussed in this report, the expression rate was 78% - well above industrial average.
Summary
Xyme's Generative Scientific Intelligence platform has successfully designed XYM-357, a radically novel and functional lipase with only 25% sequence identity to any known natural equivalent - a sequence so distant from biology that we estimate it would take nature approximately one billion years to evolve. The platform integrates quantum chemistry-derived energy landscapes, physics-infused flow matching for catalytic active site generation, and a multi-stage in silico filtering pipeline that identified over 70% of experimental failures before a single experiment was run. Of 52 candidates taken forward to experimental validation, 48 showed functional activity, giving a 92% hit rate that stands well above anything reported in the recent de novo enzyme design literature. Expression in cell-free protein synthesis reached 78%, more than double the 30% typically considered acceptable. Beyond its novelty, XYM-357 is also the smallest functional lipase on record, with roughly half the mass of commercial alternatives like CALB; a meaningful advantage in a market where performance is measured per ton. Taken together, these results make the case that designing from first principles quantum chemistry, rather than pattern-matching on structural databases, is the route to enzymes that are simultaneously novel, functional, and industrially relevant.
.webp)


