We present a scalable, end-to-end workflow for protein design. By augmenting protein sequences with natural language descriptions of their biochemical properties, we train generative models that can be preferentially aligned with protein fitness landscapes. Through complex experimental-and simulation-based observations, we integrate these measures as preferred parameters for generating new protein variants and demonstrate our workflow on five diverse supercomputers. We achieve >1 ExaFLOPS sustained performance in mixed precision on each supercomputer and a maximum sustained performance of 4.11 Ex-aFLOPS and peak performance of 5.57 ExaFLOPS. We establish the scientific performance of our model on two tasks: (1) across a predetermined benchmark dataset of deep mutational scanning experiments to optimize the fitness-determining mutations in the yeast protein, and (2) in optimizing the design of the enzyme malate dehydrogenase to achieve lower activation barriers (and therefore increased catalytic rates) using simulation data. Our implementation thus sets high watermarks for multimodal protein design workflows.
Gautham Dharuman is an Assistant Computational Scientist at the Data Science and Learning division of Argonne National Laboratory. He earned his dual Ph.D. in Computational Science and Engineering and Electrical Engineering from Michigan State University in 2018. His research focuses on developing and applying advanced AI models and methods at scale, including multimodal models for protein design workflows, preference optimization and reinforcement learning methods for incorporating experimental feedback from automated laboratories, neural operator surrogates for complex dynamical systems, agentic frameworks for scientific workflows, and scaling large language model training frameworks on emerging Exascale systems, to tackle challenges in the space of automated scientific discovery.