Deep Inverse Design, Discovery, and Optimization of Molecular Structure through 3D Invariant and Multimodal Machine Learning
MetadataShow full item record
Accelerating the discovery of novel materials and molecules with desired functionalities is crucial for continuing our progression towards technical solutions to some of the world’s most pressing issues including disease and climate change. The unique structural and atomic makeup of a molecule dictates its functional behavior, yet we still lack a fundamental understanding of the relationship between structure and function that would allow us to quickly predict the effect of structural modifications on a downstream chemical behavior. Learning this mapping would vastly expedite molecular discovery and is at the heart of the field of de novo molecular design. In this work, we explore a variety of generative statistical machine learning methods for approximating the joint probability manifold of molecular structure and function (JPM-SF). There are several key choices that impact a models’ ability to accurately learn this manifold, navigate it, and exploit the regions which it predicts to have optimal properties. The choice of molecular representation both fed to and generated by the model determines the amount of structural information embedded within the model. Many properties are dependent on the distances and angles between atoms and thus 3D representations are often necessary to accurately approximate the JPM-SF. However, the complexity of modeling and generating 3D structures elicits an increased computational burden and empirical results suggest that models which approximate the distribution of 1D sequence-based molecular representations are better at replicating the physicochemical property distributions of the training set from which they are drawn. The choice of model architecture determines the inductive priors given to the model. These can include both structural priors such as the rotational and translational invariance of molecular structures with respect to their intrinsic properties, as well as statistical priors such as the dimensionality and type of distribution from which latent variables are sampled. The choice of sampling and optimization methods, in part, determine the efficiency of the model at achieving a particular goal during inference. This can include adopting design criteria beyond those that the model was explicitly trained to predict and leveraging the mathematical structure of latent variables to aid in exploration and decision making. Each of these choices affects the others and thus we must take a holistic approach when designing a de novo design algorithm to tackle a new problem. Herein, we present a detailed look at the characteristics and performance of three generative de novo molecular design methods with respect to these design decisions. These will serve as case studies to compare the effect that the choices enumerated above have on the ultimate utility of such methods in real-world molecular design scenarios.
- Chemical engineering