Analyzing small molecule inhibition of enzymes: A preliminary machine learning approach towards drug lead generation

Philip, PearlAnalyzing small molecule inhibition of enzymes: A preliminary machine learning approach towards drug lead generationMy University2017data sciencedrug designgenetic algorithmsmachine learningmoleculeQSARPharmaceutical sciencesPharmacologyInformation scienceChemical engineeringMy UniversityMy UniversityBeck, David A.C.2017-08-112017-08-112017-08-112017-06en-USThesisPhilip_washington_0250O_17372.pdfhttp://hdl.handle.net/1773/39980application/pdfCC BYThesis (Master's)--University of Washington, 2017-06This project is designed to create an implementation of quantitative structure-activity relationships (QSAR) models in Python for the prediction of inhibitory action of small-molecule drugs on the enzyme USP1 - an enzyme essential to DNA-repair in proliferating cancer cells. Molecular descriptors are calculated using PyChem and employed to characterize the properties of about 400,000 drug-like compounds from a high-throughput screening assay made available on PubChem. Multiple machine learning models are created on the training data using Scikit-learn and Theano after feature selection and processing, followed by a genetic algorithm to synthesize an ideal enzyme inhibitor to be tested for activity and use as a drug compound. Higher error and poorer model fits can be attributed to multiple sources of error – measurement of activity using AC50, imbalanced dataset in favor of molecules with zero inhibition, incomplete feature space, highly non-linear interactions between the enzyme and drug, and the attainment of local minima in hyperparameter optimization. Solutions have been suggested for each of these issues, and is proposed as a part of future work. The genetic algorithm is used to synthesize a molecule in-silico and as the model prediction accuracy is increased, it can be pursued as a drug lead in clinical trials. This project provides a promising pipeline for future work in open-source molecular drug design and can be extended for use with other datasets and target species.