Machine Learning Methods and Their Applications in Precision Medicine

Image by David Shia

Image by David Shia

Whenever the topic of artificial intelligence is brought up, the first thought that comes to mind is the multitude of human-like robots increasingly featured in TV shows and movies. However, despite our collective fascination with the anthropomorphized artificial intelligence in mainstream media, perhaps the most exciting promise of artificial intelligence is its capacity to perform data analysis at a level impossible for humans.


The field of machine learning is a branch of computer science in which statistical models are employed in algorithms that give a computer the capability to learn from given data without any explicit instructions. In many cases, the ultimate goal is to produce a model that can perform predictive decision-making based on analysis of a given input dataset, oftentimes much too large and complex for human comprehension. Such methods lend themselves well to fields in which the complexity of datasets makes it difficult for the human eye to discern meaningful patterns, with one particularly exciting field being precision medicine.


While the fine details of machine learning methods can be rather inaccessible to those without backgrounds in computer science, statistics, or mathematics, the most basic concepts are key to understanding the types of problems that machine learning seeks to solve. Machine learning algorithms can be divided into the specific tasks that each algorithm is designed to perform. On one end of the spectrum is supervised learning, in which an algorithm is given a set of input data and a set of corresponding correct outputs. From this initial dataset, the algorithm then builds a model that is challenged with a new set of input data. With the new, previously unseen input data, the model generates a set of outputs that can then be compared to the true outputs, which are then used to assess the performance of the trained model [1]. This can be thought of as how a student is taught how to do arithmetic in that they are given examples and are then tested with different, previously unseen problems to assess how well they have learned to perform the task. Two common supervised learning problems are classification and regression. Classification aims to assign inputs to certain categories, while regression aims to predict a continuous value for a given input. A number of methods exist to accomplish these tasks and are briefly summarized in Table 1.


Table 1: A brief summary of common machine learning methods.

Table 1: A brief summary of common machine learning methods.

Unsupervised learning can be thought of as the other extreme in the spectrum of machine learning tasks. In this type of modeling, an algorithm is only given an input dataset. Commonly, the algorithm is tasked with either clustering the input data into relevant groups anew or reducing the complexity of the dataset by determining and removing aspects of the data that result in the most minimal loss of meaningful information. Oftentimes, these types of analysis are used in exploratory data analysis or data mining, where the goal is to derive new meaning or knowledge from the dataset being analyzed. This contrasts nicely with the goal of supervised learning methods, in which the goal is to create a model that allows for accurate replication of relationships between inputs and outputs to be used for prediction. An overview of methods in unsupervised learning is also provided in Table 1.


With the ability to parse through incomprehensibly large datasets and observe patterns that allow for effective decision making, machine learning algorithms are a powerful and promising technology to be applied in the field of precision medicine. Ultimately, this boils down to a final task of classification. Proponents of the precision medicine model believe in being able to use the massive amount of available data associated with patients – including lifestyle, environmental, molecular, and genetic – in an integrative manner in order to determine optimal modes of treatment. In the arena of medicine, with its foundation on the imperfect science of biology, finding the most important factors in the massive wealth of patient data to inform decision making can be a daunting task. This job is certainly no easier when considering the wealth of biological data that we have yet to understand of in the context of disease.


However, such an undertaking is being tackled head on. In a massive tour de force, a study by Costello et. al sought to predict breast cancer cell responses to 28 drugs using machine learning methods. The study was set up as a competition among different labs across the United States to develop an algorithm that could most accurately predict how sensitive a set of breast cancer cells lines were to the 28 different drugs [2, 3]. Competing groups were provided with drug response data for only a portion of the cell lines and were tasked with accurately predicting the response for the remaining cell lines to the drugs. The final methods developed and submitted encompassed a number of different statistical approaches, including advanced variations on decision tree methodology and support vector machines, linear regression, and non-linear regression. Using a ranking system to evaluate performance of the methods, the top two methods were identified and validated using a resampling analysis over 10,000 iterations [3].


In their analysis of these methods, the authors identified three key features that contributed to increased accuracy of the predictions: 1) the ability for a given algorithm to model nonlinear relationships, 2) the ability to combine heterogeneous data types (i.e. genomic, transcriptomic, and proteomic profiles) into a single model – termed multiview learning, and 3) incorporation of methods that could leverage prior knowledge in the form of known biological pathways. Put more simply, the best algorithms were able to look for correlations between pieces of the data that were more complex than just linear relationships. Moreover, top performing algorithms incorporated datasets derived from different “omic” methods and further increased accuracy by taking into account previous knowledge about given pathways when building final predictive models [3].


Perhaps most compelling was the demonstration that using multiple predictive models in combination could result in far superior predictive performance compared to that of any single given model. Termed ‘wisdom of crowds,’ this phenomenon requires that the individual predictive models being combined provide complementary information [2]. The successful demonstration of boosted prediction suggests that the combined use of different predictive methods can impart complementary information that lead to more accurate prediction. Thus, having a single ‘best’ algorithm should not be the goal in future iterations of drug sensitivity prediction. Rather, being able to build a number of optimized models each using complementary but distinct statistical models and combining their predictions to guide final classifications should be the goal.


The modeling of cancer cell line drug sensitivity may seem like a far cry from researchers trying to resolve discordant behaviors between cell lines and bona fide tumors. However, this demonstration of the predictive capability of machine learning is an important step towards the goal of identifying meaningful treatment options for patients with no further options.


David Shia

Staff Writer, Signal to Noise Magazine

MD/PhD Candidate in Molecular Biology Interdepartmental Doctoral Program, UCLA




[1] Malik, Y. & Jens, A (Eds). miRNomics: microRNA biology and computational analysis. Springer Protocols (2014).

[2] Ali, M. & Aittokallio, T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophysical reviews, 1-9 (2018).

[3] Costello, J. C., et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nature Biotechnology32(12), 1202-1212 (2014).