264 lines
17 KiB
Text
264 lines
17 KiB
Text
# Literature Review
|
||
|
||
The application of machine learning in medicine has garnered enormous
|
||
attention over the past decade [@rabbani2022]. Artificial intelligence
|
||
(AI) and especially the subdiscipline of machine learning (ML) have
|
||
become hot topics generating increasing interest among laboratory
|
||
professionals. AI is a rather broad term and can be defined as the
|
||
theory and development of computer systems to perform complex tasks
|
||
typically requiring human intelligence, such as decision-making, visual
|
||
perception, speech recognition, and translation between languages. ML is
|
||
the science of programming, allowing computers to learn from data
|
||
without being explicitly programmed [@debruyne2021]. The ever more
|
||
extensive use of ML in clinical and basic medical research is reflected
|
||
in the number of titles and abstracts of papers indexed on PubMed and
|
||
published until 2006 as compared to 2007--2017, with a nearly 10-fold
|
||
increase from 1000 to slightly more than 9000 articles in that time
|
||
frame [@cabitza2018]. A literature review by Rabbani et al. found 39
|
||
articles about the field of clinical chemistry in laboratory medicine
|
||
between 2011 and 2021 [-@rabbani2022].
|
||
|
||
## A Brief Primer on Machine Learning
|
||
|
||
While this literature review aims not to provide an extensive
|
||
representation of the mathematics behind ML algorithms, some basic
|
||
concepts will be introduced to allow a sufficient understanding of the
|
||
topics discussed in the paper. ML models can be classified into broad
|
||
categories based on several criteria. These categories include the type
|
||
of supervision, whether are not the algorithm can learn incrementally
|
||
from an incoming stream of data (batch and online learning), and how
|
||
they generalize (instance-based versus model-based learning)
|
||
[@debruyne2021]. Rabbani et al. further classified the specific clinical
|
||
chemistry uses into five broad categories, predicting laboratory test
|
||
values, improving laboratory utilization, automating laboratory
|
||
processes, promoting precision laboratory test interpretation, and
|
||
improving laboratory medicine information systems [-@rabbani2022].
|
||
|
||
### Supervised vs. Unsupervised Learning
|
||
|
||
Four important categories can be distinguished based on the amount and
|
||
type of supervision the models receive during training: supervised,
|
||
unsupervised, semi-supervised, and reinforcement learning. Training data
|
||
are labeled in supervised learning, and data samples are predicted with
|
||
knowledge about the desired solutions [@debruyne2021]. They are
|
||
typically used for classification and regression purposes. Some of the
|
||
essential supervised algorithms are Linear Regression, Logistic
|
||
Regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVMs),
|
||
Decision Trees (DTs), Random Forests (RFs), and supervised neural
|
||
networks. In unsupervised learning, training data are unlabeled. In
|
||
other words, observations are classified without prior data sample
|
||
knowledge [@debruyne2021]. Unsupervised algorithms can be used for
|
||
clustering (e.g., k-means clustering, density-based spatial clustering
|
||
of applications with noise, hierarchical cluster analysis),
|
||
visualization, and dimensionality reduction (e.g., principal component
|
||
analysis (PCA), kernel PCA, locally linear embedding, t-distributed
|
||
stochastic neighbor embedding), anomaly detection and novelty detection
|
||
(e.g., one-class SVM, isolation forest) and association rule learning
|
||
(e.g., apriori, eclat). However, some models can deal with partially
|
||
labeled training data (i.e., semi-supervised learning). At last, in
|
||
reinforcement learning, an agent (i.e., the learning system) learns what
|
||
actions to take to optimize the outcome of a strategy (i.e., a policy)
|
||
or to get the maximum cumulative reward [@debruyne2021]. This system
|
||
resembles humans learning to ride a bike. It can typically be used in
|
||
learning games, such as Go, chess, or even poker, or settings where the
|
||
outcome is continuous rather than dichotomous (i.e., right or
|
||
wrong)[@debruyne2021]. The proposed study will use supervised learning,
|
||
as the data is labeled, and a particular outcome is expected.
|
||
|
||
### Model Types
|
||
|
||
#### Random Forests
|
||
|
||
Random forests are an ensemble learning method that combines multiple
|
||
decision trees to make predictions or classify data. It was first
|
||
introduced by Leo Breiman in 2001 and has since gained popularity due to
|
||
its robustness and accuracy [@liaw2002]. The algorithm creates many
|
||
decision trees, each trained on a different subset of the data using
|
||
bootstrap aggregating or "bagging." The random forests algorithm (for
|
||
both classification and regression) is as follows:
|
||
|
||
1. Draw ntree bootstrap samples from the original data.
|
||
|
||
2. For each bootstrap sample, grow an unpruned classification or
|
||
regression tree, with the following modification: at each node,
|
||
rather than choosing the best split among all predictors, randomly
|
||
sample mtry of the predictors and choose the best split from among
|
||
those variables. (Bagging can be considered a special case of random
|
||
forests obtained when mtry = p, the number of predictors.)
|
||
|
||
3. Predict new data by aggregating the predictions of the ntree trees
|
||
(i.e., majority votes for classification, average for regression)
|
||
[@liaw2002].
|
||
|
||
Random forests offer several advantages that make them well-suited for
|
||
predictive modeling in healthcare:
|
||
|
||
1. Robustness: Random forests are less prone to overfitting than
|
||
individual decision trees. The aggregation of multiple trees helps
|
||
to reduce the impact of outliers and noise in the data, resulting in
|
||
more stable and reliable predictions.
|
||
|
||
2. Variable Importance: Random forests provide estimates of the
|
||
importance of different features in making predictions. This
|
||
information aids in feature selection, identifying the most
|
||
influential factors, and gaining insights into the underlying data
|
||
relationships.
|
||
|
||
3. Handling Complex Data: Random forests can take various data types,
|
||
including categorical and numerical features, without extensive
|
||
preprocessing. This flexibility makes them suitable for healthcare
|
||
datasets often comprising diverse variables [@breiman2001a].
|
||
|
||
#### Gradient Boosting
|
||
|
||
Gradient boosting machines (GBMs) are an extremely popular machine
|
||
learning algorithm that have proven successful across many domains and
|
||
is one of the leading methods for winning Kaggle competitions. Whereas
|
||
random forests build an ensemble of deep independent trees, GBMs build
|
||
an ensemble of shallow trees in sequence, with each tree learning and
|
||
improving on the previous one. Although shallow trees by themselves are
|
||
relatively weak predictive models, they can be "boosted" to produce a
|
||
powerful "committee" that, when appropriately tuned, is often hard to
|
||
beat with other algorithms [@boehmke2020]. Gradient boosting involves
|
||
the following key steps:
|
||
|
||
1. Building an Initial Model: The algorithm creates an initial model,
|
||
typically a simple decision tree, to make predictions.
|
||
|
||
2. Calculation of Residuals: The residuals represent the differences
|
||
between the actual values and the predictions of the current model.
|
||
|
||
3. Fitting Subsequent Models: Subsequent weak models are trained to
|
||
predict the residuals of the previous model. These models are fitted
|
||
to minimize residual errors, typically using gradient descent
|
||
optimization.
|
||
|
||
4. Ensemble Creation: The predictions of all the weak models are
|
||
combined by summing them, creating a strong predictive model.
|
||
|
||
5. Iterative Improvement: The process is repeated for multiple
|
||
iterations, with each new model attempting to reduce further the
|
||
errors made by the previous models[@chen2016].
|
||
|
||
Gradient boosting offers several advantages that include:
|
||
|
||
1. High Predictive Accuracy: By combining multiple weak models,
|
||
gradient boosting can achieve high predictive accuracy, often
|
||
outperforming other machine learning algorithms.
|
||
|
||
2. Handling Complex Relationships: Gradient boosting can capture
|
||
complex nonlinear relationships between input and target variables,
|
||
making it suitable for datasets with intricate patterns.
|
||
|
||
3. Robustness to Outliers and Noise: The iterative nature of gradient
|
||
boosting helps reduce the impact of outliers and noise in the data,
|
||
leading to more robust predictions [@chen2016].
|
||
|
||
### Machine Learning Workflow
|
||
|
||
Since this study will focus on supervised learning, the review will
|
||
focus on that. Machine learning can be broken into three broad steps,
|
||
data cleaning and processing, training and testing the model, and
|
||
finally, the model is evaluated, deployed, and monitored
|
||
[@debruyne2021]. In the first phase, data is collected, cleaned, and
|
||
labeled. Data cleaning or pre-processing is one of the essential steps
|
||
in designing a reliable model [@debruyne2021]. Some examples of common
|
||
pre-processing steps are the handling of missing data, detection of
|
||
outliers, and encoding of categorical data. Data at this stage is also
|
||
split into training and testing data, typically following somewhere near
|
||
a 70-30 split. These two data sets are used for different portions of
|
||
the rest of the model building. The Training set data is used to develop
|
||
feature sets, train our algorithms, tune hyperparameters, compare
|
||
models, and all the other activities required to choose a final model
|
||
(e.g., the model we want to put into production) [@boehmke2020]. Once
|
||
the final model is selected, the test set data is used to estimate an
|
||
unbiased assessment of the model's performance, which we refer to as the
|
||
generalization error [@boehmke2020]. Most time (as much as 80%) is
|
||
invested into the data processes stage. After feature engineering, an ML
|
||
model is trained and tested on the collected data in the second phase.
|
||
Feature engineering is performed on the training set to select a good
|
||
set of features to train on. The ML model will only be able to learn
|
||
efficiently if the training data contains enough relevant features and
|
||
minimal irrelevant ones [@géron2019]. The data is then run through
|
||
various models, Linear Regression, Logistic Regression, K-Nearest
|
||
Neighbors (KNN), Support Vector Machines (SVMs), Decision Trees (DTs),
|
||
and Random Forests (RFs).
|
||
|
||
Once a model is selected, the third phase begins to evaluate the model's
|
||
performance. Historically, the performance of statistical models was
|
||
primarily based on goodness-of-fit tests and the assessment of
|
||
residuals. Unfortunately, misleading conclusions may follow from
|
||
predictive models that pass these assessments [@breiman2001]. Today, it
|
||
has become widely accepted that a more sound approach to assessing model
|
||
performance is to determine the predictive accuracy via loss functions
|
||
[@boehmke2020]. *Loss functions* are metrics that compare the predicted
|
||
values to the actual value (the output of a loss function is often
|
||
referred to as the error or pseudo residual). When performing resampling
|
||
methods, we assess the predicted values for a validation set compared to
|
||
the actual target value. The overall validation error of the model is
|
||
computed by aggregating the errors across the entire validation data set
|
||
[@boehmke2020]
|
||
|
||
### Machine Learning in the Clinical Laboratory
|
||
|
||
Rabbani et al. performed a comprehensive study of the current state of
|
||
machine learning in laboratory medicine [-@rabbani2022]. This study
|
||
revealed several exciting applications, including predicting laboratory
|
||
test values, improving laboratory utilization, automating laboratory
|
||
processes, promoting precision laboratory test interpretation, and
|
||
improving laboratory medicine information systems. In these studies,
|
||
tree-based learning algorithms and neural networks often performed best.
|
||
@tbl-lab_ml displays the overview of their research.
|
||
|
||
| **Author and Year** | **Objective and Machine Learning Task** | **Best Model** | **Major Themes** |
|
||
|:-----------------|:-----------------|:-----------------|:-----------------|
|
||
| Azarkhish (2012) | Predict iron deficiency anemia and serum iron levels from CBC indices | Neural Network | Prediction |
|
||
| Cao (2012) | Triage manual review for urinalysis samples | Tree-based | Automation |
|
||
| Yang (2013) | Predict normal reference ranges of ESR for various laboratories based on geographic and other clinical features | Neural Network | Interpretation |
|
||
| Lidbury (2015) | Predict liver function test results from other tests in the panel, highlighting redundancy in the liver function panel | Tree-based | Prediction, Utilization |
|
||
| Demirci (2016) | Classify whether critical lab result is valid or invalid using other lab values and clinical information | Neural Network | Automation, Interpretation, Validation |
|
||
| Luo (2016) | Predict ferritin from other tests in iron panel | Tree-based | Prediction, Utilization |
|
||
| Poole (2016) | Create personalized reference ranges that take into account patients\' diagnoses | Unsupervised learning | Interpretation |
|
||
| Parr (2018) | Automate mapping of Veterans Affair laboratory data to LOINC codes | Tree-based | Information systems, Automation |
|
||
| Wilkes (2018) | Classify urine steroid profiles as normal or abnormal, and further interpret into specific disease processes | Tree-based | Interpretation, Automation |
|
||
| Fillmore (2019) | Automate mapping of Veterans Affair laboratory data to LOINC codes | Tree-based | Information systems, Automation |
|
||
| Lee (2019) | Predict LDL-C levels from a limited lipid panel more accurately than current gold standard equations | Neural Network | Interpretation, Prediction |
|
||
| Xu (2019) | Identify redundant laboratory tests and predict their results as normal or abnormal | Tree-based | Prediction, Utilization |
|
||
| Islam (2020) | Use prior ordering patterns to create an algorithm that can recommend best practice tests for specific diagnoses | Neural Network | Utilization |
|
||
| Peng (2020) | Interpret newborn screening assays based on gestational age and other clinical information to reduce false positives | Tree-based | Interpretation, Utilization |
|
||
| Wang (2020) | Automatically verify if lab test result is valid or invalid | Tree-based | Validation, Automation |
|
||
| Dunn (2021) | Predict laboratory test results from wearable data | Tree-based | Prediction |
|
||
| Fang (2021) | Classify blood specimen as clotted or not clotted based on coagulation indices | Neural Network | Quality control |
|
||
| Farrell (2021) | Automatically identify mislabelled laboratory samples | Neural Network | Quality control, Automation |
|
||
|
||
: Summary of characteristics of machine learning algorithms
|
||
[@rabbani2022]. {#tbl-lab_ml}
|
||
|
||
## Reflex Testing
|
||
|
||
The laboratory diagnosis of thyroid dysfunction relies on the
|
||
measurement of circulating concentrations of thyrotropin (TSH), free
|
||
thyroxine (fT4), and, in some cases, free triiodothyronine (fT3). TSH
|
||
measurement is the most sensitive initial laboratory test for screening
|
||
individuals for thyroid hormone abnormalities [@woodmansee2018]. TSH and
|
||
fT4 have a complex, nonlinear relationship, such that small changes in
|
||
fT4 result in relatively significant changes in TSH [@plebani2020]. Many
|
||
clinicians and laboratories check TSH alone as the initial test for
|
||
thyroid problems and only add a Free T4 measurement if the TSH is
|
||
abnormal (outside the laboratory's normal reference range). This is
|
||
known as reflex testing [@woodmansee2018]. Reflex testing became
|
||
possible with the advent of laboratory information systems (LIS) that
|
||
were sufficiently flexible to permit modification of existing test
|
||
requests at various stages of the analytical process [@srivastava2010].
|
||
Reflex testing is widely used, the principal aim being to optimize the
|
||
use of laboratory tests. However, the common practice of reflex testing
|
||
relies simply on hard-coded rules that allow no flexibility. For
|
||
instance, in the case of TSH, free T4 will be added to the patient order
|
||
whenever the value falls outside the established laboratory reference
|
||
range. This brings into the fold the issue that the thresholds used to
|
||
trigger reflex addition of tests vary widely. In a study by Murphy, he
|
||
found the hypocalcaemic threshold to trigger magnesium measurement
|
||
varied from 1.50 mmol/L up to 2.20 mmol/L [-@murphy2021]. Even allowing
|
||
for differences in the nature, size, and staffing of hospital
|
||
laboratories and populations served, the extent of the observed
|
||
variation invites scrutiny [@murphy2021].
|