DHSC-Capstone/chapter2.qmd

# Literature Review

The application of machine learning in medicine has garnered enormous
attention over the past decade [@rabbani2022]. Artificial intelligence
(AI) and especially the subdiscipline of machine learning (ML) have
become hot topics that are generating increasing interest among
laboratory professionals. AI is a rather broad term and can be defined
as the theory and development of computer systems to perform complex
tasks normally requiring human intelligence, such as decision-making,
visual perception, speech recognition, and translation between
languages. ML is the science of programming, which gives computers the
ability to learn from data without being explicitly programmed
[@debruyne2021]. The ever wider use of ML in clinical and basic medical
research is reflected in the number of titles and abstracts of papers
indexed on PubMed and published until 2006 as compared to 2007--2017,
with a nearly 10-fold increase from 1000 to slightly more than 9000
articles in the that time frame [@cabitza2018]. A literature review by
Rabbani et al. found 39 articles pertaining to the field of clinical
chemistry in laboratory medicine between 2011 and 2021 [-@rabbani2022].

## A Brief Primer on Machine Learning

While the aim of this literature review is not to provide an extensive
representation of the mathematics behind ML algorithms, some basic
concepts will be introduced to allow a sufficient understanding of the
topics discussed in the paper. ML models can be classified into broad
categories based on several criteria, such as the type of supervision,
whether are not the algorithm can learn incrementally from an incoming
stream of data (batch and online learning), and how they generalize
(instance-based versus model-based learning) [@debruyne2021]. Rabbani et
al. further classified the specfic clinical chemistry uses into five
board categories, predicting laboratory test values, improving
laboratory utilization, automating laboratory processes, promoting
precision laboratory test interpretation, and improving laboratory
medicine information systems [-@rabbani2022].

### Supervised vs Unsupervised Learning

Four important categories can be distinguished based on the amount and
type of supervision the models receive during training: supervised,
unsupervised, semi-supervised, and reinforcement learning. In supervised
learning, training data are labeled and data samples are predicted with
knowledge about the desired solutions [@debruyne2021]. They are
typically used for classification and regression purposes. Some of the
most important supervised algorithms are Linear Regression, Logistic
Regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVMs),
Decision Trees (DTs), Random Forests (RFs), and supervised neural
networks. In unsupervised learning, training data are unlabeled. In
other words, observations are classified without any prior data sample
knowledge [@debruyne2021]. Unsupervised algorithms can be used for
clustering (e.g. k-means clustering, density-based spatial clustering of
applications with noise, hierarchical cluster analysis), visualization
and dimensionality reduction (e.g. principal component analysis (PCA),
kernel PCA, locally linear embedding, t-distributed stochastic neighbor
embedding), anomaly detection and novelty detection (e.g. one-class SVM,
isolation forest) and association rule learning (e.g. apriori, eclat).
However, some models can deal with partially labeled training data (i.e.
semi-supervised learning). At last, in reinforcement learning, an agent
(i.e. the learning system) learns what actions to take to optimize the
outcome of a strategy (i.e. a policy) or to get the maximum cumulative
reward [@debruyne2021]. This system resembles humans learning to ride a
bike and can typically be used in learning games, such as Go, chess, or
even poker, or settings where the outcome is continuous rather than
dichotomous (i.e. right or wrong)[@debruyne2021]. The proposed study
will use supervised learning, as the data is labeled and an particular
outcome is expected.

### Machine Learning Workflow

Since this study will focus of supervised learning the review will focus
on that. Machine learning can be broken into three board steps, data
cleaning and processing, training and testing the model, finally the
model is evaluated, deployed, and monitored [@debruyne2021]. In the
first phase data is collected, cleaned, and labeled. Data cleaning or
pre-processing is one of the most important steps in designing a
reliable model [@debruyne2021]. Some examples of common pre-processing
steps are handling of missing data, detection of outliers, and encoding
of categorical data. Data at this stage is also split into training and
testing data, typically following somewhere near a 70-30 split. These
two data sets are used for different portions of the rest of model
building. The Training set data is used to develop feature sets, train
our algorithms, tune hyperparameters, compare models, and all of the
other activities required to choose a final model (e.g., the model we
want to put into production) [@boehmke2020]. Once the final model is
chosen the test set data is used to estimate an unbiased assessment of
the model's performance, which we refer to as the generalization error
[@boehmke2020]. Most time (as much as 80%) is invested into the data
processes stage. In the second phase, a ML model is trained and tested
on the collected data after feature engineering. Feature engineering is
performed on the training set to select a good set of features to train
on. The ML model will only be able to learn efficiently if the training
data contains enough relevant features and minimal irrelevant ones
[@géron2019]. The data is then run through various models, Linear
Regression, Logistic Regression, K-Nearest Neighbors (KNN), Support
Vector Machines (SVMs), Decision Trees (DTs), Random Forests (RFs). Once
a model is selected the third phase begins to evaluate the models
performance. Historically, the performance of statistical models was
largely based on goodness-of-fit tests and assessment of residuals.
Unfortunately, misleading conclusions may follow from predictive models
that pass these kinds of assessments [@breiman2001]. Today, it has
become widely accepted that a more sound approach to assessing model
performance is to assess the predictive accuracy via loss functions
[@boehmke2020]. Loss functions are metrics that compare the predicted
values to the actual value (the output of a loss function is often
referred to as the error or pseudo residual). When performing resampling
methods, we assess the predicted values for a validation set compared to
the actual target value. The overall validation error of the model is
computed by aggregating the errors across the entire validation data set
[@boehmke2020].

<!--# should I talk about Model types ?-->

### Machine Learning in the Clinical Laboratory

<!--# Can I copy this table? -->

| Author and Year  | **Objective and Machine Learning Task**                                                                         | **Best Model** | **Major Themes** |
|------------------|-----------------------------------------------------------------------------------------------------------------|----------------|------------------|
|                  
                   
 Azarkhish (2012)  |                                                                                                                 
                                                                                                                                     
                    Predict iron deficiency anemia and serum iron levels from CBC indices                                            |                
                                                                                                                                                      
                                                                                                                                      Neural Network  |                  
                                                                                                                                                                         
                                                                                                                                                       Prediction        |
|                  
                   
 Cao (2012)        |                                                                                                                 
                                                                                                                                     
                    Triage manual review for urinalysis samples                                                                      |                
                                                                                                                                                      
                                                                                                                                      Tree-based      | Automation       |
|                  
                   
 Yang (2013)       |                                                                                                                 
                                                                                                                                     
                    Predict normal reference ranges of ESR for various laboratories based on geographic and other clinical features  |                
                                                                                                                                                      
                                                                                                                                      Neural Network  |                  
                                                                                                                                                                         
                                                                                                                                                       Interpretation    |

: Table 1. Summary of characteristics of machine learning algorithms
[@rabbani2022].

### 

####
Create chapter2.qmd 2022-08-03 10:40:19 -04:00			`# Literature Review`
chapter 2 updates 2022-09-10 18:06:12 -04:00
			`The application of machine learning in medicine has garnered enormous`
			`attention over the past decade [@rabbani2022]. Artificial intelligence`
			`(AI) and especially the subdiscipline of machine learning (ML) have`
			`become hot topics that are generating increasing interest among`
			`laboratory professionals. AI is a rather broad term and can be defined`
			`as the theory and development of computer systems to perform complex`
			`tasks normally requiring human intelligence, such as decision-making,`
			`visual perception, speech recognition, and translation between`
			`languages. ML is the science of programming, which gives computers the`
			`ability to learn from data without being explicitly programmed`
			`[@debruyne2021]. The ever wider use of ML in clinical and basic medical`
			`research is reflected in the number of titles and abstracts of papers`
			`indexed on PubMed and published until 2006 as compared to 2007--2017,`
			`with a nearly 10-fold increase from 1000 to slightly more than 9000`
			`articles in the that time frame [@cabitza2018]. A literature review by`
			`Rabbani et al. found 39 articles pertaining to the field of clinical`
			`chemistry in laboratory medicine between 2011 and 2021 [-@rabbani2022].`

			`## A Brief Primer on Machine Learning`

			`While the aim of this literature review is not to provide an extensive`
			`representation of the mathematics behind ML algorithms, some basic`
			`concepts will be introduced to allow a sufficient understanding of the`
			`topics discussed in the paper. ML models can be classified into broad`
			`categories based on several criteria, such as the type of supervision,`
			`whether are not the algorithm can learn incrementally from an incoming`
			`stream of data (batch and online learning), and how they generalize`
			`(instance-based versus model-based learning) [@debruyne2021]. Rabbani et`
			`al. further classified the specfic clinical chemistry uses into five`
			`board categories, predicting laboratory test values, improving`
			`laboratory utilization, automating laboratory processes, promoting`
			`precision laboratory test interpretation, and improving laboratory`
			`medicine information systems [-@rabbani2022].`

			`### Supervised vs Unsupervised Learning`

			`Four important categories can be distinguished based on the amount and`
			`type of supervision the models receive during training: supervised,`
			`unsupervised, semi-supervised, and reinforcement learning. In supervised`
			`learning, training data are labeled and data samples are predicted with`
			`knowledge about the desired solutions [@debruyne2021]. They are`
			`typically used for classification and regression purposes. Some of the`
			`most important supervised algorithms are Linear Regression, Logistic`
			`Regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVMs),`
			`Decision Trees (DTs), Random Forests (RFs), and supervised neural`
			`networks. In unsupervised learning, training data are unlabeled. In`
			`other words, observations are classified without any prior data sample`
			`knowledge [@debruyne2021]. Unsupervised algorithms can be used for`
			`clustering (e.g. k-means clustering, density-based spatial clustering of`
			`applications with noise, hierarchical cluster analysis), visualization`
			`and dimensionality reduction (e.g. principal component analysis (PCA),`
			`kernel PCA, locally linear embedding, t-distributed stochastic neighbor`
			`embedding), anomaly detection and novelty detection (e.g. one-class SVM,`
			`isolation forest) and association rule learning (e.g. apriori, eclat).`
			`However, some models can deal with partially labeled training data (i.e.`
			`semi-supervised learning). At last, in reinforcement learning, an agent`
			`(i.e. the learning system) learns what actions to take to optimize the`
			`outcome of a strategy (i.e. a policy) or to get the maximum cumulative`
			`reward [@debruyne2021]. This system resembles humans learning to ride a`
			`bike and can typically be used in learning games, such as Go, chess, or`
			`even poker, or settings where the outcome is continuous rather than`
			`dichotomous (i.e. right or wrong)[@debruyne2021]. The proposed study`
			`will use supervised learning, as the data is labeled and an particular`
			`outcome is expected.`

			`### Machine Learning Workflow`

			`Since this study will focus of supervised learning the review will focus`
			`on that. Machine learning can be broken into three board steps, data`
			`cleaning and processing, training and testing the model, finally the`
			`model is evaluated, deployed, and monitored [@debruyne2021]. In the`
			`first phase data is collected, cleaned, and labeled. Data cleaning or`
			`pre-processing is one of the most important steps in designing a`
			`reliable model [@debruyne2021]. Some examples of common pre-processing`
			`steps are handling of missing data, detection of outliers, and encoding`
			`of categorical data. Data at this stage is also split into training and`
			`testing data, typically following somewhere near a 70-30 split. These`
			`two data sets are used for different portions of the rest of model`
			`building. The Training set data is used to develop feature sets, train`
			`our algorithms, tune hyperparameters, compare models, and all of the`
			`other activities required to choose a final model (e.g., the model we`
			`want to put into production) [@boehmke2020]. Once the final model is`
			`chosen the test set data is used to estimate an unbiased assessment of`
			`the model's performance, which we refer to as the generalization error`
			`[@boehmke2020]. Most time (as much as 80%) is invested into the data`
chapter 2 updates 2022-09-10 18:16:42 -04:00			`processes stage. In the second phase, a ML model is trained and tested`
			`on the collected data after feature engineering. Feature engineering is`
			`performed on the training set to select a good set of features to train`
			`on. The ML model will only be able to learn efficiently if the training`
			`data contains enough relevant features and minimal irrelevant ones`
			`[@géron2019]. The data is then run through various models, Linear`
			`Regression, Logistic Regression, K-Nearest Neighbors (KNN), Support`
			`Vector Machines (SVMs), Decision Trees (DTs), Random Forests (RFs). Once`
			`a model is selected the third phase begins to evaluate the models`
			`performance. Historically, the performance of statistical models was`
			`largely based on goodness-of-fit tests and assessment of residuals.`
			`Unfortunately, misleading conclusions may follow from predictive models`
			`that pass these kinds of assessments [@breiman2001]. Today, it has`
			`become widely accepted that a more sound approach to assessing model`
			`performance is to assess the predictive accuracy via loss functions`
			`[@boehmke2020]. Loss functions are metrics that compare the predicted`
			`values to the actual value (the output of a loss function is often`
			`referred to as the error or pseudo residual). When performing resampling`
			`methods, we assess the predicted values for a validation set compared to`
			`the actual target value. The overall validation error of the model is`
			`computed by aggregating the errors across the entire validation data set`
			`[@boehmke2020].`
chapter 2 updates 2022-09-10 18:06:12 -04:00
Update chapter2.qmd 2022-09-10 18:59:54 -04:00			`<!--# should I talk about Model types ?-->`

			`### Machine Learning in the Clinical Laboratory`

			`<!--# Can I copy this table? -->`

			`\| Author and Year \| Objective and Machine Learning Task \| Best Model \| Major Themes \|`
			`\|------------------\|-----------------------------------------------------------------------------------------------------------------\|----------------\|------------------\|`
			`\|`

			`Azarkhish (2012) \|`

			`Predict iron deficiency anemia and serum iron levels from CBC indices \|`

			`Neural Network \|`

			`Prediction \|`
			`\|`

			`Cao (2012) \|`

			`Triage manual review for urinalysis samples \|`

			`Tree-based \| Automation \|`
			`\|`

			`Yang (2013) \|`

			`Predict normal reference ranges of ESR for various laboratories based on geographic and other clinical features \|`

			`Neural Network \|`

			`Interpretation \|`

			`: Table 1. Summary of characteristics of machine learning algorithms`
			`[@rabbani2022].`

			`###`

chapter 2 updates 2022-09-10 18:06:12 -04:00			`####`