chapter 2 updates

This commit is contained in:
Kyle Belanger 2022-09-10 18:16:42 -04:00
parent 1c2d279910
commit cd13f25a36
2 changed files with 45 additions and 1 deletions

View file

@ -85,6 +85,27 @@ want to put into production) [@boehmke2020]. Once the final model is
chosen the test set data is used to estimate an unbiased assessment of chosen the test set data is used to estimate an unbiased assessment of
the model's performance, which we refer to as the generalization error the model's performance, which we refer to as the generalization error
[@boehmke2020]. Most time (as much as 80%) is invested into the data [@boehmke2020]. Most time (as much as 80%) is invested into the data
processes stage. processes stage. In the second phase, a ML model is trained and tested
on the collected data after feature engineering. Feature engineering is
performed on the training set to select a good set of features to train
on. The ML model will only be able to learn efficiently if the training
data contains enough relevant features and minimal irrelevant ones
[@géron2019]. The data is then run through various models, Linear
Regression, Logistic Regression, K-Nearest Neighbors (KNN), Support
Vector Machines (SVMs), Decision Trees (DTs), Random Forests (RFs). Once
a model is selected the third phase begins to evaluate the models
performance. Historically, the performance of statistical models was
largely based on goodness-of-fit tests and assessment of residuals.
Unfortunately, misleading conclusions may follow from predictive models
that pass these kinds of assessments [@breiman2001]. Today, it has
become widely accepted that a more sound approach to assessing model
performance is to assess the predictive accuracy via loss functions
[@boehmke2020]. Loss functions are metrics that compare the predicted
values to the actual value (the output of a loss function is often
referred to as the error or pseudo residual). When performing resampling
methods, we assess the predicted values for a validation set compared to
the actual target value. The overall validation error of the model is
computed by aggregating the errors across the entire validation data set
[@boehmke2020].
#### ####

View file

@ -191,3 +191,26 @@ PMID: 33045173}
date = {2020-02-01}, date = {2020-02-01},
url = {https://bradleyboehmke.github.io/HOML/} url = {https://bradleyboehmke.github.io/HOML/}
} }
@book{géron2019,
title = {Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: concepts, tools, and techniques to build intelligent systems},
author = {{Géron}, {Aurélien}},
year = {2019},
date = {2019},
publisher = {O'Reilly Media, Inc},
edition = {Second edition},
address = {Beijing [China] ; Sebastopol, CA}
}
@article{breiman2001,
title = {Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)},
author = {Breiman, Leo},
year = {2001},
month = {08},
date = {2001-08-01},
journal = {Statistical Science},
volume = {16},
number = {3},
doi = {10.1214/ss/1009213726},
url = {http://dx.doi.org/10.1214/ss/1009213726}
}