chapter 2 updates
This commit is contained in:
parent
1c2d279910
commit
cd13f25a36
2 changed files with 45 additions and 1 deletions
23
chapter2.qmd
23
chapter2.qmd
|
@ -85,6 +85,27 @@ want to put into production) [@boehmke2020]. Once the final model is
|
||||||
chosen the test set data is used to estimate an unbiased assessment of
|
chosen the test set data is used to estimate an unbiased assessment of
|
||||||
the model's performance, which we refer to as the generalization error
|
the model's performance, which we refer to as the generalization error
|
||||||
[@boehmke2020]. Most time (as much as 80%) is invested into the data
|
[@boehmke2020]. Most time (as much as 80%) is invested into the data
|
||||||
processes stage.
|
processes stage. In the second phase, a ML model is trained and tested
|
||||||
|
on the collected data after feature engineering. Feature engineering is
|
||||||
|
performed on the training set to select a good set of features to train
|
||||||
|
on. The ML model will only be able to learn efficiently if the training
|
||||||
|
data contains enough relevant features and minimal irrelevant ones
|
||||||
|
[@géron2019]. The data is then run through various models, Linear
|
||||||
|
Regression, Logistic Regression, K-Nearest Neighbors (KNN), Support
|
||||||
|
Vector Machines (SVMs), Decision Trees (DTs), Random Forests (RFs). Once
|
||||||
|
a model is selected the third phase begins to evaluate the models
|
||||||
|
performance. Historically, the performance of statistical models was
|
||||||
|
largely based on goodness-of-fit tests and assessment of residuals.
|
||||||
|
Unfortunately, misleading conclusions may follow from predictive models
|
||||||
|
that pass these kinds of assessments [@breiman2001]. Today, it has
|
||||||
|
become widely accepted that a more sound approach to assessing model
|
||||||
|
performance is to assess the predictive accuracy via loss functions
|
||||||
|
[@boehmke2020]. Loss functions are metrics that compare the predicted
|
||||||
|
values to the actual value (the output of a loss function is often
|
||||||
|
referred to as the error or pseudo residual). When performing resampling
|
||||||
|
methods, we assess the predicted values for a validation set compared to
|
||||||
|
the actual target value. The overall validation error of the model is
|
||||||
|
computed by aggregating the errors across the entire validation data set
|
||||||
|
[@boehmke2020].
|
||||||
|
|
||||||
####
|
####
|
||||||
|
|
|
@ -191,3 +191,26 @@ PMID: 33045173}
|
||||||
date = {2020-02-01},
|
date = {2020-02-01},
|
||||||
url = {https://bradleyboehmke.github.io/HOML/}
|
url = {https://bradleyboehmke.github.io/HOML/}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@book{géron2019,
|
||||||
|
title = {Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: concepts, tools, and techniques to build intelligent systems},
|
||||||
|
author = {{Géron}, {Aurélien}},
|
||||||
|
year = {2019},
|
||||||
|
date = {2019},
|
||||||
|
publisher = {O'Reilly Media, Inc},
|
||||||
|
edition = {Second edition},
|
||||||
|
address = {Beijing [China] ; Sebastopol, CA}
|
||||||
|
}
|
||||||
|
|
||||||
|
@article{breiman2001,
|
||||||
|
title = {Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)},
|
||||||
|
author = {Breiman, Leo},
|
||||||
|
year = {2001},
|
||||||
|
month = {08},
|
||||||
|
date = {2001-08-01},
|
||||||
|
journal = {Statistical Science},
|
||||||
|
volume = {16},
|
||||||
|
number = {3},
|
||||||
|
doi = {10.1214/ss/1009213726},
|
||||||
|
url = {http://dx.doi.org/10.1214/ss/1009213726}
|
||||||
|
}
|
||||||
|
|
Loading…
Reference in a new issue