211 lines
7.3 KiB
Text
211 lines
7.3 KiB
Text
# Methods
|
|
|
|
## IRB
|
|
|
|
This study was submitted to the Cambell University Institutional Review
|
|
Board (Campbell IRB) . The study was determined to be Not Human Subjects
|
|
Research as defined by 45 CFR 46.102(e), and thus exempt from further
|
|
review by the IRB.
|
|
|
|
## Population and Data
|
|
|
|
This study used the Medical Information Mart for Intensive Care (MIMIC)
|
|
database [@johnsonalistair]. MIMIC (Medical Information Mart for
|
|
Intensive Care) is an extensive, freely-available database comprising
|
|
de-identified health-related data from patients who were admitted to the
|
|
critical care units of the Beth Israel Deaconess Medical Center. The
|
|
database contains many different types of information, but only data
|
|
from the patients and laboratory events table are used in this study.
|
|
The study uses version IV of the database, comprising data from 2008 -
|
|
2019.
|
|
|
|
## Data Variables and Outcomes
|
|
|
|
```{r}
|
|
#| include: FALSE
|
|
|
|
library(magrittr)
|
|
|
|
source(here::here("ML","1-data-exploration.R"))
|
|
|
|
```
|
|
|
|
A total of 18 variables were chosen for this study. The age and gender
|
|
of the patient were pulled from the patient table in the MIMIC database.
|
|
While this database contains some additional demographic information, it
|
|
is incomplete and thus unusable for this study. 15 lab values were
|
|
selected for this study, this includes:
|
|
|
|
- **BMP**: BUN, bicarbonate, calcium, chloride, creatinine, glucose,
|
|
potassium, sodium
|
|
|
|
- **CBC**: Hematocrit, hemoglobin, platelet count, red blood cell
|
|
count, white blood cell count
|
|
|
|
- TSH
|
|
|
|
- Free T4
|
|
|
|
The unique patient id and chart time were also retained for identifying
|
|
each sample. Each sample contains one set of 15 lab values for each
|
|
patient. Patients may have several samples in the data set run at
|
|
different times. Rows were retained as long as they had less than three
|
|
missing results. These missing results can be filled in by imputation
|
|
later in the process. Samples were also filtered for those with TSH
|
|
above or below the reference range of 0.27 - 4.2 uIU/mL. These represent
|
|
samples that would have reflexed for Free T4 testing. After filtering,
|
|
the final data set contained `r nrow(ds1)` rows.
|
|
|
|
Once the final data set was collected, an additional column was created
|
|
for the outcome variable to determine if the Free T4 value was
|
|
diagnostic. This outcome variable was used for building classification
|
|
models. The classification variable was not used in regression models.
|
|
@tbl-outcome_var shows how the outcomes were added
|
|
|
|
| TSH Value | Free T4 Value | Outcome |
|
|
|---------------|---------------|---------------------|
|
|
| \>4.2 uIU/ml | \>0.93 ng/dL | Non-Hypothyroidism |
|
|
| \>4.2 uIU/ml | \<0.93 ng/dL | Hypothyroidism |
|
|
| \<0.27 uIU/ml | \<1.7 ng/dL | Non-Hyperthyroidism |
|
|
| \<0.27 uIU/ml | \>1.7 ng/dL | Hyperthyroidism |
|
|
|
|
: Outcome Variable {#tbl-outcome_var}
|
|
|
|
. @tbl-data_summary shows the summary statistics of each variable
|
|
selected for the study. Each numeric variable is listed with the percent
|
|
missing, median, and interquartile range (IQR). The data set is weighted
|
|
toward elevated TSH levels, with 80% of values falling into that
|
|
category. Glucose and Calcium have several missing values at
|
|
`r gtsummary::inline_text(summary_tbl, variable = GLU, column = n)` and
|
|
`r gtsummary::inline_text(summary_tbl, variable = CA, column = n)`,
|
|
respectively.
|
|
|
|
```{r}
|
|
#| label: tbl-data_summary
|
|
#| tbl-cap: Data Summary
|
|
#| echo: false
|
|
|
|
summary_tbl %>% gtsummary$as_kable()
|
|
```
|
|
|
|
## Data Inspection
|
|
|
|
By examining @tbl-data_summary several important data set
|
|
characteristics quickly come to light without explanation. The median
|
|
age across the data set, as a whole, is quite similar, with a median age
|
|
across all categories of 62.5. Females are better represented in the
|
|
data set, with higher percentages in all categories. Across all
|
|
categories, the median values for each lab result are pretty similar.
|
|
The expectation for this is Red Blood cells, which show more
|
|
considerable variation across the various categories.
|
|
|
|
{#fig-distro_histo}
|
|
|
|
When examining @fig-distro_histo, many clinical chemistry values do not
|
|
show a standard distribution. However, the hematology results typically
|
|
do appear to follow a standard distribution. While not a problem for
|
|
most tree-based classification models, many regression models perform
|
|
better with standard variables. Standardizing variables provides a
|
|
common comparable unit of measure across all the variables
|
|
[@boehmke2020]. Since lab values do not contain negative numbers, all
|
|
numeric values will be log-transformed to bring them to normal
|
|
distributions.
|
|
|
|
{#fig-corr_plot}
|
|
|
|
@fig-corr_plot shows a high correlation between Hemoglobin, hematocrit,
|
|
and Red Blood Cell values (as expected). While high correlation does not
|
|
lead to model issues, it can cause unnecessary computations with little
|
|
value. However, due to the small number of variables, the computation
|
|
burden is not expected to cause delays, and thus the variables will not
|
|
be removed.
|
|
|
|
## Data Tools
|
|
|
|
All data handling and modeling were performed using R and R Studio. The
|
|
current report was rendered in the following environment.
|
|
|
|
```{r}
|
|
#| label: tbl-platform-info
|
|
#| tbl-cap: Session Info R Environment
|
|
#| echo: false
|
|
#| message: false
|
|
#| #| warning: false
|
|
|
|
df_session_platform <- devtools::session_info()$platform %>%
|
|
unlist(.) %>%
|
|
as.data.frame(.) %>%
|
|
tibble::rownames_to_column(.)
|
|
|
|
colnames(df_session_platform) <- c("Setting", "Value")
|
|
|
|
knitr::kable(
|
|
df_session_platform
|
|
,align = 'l'
|
|
,booktabs = TRUE
|
|
)
|
|
|
|
```
|
|
|
|
```{r}
|
|
#| label: tbl-package-info
|
|
#| tbl-cap: Package Info R Environment
|
|
#| echo: false
|
|
#| message: false
|
|
#| warning: false
|
|
|
|
|
|
df_session_packages <- devtools::session_info(include_base = TRUE)$packages %>%
|
|
as.data.frame(.) %>%
|
|
dplyr::select(loadedversion, date) %>%
|
|
tibble::rownames_to_column()
|
|
|
|
colnames(df_session_packages) <- c("Package", "Loaded Version", "Date")
|
|
|
|
used_packages <-renv::dependencies(progress = FALSE) %>% dplyr::select(2)
|
|
|
|
df_session_packages <- df_session_packages %>%
|
|
dplyr::filter(Package %in% used_packages$Package)
|
|
|
|
knitr::kable(
|
|
df_session_packages
|
|
,align = 'l'
|
|
,booktabs = TRUE
|
|
)
|
|
|
|
```
|
|
|
|
## Model Selection
|
|
|
|
Both classification and regression models were screened using a random
|
|
grid search to tune hyperparameters. The models were tested against the
|
|
training data set to find the best-fit model. @fig-reg-screen shows the
|
|
results of the model screening for regression models, using root mean
|
|
square error (RMSE) as the ranking method. Random Forest models and
|
|
boosted trees performed similarly and were selected for further testing.
|
|
A full grid search was performed on both models, with a Random Forest
|
|
model as the final selection. The final hyperparameters selected were:
|
|
|
|
- mtry: 8
|
|
|
|
- trees: 1000
|
|
|
|
- minimum nodes: 2
|
|
|
|
{#fig-reg-screen}
|
|
|
|
@fig-class-screen shows the results of the model screen for
|
|
classification models using accuracy as the ranking method. As with
|
|
regression models, boosted trees and random forest models performed the
|
|
best. After completing a full grid search of both model types, a random
|
|
forest model was again chosen as the final model. The final
|
|
hyperparameters for the model selected were:
|
|
|
|
- mtry: 8
|
|
|
|
- trees: 2000
|
|
|
|
- minimum nodes: 2
|
|
|
|
{#fig-class-screen}
|