DHSC-Capstone/chapter3.qmd

# Methods

## IRB

Based on the information you submitted for this project, the Campbell
University Institutional Review Board (Campbell IRB) determined this
submission is Not Human Subjects Research as defined by 45 CFR
46.102(e).

## Population and Data

This study used the Medical Information Mart for Intensive Care (MIMIC)
database [@johnsonalistair]. MIMIC (Medical Information Mart for
Intensive Care) is an extensive, freely-available database comprising
de-identified health-related data from patients who were admitted to the
critical care units of the Beth Israel Deaconess Medical Center. The
database contains many different types of information, but only data
from the patients and laboratory events table are used in this study.
The study uses version IV of the database, comprising data from 2008 -
2019.

## Data Variables and Outcomes

```{r}
#| include: FALSE

source(here::here("ML","1-data-exploration.R"))

```

A total of 18 variables were chosen for this study. The age and gender
of the patient were pulled from the patient table in the MIMIC database.
While this database contains some additional demographic information, it
is incomplete and thus unusable for this study. 15 lab values were
selected for this study, this includes:

-   **BMP**: BUN, bicarbonate, calcium, chloride, creatinine, glucose,
    potassium, sodium

-   **CBC**: Hematocrit, hemoglobin, platelet count, red blood cell
    count, white blood cell count

-   TSH

-   Free T4

The unique patient id and chart time were also retained for identifying
each sample. Each sample contains one set of 16 lab values for each
patient. Patients may have several samples in the data set run at
different times. Rows were retained as long as they had less than three
missing results. These missing results can be filled in by imputation
later in the process. Samples were also filtered for those with TSH
above or below the reference range of 0.27 - 4.2 uIU/mL. These represent
samples that would have reflexed for Free T4 testing. After filtering,
the final data set contained `r nrow(ds1)` rows.

Once the final data set was collected, an additional column was created
for the outcome variable to determine if the Free T4 value was
diagnostic. This outcome variable was used for building classification
models. The classification variable was not used in regression models.
@tbl-outcome_var shows how the outcomes were added

| TSH Value     | Free T4 Value | Outcome             |
|---------------|---------------|---------------------|
| \>4.2 uIU/ml  | \>0.93 ng/dL  | Non-Hypothyroidism  |
| \>4.2 uIU/ml  | \<0.93 ng/dL  | Hypothyroidism      |
| \<0.27 uIU/ml | \<1.7 ng/dL   | Non-Hyperthyroidism |
| \<0.27 uIU/ml | \>1.7 ng/dL   | Hyperthyroidism     |

: Outcome Variable {#tbl-outcome_var}

. @tbl-data_summary shows the summary statistics of each variable
selected for the study. Each numeric variable is listed with the percent
missing, median, and interquartile range (IQR). The data set is weighted
toward elevated TSH levels, with 80% of values falling into that
category. Glucose and Calcium have several missing values at
`r gtsummary::inline_text(summary_tbl, variable = GLU, column = n)` and
`r gtsummary::inline_text(summary_tbl, variable = CA, column = n)`,
respectively.

```{r}
#| label: tbl-data_summary
#| tbl-cap: Data Summary
#| echo: false

summary_tbl %>% gtsummary$as_kable()
```

## Data Inspection

By examining @tbl-data_summary several important data set
characteristics quickly come to light without explanation. The median
age across the data set, as a whole, is quite similar, with a median age
across all categories of 62.5. Females are better represented in the
data set, with higher percentages in all categories. Across all
categories, the median values for each lab result are pretty similar.
The expectation for this is Red Blood cells, which show more
considerable variation across the various categories.

![Distribution of
Variables](figures/distrubution_histo){#fig-distro_histo}

When examining @fig-distro_histo, many clinical chemistry values do not
show a standard distribution. However, the hematology results typically
do appear to follow a standard distribution. While not a problem for
most tree-based classification models, many regression models perform
better with standard variables. Standardizing variables provides a
common comparable unit of measure across all the variables
[@boehmke2020]. Since lab values do not contain negative numbers, all
numeric values will be log-transformed to bring them to normal
distributions.

![Variable Correlation Plot](figures/corr_plot){#fig-corr_plot}

@fig-corr_plot shows a high correlation between better Hemoglobin,
hematocrit, and Red Blood Cell values (as would be expected). While high
correlation does not lead to model issues, it can cause unnecessary
computations with little value. However, due to the small about of
variables to begin with

## Data Tools

All data handling and modeling were performed using R and R Studio. The
current report was rendered in the following environment.

```{r}
#| label: tbl-platform-info
#| tbl-cap: Session Info R Environment
#| echo: false
#| message: false
#| #| warning: false

df_session_platform <-  devtools::session_info()$platform %>% 
  unlist(.) %>% 
  as.data.frame(.) %>% 
  tibble::rownames_to_column(.)

colnames(df_session_platform) <- c("Setting", "Value")

knitr::kable(
  df_session_platform
  ,align = 'l'
  ,booktabs = TRUE
)

```

```{r}
#| label: tbl-package-info
#| tbl-cap: Package Info R Environment
#| echo: false
#| message: false
#| warning: false


df_session_packages <-  devtools::session_info(include_base = TRUE)$packages %>% 
  as.data.frame(.) %>% 
  dplyr::select(loadedversion, date) %>% 
  tibble::rownames_to_column()

colnames(df_session_packages) <- c("Package", "Loaded Version", "Date")

used_packages <-renv::dependencies(progress = FALSE) %>% dplyr::select(2)

df_session_packages <- df_session_packages %>% 
  dplyr::filter(Package %in% used_packages$Package)

knitr::kable(
  df_session_packages
  ,align = 'l'
  ,booktabs = TRUE
)

```

## Model Selection

Both classification and regression models were screened using a random
grid search to tune hyperparameters. The models were tested against the
training data set to find the best-fit model. @fig-reg-screen shows the
results of the model screening for regression models, using root mean
square error (RMSE) as the ranking method. Random Forest models and
boosted trees performed similarly and were selected for further testing.
A full grid search was performed on both models, with a Random Forest
model as the final selection. The final hyperparameters selected were:

-   mtry: 8

-   trees: 1000

-   minimum nodes: 2

![Regression Model Screen](figures/reg_screen){#fig-reg-screen}

@fig-class-screen shows the results of the model screen for
classification models using accuracy as the ranking method. As with
regression models, boosted tress and random forest models performed the
best. After completing a full grid search of both model types, a random
forest model was again chosen as the final model. The final
hyperparameters for the model selected were:

-   mtry: 8

-   trees: 2000

-   minimun nodes: 2

![Classification Model Screen](figures/class_screen){#fig-class-screen}
Update chapter3.qmd 2023-01-22 15:26:35 -05:00			`# Methods`
updates 2022-09-26 20:11:10 -04:00
ujpdate sections 2023-05-27 16:46:49 -04:00			`## IRB`

			`Based on the information you submitted for this project, the Campbell`
			`University Institutional Review Board (Campbell IRB) determined this`
			`submission is Not Human Subjects Research as defined by 45 CFR`
			`46.102(e).`
Update chapter3.qmd 2023-01-24 13:48:02 -05:00
Update chapter3.qmd 2023-01-22 15:26:35 -05:00			`## Population and Data`
Update chapter3.qmd 2023-01-23 19:08:48 -05:00
ujpdate sections 2023-05-27 16:46:49 -04:00			`This study used the Medical Information Mart for Intensive Care (MIMIC)`
			`database [@johnsonalistair]. MIMIC (Medical Information Mart for`
			`Intensive Care) is an extensive, freely-available database comprising`
			`de-identified health-related data from patients who were admitted to the`
			`critical care units of the Beth Israel Deaconess Medical Center. The`
			`database contains many different types of information, but only data`
			`from the patients and laboratory events table are used in this study.`
			`The study uses version IV of the database, comprising data from 2008 -`
			`2019.`
Update chapter3.qmd 2023-01-23 19:08:48 -05:00
			`## Data Variables and Outcomes`
Update chapter3.qmd 2023-01-25 08:13:31 -05:00
Update chapter3.qmd 2023-01-25 14:29:28 -05:00			```{r}
			`#\| include: FALSE`
Update chapter3.qmd 2023-01-25 08:13:31 -05:00
Update chapter3.qmd 2023-01-25 14:29:28 -05:00			`source(here::here("ML","1-data-exploration.R"))`
Update chapter3.qmd 2023-01-25 08:13:31 -05:00
Update chapter3.qmd 2023-01-25 14:29:28 -05:00			```

ujpdate sections 2023-05-27 16:46:49 -04:00			`A total of 18 variables were chosen for this study. The age and gender`
			`of the patient were pulled from the patient table in the MIMIC database.`
			`While this database contains some additional demographic information, it`
			`is incomplete and thus unusable for this study. 15 lab values were`
			`selected for this study, this includes:`
Update chapter3.qmd 2023-01-25 14:29:28 -05:00
ujpdate sections 2023-05-27 16:46:49 -04:00			`- BMP: BUN, bicarbonate, calcium, chloride, creatinine, glucose,`
			`potassium, sodium`
Update chapter3.qmd 2023-01-25 14:29:28 -05:00
ujpdate sections 2023-05-27 16:46:49 -04:00			`- CBC: Hematocrit, hemoglobin, platelet count, red blood cell`
			`count, white blood cell count`
Update chapter3.qmd 2023-01-25 08:13:31 -05:00
			`- TSH`

			`- Free T4`

ujpdate sections 2023-05-27 16:46:49 -04:00			`The unique patient id and chart time were also retained for identifying`
			`each sample. Each sample contains one set of 16 lab values for each`
			`patient. Patients may have several samples in the data set run at`
			`different times. Rows were retained as long as they had less than three`
			`missing results. These missing results can be filled in by imputation`
			`later in the process. Samples were also filtered for those with TSH`
			`above or below the reference range of 0.27 - 4.2 uIU/mL. These represent`
			`samples that would have reflexed for Free T4 testing. After filtering,`
			the final data set contained `r nrow(ds1)` rows.

			`Once the final data set was collected, an additional column was created`
			`for the outcome variable to determine if the Free T4 value was`
			`diagnostic. This outcome variable was used for building classification`
			`models. The classification variable was not used in regression models.`
			`@tbl-outcome_var shows how the outcomes were added`
Update chapter3.qmd 2023-01-25 14:29:28 -05:00
Update chapter3.qmd 2023-01-25 16:39:42 -05:00			`\| TSH Value \| Free T4 Value \| Outcome \|`
			`\|---------------\|---------------\|---------------------\|`
			`\| \>4.2 uIU/ml \| \>0.93 ng/dL \| Non-Hypothyroidism \|`
			`\| \>4.2 uIU/ml \| \<0.93 ng/dL \| Hypothyroidism \|`
book updates 2023-01-26 15:39:24 -05:00			`\| \<0.27 uIU/ml \| \<1.7 ng/dL \| Non-Hyperthyroidism \|`
			`\| \<0.27 uIU/ml \| \>1.7 ng/dL \| Hyperthyroidism \|`
Update chapter3.qmd 2023-01-25 14:29:28 -05:00
			`: Outcome Variable {#tbl-outcome_var}`

ujpdate sections 2023-05-27 16:46:49 -04:00			`. @tbl-data_summary shows the summary statistics of each variable`
			`selected for the study. Each numeric variable is listed with the percent`
			`missing, median, and interquartile range (IQR). The data set is weighted`
			`toward elevated TSH levels, with 80% of values falling into that`
			`category. Glucose and Calcium have several missing values at`
			`r gtsummary::inline_text(summary_tbl, variable = GLU, column = n)` and
			`r gtsummary::inline_text(summary_tbl, variable = CA, column = n)`,
			`respectively.`
book updates 2023-01-26 15:39:24 -05:00
Update chapter3.qmd 2023-01-25 16:39:42 -05:00			```{r}
			`#\| label: tbl-data_summary`
			`#\| tbl-cap: Data Summary`
			`#\| echo: false`

book updates 2023-01-26 15:39:24 -05:00			`summary_tbl %>% gtsummary$as_kable()`
Update chapter3.qmd 2023-01-25 16:39:42 -05:00			```

			`## Data Inspection`

ujpdate sections 2023-05-27 16:46:49 -04:00			`By examining @tbl-data_summary several important data set`
			`characteristics quickly come to light without explanation. The median`
			`age across the data set, as a whole, is quite similar, with a median age`
			`across all categories of 62.5. Females are better represented in the`
			`data set, with higher percentages in all categories. Across all`
			`categories, the median values for each lab result are pretty similar.`
			`The expectation for this is Red Blood cells, which show more`
			`considerable variation across the various categories.`

			`![Distribution of`
			`Variables](figures/distrubution_histo){#fig-distro_histo}`

			`When examining @fig-distro_histo, many clinical chemistry values do not`
			`show a standard distribution. However, the hematology results typically`
			`do appear to follow a standard distribution. While not a problem for`
			`most tree-based classification models, many regression models perform`
			`better with standard variables. Standardizing variables provides a`
			`common comparable unit of measure across all the variables`
			`[@boehmke2020]. Since lab values do not contain negative numbers, all`
			`numeric values will be log-transformed to bring them to normal`
			`distributions.`
updates 2023-02-27 10:34:27 -05:00
book updates 2023-01-26 08:15:44 -05:00			`![Variable Correlation Plot](figures/corr_plot){#fig-corr_plot}`
Update chapter3.qmd 2023-01-25 16:39:42 -05:00
ujpdate sections 2023-05-27 16:46:49 -04:00			`@fig-corr_plot shows a high correlation between better Hemoglobin,`
			`hematocrit, and Red Blood Cell values (as would be expected). While high`
			`correlation does not lead to model issues, it can cause unnecessary`
			`computations with little value. However, due to the small about of`
			`variables to begin with`
updates 2023-02-27 10:34:27 -05:00
			`## Data Tools`
Update chapter3.qmd 2023-01-25 08:13:31 -05:00
ujpdate sections 2023-05-27 16:46:49 -04:00			`All data handling and modeling were performed using R and R Studio. The`
			`current report was rendered in the following environment.`
updates 2023-02-27 10:34:27 -05:00
			```{r}
Update chapter3.qmd 2023-02-27 14:19:22 -05:00			`#\| label: tbl-platform-info`
			`#\| tbl-cap: Session Info R Environment`
			`#\| echo: false`
updates 2023-03-15 19:55:27 -04:00			`#\| message: false`
			`#\| #\| warning: false`
Update chapter3.qmd 2023-02-27 14:19:22 -05:00
			`df_session_platform <- devtools::session_info()$platform %>%`
			`unlist(.) %>%`
			`as.data.frame(.) %>%`
			`tibble::rownames_to_column(.)`

			`colnames(df_session_platform) <- c("Setting", "Value")`

			`knitr::kable(`
			`df_session_platform`
			`,align = 'l'`
			`,booktabs = TRUE`
			`)`

			```

			```{r}
			`#\| label: tbl-package-info`
			`#\| tbl-cap: Package Info R Environment`
			`#\| echo: false`
			`#\| message: false`
updates 2023-03-15 19:55:27 -04:00			`#\| warning: false`
Update chapter3.qmd 2023-02-27 14:19:22 -05:00

			`df_session_packages <- devtools::session_info(include_base = TRUE)$packages %>%`
			`as.data.frame(.) %>%`
			`dplyr::select(loadedversion, date) %>%`
			`tibble::rownames_to_column()`

			`colnames(df_session_packages) <- c("Package", "Loaded Version", "Date")`

updates 2023-03-15 19:55:27 -04:00			`used_packages <-renv::dependencies(progress = FALSE) %>% dplyr::select(2)`
Update chapter3.qmd 2023-02-27 14:19:22 -05:00
			`df_session_packages <- df_session_packages %>%`
			`dplyr::filter(Package %in% used_packages$Package)`

			`knitr::kable(`
			`df_session_packages`
			`,align = 'l'`
			`,booktabs = TRUE`
			`)`

updates 2023-02-27 10:34:27 -05:00			```
Update chapter3.qmd 2023-01-25 14:29:28 -05:00
			`## Model Selection`
book updates 2023-01-26 08:15:44 -05:00
ujpdate sections 2023-05-27 16:46:49 -04:00			`Both classification and regression models were screened using a random`
			`grid search to tune hyperparameters. The models were tested against the`
			`training data set to find the best-fit model. @fig-reg-screen shows the`
			`results of the model screening for regression models, using root mean`
			`square error (RMSE) as the ranking method. Random Forest models and`
			`boosted trees performed similarly and were selected for further testing.`
			`A full grid search was performed on both models, with a Random Forest`
			`model as the final selection. The final hyperparameters selected were:`

			`- mtry: 8`

			`- trees: 1000`

			`- minimum nodes: 2`

			`![Regression Model Screen](figures/reg_screen){#fig-reg-screen}`

			`@fig-class-screen shows the results of the model screen for`
			`classification models using accuracy as the ranking method. As with`
			`regression models, boosted tress and random forest models performed the`
			`best. After completing a full grid search of both model types, a random`
			`forest model was again chosen as the final model. The final`
			`hyperparameters for the model selected were:`

			`- mtry: 8`

			`- trees: 2000`

			`- minimun nodes: 2`

			`![Classification Model Screen](figures/class_screen){#fig-class-screen}`