DHSC-Capstone/chapter3.qmd

64 lines
3.5 KiB
Text
Raw Normal View History

2023-01-22 15:26:35 -05:00
# Methods
2022-09-26 20:11:10 -04:00
2023-01-24 13:48:02 -05:00
need brief description and IRB statement
2023-01-22 15:26:35 -05:00
## Population and Data
2023-01-23 19:08:48 -05:00
2023-01-26 14:57:12 -05:00
This study was designed using the Medical Information Mart for Intensive Care (MIMIC) database [@johnsonalistair]. MIMIC (Medical Information Mart for Intensive Care) is an extensive, freely-available database comprising de-identified health-related data from patients who were admitted to the critical care units of the Beth Israel Deaconess Medical Center. The database contains many different types of information, but only data from the patients and laboratory events table are used in this study. The database contains many kinds of information, but only data from the patients and laboratory events table are used in this study. The study uses version IV of the database, comprising data from 2008 - 2019.
2023-01-23 19:08:48 -05:00
## Data Variables and Outcomes
2023-01-25 08:13:31 -05:00
2023-01-25 14:29:28 -05:00
```{r}
#| include: FALSE
2023-01-25 08:13:31 -05:00
2023-01-25 14:29:28 -05:00
source(here::here("ML","1-data-exploration.R"))
2023-01-25 08:13:31 -05:00
2023-01-25 14:29:28 -05:00
```
2023-01-26 14:57:12 -05:00
A total of 18 variables were chosen for this study. The age and gender of the patient were pulled from the patient table in the MIMIC database. While this database contains some additional demographic information, it is incomplete and thus unusable for this study. 16 lab values were selected for this study, this includes:
2023-01-25 14:29:28 -05:00
2023-01-26 14:57:12 -05:00
- **BMP**: BUN, bicarbonate, calcium, chloride, creatinine, glucose, potassium, sodium
2023-01-25 14:29:28 -05:00
2023-01-26 14:57:12 -05:00
- **CBC**: Hematocrit, hemoglobin, platelet count, red blood cell count, white blood cell count
2023-01-25 08:13:31 -05:00
- TSH
- Free T4
2023-01-26 14:57:12 -05:00
The unique patient id and chart time were also retained for identifying each sample. Each sample contains one set of 16 lab values for each patient. Patients may have several samples in the data set that were run at different times. Rows were retained as long as they had less than three missing results. These missing results can be filled in by imputation later in the process. Samples were also filtered for those with TSH above or below the reference range of 0.27 - 4.2 uIU/mL. These represent samples that would have reflexed for Free T4 testing. After filtering, the final data set contained `r nrow(ds1)` rows.
2023-01-26 15:39:24 -05:00
Once the final data set was collected, an additional column was created for the outcome variable to determine if the Free T4 value was diagnostic. After adding the outcome variable, the Free T4 value was dropped from each row. @tbl-outcome_var shows how the outcomes were added
2023-01-25 14:29:28 -05:00
2023-01-25 16:39:42 -05:00
| TSH Value | Free T4 Value | Outcome |
|---------------|---------------|---------------------|
| \>4.2 uIU/ml | \>0.93 ng/dL | Non-Hypothyroidism |
| \>4.2 uIU/ml | \<0.93 ng/dL | Hypothyroidism |
2023-01-26 15:39:24 -05:00
| \<0.27 uIU/ml | \<1.7 ng/dL | Non-Hyperthyroidism |
| \<0.27 uIU/ml | \>1.7 ng/dL | Hyperthyroidism |
2023-01-25 14:29:28 -05:00
: Outcome Variable {#tbl-outcome_var}
2023-01-26 15:39:24 -05:00
. @tbl-data_summary shows the summary statistics of each variable selected for the study. Each numeric variable is listed with the percent missing, median, and interquartile range (IQR). The data set is weighted toward elevated TSH levels, with 80% of values falling into that category. Glucose and Calcium both have high amounts of missing values at `r gtsummary::inline_text(summary_tbl, variable = GLU, column = n)` and `r gtsummary::inline_text(summary_tbl, variable = CA, column = n)`, respectively.
2023-01-25 16:39:42 -05:00
```{r}
#| label: tbl-data_summary
#| tbl-cap: Data Summary
#| echo: false
2023-01-26 15:39:24 -05:00
summary_tbl %>% gtsummary$as_kable()
2023-01-25 16:39:42 -05:00
```
## Data Inspection
2023-01-26 14:57:12 -05:00
![Distribution of Variables](figures/distrubution_histo){#fig-distro_histo}
2023-01-26 08:15:44 -05:00
![Variable Correlation Plot](figures/corr_plot){#fig-corr_plot}
2023-01-25 16:39:42 -05:00
2023-01-25 08:13:31 -05:00
## Data Transformations
In progress
2023-01-25 14:29:28 -05:00
## Model Selection
2023-01-26 08:15:44 -05:00
In Progress