updates

2023-06-25 08:45:15 -04:00 · 2023-06-25 08:45:15 -04:00 · 5885dd08b3
commit 5885dd08b3
parent 31c852f413
8 changed files with 349 additions and 36 deletions
--- a/ML/2-modeling.R
+++ b/ML/2-modeling.R
@ -140,6 +140,8 @@ gp2$ggsave(
 )

 ys$accuracy(class_test_results %>%  tune::collect_predictions() ,truth = ft4_dia, estimate = .pred_class )
+ys$sensitivity(class_test_results %>%  tune::collect_predictions() ,truth = ft4_dia, estimate = .pred_class )
+ys$specificity(class_test_results %>%  tune::collect_predictions() ,truth = ft4_dia, estimate = .pred_class )

 class_test_results %>%
  workflows::extract_fit_parsnip() %>%
--- a/_quarto.yml
+++ b/_quarto.yml
@ -7,13 +7,13 @@ book:
  date: "8/2/2022"
  chapters:
    - index.qmd
+    - abstract.qmd
    - chapter1.qmd
    - chapter2.qmd
    - chapter3.qmd
    - chapter4.qmd
    - chapter5.qmd
    - references.qmd
-  abstract: "This is a test to see what happens with this"



--- a/abstract.qmd
+++ b/abstract.qmd
@ -0,0 +1,51 @@
+## Abstract {.unnumbered}
+
+**Introduction**: This research study focuses on developing and testing
+a machine learning algorithm to predict the FT4 result or diagnose hyper
+or hypothyroidism in clinical chemistry. The goal is to bridge the gap
+between hard-coded reflex testing and fully manual reflective testing
+using machine learning algorithms. The significance of this study lies
+in the increasing healthcare costs, where laboratory services contribute
+significantly to medical decisions and budgets. By implementing
+automated reflex testing with machine learning algorithms, unnecessary
+laboratory tests can be reduced, resulting in cost savings and improved
+efficiency in the healthcare system.
+
+**Methods:** The study was performed using the Medical Information Mart
+for Intensive Care (MIMIC) database for data collection. The database
+consists of de-identified health-related data from critical care units.
+Eighteen variables, including patient demographics and lab values, were
+selected for the study. The data set was filtered based on specific
+criteria, and an outcome variable was created to determine if the Free
+T4 value was diagnostic. The data handling and modeling were performed
+using R and R Studio. Regression and classification models were screened
+using a random grid search to tune hyperparameters, and random forest
+models were selected as the final models based on their performance. The
+selected hyperparameters for both regression and classification models
+are specified.
+
+**Results:** The study analyzed a dataset of 11,340 observations,
+randomly splitting it into a training set (9071 observations) and a
+testing set (2269 observations) based on the Free T4 laboratory
+diagnostic value stratification. Classification algorithms were used to
+predict whether Free T4 would be diagnostic, achieving an accuracy of
+0.796 and an AUC of 0.918. The model had a sensitivity of 0.632 and a
+specificity of 0.892. The importance of individual analytes was
+assessed, with TSH being the most influential variable. The study also
+evaluated the predictability of Free T4 results using regression,
+achieving a Root Mean Square Error (RMSE) of 0.334. The predicted
+results had an accuracy of 0.790, similar to the classification model.
+
+**Discussion:** The study found that the diagnostic value of Free T4 can
+be accurately predicted 80% of the time using machine learning
+algorithms. However, the model had limitations in terms of sensitivity,
+with a false negative rate of 16% for elevated TSH results and 20% for
+decreased TSH results. The model achieved a specificity of 89% but did
+not meet the threshold for clinical deployment. The importance of
+individual analytes was explored, revealing unexpected correlations
+between TSH and hematology results, which could be valuable for future
+algorithms. Real-world applications could use predictive models in
+clinical decision-making systems to determine the need for Free T4 lab
+tests based on predictions and patient signs and symptoms. However,
+implementing such algorithms in existing laboratory information systems
+poses challenges.
--- a/chapter3.qmd
+++ b/chapter3.qmd
@ -2,11 +2,22 @@

 ## IRB

-Based on the information you submitted for this project, the Campbell University Institutional Review Board (Campbell IRB) determined this submission is Not Human Subjects Research as defined by 45 CFR 46.102(e).
+This study was submitted to the Cambell University Institutional Review
+Board (Campbell IRB) . The study was determined to be Not Human Subjects
+Research as defined by 45 CFR 46.102(e), and thus exempt from further
+review by the IRB.

 ## Population and Data

-This study used the Medical Information Mart for Intensive Care (MIMIC) database [@johnsonalistair]. MIMIC (Medical Information Mart for Intensive Care) is an extensive, freely-available database comprising de-identified health-related data from patients who were admitted to the critical care units of the Beth Israel Deaconess Medical Center. The database contains many different types of information, but only data from the patients and laboratory events table are used in this study. The study uses version IV of the database, comprising data from 2008 - 2019.
+This study used the Medical Information Mart for Intensive Care (MIMIC)
+database [@johnsonalistair]. MIMIC (Medical Information Mart for
+Intensive Care) is an extensive, freely-available database comprising
+de-identified health-related data from patients who were admitted to the
+critical care units of the Beth Israel Deaconess Medical Center. The
+database contains many different types of information, but only data
+from the patients and laboratory events table are used in this study.
+The study uses version IV of the database, comprising data from 2008 -
+2019.

 ## Data Variables and Outcomes

@ -19,19 +30,37 @@ source(here::here("ML","1-data-exploration.R"))

 ```

-A total of 18 variables were chosen for this study. The age and gender of the patient were pulled from the patient table in the MIMIC database. While this database contains some additional demographic information, it is incomplete and thus unusable for this study. 15 lab values were selected for this study, this includes:
+A total of 18 variables were chosen for this study. The age and gender
+of the patient were pulled from the patient table in the MIMIC database.
+While this database contains some additional demographic information, it
+is incomplete and thus unusable for this study. 15 lab values were
+selected for this study, this includes:

-   **BMP**: BUN, bicarbonate, calcium, chloride, creatinine, glucose, potassium, sodium
+-   **BMP**: BUN, bicarbonate, calcium, chloride, creatinine, glucose,
+    potassium, sodium

-   **CBC**: Hematocrit, hemoglobin, platelet count, red blood cell count, white blood cell count
+-   **CBC**: Hematocrit, hemoglobin, platelet count, red blood cell
+    count, white blood cell count

 -   TSH

 -   Free T4

-The unique patient id and chart time were also retained for identifying each sample. Each sample contains one set of 15 lab values for each patient. Patients may have several samples in the data set run at different times. Rows were retained as long as they had less than three missing results. These missing results can be filled in by imputation later in the process. Samples were also filtered for those with TSH above or below the reference range of 0.27 - 4.2 uIU/mL. These represent samples that would have reflexed for Free T4 testing. After filtering, the final data set contained `r nrow(ds1)` rows.
+The unique patient id and chart time were also retained for identifying
+each sample. Each sample contains one set of 15 lab values for each
+patient. Patients may have several samples in the data set run at
+different times. Rows were retained as long as they had less than three
+missing results. These missing results can be filled in by imputation
+later in the process. Samples were also filtered for those with TSH
+above or below the reference range of 0.27 - 4.2 uIU/mL. These represent
+samples that would have reflexed for Free T4 testing. After filtering,
+the final data set contained `r nrow(ds1)` rows.

-Once the final data set was collected, an additional column was created for the outcome variable to determine if the Free T4 value was diagnostic. This outcome variable was used for building classification models. The classification variable was not used in regression models. @tbl-outcome_var shows how the outcomes were added
+Once the final data set was collected, an additional column was created
+for the outcome variable to determine if the Free T4 value was
+diagnostic. This outcome variable was used for building classification
+models. The classification variable was not used in regression models.
+@tbl-outcome_var shows how the outcomes were added

 | TSH Value     | Free T4 Value | Outcome             |
 |---------------|---------------|---------------------|
@ -42,7 +71,14 @@ Once the final data set was collected, an additional column was created for the

 : Outcome Variable {#tbl-outcome_var}

-. @tbl-data_summary shows the summary statistics of each variable selected for the study. Each numeric variable is listed with the percent missing, median, and interquartile range (IQR). The data set is weighted toward elevated TSH levels, with 80% of values falling into that category. Glucose and Calcium have several missing values at `r gtsummary::inline_text(summary_tbl, variable = GLU, column = n)` and `r gtsummary::inline_text(summary_tbl, variable = CA, column = n)`, respectively.
+. @tbl-data_summary shows the summary statistics of each variable
+selected for the study. Each numeric variable is listed with the percent
+missing, median, and interquartile range (IQR). The data set is weighted
+toward elevated TSH levels, with 80% of values falling into that
+category. Glucose and Calcium have several missing values at
+`r gtsummary::inline_text(summary_tbl, variable = GLU, column = n)` and
+`r gtsummary::inline_text(summary_tbl, variable = CA, column = n)`,
+respectively.

 ```{r}
 #| label: tbl-data_summary
@ -54,19 +90,41 @@ summary_tbl %>% gtsummary$as_kable()

 ## Data Inspection

-By examining @tbl-data_summary several important data set characteristics quickly come to light without explanation. The median age across the data set, as a whole, is quite similar, with a median age across all categories of 62.5. Females are better represented in the data set, with higher percentages in all categories. Across all categories, the median values for each lab result are pretty similar. The expectation for this is Red Blood cells, which show more considerable variation across the various categories.
+By examining @tbl-data_summary several important data set
+characteristics quickly come to light without explanation. The median
+age across the data set, as a whole, is quite similar, with a median age
+across all categories of 62.5. Females are better represented in the
+data set, with higher percentages in all categories. Across all
+categories, the median values for each lab result are pretty similar.
+The expectation for this is Red Blood cells, which show more
+considerable variation across the various categories.

-![Distribution of Variables](figures/distrubution_histo){#fig-distro_histo}
+![Distribution of
+Variables](figures/distrubution_histo){#fig-distro_histo}

-When examining @fig-distro_histo, many clinical chemistry values do not show a standard distribution. However, the hematology results typically do appear to follow a standard distribution. While not a problem for most tree-based classification models, many regression models perform better with standard variables. Standardizing variables provides a common comparable unit of measure across all the variables [@boehmke2020]. Since lab values do not contain negative numbers, all numeric values will be log-transformed to bring them to normal distributions.
+When examining @fig-distro_histo, many clinical chemistry values do not
+show a standard distribution. However, the hematology results typically
+do appear to follow a standard distribution. While not a problem for
+most tree-based classification models, many regression models perform
+better with standard variables. Standardizing variables provides a
+common comparable unit of measure across all the variables
+[@boehmke2020]. Since lab values do not contain negative numbers, all
+numeric values will be log-transformed to bring them to normal
+distributions.

 ![Variable Correlation Plot](figures/corr_plot){#fig-corr_plot}

-@fig-corr_plot shows a high correlation between Hemoglobin, hematocrit, and Red Blood Cell values (as expected). While high correlation does not lead to model issues, it can cause unnecessary computations with little value. However, due to the small number of variables, the computation burden is not expected to cause delays, and thus the variables will not be removed.
+@fig-corr_plot shows a high correlation between Hemoglobin, hematocrit,
+and Red Blood Cell values (as expected). While high correlation does not
+lead to model issues, it can cause unnecessary computations with little
+value. However, due to the small number of variables, the computation
+burden is not expected to cause delays, and thus the variables will not
+be removed.

 ## Data Tools

-All data handling and modeling were performed using R and R Studio. The current report was rendered in the following environment.
+All data handling and modeling were performed using R and R Studio. The
+current report was rendered in the following environment.

 ```{r}
 #| label: tbl-platform-info
@ -120,7 +178,14 @@ knitr::kable(

 ## Model Selection

-Both classification and regression models were screened using a random grid search to tune hyperparameters. The models were tested against the training data set to find the best-fit model. @fig-reg-screen shows the results of the model screening for regression models, using root mean square error (RMSE) as the ranking method. Random Forest models and boosted trees performed similarly and were selected for further testing. A full grid search was performed on both models, with a Random Forest model as the final selection. The final hyperparameters selected were:
+Both classification and regression models were screened using a random
+grid search to tune hyperparameters. The models were tested against the
+training data set to find the best-fit model. @fig-reg-screen shows the
+results of the model screening for regression models, using root mean
+square error (RMSE) as the ranking method. Random Forest models and
+boosted trees performed similarly and were selected for further testing.
+A full grid search was performed on both models, with a Random Forest
+model as the final selection. The final hyperparameters selected were:

 -   mtry: 8

@ -130,7 +195,12 @@ Both classification and regression models were screened using a random grid sear

 ![Regression Model Screen](figures/reg_screen){#fig-reg-screen}

-@fig-class-screen shows the results of the model screen for classification models using accuracy as the ranking method. As with regression models, boosted trees and random forest models performed the best. After completing a full grid search of both model types, a random forest model was again chosen as the final model. The final hyperparameters for the model selected were:
+@fig-class-screen shows the results of the model screen for
+classification models using accuracy as the ranking method. As with
+regression models, boosted trees and random forest models performed the
+best. After completing a full grid search of both model types, a random
+forest model was again chosen as the final model. The final
+hyperparameters for the model selected were:

 -   mtry: 8

--- a/chapter4.qmd
+++ b/chapter4.qmd
@ -8,7 +8,13 @@ load(here::here("figures", "strata_table.Rda"))

 ```

-The final data set used for this analysis consisted of 11,340 observations. All observations contained a TSH and Free T4 result and less than three missing results from all other analytes selected for the study. The dataset was then randomly split into a training set containing 9071 observations and a testing set containing 2269 observations. The data was split using stratification of the Free T4 laboratory diagnostic value. @tbl-strata shows the split percentages.
+The final data set used for this analysis consisted of 11,340
+observations. All observations contained a TSH and Free T4 result and
+less than three missing results from all other analytes selected for the
+study. The dataset was then randomly split into a training set
+containing 9071 observations and a testing set containing 2269
+observations. The data was split using stratification of the Free T4
+laboratory diagnostic value. @tbl-strata shows the split percentages.

 ```{r}
 #| label: tbl-strata
@ -19,26 +25,66 @@ strata_table %>% knitr::kable()

 ```

-First, the report shows the ability of classification algorithms to predict whether Free T4 will be diagnostic, with the prediction quality measured by Area Under Curve (AUC) and accuracy. Data regarding the importance association between each predictor analyte and the Free T4 Diagnostic value is then presented. Finally, data is presented with the extent to which FT4 can be predicted by examining the correlation statistics denoting the relationship between measured and predicted Free T4 values.
+First, the report shows the ability of classification algorithms to
+predict whether Free T4 will be diagnostic, with the prediction quality
+measured by Area Under Curve (AUC) and accuracy. Data regarding the
+importance association between each predictor analyte and the Free T4
+Diagnostic value is then presented. Finally, data is presented with the
+extent to which FT4 can be predicted by examining the correlation
+statistics denoting the relationship between measured and predicted Free
+T4 values.

 ## Predictability of Free T4 Classifications

-In clinical decision-making, a key consideration in interpreting numerical laboratory results is often just whether the results fall within the normal reference range [@luo2016]. In the case of Free T4 reflex testing, the results will either fall within the normal range indicating the Free T4 is not diagnostic of Hyper or Hypo Throydism, or they will fall outside those ranges indicating they are diagnostic. The final model achieved an accuracy of 0.796 and an AUC of 0.918. @fig-roc_curve provides ROC curves for each of the four outcome classes.
+In clinical decision-making, a key consideration in interpreting
+numerical laboratory results is often just whether the results fall
+within the normal reference range [@luo2016]. In the case of Free T4
+reflex testing, the results will either fall within the normal range
+indicating the Free T4 is not diagnostic of Hyper or Hypo Throydism, or
+they will fall outside those ranges indicating they are diagnostic. The
+final model achieved an accuracy of 0.796 and an AUC of 0.918.
+@fig-roc_curve provides ROC curves for each of the four outcome classes.
+The same model achieved a sensitivity of 0.632 and specificity of 0.892

-![ROC curves for each of the four outcome classes](figures/roc_curve_class){#fig-roc_curve}
+![ROC curves for each of the four outcome
+classes](figures/roc_curve_class){#fig-roc_curve}

-@fig-conf-matrix-class shows the confusion matrix of the final testing data. Of the 2269 total results, 1805 were predicted correctly, leaving 464 incorrectly predicted results. Of the incorrectly predicted results, 72 results predicted a diagnostic Free T4 when the correct result was non-diagnostic. 392 of the incorrectly predicted results were predicted as non-diagnostic when the correct result was diagnostic.
+@fig-conf-matrix-class shows the confusion matrix of the final testing
+data. Of the 2269 total results, 1805 were predicted correctly, leaving
+464 incorrectly predicted results. Of the incorrectly predicted results,
+72 results predicted a diagnostic Free T4 when the correct result was
+non-diagnostic. 392 of the incorrectly predicted results were predicted
+as non-diagnostic when the correct result was diagnostic.

-![Final Model Confusion Matrix](figures/conf_matrix_class){#fig-conf-matrix-class}
+![Final Model Confusion
+Matrix](figures/conf_matrix_class){#fig-conf-matrix-class}

 ## Contributions of Individual Analytes

-Understanding how an ML model makes predictions helps build trust in the model and is the fundamental idea of the emerging field of interpretable machine learning (IML) [@greenwell2020]. @fig-vip-class shows the importance of features in the final model. Importance can be defined as the extent to which a feature has a "meaningful" impact on the predicted outcome [@laan2006]. As expected, TSH is the leading variable in importance rankings, leading all other variables by over 2000's points. The following three variables are all parts of a Complete Blood Count (CBC), followed by the patient's glucose value.
+Understanding how an ML model makes predictions helps build trust in the
+model and is the fundamental idea of the emerging field of interpretable
+machine learning (IML) [@greenwell2020]. @fig-vip-class shows the
+importance of features in the final model. Importance can be defined as
+the extent to which a feature has a "meaningful" impact on the predicted
+outcome [@laan2006]. As expected, TSH is the leading variable in
+importance rankings, leading all other variables by over 2000's points.
+The following three variables are all parts of a Complete Blood Count
+(CBC), followed by the patient's glucose value.

 ![Variable Importance Plot](figures/vip_class){#fig-vip-class}

 ## Predictability of Free T4 Results (Regression)

-Today, it has become widely accepted that a more sound approach to assessing model performance is to assess the predictive accuracy via loss functions. Loss functions are metrics that compare the predicted values to the actual value (the output of a loss function is often referred to as the error or pseudo residual) [@boehmke2020]. The loss function used to evaluate the final model was selected as the Root Mean Square Error, and the final testing data achieved an RMSE of 0.334. @fig-reg-pred shows the plotted results. The predicted results were also used to add the diagnostic classification of Free T4. These results achieved an accuracy of 0.790, and thus very similar to the classification model.
+Today, it has become widely accepted that a more sound approach to
+assessing model performance is to assess the predictive accuracy via
+loss functions. Loss functions are metrics that compare the predicted
+values to the actual value (the output of a loss function is often
+referred to as the error or pseudo residual) [@boehmke2020]. The loss
+function used to evaluate the final model was selected as the Root Mean
+Square Error, and the final testing data achieved an RMSE of 0.334.
+@fig-reg-pred shows the plotted results. The predicted results were also
+used to add the diagnostic classification of Free T4. These results
+achieved an accuracy of 0.790, and thus very similar to the
+classification model.

 ![Regression Predictions Plot](figures/reggression_pred){#fig-reg-pred}
--- a/chapter5.qmd
+++ b/chapter5.qmd
@ -1,31 +1,154 @@
 # Discussion

-Intro Paragraph - In progress<!--# Write after I write everything else -->
-
 ## Summary of Results

-The findings of this study indicate that within another commonly ordered laboratory testing, the diagnostic value of Free T4 can be predicted accurately 80% of the time. While examining only the elevated TSH results, the algorithm had a false positive rate of 2% and a false negative rate of 16%. In the original data, 76% of the time, the result was non-diagnostic for Hypo-Thryodism. For the decreased TSH results, the algorithm had a false positive rate of 8% and a false negative rate of 20%. In the original data, 67% of the time, the result was non-diagnostic for Hyper-Thryodism.
+The findings of this study indicate that within another commonly ordered
+laboratory testing, the diagnostic value of Free T4 can be predicted
+accurately 80% of the time. While examining only the elevated TSH
+results, the algorithm had a false positive rate of 2% and a false
+negative rate of 16%. In the original data, 76% of the time, the result
+was non-diagnostic for Hypo-Thryodism. For the decreased TSH results,
+the algorithm had a false positive rate of 8% and a false negative rate
+of 20%. In the original data, 67% of the time, the result was
+non-diagnostic for Hyper-Thryodism.

-While TSH was expected to be the most important variable in building random forest models, it was entirely unexpected that the following three values would be Hematology results. In the clinical laboratory, TSH and CBCs are often run on different analyzers and in other departments. Finding this slight correlation could be valuable to building further algorithms.
+While the model achieved an overall accuracy of 80%, it struggled to
+identify positives with a sensitivity of only 63%. However, the model
+did achieve a specificity of 89%. Sensitivity refers to a test's ability
+to designate an individual with the disease as positive. A highly
+sensitive test means few false negative results, and thus fewer disease
+cases are missed. The specificity of a test is its ability to designate
+an individual who does not have a disease as negative. A highly specific
+test means that there are few false positive results. It may not be
+feasible to use a test with low specificity for screening since many
+people without the disease will screen positive and potentially receive
+unnecessary diagnostic procedures [@newyorkstatedepartmentofhealth].
+
+In a study by Xu et al., a machine learning model was used to predict
+laboratory test results as normal or abnormal to identify low-yield,
+repetitive laboratory tests [-@xu2019]. Their group performed a
+multi-site study of nearly 200,000 inpatient laboratory testing orders
+to identify the most repetitive laboratory tests and then attempted to
+predict each one. They achieved an AUROC of \> 90% for 20 common
+laboratory tests, including sodium, hemoglobin, and lactate
+dehydrogenase. They proposed a sensitive decision threshold of a
+negative predictive value of 95% to power a clinical decision support
+tool aimed at reducing low-yield, repetitive testing [@xu2019]. No other
+published studies exist in the clinical laboratory with a proposed value
+for the success of a machine learning model. If using the 95%
+specificity threshold, the current model does not achieve the result
+necessary to be considered final.
+
+While TSH was expected to be the most important variable in building
+random forest models, it was entirely unexpected that the following
+three values would be Hematology results. In the clinical laboratory,
+TSH and CBCs are often run on different analyzers and in other
+departments. Finding this slight correlation could be valuable to
+building further algorithms. The currently available literature states
+TSH and fT4 have a complex, nonlinear relationship, such that small
+changes in fT4 result in relatively large changes in TSH [@plebani2020].
+However, no currently available literature explores a relationship
+between TSH and any of the CBC tests. These small changes between FT4
+and TSH may be explained if this link can be expanded. While this study
+only focuses on high-level CBC testing, most automated CBC analyzers can
+run many more tests, which could be used in the development of future
+algorithms.

 ## Real World Applications

-While the current algorithm did not quite achieve an accuracy ready for deployment, it is hypothesized that a system like this could be implemented in clinical decision-making systems. As stated previously, current practice is a physician (or other care providers) orders a TSH, and if the value is outside laboratory-established reference ranges, the Free T4 is added on. In the current study database, this reflex testing was non-diagnostic 76% of the time for elevated TSH values and 67% for decreased TSH values. Using clinical decision support first to predict whether the Free T4 would be diagnostic, the care provider can use this prediction and other patient signs and symptoms to determine if running a Free T4 lab test is needed.
+While the current algorithm did not quite achieve an accuracy ready for
+deployment, it is hypothesized that a system like this could be
+implemented in clinical decision-making systems. As stated previously,
+current practice is a physician (or other care providers) orders a TSH,
+and if the value is outside laboratory-established reference ranges, the
+Free T4 is added on. In the current study database, this reflex testing
+was non-diagnostic 76% of the time for elevated TSH values and 67% for
+decreased TSH values. Using clinical decision support first to predict
+whether the Free T4 would be diagnostic, the care provider can use this
+prediction and other patient signs and symptoms to determine if running
+a Free T4 lab test is needed.

-Similarly to Luo et al., the idea that the diagnostic information offered by Free T4 often duplicates what other diagnostic tests provide suggests a notion of "informationally" redundant testing [-@luo2016]. It is speculated that informationally redundant testing occurs in various diagnostic settings and diagnostic workups. It is much more frequent than the more traditionally defined and narrowly framed notion of redundant testing, which most often includes unintended duplications of the same or similar tests. Under this narrow definition, redundant laboratory testing is estimated to waste more than \$5 billion annually in the United States, potentially dwarfed by the waste from informationally redundant testing [@luo2016]. However, since Free T4 and all other tests used in this study are performed on automated instruments, the cost savings to the lab and patient may be minimal.
+Similarly to Luo et al., the idea that the diagnostic information
+offered by Free T4 often duplicates what other diagnostic tests provide
+suggests a notion of "informationally" redundant testing [-@luo2016]. It
+is speculated that informationally redundant testing occurs in various
+diagnostic settings and diagnostic workups. It is much more frequent
+than the more traditionally defined and narrowly framed notion of
+redundant testing, which most often includes unintended duplications of
+the same or similar tests. Under this narrow definition, redundant
+laboratory testing is estimated to waste more than \$5 billion annually
+in the United States, potentially dwarfed by the waste from
+informationally redundant testing [@luo2016]. However, since Free T4 and
+all other tests used in this study are performed on automated
+instruments, the cost savings to the lab and patient may be minimal.

-As Rabbani et al. study showed, Machine Learning in the Clinical Laboratory is an emerging field. However, few existing studies relate to predicting laboratory values based on other results [-@rabbani2022]. The few studies that do exist follow a similar premise. All are trying to reduce redundant laboratory testing, thus lowering the patient's cost.
+As Rabbani et al. study showed, Machine Learning in the Clinical
+Laboratory is an emerging field. However, few existing studies relate to
+predicting laboratory values based on other results [-@rabbani2022]. The
+few studies that do exist follow a similar premise. All are trying to
+reduce redundant laboratory testing, thus lowering the patient's cost.

 ## Study Limitations

-While the MIMIC-IV database allowed for a first run of the study, it does suffer from some issues compared to other patient results. The MIMIC-IV database only contains results from ICU patients. Thus the result may not represent normal results for patients typically screened for hyper or hypothyroidism. In a study by Tyler et al., they found that laboratory value ranges from critically ill patients deviate significantly from those of healthy controls [-@tyler2018]. In their study, distribution curves based on ICU data, have differed considerably from the standard hospital range (mean \[SD\] overlapping coefficient, 0.51 \[0.32-0.69\]) [@tyler2018]. The data ranges from 2008 to 2019. During this time, there could have been several unknown laboratory changes. Often laboratories change methods, reference ranges, or even vendors. None of this data is available in the MIMIC database. A change in method or vendor could cause a shift in results, thus causing the algorithm to assign incorrect outcomes.
+While the MIMIC-IV database allowed for a first run of the study, it
+does suffer from some issues compared to other patient results. The
+MIMIC-IV database only contains results from ICU patients. Thus the
+result may not represent normal results for patients typically screened
+for hyper or hypothyroidism. In a study by Tyler et al., they found that
+laboratory value ranges from critically ill patients deviate
+significantly from those of healthy controls [-@tyler2018]. In their
+study, distribution curves based on ICU data, have differed considerably
+from the standard hospital range (mean \[SD\] overlapping coefficient,
+0.51 \[0.32-0.69\]) [@tyler2018]. The data ranges from 2008 to 2019.
+During this time, there could have been several unknown laboratory
+changes. Often laboratories change methods, reference ranges, or even
+vendors. None of this data is available in the MIMIC database. A change
+in method or vendor could cause a shift in results, thus causing the
+algorithm to assign incorrect outcomes.

-The dataset also sufferers from incompleteness. Due to the fact the database was not explicitly designed for this study, many patients do not have complete sets of lab results. The study also had to pick and choose lab tests to allow for as many groups of TSH and Free T4 results as possible. For instance, in a study by Luo et al., a total of 42 different lab tests were selected for a Machine Learning study, compared to only 16 selected for this study [-@luo2016]. The patient demographic data also suffered from the same incompleteness. Due to this fact, only the age and gender of the patient were used in developing the algorithm. An early study by Schectman et al. found the mean TSH level of Blacks was 0.4 (SE .053) mU/L lower than that for Whites after age and sex adjustment, race explaining 6.5 percent of the variation in TSH levels [-@schectman1991]. This variation in results should potentially be included in developing a future algorithm. However, as it stands, the current data set has incomplete data for patient race and ethnicity.
+The dataset also sufferers from incompleteness. Due to the fact the
+database was not explicitly designed for this study, many patients do
+not have complete sets of lab results. The study also had to pick and
+choose lab tests to allow for as many groups of TSH and Free T4 results
+as possible. For instance, in a study by Luo et al., a total of 42
+different lab tests were selected for a Machine Learning study, compared
+to only 16 selected for this study [-@luo2016]. The patient demographic
+data also suffered from the same incompleteness. Due to this fact, only
+the age and gender of the patient were used in developing the algorithm.
+An early study by Schectman et al. found the mean TSH level of Blacks
+was 0.4 (SE .053) mU/L lower than that for Whites after age and sex
+adjustment, race explaining 6.5 percent of the variation in TSH levels
+[-@schectman1991]. This variation in results should potentially be
+included in developing a future algorithm. However, as it stands, the
+current data set has incomplete data for patient race and ethnicity.

-As Machine learning algorithms become more and more powerful, it is additionally vital from an infrastructure standpoint to have the processing power capable of handling the algorithms. This becomes even more important in an attempt to put the algorithm into practice, as the computer must be able to process results in mere milliseconds.
+As Machine learning algorithms become more and more powerful, it is
+additionally vital from an infrastructure standpoint to have the
+processing power capable of handling the algorithms. This becomes even
+more important in an attempt to put the algorithm into practice, as the
+computer must be able to process results in mere milliseconds.

 ## Future Studies

-While the current algorithm is not quite ready for production use, it does lead to many promising ideas. The first step to further develop this algorithm would be collecting data on non-ICU patients. The idea would be gathering data on patients much closer to those screened for Hypo and Hyper-Thyrodism. With data closer to normal, the optimal hyperparameters could continue to be tweaked, as well as training the model with this data. There could also be a reason to try and test the current algorithm with different patient data to assess performance. This would be similar to what Li et al. performed with their study to identify unnecessary laboratory tests [-@li2022]. After developing their algorithm on the MIMIC-III database, they gathered data from Memorial Hermann Hospital in Houston, Texas. However, their algorithm was designed for ICU patients in this study, so this was a more direct performance comparison. In the case of this study, the algorithm was intended more as a proof of concept than are production-ready idea.
+While the current algorithm is not quite ready for production use, it
+does lead to many promising ideas. The first step to further develop
+this algorithm would be collecting data on non-ICU patients. The idea
+would be gathering data on patients much closer to those screened for
+Hypo and Hyper-Thyrodism. With data closer to normal, the optimal
+hyperparameters could continue to be tweaked, as well as training the
+model with this data. There could also be a reason to try and test the
+current algorithm with different patient data to assess performance.
+This would be similar to what Li et al. performed with their study to
+identify unnecessary laboratory tests [-@li2022]. After developing their
+algorithm on the MIMIC-III database, they gathered data from Memorial
+Hermann Hospital in Houston, Texas. However, their algorithm was
+designed for ICU patients in this study, so this was a more direct
+performance comparison. In the case of this study, the algorithm was
+intended more as a proof of concept than are production-ready idea.

-One of the most challenging parts of this study and any machine learning in the clinical laboratory is implementing it after the fact. Developing an algorithm that can predict laboratory testing is just half the idea. Many current laboratory information systems would be unable to handle this type of clinical decision-making system, as this would be much outside the expected behavior of these systems.
+One of the most challenging parts of this study and any machine learning
+in the clinical laboratory is implementing it after the fact. Developing
+an algorithm that can predict laboratory testing is just half the idea.
+Many current laboratory information systems would be unable to handle
+this type of clinical decision-making system, as this would be much
+outside the expected behavior of these systems.
--- a/extras/irbletter.docx
+++ b/extras/irbletter.docx
--- a/references.bib
+++ b/references.bib
@ -410,3 +410,24 @@ PMCID: PMC6324400}
 	note = {PMID: 2003636
 PMCID: PMC1405055}
 }
+
+@misc{newyorkstatedepartmentofhealth,
+	title = {Disease Screening - Statistics Teaching Tools - New York State Department of Health},
+	author = {New York State Department of Health, },
+	url = {https://www.health.ny.gov/diseases/chronic/discreen.htm}
+}
+
+@article{xu2019,
+	title = {Prevalence and Predictability of Low-Yield Inpatient Laboratory Diagnostic Tests},
+	author = {Xu, Song and Hom, Jason and Balasubramanian, Santhosh and Schroeder, Lee F. and Najafi, Nader and Roy, Shivaal and Chen, Jonathan H.},
+	year = {2019},
+	month = {09},
+	date = {2019-09-11},
+	journal = {JAMA Network Open},
+	pages = {e1910967},
+	volume = {2},
+	number = {9},
+	doi = {10.1001/jamanetworkopen.2019.10967},
+	url = {https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2749559},
+	langid = {en}
+}