DHSC-Capstone/chapter5.qmd

# Discussion

## Summary of Results

The findings of this study indicate that within another commonly ordered
laboratory testing, the diagnostic value of Free T4 can be predicted
accurately 80% of the time. While examining only the elevated TSH
results, the algorithm had a false positive rate of 2% and a false
negative rate of 16%. In the original data, 76% of the time, the result
was non-diagnostic for Hypo-Thryodism. For the decreased TSH results,
the algorithm had a false positive rate of 8% and a false negative rate
of 20%. In the original data, 67% of the time, the result was
non-diagnostic for Hyper-Thryodism.

While the model achieved an overall accuracy of 80%, it struggled to
identify positives with a sensitivity of only 63%. However, the model
did achieve a specificity of 89%. Sensitivity refers to a test's ability
to designate an individual with the disease as positive. A highly
sensitive test means few false negative results, and thus fewer disease
cases are missed. The specificity of a test is its ability to designate
an individual who does not have a disease as negative. A highly specific
test means that there are few false positive results. It may not be
feasible to use a test with low specificity for screening since many
people without the disease will screen positive and potentially receive
unnecessary diagnostic procedures [@newyorkstatedepartmentofhealth].

In a study by Xu et al., a machine learning model was used to predict
laboratory test results as normal or abnormal to identify low-yield,
repetitive laboratory tests [-@xu2019]. Their group performed a
multi-site study of nearly 200,000 inpatient laboratory testing orders
to identify the most repetitive laboratory tests and then attempted to
predict each one. They achieved an AUROC of \> 90% for 20 common
laboratory tests, including sodium, hemoglobin, and lactate
dehydrogenase. They proposed a sensitive decision threshold of a
negative predictive value of 95% to power a clinical decision support
tool aimed at reducing low-yield, repetitive testing [@xu2019]. No other
published studies exist in the clinical laboratory with a proposed value
for the success of a machine learning model. If using the 95%
specificity threshold, the current model does not achieve the result
necessary to be considered final.

While TSH was expected to be the most important variable in building
random forest models, it was entirely unexpected that the following
three values would be Hematology results. In the clinical laboratory,
TSH and CBCs are often run on different analyzers and in other
departments. Finding this slight correlation could be valuable to
building further algorithms. The currently available literature states
TSH and fT4 have a complex, nonlinear relationship, such that small
changes in fT4 result in relatively large changes in TSH [@plebani2020].
However, no currently available literature explores a relationship
between TSH and any of the CBC tests. These small changes between FT4
and TSH may be explained if this link can be expanded. While this study
only focuses on high-level CBC testing, most automated CBC analyzers can
run many more tests, which could be used in the development of future
algorithms.

## Real World Applications

While the current algorithm did not quite achieve an accuracy ready for
deployment, it is hypothesized that a system like this could be
implemented in clinical decision-making systems. As stated previously,
current practice is a physician (or other care providers) orders a TSH,
and if the value is outside laboratory-established reference ranges, the
Free T4 is added on. In the current study database, this reflex testing
was non-diagnostic 76% of the time for elevated TSH values and 67% for
decreased TSH values. Using clinical decision support first to predict
whether the Free T4 would be diagnostic, the care provider can use this
prediction and other patient signs and symptoms to determine if running
a Free T4 lab test is needed.

Similarly to Luo et al., the idea that the diagnostic information
offered by Free T4 often duplicates what other diagnostic tests provide
suggests a notion of "informationally" redundant testing [-@luo2016]. It
is speculated that informationally redundant testing occurs in various
diagnostic settings and diagnostic workups. It is much more frequent
than the more traditionally defined and narrowly framed notion of
redundant testing, which most often includes unintended duplications of
the same or similar tests. Under this narrow definition, redundant
laboratory testing is estimated to waste more than \$5 billion annually
in the United States, potentially dwarfed by the waste from
informationally redundant testing [@luo2016]. However, since Free T4 and
all other tests used in this study are performed on automated
instruments, the cost savings to the lab and patient may be minimal.

As Rabbani et al. study showed, Machine Learning in the Clinical
Laboratory is an emerging field. However, few existing studies relate to
predicting laboratory values based on other results [-@rabbani2022]. The
few studies that do exist follow a similar premise. All are trying to
reduce redundant laboratory testing, thus lowering the patient's cost.

## Study Limitations

While the MIMIC-IV database allowed for a first run of the study, it
does suffer from some issues compared to other patient results. The
MIMIC-IV database only contains results from ICU patients. Thus the
result may not represent normal results for patients typically screened
for hyper or hypothyroidism. In a study by Tyler et al., they found that
laboratory value ranges from critically ill patients deviate
significantly from those of healthy controls [-@tyler2018]. In their
study, distribution curves based on ICU data, have differed considerably
from the standard hospital range (mean \[SD\] overlapping coefficient,
0.51 \[0.32-0.69\]) [@tyler2018]. The data ranges from 2008 to 2019.
During this time, there could have been several unknown laboratory
changes. Often laboratories change methods, reference ranges, or even
vendors. None of this data is available in the MIMIC database. A change
in method or vendor could cause a shift in results, thus causing the
algorithm to assign incorrect outcomes.

The dataset also sufferers from incompleteness. Due to the fact the
database was not explicitly designed for this study, many patients do
not have complete sets of lab results. The study also had to pick and
choose lab tests to allow for as many groups of TSH and Free T4 results
as possible. For instance, in a study by Luo et al., a total of 42
different lab tests were selected for a Machine Learning study, compared
to only 16 selected for this study [-@luo2016]. The patient demographic
data also suffered from the same incompleteness. Due to this fact, only
the age and gender of the patient were used in developing the algorithm.
An early study by Schectman et al. found the mean TSH level of Blacks
was 0.4 (SE .053) mU/L lower than that for Whites after age and sex
adjustment, race explaining 6.5 percent of the variation in TSH levels
[-@schectman1991]. This variation in results should potentially be
included in developing a future algorithm. However, as it stands, the
current data set has incomplete data for patient race and ethnicity.

As Machine learning algorithms become more and more powerful, it is
additionally vital from an infrastructure standpoint to have the
processing power capable of handling the algorithms. This becomes even
more important in an attempt to put the algorithm into practice, as the
computer must be able to process results in mere milliseconds.

## Future Studies

While the current algorithm is not quite ready for production use, it
does lead to many promising ideas. The first step to further develop
this algorithm would be collecting data on non-ICU patients. The idea
would be gathering data on patients much closer to those screened for
Hypo and Hyper-Thyrodism. With data closer to normal, the optimal
hyperparameters could continue to be tweaked, as well as training the
model with this data. There could also be a reason to try and test the
current algorithm with different patient data to assess performance.
This would be similar to what Li et al. performed with their study to
identify unnecessary laboratory tests [-@li2022]. After developing their
algorithm on the MIMIC-III database, they gathered data from Memorial
Hermann Hospital in Houston, Texas. However, their algorithm was
designed for ICU patients in this study, so this was a more direct
performance comparison. In the case of this study, the algorithm was
intended more as a proof of concept than are production-ready idea.

One of the most challenging parts of this study and any machine learning
in the clinical laboratory is implementing it after the fact. Developing
an algorithm that can predict laboratory testing is just half the idea.
Many current laboratory information systems would be unable to handle
this type of clinical decision-making system, as this would be much
outside the expected behavior of these systems.