College of Graduate Studies: Theses & Dissertations

Term of Award

Spring 2026

Degree Name

Master of Science, Information Technology

Document Type and Release Option

Thesis (open access)

Copyright Statement / License for Reuse

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Department

Department of Information Technology

Committee Chair

Atef Mohamed (Shalan)

Committee Member 1

Lei Chen

Committee Member 2

Hayden Wimmer

Abstract

Intensive Care Unit (ICU) patients do not follow a single uniform physiological pattern. Patients admitted with the same diagnosis show different clinical trajectories over time making standardized classification and treatment approaches insufficient. The increasing availability of large-scale electronic health records in MIMIC-IV makes it possible to investigate such heterogeneity through data-driven approach that captures how physiology evolves during the early phase of ICU admission. This thesis compares two analytical pipelines designed to identify physiological subtypes from the first 48 hours of ICU time series data. This study then assesses how well these subtypes predict in-hospital mortality. The first approach, referred to as Study 1, employs a compact temporal representation of 94 features derived from the selected set of patient vital signs and laboratory variables. The second pipeline, referred to as Study 2, expands this representation to 358 features by incorporating a broader set of physiological variables (15 vital signs and 22 laboratory variables) across general ICU without restricting to a specific diagnosis.

Both pipelines employ Principal Component Analysis (PCA) to reduce the dimensionality of the feature representation, which is then subjected to K-Means (K = 3) and Bayesian Gaussian Mixture Model (BGMM) clustering algorithms to discover the underlying patient subtypes. Supervised learning is employed using Logistic Regression and XGBoost classifiers to predict patient in-hospital mortality risk.

Study 1 achieves strong performance with an XGBoost AUROC of 0.85, an AUPRC of 0.63 along with good calibration reflected by a low Brier score. Study 2 achieved slightly lower performance with an XGBoost AUROC of 0.828, an AUPRC of 0.443, but a significantly improved Brier score of 0.0785. Study 2 also introduced SHAP (SHapley Additive exPlanations) analysis of the top features. It is determined that Oxygen Saturation (SpO2) variability, lactate slope, Acidity or Alkalinity of blood (pH) slope, creatinine, and lactate levels are the key features of in-hospital mortality prediction. The identified subtypes show clear clinical separation, with the high-severity group having an in-hospital mortality rate of 21.5% compared to 5.4% in the more stable group.

Overall, this comparative analysis implies that increasing the complexity of features does not necessarily improve the predictive performance of the model. A smaller and carefully selected feature set can achieve better classification, while a larger feature set provides deeper clinical insights and support interpretability.

OCLC Number

1588663753

Research Data and Supplementary Material

No

Share

COinS