College of Graduate Studies: Theses & Dissertations

Application of Machine Learning and Large Language Models in Healthcare for Data Prediction and Summarization

Chiazam Chisom Izuchukwu, Georgia Southern UniversityFollow

Term of Award

Spring 2025

Degree Name

Master of Science, Information Technology

Document Type and Release Option

Thesis (open access)

Copyright Statement / License for Reuse

This work is licensed under a Creative Commons Attribution 4.0 License.

Department

Department of Information Technology

Committee Chair

Hayden Wimmer

Committee Member 1

Jongyeop Kim

Committee Member 2

Atef Mohamed

Abstract

This study aims to examine the use of machine learning (ML) and large language models (LLMs) in healthcare to enhance disease prediction, clinical decision-making, and information management. Five supervised ML models—Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), Decision Trees (DT), and Naïve Bayes (NB)—on three different computing platforms—Google Colab, Databricks, and Snowflake—were employed for disease classification. Data preprocessing included treating missing values, encoding categorical variables utilizing one-hot-encoding, feature scaling when needed, and tackling class imbalance with Synthetic Minority Over-sampling Technique (SMOTE) before an 80-20 train-test separation. Models were created with Scikit-learn (Google Collab), Spark MLlib (Databricks), and Snowpark (Snowflake), with resulting efficacy being measured by classification metrics (accuracy, precision, recall, F1-score, and AUC-ROC) and regression metrics (Mean Absolute Error, Mean Squared Error, Root Mean Squared Error and R2). The study also explores whether LLMs can generate concise summaries of oncology reports (HTML) to curb information overload further and inform clinical decision-making. The summaries were generated using pre-trained transformer models like BART, T5, and Pegasus and evaluated using BLEU, ROUGE, and BERT scores. Additionally, performance was compared against recursive (summary of summaries) and direct summarization techniques and outputs from conversational AI models (e.g., ChatGPT, Google NotebookLM).

OCLC Number

1521193846

Catalog Permalink

https://galileo-georgiasouthern.primo.exlibrisgroup.com/permalink/01GALI_GASOUTH/1r4bu70/alma9916621325502950

Recommended Citation

Izuchukwu, Chiazam Chisom, "Application of Machine Learning and Large Language Models in Healthcare for Data Prediction and Summarization" (2025). College of Graduate Studies: Theses & Dissertations. 2919.
https://digitalcommons.georgiasouthern.edu/etd/2919

Research Data and Supplementary Material

Download

Included in

Artificial Intelligence and Robotics Commons, Data Science Commons, Health Information Technology Commons

COinS

College of Graduate Studies: Theses & Dissertations

Application of Machine Learning and Large Language Models in Healthcare for Data Prediction and Summarization

Term of Award

Degree Name

Document Type and Release Option

Copyright Statement / License for Reuse

Department

Committee Chair

Committee Member 1

Committee Member 2

Abstract

OCLC Number

Catalog Permalink

Recommended Citation

Research Data and Supplementary Material

Included in

Search GS Commons

Browse GS Commons

About GS Commons

Submission Guidelines

College of Graduate Studies: Theses & Dissertations

Application of Machine Learning and Large Language Models in Healthcare for Data Prediction and Summarization

Author

Term of Award

Degree Name

Document Type and Release Option

Copyright Statement / License for Reuse

Department

Committee Chair

Committee Member 1

Committee Member 2

Abstract

OCLC Number

Catalog Permalink

Recommended Citation

Research Data and Supplementary Material

Included in

Share

Search GS Commons

Browse GS Commons

About GS Commons

Submission Guidelines