Early-Stage Diabetes Prediction using Apache Spark and MLlib

Faculty Mentor

Dr. Hayden Wimmer

Location

Poster 210

Session Format

Poster Presentation

Academic Unit

Department of Information Technology

Background

  • We develop a diabetes prediction system on Apache Spark.

  • We utilize the Hadoop Distributed File System (HDFS) to store and retrieve our dataset into Spark. We opted to use PySpark to write Spark commands in Python.

  • We use the ‘Early-stage diabetes risk prediction dataset’ retrieved from the UCI machine learning repository.

  • To develop our prediction models, we utilize four machine learning algorithms: Decision Trees, Random Forest, Gradient Boosted Trees and Naïve Bayes.

Keywords

Allen E. Paulson College of Engineering and Computing Student Research Symposium, Hadoop Distributed File System, HDFS

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Presentation Type and Release Option

Presentation (File Not Available for Download)

Start Date

2022 12:00 AM

January 2022

This document is currently not available here.

Share

COinS
 
Jan 1st, 12:00 AM

Early-Stage Diabetes Prediction using Apache Spark and MLlib

Poster 210

  • The explosion of the volume of data now being generated has led to the creation of larger and more complex datasets compiled from multiple sources.
  • Conventional data processing technology cannot manage the size and complexity of these datasets. This has driven the need for big data processing tools that can handle the tremendous workloads more efficiently.