Early-Stage Diabetes Prediction using Apache Spark and MLlib
Faculty Mentor
Dr. Hayden Wimmer
Location
Poster 210
Session Format
Poster Presentation
Academic Unit
Department of Information Technology
Background
-
We develop a diabetes prediction system on Apache Spark.
-
We utilize the Hadoop Distributed File System (HDFS) to store and retrieve our dataset into Spark. We opted to use PySpark to write Spark commands in Python.
-
We use the ‘Early-stage diabetes risk prediction dataset’ retrieved from the UCI machine learning repository.
-
To develop our prediction models, we utilize four machine learning algorithms: Decision Trees, Random Forest, Gradient Boosted Trees and Naïve Bayes.
Keywords
Allen E. Paulson College of Engineering and Computing Student Research Symposium, Hadoop Distributed File System, HDFS
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Presentation Type and Release Option
Presentation (File Not Available for Download)
Start Date
2022 12:00 AM
January 2022
Early-Stage Diabetes Prediction using Apache Spark and MLlib
Poster 210
- The explosion of the volume of data now being generated has led to the creation of larger and more complex datasets compiled from multiple sources.
- Conventional data processing technology cannot manage the size and complexity of these datasets. This has driven the need for big data processing tools that can handle the tremendous workloads more efficiently.