Principle Component Analysis for Feature Reduction and Data Preprocessing in Data Science

Document Type

Contribution to Book

Publication Date


Publication Title

Proceedings of the Conference on Information Systems Applied Research




Medical datasets are large and complex. Due to the number of variables contained within medical data, machine learning algorithms may not be able to induct patterns from the data or may over fit the learned model to the data thereby reducing the generalizability of the model. Feature reduction seeks to limit the number of variables as input by establishing correlations between variables and reducing the overall feature set to the minimum number of possible variables to describe the data. This research seeks to examine the effects of principal component analysis for feature reduction when applied to decision trees. Results indicate that principle component analysis (PCA) may be employed to reduce the number of features; however, the results suffer minor degradation.