Learning Big Data on Spark for the Optimal IDW-Based Spatiotemporal Interpolation

Document Type

Conference Proceeding

Publication Date


Publication Title

Proceedings of the Association of American Geographers Annual Meeting


To better assess the relationships between environmental exposures and health outcomes, an appropriate spatiotemporal interpolation is critical. Usually, air pollution data is collected at a limited number of monitoring locations and with a non-continuous manner. Traditional spatiotemporal methods treat space and time separately when interpolating the pollution data in the continuous space-time domain. Such interpolation results may be far away from the satisfaction. Li et al. (2004) proposed the extension approach to incorporate spatial and temporal dimensions simultaneously by treating time as another dimension in space. Unfortunately, modern work on spatiotemporal interpolation utilized simplistic methods to scale the range of the time dimension. Besides, due to the large data sets, experiments are usually very expensive in running time. Based on a recent work by Li et al. (2014), we develop an IDW (Inverse Distance Weighting)-based spatiotemporal interpolation, employ the efficient k-d tree structure to store data, combine the extension approach with machine learning methods, such as k-fold cross validation and bootstrap aggregating, to learn optimal parameters. Furthermore, we implement our method on Apache Spark, which is a lightning-fast cluster computing framework and represents the avant-garde of big data processing tools. Our experimental results demonstrate the computational power and improved performance of our method, which significantly outperforms the previous work in terms of speed and accuracy.