Learning Big Data on Spark for the Optimal IDW-Based Spatiotemporal Interpolation
To better assess the relationships between environmental exposures and health outcomes, an appropriate spatiotemporal interpolation is critical. Usually, air pollution data is collected at a limited number of monitoring locations and with a non-continuous manner. Traditional spatiotemporal methods treat space and time separately when interpolating the pollution data in the continuous space-time domain. Such interpolation results may be far away from the satisfaction. Li et al. (2004) proposed the extension approach to incorporate spatial and temporal dimensions simultaneously by treating time as another dimension in space. Unfortunately, modern work on spatiotemporal interpolation utilized simplistic methods to scale the range of the time dimension. Besides, due to the large data sets, experiments are usually very expensive in running time. Based on a recent work by Li et al. (2014), we develop an IDW (Inverse Distance Weighting)-based spatiotemporal interpolation, employ the efficient k-d tree structure to store data, combine the extension approach with machine learning methods, such as k-fold cross validation and bootstrap aggregating, to learn optimal parameters. Furthermore, we implement our method on Apache Spark, which is a lightning-fast cluster computing framework and represents the avant-garde of big data processing tools. Our experimental results demonstrate the computational power and improved performance of our method, which significantly outperforms the previous work in terms of speed and accuracy.
American Association of Geographers Annual Meeting (AAG)
Tong, Weitian, Xiaolu Zhou, Lixin Li, Gina Besenyi, Heather Yates.
"Learning Big Data on Spark for the Optimal IDW-Based Spatiotemporal Interpolation."
Computer Science Faculty Presentations.