Application of machine learning to fill in the missing monitoring data of air quality

Three machine learning models were used to predict the missing values of monitoring data of air quality for Gia Lam, Hanoi and Nha Trang, Khanh Hoa stations. Extensive experimental results indicated that the effectiveness of the three studied machine learning models, namely ARMA, ANN and SVR is better than that of traditional approaches such as LR and Spline interpolation. It is found that the quality of dataset in terms of missing data points significantly influences on the performance of the selected models. Among the three studied models, ARMA is the best in terms of filling in the missing monitoring data of air quality. However, it is hard to say which model is better, because the selection of the appropriate model which is based on data properties and the objectives of the analysis, influences to the performance of models. A strange point is that the performance of SVR model in this study is worse than that of ANN and ARMA models. This is different from what reported by several previous studies supposing the need for further studies. Nevertheless, for the prediction of the fluctuation trend of pollutant concentrations, the studied SVR model is better the traditional approaches including LR and Spline interpolation. This study suggested that the machine learning approaches including ARMA, ANN and SVR are potential methods for filling-in the missing values of air quality monitoring data

7 trang | Chia sẻ: honghp95 | Lượt xem: 896 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Application of machine learning to fill in the missing monitoring data of air quality, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

Vietnam Journal of Science and Technology 56 (2C) (2018) 104-110 APPLICATION OF MACHINE LEARNING TO FILL IN THE MISSING MONITORING DATA OF AIR QUALITY Mac Duy Hung1, 2, *, Nghiem Trung Dung1, Hoang Xuan Co3 1Hanoi University of Science and Technology, 1 Dai Co Viet road, Ha Noi, Viet Nam 2Thai Nguyen University of Technology, 3/2 road, Tich Luong ward, Thai Nguyen, Viet Nam 3VNU University of Science, 334 Nguyen Trai road, Ha Noi, Viet Nam *Email: macdh@tnut.edu.vn Received: 10 May 2018; Accepted for publication: 21 August 2018 ABSTRACT In this paper, three machine learning models have been applied to predict and fill in the missing monitoring data of air quality for Gia Lam and Nha Trang stations in Hanoi and Khanh Hoa respectively, including Autoregressive Moving Average (ARMA), Artificial Neural Network (ANN), and Support Vector Regression (SVR). Two air pollutants being NO2 and PM10 were selected for this study. The experimental results showed that the performance of all three studied models is better than that of some traditional approaches, including Multiple Linear Regression (LR) and Spline interpolation. Besides that, ARMA, ANN and SVR can capture the fluctuation of concentrations of the selected pollutants. These results indicated that the machine learning is a feasible approach to deal with the missing of data which is one of the biggest problems of air quality monitoring stations in Viet Nam. Keywords: air quality, ANN, ARMA, SVR, missing data. 1. INTRODUCTION Monitoring and modeling of air quality is of ultimate significance for understanding the trend and characteristics of air pollutants. For understanding and simulating the fluctuation of an air pollutant, it is required to have the dataset of air quality which is not only long enough in time and reliable but also time-serially completion of observations. However, the continuity of time-series measurements is normally plagued with different factors including malfunction of the equipment, power cut off, not regularly maintained, etc., resulting in the gap of data points or missing data. Many statistical approaches such as linear or logistic regression, polynomial or spline interpolation/extrapolation [1, 2, 3], Kalman filter approach [4] and so on have been proposed to deal with this problem. However, none of them is effective when the number of gaps is large. In recent studies, the machine learning approaches have been successfully applied to predict values of concentrations of air pollutants [2, 5, 6, 7, 8, 9, 10, 11]. This study, therefore, aimed at the application of machine learning to fill in the gaps of air quality monitoring data in Mac Duy Hung, Nghiem Trung Dung, Hoang Xuan Co Viet Nam focusing on autoregressive moving average (ARMA), artificial neural network (ANN) and support vector regression methods. 2. METHODOLOGY 2.1. Autoregressive moving average Autoregressive moving average (ARMA) is a statistical model of time series analysis which combines autoregressive analysis (AR) and moving average (MA) methods. An ARMA model of xt time series data can be defined by following equations [12]. AR component: MA component: 1 1 ...t t p t p tx x x zα α− −= + + + (1); 0 1 1 ...t t t q t qx z x xβ β β− −= + + + (2) And ARMA model: 1 1 1 1... ...t t p t p t t q t qx x x z x xα α β β− − − −= + + + + + + (3) where, α1, , αp and β1, , βp are the corresponding coefficients. 2.2. Artificial neural network Artificial neural network (ANN), a mathematical model, is built based on a biological neural system that consists of three or more layers which are formed by neurons intended to simulate the learning and pattern recognition [13]. An example of typical NN structure is shown in Figure 1, which only has one hidden layer. Figure 1. The structure of a typical artificial neuron network (with three layers). The output of i-th neuron (xi) is determined by the following equations (4) and (5): ( )ξ=i ix f (4) 1 .ξ ε −⊆Γ = +∑ i i ij j i j W x (5) where, ξi is the potential of i-th neuron; f(ξi) is called transfer function; the threshold εi is a weight coefficient of the connection with formally added neuron j (so called bias). The equation (5) is carried out over all neuron j (xj) transferring the signals to i (xi) neuron. In this study, multilayer feedforward neural network (FFNN) was used with the transfer function being sigmoidal. 2.3. Support vector regression Application of machine learning to fill in the missing monitoring data of air quality Support vector machine (SVM) was proposed by V. N. Vapnik [14] to deal with the problems of data classification. SVM creates a hyper plane in boundless dimensional (feature) space, which is used for classification and regression. Support vector regression (SVR) is a linear regression based on the SVM technique. A linear regression function of a given set of data x can be wrote in form ( ) ( )= Φ +T iF x w x b in a feature space , where w is the coefficient vector, b is a threshold and Ф(xi) is a nonlinear function which maps the input x to a vector in . In order to estimate a function F within a finite accuracy, the estimation value ˆF of F needs to satisfy the condition ˆ ε− ≤F F , where ε is the allowably maximum deviation during the training state. The reliability of prediction values is measured by a loss function which is called ε-intensitive cost lost function [5, 15]. The SVR function for nonlinear predictions becomes the equation (6) below: ( ) ( ) ( )* 1 ˆ ˆ ,α α = = = − +∑ N i i i i y F x K x x b (6) where, α and α* are Lagrange parameters; K(xi, x) is known as a Kernel function. Any function that meets Mercer’s condition [14] can be used as the Kernel function. For this study, several preliminary tests were conducted to select the Kernel function including linear, RBF and polynomial. The results indicated that the linear function is the best, therefore, it was chosen for this study. 2.4. Datasets Data used for this study were extracted from the databases of air quality monitoring stations in Gia Lam (Hanoi, from 2014 to 2016) and Nha Trang (Khanh Hoa, from 2013 to 2015). This study focused on two primary air quality parameters, namely NO2 and PM10. Datasets for training and testing the models are extracted from these databases in periods in which data are not missing. The input data are set as the following form: ( )24 23 2 1, ,..., , ,− − − −= t t t tX X X X X t where, Xt-24, , Xt-1 are concentrations of a studied pollutant which is needed to fill in the gaps; t is the time of gaps, i.e., to predict a missing value of a pollutant, these models need 25 values, where 24 values are concentrations of this pollutant in 24 previous hours and t is used as the variable of the concentration trend of this pollutant. 2.5. Evaluation of performance The performance of selected models was evaluated based on statistical indicators including root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (r). 3. RESULTS AND DISCUSSIONS In this study, testing datasets are complete segments from data sources that are assumed to be missing. The positions of these segments are set by random. In addition, the predicted value in the time t is feedback to the input of models as Xt-1 to predict the missing concentration of next time t+1. The performance of gap filling of the selected models on testing datasets are presented below. Mac Duy Hung, Nghiem Trung Dung, Hoang Xuan Co 3.1. Filling in the data of NO2 Obtained results presenting in Table 1 showed that, in almost all experiments, the performance of traditional approaches tested in this study including LR and Spline interpolation in segments with huge number of missing values is not reliable. It is because, these models try to build a function that fit the trend of the studied pollutants. However, the fluctuation of the concentrations of a pollutant in the air is a nonlinear function which is influenced by many factors (e.g., time, precursors, meteorological conditions and so on), therefore, they cannot build a fitting function for that. On the contrary, almost machine learning models including ARMA, ANN and SVR do not need to build a fitting function. Values predicting by these models are calculated from historical data in the learning process. Therefore, they can predict more accuracy. The ARMA, ANN and SVR used in this study predicted well data of Gia Lam and Nha Trang stations. The performances of these models are much better than those of LR and Spline models, not only in terms of statistical indicators but also in terms of the fluctuation trend of pollutant as presented in Figure 2. It can be seen from Figure 2 that, three models, ARMA, ANN and SVM, adapted well with the fluctuation of real NO2 concentration. This is very important for the prediction of a parameter through time series. Table 1. The performance of selected models on testing datasets for filling in the data of NO2 with more than 100 missing points in Gia Lam and Nha Trang stations. Year Indicators Gia Lam station (GL) Nha Trang station (NT) LR Spline ARMA ANN SVR LR Spline ARMA ANN SVR 2014(GL) 2013(NT) RMSE(µg/m3) 41.49 43.27 16.72 17.52 22.30 11.68 12.45 8.59 8.82 9.90 MAE(µg/m3) 33.80 35.66 13.15 14.56 18.17 9.36 9.95 6.53 6.93 7.81 r 0.07 -0.06 0.75 0.73 0.48 0.14 0.13 0.68 0.58 0.52 2014(GL) 2013(NT) RMSE(µg/m3) 32.35 35.40 16.15 15.32 26.86 13.52 14.77 9.10 11.52 13.08 MAE(µg/m3) 26.97 30.35 12.06 12.59 21.32 10.54 11.81 7.16 9.10 10.25 r -0.04 -0.03 0.74 0.74 0.30 -0.16 -0.15 0.63 0.46 0.24 2014(GL) 2013(NT) RMSE(µg/m3) 15.91 19.85 6.98 12.37 8.61 10.33 11.87 8.72 8.84 10.02 MAE(µg/m3) 13.42 17.32 5.95 10.35 7.00 8.06 9.22 7.04 7.41 7.92 r -0.06 -0.19 0.80 0.75 0.70 0.14 0.17 0.58 0.44 0.35 In addition, the performance of the selected models for Nha Trang station is worse than that for Gia Lam station. It is because, the number of missing data points of Nha Trang station is huge (the rates of missing points in 2013, 2014 and 2015 are 30 %, 37 % and 13 %, respectively), therefore, the models had less information to learn. 3.2. Filling in the data of PM10 PM10 is a typical air quality parameter. It can be directly emitted into the air from local sources and/or can come from remote sources by the long-range transport. It is also formed in the air as a secondary pollutant. Besides its generation process, it can be removed from the air by the wet and dry deposition. Its level is, hence, dependent on many factors including emissions sources, meteorological conditions, topography, the concentrations of precursors such as NO2 Application of machine learning to fill in the missing monitoring data of air quality and SO2, etc. The fluctuation of its concentration in the air is, therefore, very complex. In addition, the number of missing values of PM10 in the two stations is huge. That is why, the models cannot predict fully the fluctuation trend of PM10. Thus, as can be seen from Table 2 and Figure 3, the performance of these models for PM10 is worse than those for NO2 not only in terms of statistical indicators but also in terms of the ability to capture of the trend of pollutants. (a) (b) Figure 2. The comparison of measured and predicted values of NO2 for selected models (a) Gia Lam station and (b) Nha Trang station. Table 2. The performance of selected models on testing datasets for filling in the data of PM10 with more than 100 missing points in Gia Lam and Nha Trang stations. Year Indicators Gia Lam station (GL) Nha Trang station (NT) LR Spline ARMA ANN SVR LR Spline ARMA ANN SVR 2014(GL) 2013(NT) RMSE(µg/m3) 49.21 55.95 32.57 43.94 64.68 16.83 17.10 12.69 15.43 29.14 MAE(µg/m3) 36.84 42.30 24.43 37.63 46.65 14.25 14.43 10.11 12.88 22.96 r 0.63 0.60 0.85 0.84 -0.58 0.59 0.57 0.68 0.57 -0.24 2014(GL) 2013(NT) RMSE(µg/m3) 7.50 8.03 4.35 8.14 16.05 11.88 12.63 8.01 9.83 19.95 MAE(µg/m3) 6.30 6.67 3.68 7.16 14.06 9.64 10.66 6.10 7.45 17.26 r -0.18 -0.17 0.89 0.78 0.37 -0.14 -0.02 0.69 0.35 0.26 2014(GL) 2013(NT) RMSE(µg/m3) 17.60 17.21 18.34 25.17 21.55 14.85 15.00 14.26 14.95 16.15 MAE(µg/m3) 13.98 13.98 11.56 20.39 12.61 11.99 12.30 9.32 9.52 10.44 r 0.63 0.63 0.25 0.54 -0.31 -0.22 -0.24 0.55 0.43 -0.01 Furthermore, the results also indicated that the performance of SVR is the worst among the three models. It is contrary to what reported by previous studies [6, 7, 8] in which SVM/SVR is better than ANN and ARRMA in the prediction of air quality. This might be explained by the quality of data, the selection of inputs variable, and the different way of approach for prediction. As presented above, this study used 25 input variables (24 variables for fluctuation trend of Mac Duy Hung, Nghiem Trung Dung, Hoang Xuan Co studied pollutants in 24 previous hours and the remaining one is used as the activity of emission source), while other forecasting studies used meteorological variables [7, 8] and precursors related to the pollutant to be predicted [6, 7, 8]. However, these results indicate that the performance of machine learning models is better than that of traditional approaches, which is consistent with the results of our previous studies [9, 10]. (a) (b) Figure 3. The comparison of measured and predicted values of PM10 for selected models (a) Gia Lam station and (b) Nha Trang station. 4. CONCLUSIONS Three machine learning models were used to predict the missing values of monitoring data of air quality for Gia Lam, Hanoi and Nha Trang, Khanh Hoa stations. Extensive experimental results indicated that the effectiveness of the three studied machine learning models, namely ARMA, ANN and SVR is better than that of traditional approaches such as LR and Spline interpolation. It is found that the quality of dataset in terms of missing data points significantly influences on the performance of the selected models. Among the three studied models, ARMA is the best in terms of filling in the missing monitoring data of air quality. However, it is hard to say which model is better, because the selection of the appropriate model which is based on data properties and the objectives of the analysis, influences to the performance of models. A strange point is that the performance of SVR model in this study is worse than that of ANN and ARMA models. This is different from what reported by several previous studies supposing the need for further studies. Nevertheless, for the prediction of the fluctuation trend of pollutant concentrations, the studied SVR model is better the traditional approaches including LR and Spline interpolation. This study suggested that the machine learning approaches including ARMA, ANN and SVR are potential methods for filling-in the missing values of air quality monitoring data. Acknowledgements. The authors would like to acknowledge the Center for Environmental Monitoring (CEM), Viet Nam Environment Administration for providing with the data of air quality monitoring stations for this study. Application of machine learning to fill in the missing monitoring data of air quality REFERENCES 1. Koutsoyianis D. and Langousis A. - Precipitation, Treaties on water science, ed. Wilderer P. and Uhlenbrook S. Academic Press, Oxford, 2011. 2. Şahin Ü. A., Bayat C., and Uçanc O. N. - Application of cellular neural network (CNN) to the prediction of missing air pollutant data, Atmospheric Research 101 (2011) 314-326. 3. B. H., Raleigh M. S., Fisher A., and Lundquist J. D. - A comparision of methods for filling gaps in hourly near-suface air temperature data, J. Hydrometeorol 14 (3) (2013) 929-945. 4. Alavi N., Warland J. S., and Berg A. A. - Filling gaps in evapotranspiration mearurements for water budget studies: Evaluation of a Kalman filtering approach, Agric. For. Meteorol. 141 (1) (2006) 57-66. 5. Lin K.P., Pai P.F., and Yang S.L. - Forecasting concentrations of air pollutants by logarithm support vector regression with immune algorithms, Applied Mathematics and Computation 217 ( 2011) 5318-5327. 6. Sánchez A. S., Nieto P. J. G., Fernández P. R., Díaz J. J. d. C., and Iglesias-Rodríguez F. J. - Application of an SVM-based regression model to the air quality study at local scale in the Avilés urban area (Spain), Mathematical and Computer Modelling 54 (2011) 1453-1466. 7. Lu W.-Z. and Wang W.-J. - Potential assessment of the ‘‘support vector machine’’ method in forecasting ambient air pollutant trends, Chemosphere 59 (2005) 693–701. 8. Luna A. S., Paredes M. L. L., Oliveira G. C. G., and Correa S. M. - Prediction of ozone concentration in tropospheric levels using artificial neural networks and support vector machine at Rio de Janeiro, Brazil, Atmospheric Environment 98 (2014) 98-104. 9. Mac Duy Hung, Nghiem Trung Dung, and Dinh Thu Hang. - Application of artificial neural network to fill in the missing monitoring data of air quality, Vietnam Journal of Science and Technology (VAST) 53 (3A) (2015) 199-204. 10. Mac Duy Hung and Nghiem Trung Dung - Application of Echo State Network for the forecast of air quality, Vietnam Journal of Science and Technology (VAST) 54 (1) (2016) 54-63. 11. Mac Duy Hung, Nghiem Trung Dung, and Hoang Xuan Co - Application of Multilayer Perceptron Neural Network for the forecast of tropospheric ozone in Hanoi, Journal of Science and Technology of Technical Universities 111 (2016) 46-51. 12. Neusser K. - Autoregressive moving average models, Time series econometrics. Springer Intenational Publishing, Switzerland, 2016. 13. Ooba M., Hirano T., Mogami J. I., Hirata R., and Fujinuma Y. - Comparisions of gap-filling methods for carbon flux dataset: A combination of a genetic algorithm and artificial neural network, Ecological Modelling 198 (2006) 473-486. 14. Vapnik V. N. - An Overview of Statistical Learning Theory, Proceeding of the IEEE Transactions on Neural Networks 10 (5) (1999) 988-999. 15. Pai P. F. and Hong W. C. - A recurrent support vector regression model in rainfall forecasting, Hydrological Processes 21 ( 2007) 819-827.

Các file đính kèm theo tài liệu này:

13036_103810386336_1_sm_0063_2081343.pdf