Three machine learning models were used to predict the missing values of monitoring data of
air quality for Gia Lam, Hanoi and Nha Trang, Khanh Hoa stations. Extensive experimental results
indicated that the effectiveness of the three studied machine learning models, namely ARMA,
ANN and SVR is better than that of traditional approaches such as LR and Spline interpolation. It
is found that the quality of dataset in terms of missing data points significantly influences on the
performance of the selected models. Among the three studied models, ARMA is the best in terms
of filling in the missing monitoring data of air quality. However, it is hard to say which model is
better, because the selection of the appropriate model which is based on data properties and the
objectives of the analysis, influences to the performance of models. A strange point is that the
performance of SVR model in this study is worse than that of ANN and ARMA models. This is
different from what reported by several previous studies supposing the need for further studies.
Nevertheless, for the prediction of the fluctuation trend of pollutant concentrations, the studied
SVR model is better the traditional approaches including LR and Spline interpolation. This study
suggested that the machine learning approaches including ARMA, ANN and SVR are potential
methods for filling-in the missing values of air quality monitoring data
7 trang |
Chia sẻ: honghp95 | Lượt xem: 639 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Application of machine learning to fill in the missing monitoring data of air quality, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Vietnam Journal of Science and Technology 56 (2C) (2018) 104-110
APPLICATION OF MACHINE LEARNING TO FILL IN THE
MISSING MONITORING DATA OF AIR QUALITY
Mac Duy Hung1, 2, *, Nghiem Trung Dung1, Hoang Xuan Co3
1Hanoi University of Science and Technology, 1 Dai Co Viet road, Ha Noi, Viet Nam
2Thai Nguyen University of Technology, 3/2 road, Tich Luong ward, Thai Nguyen, Viet Nam
3VNU University of Science, 334 Nguyen Trai road, Ha Noi, Viet Nam
*Email: macdh@tnut.edu.vn
Received: 10 May 2018; Accepted for publication: 21 August 2018
ABSTRACT
In this paper, three machine learning models have been applied to predict and fill in the
missing monitoring data of air quality for Gia Lam and Nha Trang stations in Hanoi and Khanh
Hoa respectively, including Autoregressive Moving Average (ARMA), Artificial Neural
Network (ANN), and Support Vector Regression (SVR). Two air pollutants being NO2 and PM10
were selected for this study. The experimental results showed that the performance of all three
studied models is better than that of some traditional approaches, including Multiple Linear
Regression (LR) and Spline interpolation. Besides that, ARMA, ANN and SVR can capture the
fluctuation of concentrations of the selected pollutants. These results indicated that the machine
learning is a feasible approach to deal with the missing of data which is one of the biggest
problems of air quality monitoring stations in Viet Nam.
Keywords: air quality, ANN, ARMA, SVR, missing data.
1. INTRODUCTION
Monitoring and modeling of air quality is of ultimate significance for understanding the
trend and characteristics of air pollutants. For understanding and simulating the fluctuation of an
air pollutant, it is required to have the dataset of air quality which is not only long enough in
time and reliable but also time-serially completion of observations. However, the continuity of
time-series measurements is normally plagued with different factors including malfunction of
the equipment, power cut off, not regularly maintained, etc., resulting in the gap of data points or
missing data. Many statistical approaches such as linear or logistic regression, polynomial or
spline interpolation/extrapolation [1, 2, 3], Kalman filter approach [4] and so on have been
proposed to deal with this problem. However, none of them is effective when the number of
gaps is large. In recent studies, the machine learning approaches have been successfully applied
to predict values of concentrations of air pollutants [2, 5, 6, 7, 8, 9, 10, 11]. This study, therefore,
aimed at the application of machine learning to fill in the gaps of air quality monitoring data in
Mac Duy Hung, Nghiem Trung Dung, Hoang Xuan Co
Viet Nam focusing on autoregressive moving average (ARMA), artificial neural network (ANN)
and support vector regression methods.
2. METHODOLOGY
2.1. Autoregressive moving average
Autoregressive moving average (ARMA) is a statistical model of time series analysis
which combines autoregressive analysis (AR) and moving average (MA) methods. An ARMA
model of xt time series data can be defined by following equations [12].
AR component: MA component:
1 1 ...t t p t p tx x x zα α− −= + + + (1); 0 1 1 ...t t t q t qx z x xβ β β− −= + + + (2)
And ARMA model:
1 1 1 1... ...t t p t p t t q t qx x x z x xα α β β− − − −= + + + + + + (3)
where, α1, , αp and β1, , βp are the corresponding coefficients.
2.2. Artificial neural network
Artificial neural network (ANN), a mathematical model, is built based on a biological neural
system that consists of three or more layers which are formed by neurons intended to simulate the
learning and pattern recognition [13]. An example of typical NN structure is shown in Figure 1,
which only has one hidden layer.
Figure 1. The structure of a typical artificial neuron network (with three layers).
The output of i-th neuron (xi) is determined by the following equations (4) and (5):
( )ξ=i ix f (4)
1
.ξ ε
−⊆Γ
= +∑
i
i ij j i
j
W x
(5)
where, ξi is the potential of i-th neuron; f(ξi) is called transfer function; the threshold εi is a
weight coefficient of the connection with formally added neuron j (so called bias).
The equation (5) is carried out over all neuron j (xj) transferring the signals to i (xi) neuron.
In this study, multilayer feedforward neural network (FFNN) was used with the transfer
function being sigmoidal.
2.3. Support vector regression
Application of machine learning to fill in the missing monitoring data of air quality
Support vector machine (SVM) was proposed by V. N. Vapnik [14] to deal with the
problems of data classification. SVM creates a hyper plane in boundless dimensional (feature)
space, which is used for classification and regression. Support vector regression (SVR) is a
linear regression based on the SVM technique. A linear regression function of a given set of data
x can be wrote in form ( ) ( )= Φ +T iF x w x b in a feature space , where w is the coefficient
vector, b is a threshold and Ф(xi) is a nonlinear function which maps the input x to a vector in .
In order to estimate a function F within a finite accuracy, the estimation value ˆF of F needs to
satisfy the condition ˆ ε− ≤F F , where ε is the allowably maximum deviation during the
training state. The reliability of prediction values is measured by a loss function which is called
ε-intensitive cost lost function [5, 15]. The SVR function for nonlinear predictions becomes the
equation (6) below:
( ) ( ) ( )*
1
ˆ
ˆ ,α α
=
= = − +∑
N
i i i
i
y F x K x x b
(6)
where, α and α* are Lagrange parameters; K(xi, x) is known as a Kernel function. Any function
that meets Mercer’s condition [14] can be used as the Kernel function. For this study, several
preliminary tests were conducted to select the Kernel function including linear, RBF and
polynomial. The results indicated that the linear function is the best, therefore, it was chosen for
this study.
2.4. Datasets
Data used for this study were extracted from the databases of air quality monitoring stations
in Gia Lam (Hanoi, from 2014 to 2016) and Nha Trang (Khanh Hoa, from 2013 to 2015). This
study focused on two primary air quality parameters, namely NO2 and PM10. Datasets for
training and testing the models are extracted from these databases in periods in which data are
not missing. The input data are set as the following form:
( )24 23 2 1, ,..., , ,− − − −= t t t tX X X X X t
where, Xt-24, , Xt-1 are concentrations of a studied pollutant which is needed to fill in the gaps; t
is the time of gaps, i.e., to predict a missing value of a pollutant, these models need 25 values,
where 24 values are concentrations of this pollutant in 24 previous hours and t is used as the
variable of the concentration trend of this pollutant.
2.5. Evaluation of performance
The performance of selected models was evaluated based on statistical indicators including
root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (r).
3. RESULTS AND DISCUSSIONS
In this study, testing datasets are complete segments from data sources that are assumed to
be missing. The positions of these segments are set by random. In addition, the predicted value
in the time t is feedback to the input of models as Xt-1 to predict the missing concentration of next
time t+1. The performance of gap filling of the selected models on testing datasets are presented
below.
Mac Duy Hung, Nghiem Trung Dung, Hoang Xuan Co
3.1. Filling in the data of NO2
Obtained results presenting in Table 1 showed that, in almost all experiments, the
performance of traditional approaches tested in this study including LR and Spline interpolation
in segments with huge number of missing values is not reliable. It is because, these models try to
build a function that fit the trend of the studied pollutants. However, the fluctuation of the
concentrations of a pollutant in the air is a nonlinear function which is influenced by many
factors (e.g., time, precursors, meteorological conditions and so on), therefore, they cannot build
a fitting function for that. On the contrary, almost machine learning models including ARMA,
ANN and SVR do not need to build a fitting function. Values predicting by these models are
calculated from historical data in the learning process. Therefore, they can predict more
accuracy. The ARMA, ANN and SVR used in this study predicted well data of Gia Lam and
Nha Trang stations. The performances of these models are much better than those of LR and
Spline models, not only in terms of statistical indicators but also in terms of the fluctuation trend
of pollutant as presented in Figure 2. It can be seen from Figure 2 that, three models, ARMA,
ANN and SVM, adapted well with the fluctuation of real NO2 concentration. This is very
important for the prediction of a parameter through time series.
Table 1. The performance of selected models on testing datasets for filling in the data of NO2 with more
than 100 missing points in Gia Lam and Nha Trang stations.
Year Indicators
Gia Lam station (GL) Nha Trang station (NT)
LR Spline ARMA ANN SVR LR Spline ARMA ANN SVR
2014(GL)
2013(NT)
RMSE(µg/m3) 41.49 43.27 16.72 17.52 22.30 11.68 12.45 8.59 8.82 9.90
MAE(µg/m3) 33.80 35.66 13.15 14.56 18.17 9.36 9.95 6.53 6.93 7.81
r 0.07 -0.06 0.75 0.73 0.48 0.14 0.13 0.68 0.58 0.52
2014(GL)
2013(NT)
RMSE(µg/m3) 32.35 35.40 16.15 15.32 26.86 13.52 14.77 9.10 11.52 13.08
MAE(µg/m3) 26.97 30.35 12.06 12.59 21.32 10.54 11.81 7.16 9.10 10.25
r -0.04 -0.03 0.74 0.74 0.30 -0.16 -0.15 0.63 0.46 0.24
2014(GL)
2013(NT)
RMSE(µg/m3) 15.91 19.85 6.98 12.37 8.61 10.33 11.87 8.72 8.84 10.02
MAE(µg/m3) 13.42 17.32 5.95 10.35 7.00 8.06 9.22 7.04 7.41 7.92
r -0.06 -0.19 0.80 0.75 0.70 0.14 0.17 0.58 0.44 0.35
In addition, the performance of the selected models for Nha Trang station is worse than that
for Gia Lam station. It is because, the number of missing data points of Nha Trang station is
huge (the rates of missing points in 2013, 2014 and 2015 are 30 %, 37 % and 13 %,
respectively), therefore, the models had less information to learn.
3.2. Filling in the data of PM10
PM10 is a typical air quality parameter. It can be directly emitted into the air from local
sources and/or can come from remote sources by the long-range transport. It is also formed in
the air as a secondary pollutant. Besides its generation process, it can be removed from the air by
the wet and dry deposition. Its level is, hence, dependent on many factors including emissions
sources, meteorological conditions, topography, the concentrations of precursors such as NO2
Application of machine learning to fill in the missing monitoring data of air quality
and SO2, etc. The fluctuation of its concentration in the air is, therefore, very complex. In
addition, the number of missing values of PM10 in the two stations is huge. That is why, the
models cannot predict fully the fluctuation trend of PM10. Thus, as can be seen from Table 2 and
Figure 3, the performance of these models for PM10 is worse than those for NO2 not only in
terms of statistical indicators but also in terms of the ability to capture of the trend of pollutants.
(a) (b)
Figure 2. The comparison of measured and predicted values of NO2 for selected models
(a) Gia Lam station and (b) Nha Trang station.
Table 2. The performance of selected models on testing datasets for filling in the data of PM10 with more
than 100 missing points in Gia Lam and Nha Trang stations.
Year Indicators
Gia Lam station (GL) Nha Trang station (NT)
LR Spline ARMA ANN SVR LR Spline ARMA ANN SVR
2014(GL)
2013(NT)
RMSE(µg/m3) 49.21 55.95 32.57 43.94 64.68 16.83 17.10 12.69 15.43 29.14
MAE(µg/m3) 36.84 42.30 24.43 37.63 46.65 14.25 14.43 10.11 12.88 22.96
r 0.63 0.60 0.85 0.84 -0.58 0.59 0.57 0.68 0.57 -0.24
2014(GL)
2013(NT)
RMSE(µg/m3) 7.50 8.03 4.35 8.14 16.05 11.88 12.63 8.01 9.83 19.95
MAE(µg/m3) 6.30 6.67 3.68 7.16 14.06 9.64 10.66 6.10 7.45 17.26
r -0.18 -0.17 0.89 0.78 0.37 -0.14 -0.02 0.69 0.35 0.26
2014(GL)
2013(NT)
RMSE(µg/m3) 17.60 17.21 18.34 25.17 21.55 14.85 15.00 14.26 14.95 16.15
MAE(µg/m3) 13.98 13.98 11.56 20.39 12.61 11.99 12.30 9.32 9.52 10.44
r 0.63 0.63 0.25 0.54 -0.31 -0.22 -0.24 0.55 0.43 -0.01
Furthermore, the results also indicated that the performance of SVR is the worst among the
three models. It is contrary to what reported by previous studies [6, 7, 8] in which SVM/SVR is
better than ANN and ARRMA in the prediction of air quality. This might be explained by the
quality of data, the selection of inputs variable, and the different way of approach for prediction.
As presented above, this study used 25 input variables (24 variables for fluctuation trend of
Mac Duy Hung, Nghiem Trung Dung, Hoang Xuan Co
studied pollutants in 24 previous hours and the remaining one is used as the activity of emission
source), while other forecasting studies used meteorological variables [7, 8] and precursors
related to the pollutant to be predicted [6, 7, 8]. However, these results indicate that the
performance of machine learning models is better than that of traditional approaches, which is
consistent with the results of our previous studies [9, 10].
(a) (b)
Figure 3. The comparison of measured and predicted values of PM10 for selected models
(a) Gia Lam station and (b) Nha Trang station.
4. CONCLUSIONS
Three machine learning models were used to predict the missing values of monitoring data of
air quality for Gia Lam, Hanoi and Nha Trang, Khanh Hoa stations. Extensive experimental results
indicated that the effectiveness of the three studied machine learning models, namely ARMA,
ANN and SVR is better than that of traditional approaches such as LR and Spline interpolation. It
is found that the quality of dataset in terms of missing data points significantly influences on the
performance of the selected models. Among the three studied models, ARMA is the best in terms
of filling in the missing monitoring data of air quality. However, it is hard to say which model is
better, because the selection of the appropriate model which is based on data properties and the
objectives of the analysis, influences to the performance of models. A strange point is that the
performance of SVR model in this study is worse than that of ANN and ARMA models. This is
different from what reported by several previous studies supposing the need for further studies.
Nevertheless, for the prediction of the fluctuation trend of pollutant concentrations, the studied
SVR model is better the traditional approaches including LR and Spline interpolation. This study
suggested that the machine learning approaches including ARMA, ANN and SVR are potential
methods for filling-in the missing values of air quality monitoring data.
Acknowledgements. The authors would like to acknowledge the Center for Environmental Monitoring
(CEM), Viet Nam Environment Administration for providing with the data of air quality monitoring
stations for this study.
Application of machine learning to fill in the missing monitoring data of air quality
REFERENCES
1. Koutsoyianis D. and Langousis A. - Precipitation, Treaties on water science, ed. Wilderer P.
and Uhlenbrook S. Academic Press, Oxford, 2011.
2. Şahin Ü. A., Bayat C., and Uçanc O. N. - Application of cellular neural network (CNN) to
the prediction of missing air pollutant data, Atmospheric Research 101 (2011) 314-326.
3. B. H., Raleigh M. S., Fisher A., and Lundquist J. D. - A comparision of methods for filling
gaps in hourly near-suface air temperature data, J. Hydrometeorol 14 (3) (2013) 929-945.
4. Alavi N., Warland J. S., and Berg A. A. - Filling gaps in evapotranspiration mearurements
for water budget studies: Evaluation of a Kalman filtering approach, Agric. For. Meteorol.
141 (1) (2006) 57-66.
5. Lin K.P., Pai P.F., and Yang S.L. - Forecasting concentrations of air pollutants by logarithm
support vector regression with immune algorithms, Applied Mathematics and Computation
217 ( 2011) 5318-5327.
6. Sánchez A. S., Nieto P. J. G., Fernández P. R., Díaz J. J. d. C., and Iglesias-Rodríguez F. J. -
Application of an SVM-based regression model to the air quality study at local scale in the
Avilés urban area (Spain), Mathematical and Computer Modelling 54 (2011) 1453-1466.
7. Lu W.-Z. and Wang W.-J. - Potential assessment of the ‘‘support vector machine’’ method
in forecasting ambient air pollutant trends, Chemosphere 59 (2005) 693–701.
8. Luna A. S., Paredes M. L. L., Oliveira G. C. G., and Correa S. M. - Prediction of ozone
concentration in tropospheric levels using artificial neural networks and support vector
machine at Rio de Janeiro, Brazil, Atmospheric Environment 98 (2014) 98-104.
9. Mac Duy Hung, Nghiem Trung Dung, and Dinh Thu Hang. - Application of artificial neural
network to fill in the missing monitoring data of air quality, Vietnam Journal of Science and
Technology (VAST) 53 (3A) (2015) 199-204.
10. Mac Duy Hung and Nghiem Trung Dung - Application of Echo State Network for the
forecast of air quality, Vietnam Journal of Science and Technology (VAST) 54 (1) (2016)
54-63.
11. Mac Duy Hung, Nghiem Trung Dung, and Hoang Xuan Co - Application of Multilayer
Perceptron Neural Network for the forecast of tropospheric ozone in Hanoi, Journal of
Science and Technology of Technical Universities 111 (2016) 46-51.
12. Neusser K. - Autoregressive moving average models, Time series econometrics. Springer
Intenational Publishing, Switzerland, 2016.
13. Ooba M., Hirano T., Mogami J. I., Hirata R., and Fujinuma Y. - Comparisions of gap-filling
methods for carbon flux dataset: A combination of a genetic algorithm and artificial neural
network, Ecological Modelling 198 (2006) 473-486.
14. Vapnik V. N. - An Overview of Statistical Learning Theory, Proceeding of the IEEE
Transactions on Neural Networks 10 (5) (1999) 988-999.
15. Pai P. F. and Hong W. C. - A recurrent support vector regression model in rainfall
forecasting, Hydrological Processes 21 ( 2007) 819-827.
Các file đính kèm theo tài liệu này:
- 13036_103810386336_1_sm_0063_2081343.pdf