Classification model for water quality in reservoir using an integration of one-Against-one strategy and least square support vector machines - Thi Phuong Trang Pham

This study investigates the effectiveness of a hybrid intelligence model that integrates an LSSVM algorithm and decomposition (OAO) to improve its predictive accuracy in solving multiple class problems - Excellent, Good, Average, Fair and Poor are five levels in classifying water quality. The effectiveness of the OAO-LSSVM model is compared with that of the SMO, Multiclass Classifier, Naïve Bayes, Logistic and LibSVM. The proposed model yields a higher predictive accuracy and overall average performance score than other models with 92.196% and 91.421%, respectively. Therefore, the OAOLSSVM model can be used as a potential tool in classifying water quality in reservoir. In further study, the author hopes that the proposed model can be improved to find the most robust classification model for water quality and handle more multi-class classification problems in real world.

pdf4 trang | Chia sẻ: honghp95 | Lượt xem: 703 | Lượt tải: 0download
Bạn đang xem nội dung tài liệu Classification model for water quality in reservoir using an integration of one-Against-one strategy and least square support vector machines - Thi Phuong Trang Pham, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
162 Thi Phuong Trang Pham CLASSIFICATION MODEL FOR WATER QUALITY IN RESERVOIR USING AN INTEGRATION OF ONE-AGAINST-ONE STRATEGY AND LEAST SQUARE SUPPORT VECTOR MACHINES MÔ HÌNH PHÂN LOẠI CHẤT LƯỢNG NGUỒN NƯỚC HỒ CHỨA BẰNG SỰ KẾT HỢP CHIẾN LƯỢC MỘT ĐỐI MỘT VÀ BÌNH PHƯƠNG MÁY HỌC VÉC TƠ HỖ TRỢ Thi Phuong Trang Pham University of Technology and Education - The University of Danang; trangpham3112@gmail.com Abstract - An inefficient water management system may become one of the major disadvantages for a human-centered sustainable development process. Therefore, the classification model of water quality in reservoirs is essential in the resolution of environmental problems and has been a relevant tool for a sustainable and harmonious progress of the populations. This article proposes a classification model for classifying water quality in reservoir based on an integration of one-against-one (OAO) strategy and least square support vector machine (LSSVM). The paper analyzes and compares performance of various classification models and algorithms in order to demonstrate the suitable proposed model in classifying water quality with accuracy up to 92.196%. Tóm tắt - Một hệ thống quản lý nguồn nước không hiệu quả có thể trở thành một trong những bất lợi chính cho quá trình phát triển bền vững của loài người. Vì vậy, mô hình phân loại chất lượng nguồn nước tại hồ chứa là rất cần thiết để giải quyết vấn đề môi trường và đây cũng là công cụ hữu ích cho sự cân bằng quá trình ô nhiễm. Bài báo này đề xuất mô hình phân loại chất lượng nguồn nước tại hồ chứa dựa vào sự kết hợp giữa chiến lược một đối một và bình phương máy học vec-tơ hỗ trợ. Bài báo phân tích và so sánh kết quả đạt được với những mô hình và thuật toán phân loại khác để chứng minh sự phù hợp của mô hình được đề xuất trong việc phân loại chất lượng nguồn nước hồ chứa với độ chính xác đạt được là 92.196% Key words - Management system; classification model; reservoir water quality; one-against-one; least square support vector machines. Từ khóa - Hệ thống quản lý; mô hình phân loại; chất lượng nguồn nước hồ chứa; một đối một; bình phương máy học vec-tơ hỗ trợ. 1. Introduction The water is a primary natural resource for the survival and health of humans such as drinking, irrigation, hydroelectricity, fish fostering and recreation. Reservoirs are being subjected to intense multi-objective demands on limited resources, and water use attracts more attention to water quality. It is clear that, water quality affects other environmental interests, such as fish and wildlife, and can impact or impair water use. To be honest, an efficient water management system is a major goal in contemporary societies, taking into account the importance to health and the need to safeguard and promote its sustainable use. However, the assessment of a reservoir water quality is being done through analytical methods, which may not be a good way due to the distances to be covered, the number of parameters to be considered, and the financial resources spent to obtain such data. Many years ago, new technological breakthroughs provided new ways to create and store information. Indeed, many organizations accumulate large amounts of information on a daily basis according to their cities and processes, based on the assumption that large volume of data may be a source of knowledge which may be used to improve their performance and behavior, either by discovering trends or accelerating the course of efficient decision-making. However, the conventional tools for data analysis have a great number of drawbacks since they do not allow the detection of singularities inside such massive facts. Besides, this method is time consuming and gives low accuracy, and needs a lot of manpower. In addition to this, there were some studies [1, 2] using Artificial Neural Network (ANN) to evaluate the water quality directly or Yue Liao et.al (2011) combined multiclass support vector machine (SVM) with biomonitoring to assess water quality [3]. Classifying reservoir water quality is multi-class classification problem, and single-machine methods as well as decomposition strategies are the most popular methods. Decomposition strategies [4] are commonly used to solve classification problems with multiple classes. These methods transform a multi-class classification problem into several binary classification problems [5]. Several studies [6, 7] demonstrated that one-against-one (OAO) [8] is one of the most effective decomposition strategies. Machine learning techniques are powerful tools for research and a least squares support vector machine (LSSVM) is a highly enhanced machine-learning technique with many advanced features [9]. However, single-machine methods take a significant amount of computing time to solve large optimization problems and are not suitable for practical applications [10, 11]. The aim of this research is to propose the suitable multi- class classification model for classifying water quality based on the combination of the OAO approach and the LSSVM model in reservoir. To verify the effectiveness of the proposed model, this paper analyzes and compares the performance of the proposed model and other models. This study, therefore, proposes a multi-classification model, namely OAO-LSSVM to forecast multiple water quality in reservoir. The proposed model yields 92.196% of accuracy compared to other models when applying water quality data. The rest of this paper is organized as follow. Section 2 reviews the LSSVM, OAO and the classification evaluation methods. The collection dataset, and analytical results are mentioned in Section 3. And conclusion is given in Section 4. ISSN 1859-1531 - TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ ĐẠI HỌC ĐÀ NẴNG, SỐ 11(132).2018, QUYỂN 2 163 2. Methodology 2.1. Least square support vector machines for classification The LSSVM was improved by Suykens et al. (2002) [12]. In a function estimation of the LSSVM, the optimization problem is formulated as 2 2 , , 1 1 1 min ( , ) 2 2 N k b e k J e C e        (1) Equation (2) is the resulting LSSVM model for function prediction. 1 ( ) ( , ) N k k k f x K x x b    (2) where 𝛼𝑘 , 𝑏 are Lagrange multipliers and the bias term, respectively; and K(x, xk) is the kernel function. In this study, a radial basis function kernel (RBF) is used. Equation (3) is the RBF function. 2 2( , x ) exp( / 2k kK x x x    ) (3) 2.2. One-against-one strategy The decomposition strategy of decomposing the original problem into many sub-problems has been extensively used in using binary classifiers to solve multi- class classification problems. One-against-One (OAO) [8] is one of the most effective available decomposition strategies [6]. Therefore, the OAO algorithm is used for decomposition herein. The OAO scheme divides an original problem into as many binary problems as possible pairs of classes. Typically, the OAO method constructs k(k - 1)/2 classifiers [5], where k is the number of classes. All classifiers are combined to yield the final result. Different methods can be used to combine the obtained classifiers for the OAO scheme whereas the most common method is a simple voting method [13]. The LSSVM is the useful tool in solving binary-class classification. However, there are more and more complicated multi-class classification problems in the world. This is the reason why the author combines OAO approach with LSSVM model. The final model is created by the integration of the OAO and the LSSVM codes under the support of Matlab software. 2.3. Evaluation Various approaches have been suggested for evaluating the performance of multiclass classifiers. This study employed six evaluation measures such as accuracy, precision, sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and overall average performance score (S). Accuracy can be defined as the degree of uncertainty in a measurement with respect to an absolute standard. The predictive accuracy of a classification algorithm is calculated as follows tp tn Accuracy tp fp tn fn      (4) where the true positive (tp) values (number of correctly recognized class examples) and true negative (tn) values (number of correctly recognized examples that do not belong to the class) represent accurate classifications. The false positive (fp) value (number of examples that are either incorrectly assigned to a class or false negative (fn) value (number of examples that are not assigned to a class) refers to erroneous classifications. Two extended versions of accuracy are precision and sensitivity. Precision measures the reproducibility of a measurement, whereas sensitivity – also called recall – measures the completeness. Precision in Eq. (5) is defined as the number of true positives as a proportion of the total number of true positives and false positives that are provided by the classifier. Sensitivity in Eq. (6) is the number of correctly classified positive examples divided by the number of positive examples in the data. In identifying positive labels, sensitivity is useful for estimating the effectiveness of a classifier. tp Precision tp fp   (5) tp Sensitivity tp fn   (6) Another performance metric is specificity. The specificity of a test is the ability of the test to correctly determine the cases. This metric is estimated by calculating the number of true negatives as a proportion of the total number of true negatives and false positives in examples. Equation (7) is the formula for specificity, tn Specificity tn fp   (7) The AUC indicates the area under the receiving operating characteristic (ROC) curve which is the most commonly used tool for visualizing the performance of a classifier, and AUC is the best way to capture its performance as a single number. The ROC curve captures a single point, the area under the curve (AUC), in the analysis of model performance [14]. The AUC, sometimes referred to as the balanced accuracy [15] is easily obtained using Eq. (8). 1 2 tp tn AUC tp fn tn fp                  (8) To compound the effects of preceding measures, an overall average performance score (S) for the distinct classification models is proposed in Eq. (9). S = 1 𝑚 𝑥 ∑ 𝑃𝑖𝑚𝑖=1 (9) where m is number of distinct performance measures; and Pi is ith performance measure. 3. Data preparation and analytical results 3.1. Data preparation The case study in this paper from the field of hydroelectric engineering involves a dataset on the quality of water in a reservoir from 150 reservoirs of Taiwan. The author collected the data from the Taiwan water annual report. The quality of water plays an important role because water is a primary natural resource that supports the survival and health of humans through drinking, irrigation, 164 Thi Phuong Trang Pham hydroelectricity, aquaculture and recreation. In addition to this, predicting water quality is critical in the management of water quality, and enables a manager thereof better choice. The accurate prediction of phenomena related to water is essential to the optimal management of water resources. Table 1. Statistical attributes of reservoir water quality dataset Number Parameter - Input Number Parameter – Output (Reservoir water quality grades) 1 Secchi disk Depth (SD) 1 Excellent – Class 1 2 Chlorophyll a (Chla) 2 Good - Class 2 3 Total phosphorus (TP) 3 Average - Class 3 4 Fair - Class 4 5 Poor - Class 5 Table 1 shows the details of the water quality dataset. Carlon’s Trophic State Index (CTSI) has long been used in the region to assess eutrophication in reservoirs. Generally, the factors that are considered to evaluate reservoir water quality include Secchi disk depth (SD), Chlorophyll a (Chla), Total Phosphorus (TP), dissolved oxygen (DO), ammonia (NH3), biochemical oxygen demand (BOD), temperature (TEMP) and others. In this investigation, SD, Chla and TP are used to classify the quality of water in a reservoir because, they are the most important and popular factors in assessing the water quality. The Organization for Economic Cooperation and Development ’s (OECD) single indicator water quality differentiations (Table 2) [16] are used to generate the following five levels for each evaluation factor, as follows; excellent (Class 1), good (Class 2), average (Class 3), fair (Class 4) and poor (Class 5). The database includes 1576 data points with three independent inputs (SD, Chla and TP) and the output is one of five ratings of quality of water in a reservoir. Table 2. Water quality parameter classification Water quality constituent \ Level Excellent 1 Good 2 Average 3 Fair 4 Poor 5 Secchi disk Depth (SD) * > 4.5 4.5-3.7 3.7-2.3 2.3-1.7 < 1.7 Chlorophyll a (Chla) * 10 Total phosphorus (TP) * 40 Carlon’s Trophic State Index (CTSI) 70 * Trophic state as a function of nutrient levels defined by OECD 3.2. Analytical results To demonstrate the effectiveness of the OAO-LSSVM model, its predictive performance is compared with that of single multi-class classification algorithms - Sequential Minimal Optimization (SMO), the Multiclass Classifier, the Naïve Bayes, Logistic and the Library Support vector machine (LibSVM). The performance of the OAO- LSSVM model is evaluated in terms of accuracy, precision, sensitivity and specificity, AUC and S. Table 3 compares the performances of the SMO, Multiclass Classifier, Naïve Bayes, Logistic, LibSVM and OAO-LSSVM models when used to predict the quality of water in a reservoir using test data. The numerical results revealed that the OAO-LSSVM was the best model for predicting this dataset in terms of accuracy, precision, sensitivity and specificity, AUC and S value (92.196% 90.794%, 90.633%, 92.078%, 91.405% and 91.421%, respectively). Although the specificity of OAO-LSSVM model has lower value than Naïve Bayses and Multiclass Classifier models, the accuracy and S value of the proposed model yields highest values which are the most commonly used indexes when comparing multi-class classification models. The sorting comparison of models in terms of overall average performance score (S) is further made in Figure 1. Figure 1. Comparison of models in term of overall average performance score (S) Table 3. Comparison of other multi-class models and the proposed model Multi-class models Performance measure Accuracy (%) Precision (%) Sensitivity (%) Specificity (%) AUC (%) S (%) SMO 75.238 75.200 77.500 85.900 81.705 79.109 Multiclass Classifier 85.397 85.400 86.500 94.900 90.71 88.581 Naïve Bayses 76.000 76.000 78.700 99.500 89.15 83.870 Logistic 89.580 89.600 89.600 90.600 90.36 89.948 LibSVM 80.950 81.000 81.000 87.600 84.306 82.971 OAO-LSSVM 92.196 90.794 90.633 92.078 91.405 91.421 79.109 88.581 83.870 89.948 82.971 91.421 70 80 90 100 SMO Multiclass Classifier Naïve Bayses Logistic LibSVM OAO-LSSVM Percentage (%) P r id ic te d m o d e ls ISSN 1859-1531 - TẠP CHÍ KHOA HỌC VÀ CÔNG NGHỆ ĐẠI HỌC ĐÀ NẴNG, SỐ 11(132).2018, QUYỂN 2 165 4. Conclusions This study investigates the effectiveness of a hybrid intelligence model that integrates an LSSVM algorithm and decomposition (OAO) to improve its predictive accuracy in solving multiple class problems - Excellent, Good, Average, Fair and Poor are five levels in classifying water quality. The effectiveness of the OAO-LSSVM model is compared with that of the SMO, Multiclass Classifier, Naïve Bayes, Logistic and LibSVM. The proposed model yields a higher predictive accuracy and overall average performance score than other models with 92.196% and 91.421%, respectively. Therefore, the OAO- LSSVM model can be used as a potential tool in classifying water quality in reservoir. In further study, the author hopes that the proposed model can be improved to find the most robust classification model for water quality and handle more multi-class classification problems in real world. REFERENCES [1] L.S. Palani S, Tkalich P., An ANN application for water quality forecasting, Marine Pollution Bulletin 56:1586- 1597 (2008). [2] B.A. Singh KP, Malik A, Jain G., Artificial neural network modeling of the river water quality—A case study, Ecological Modelling 220(6):888-895 (2009). [3] Y. Liao, J. Xu, W. Wang, A Method of Water Quality Assessment Based on Biomonitoring and Multiclass Support Vector Machine, Procedia Environmental Sciences 10 (2011) 451-457. [4] A.C. Lorena, A.C.P.L.F. de Carvalho, J.M.P. Gama, A review on the combination of binary classifiers in multiclass problems, Artificial Intelligence Review 30(1) (2009) 19. [5] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes, Pattern Recognition 44(8) (2011) 1761- 1776. [6] M. Galar, A. Fernández, E. Barrenechea, F. Herrera, DRCW-OVO: Distance-based relative competence weighting combination for One-vs-One strategy in multi-class problems, Pattern Recognition 48(1) (2015) 28-42. [7] S. Kang, S. Cho, P. Kang, Constructing a multi-class classifier using one-against-one approach with different binary classifiers, Neurocomput. 149(PB) (2015) 677-682. [8] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA data mining software: an update, SIGKDD Explor. Newsl. 11(1) (2009) 10-18. [9] J.-S. Chou, A.-D. Pham, Nature-inspired metaheuristic optimization in least squares support vector regression for obtaining bridge scour information, Information Sciences 399 (2017) 64-80. [10] H. Chih-Wei, L. Chih-Jen, A comparison of methods for multiclass support vector machines, IEEE Transactions on Neural Networks 13(2) (2002) 415-425. [11] R. Rifkin, A. Klautau, In Defense of One-Vs-All Classification, J. Mach. Learn. Res. 5 (2004) 101-141. [12] J.A.K. Suykens, T.V. Gestel, J.D. Brabanter, B.D. Moor, J. Vandewalle, Least squares support vector machines, World Scientific, Singapore, 2002. [13] N. García-Pedrajas, D. Ortiz-Boyer, An empirical study of binary classifier fusion methods for multiclass classification, Information Fusion 12(2) (2011) 111-130. [14] J.-S. Chou, C.-F. Tsai, Y.-H. Lu, Project dispute prediction by hybrid machine learning techniques, Journal of Civil Engineering and Management 19(4) (2013) 505-517. [15] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Information Processing & Management 45(4) (2009) 427-437. [16] N.T.U. Hydrotech Research Institute, Reservoir eutrophiction prediction and prevention by using remote sensing technique. Water Resources Agency (in Chinese) (2005). (The Board of Editors received the paper on 19/9/2018, its review was completed on 31/10/2018)

Các file đính kèm theo tài liệu này:

  • pdfpdffull_2019m02d015_14_11_50_783_2114545.pdf
Tài liệu liên quan