Calculation of environmental properties for organic compounds using quantitative structuresolubility relationships

This study has successfully constructed QSSR linear models with the support of genetic algorithms. This technique can allow to construct the linear regression models for large data sets. The genetic algorithm allows to select the important parameter in model. The QSSR linear models is obtained to satisfy the evaluated statistics. In addition, the artificial intelligence technique based on fuzzy neural relations are also supported by genetic algorithms to construct the design of neural network as I(4)-HL(4)-O(1), adapting well with database. The QSSR neural model is predicted to give the results to be better than the QSSR linear model. The MARE value, % of QSSR linear models is larger than those from the QSSR neural models. The result in this works are opening the new researches and many promising applications in environmental treatments, also the design of medical products.

8 trang | Chia sẻ: honghp95 | Lượt xem: 880 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Calculation of environmental properties for organic compounds using quantitative structuresolubility relationships, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

Journal of Science and Technology 54 (4B) (2016) 162-169 CALCULATION OF ENVIRONMENTAL PROPERTIES FOR ORGANIC COMPOUNDS USING QUANTITATIVE STRUCTURE- SOLUBILITY RELATIONSHIPS Vo Thanh Cong1, Nguyen Minh Quang1, Pham Nu Ngoc Han2, Nguyen Xuan Truong4, Tran Kim Cuong3, Pham Van Tat2, * 1Industrial University of Ho Chi Minh City, 12 Nguyen Van Bao, Go Vap Tp.HCM 2Faculty of Science and Technology, Hoa Sen University, 8 Nguyen Van Trang, Dist.1, HCMC 3Department of Chemistry, University of Dalat, 01-Phu Dong Thien Vuong, Dalat City 4University of Natural Resources and Environment, 236B Le Van Si, Dist. Tan Binh, HCMC *Email: vantat@gmail.com Received: 15 August 2016; Accepted for publication: 10 November 2016 ABSTRACT The solubility of organic compounds in water was related to the environmental behaviors. In this work, the solubility values of 27 organic compounds were calculated by using the different molecular descriptors. The quantitative structure-solubility relationships (QSSRs) were constructed by incorporating the multivariable technique and the genetic algorithm. The important molecular descriptors such as logP, SsCH3_acnt, ABSQ, nelem, nrings, SHBa, Gmax, Gmin, Xvp6, and Xvpc4 were selected to construct the linear models QSSRs with the genetic algorithm. The best four-variable linear model QSSR was obtained from those descriptors. The quality of QSSR linear model indicated in statistical values R2training of 96.60, standard error of estimation, SE of 0.2961, F-statistic of 156.0, P-value of 0.0, R2test of 95.02, and RSS value of 2.823. The architecture of neural network I(4)-HL(4)-O(1) with R2training of 99.03 was constructed by the molecular descriptors in the four-variable linear model. The predicted solubility values of organic substances in test group resulting from these models are in good agreement with those from literature. Keywords: QSSRs, molecular descriptors, multiple regression, neural network. 1. INTRODUCTION The solubility property of organic compounds in the water is one of most important target to evaluate the environmental effects. This property is employed to treat the environmental pollutants in waste water of chemical factory. The able solution of pollutants in water are evaluated by the solubility. This parameter is thus one of the standard scale to investigate the distributed and toxic level of chemical substances. The parameters of COD and BOD also related to the solubility of organic compounds. Therefore, it is used to evaluate the quality in Calculation of environmental properties 163 water. These are considered in this work to apply good chemicals in industry and separate the inorganic substances in nature. The quantitative structure-properties relationships (QSPR) were modeled by the multiple regression techniques and the evaluation of different statistics [1,2]. The artificial neural network has been used in previous studies of the quantitative structure-activity relationship QSAR in which it was given in references [3,4]. The artificial intelligence technique combined with neural network, the fuzzy logic and the genetic algorithms are presented flexibility properties when searched the complex relationships and sophisticated in the data mining process [5]. In this study, We point out the techniques using linear multiple regression and neural network to construct the different of quantitative structure-solubility relationships (QSSR).The parameter descriptions of 2D and 3D molecular structure of organic compounds are calculated with the combination of MM+ molecular mechanics and SCF PM3 half-experience in quantum chemistry. The QSSRs linear and QSSR neural models are constructed from the structural parameters with the support of genetic algorithm. The predicted solubility of organic compounds are evaluated by the QSSR linear and QSSR neural models which this resulted calculations are compared with experimental data. 2. COMPUTATIONAL METHOD 2.1. Data and software The experimental solubility of 27 organic compounds are resulted in previous observations [6], given in Table 1. Because of these compounds are presented in industrial wastewaters. The properties to describe 2D and 3D molecular, and QSSR linear models is constructed by Regress and QSARIS [5, 7, 8]. Whereas, the QSSR neural models are constructed by INForm [9]. Table 1. Experimental solubility (S) of organic compounds at 25°C [1]. No Compound logS No Compound logS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Isooctane Pentane Cyclohexane Cyclopentane Heptane Hexane 1,1,2-trichloroethane 1,2,4-trichlorobenzene Toluene Chlorobenzene Chloroform n-butyl chloride Ethylene dichloride Dichloromethane -3.699 -1.398 -2.222 -2.000 -3.523 -1.854 -1.770 -2.600 -1.284 -1.300 -0.089 -0.959 -0.092 -0.204 15 16 17 18 19 20 21 22 23 24 25 26 27 o-dichlorobenzene n-butyl acetate Ethyl ether Methyl isoamyl ketone Methyl tert-butyl ether Methyl isobutyl ketone Ethyl acetate Methyl n-propyl ketone Triethylamine Propylene carbonate Methyl ethylketone Isobutyl alcohol n-butyl alcohol -1.796 -0.168 -0.838 0.231 0.681 -0.268 0.940 0.775 0.740 1.243 1.380 0.930 0.893 Vo Thanh Cong 164 2.2. Calculation of the molecular parameters The molecular parameters of organic compounds are optimized and calculated using molecular mechanics of Hyperchem [10]. The molecular parameters such as 2D and 3D structures, geometry, electrostatic, charge, and dispersion coefficient of octanol/ water are obtained from QSARIS model [8]. 2.3. Construction and validation of QSSR model The performances of construction and evaluation are modeled with the steps as including: − All the cases, except for the first one used to match or training models. The prediction of the first one observes suitably through QSSR linear or QSSR neural model, and the deviation value of Y1 - 1ˆY is determined. − All the cases, except for the second one used to match or training models. The prediction of the second one observes suitably through QSSR linear or QSSR neural model, and the deviation value of Y2 - 2ˆY is determined. − Other case are repeated continuously above step to observe the evaluation remaining. The values of R2test average is received from the models above. The cross-validations are performed to obtain the database. Two small data groups in database are separated as training data and test data . Each QSSR model is established from the training data use to predict the solubility of organic compounds in the testing data. The correspondences of QSSR linear models and QSSR neural models are shown by R2training and R2adj adjustment values; the predictability of the models are evaluated the cross appreciation through R2test test data value as following equation: 2 2 1 2 1 ˆ( ) 1 100 ( ) N i N i Y Y R Y Y = = ⎛ ⎞ −⎜ ⎟⎜ ⎟= − ×⎜ ⎟ −⎜ ⎟⎝ ⎠ ∑ ∑ In where, Y is observation value, Yˆ as prediction value, and Y as average value. 3. RESULTS AND DISCUSSION 3.1. Construction of QSSR model The QSSR linear models were established using Regress [5, 7] and QSARIS [8] systems. The structural molecular parameters were selected by the model use of genetic algorithm with the differential evolution technique. All of the structural molecular parameters are based on statistical model such as R2training, SE standard error, R2adj, R2test, and F-stat values. The optimum models were listed in Table 2. Table 2. The calculations of QSSR linear models (k from 1 to 5) and statistical values. The statistical parameters and molecular descriptors QSSR linear models A (k = 1) B (k = 2) C (k = 3) D (k = 4) E (k = 5) R2training 93.32 94.82 96.01 96.6 96.68 R2 adj 93.05 94.39 95.49 95.98 95.89 standard error, SE 0.389 0.3495 0.3136 0.2961 0.2994 Calculation of environmental properties 165 F-stat 349.2283 219.818 184.3612 156.0465 122.1842 R2test 92.17 92.98 94.42 95.02 93.83 Constant 0.9217 1.5831 2.1581 1.8666 0.3449 logP -1.1566 -1.135 -1.1926 -1.2251 -0.9714 SsCH3_acnt - 0.1503 0.1931 - 0.1933 ABSQ - - -0.5721 - - nelem - - - - 0.4477 nrings - - - -0.5465 - Gmax - - - - -0.0469 Gmin - - - 0.3202 Xvp6 - - - - -2.9653 Xvpc4 - - - 0.5461 - It is an observation in Table 2, the optimization of QSSR linear models was selected with the number of structural parameters (k from 1 to 5 value). The change of structural parameter leads to change the R2training and the R2testvalues respectively which it is described in Figure 1. Figure 1. a) The change of R2training and R2testvalues depend on k values in the models; b) A comparison between experimental and predicted solubility for each substances. We observe continuously in Table 2 that the QSSR model with k = 4 gives the result of highest R2test value. This value is decreased while increased k. Thus, the QSSR model with k = 4 is the most favorable one. The quality of this model is presented in the R2 value of 96.600; the SE value of 0.2961; the F-stat value of 156.0, and the R2test value of 95.020; The QSSR model (with k = 4) is examined using the cross-validation technique which remove gradually each case with statistics value regression squares RSS of 2.823. The linear regression equation of the QSSR model as follows: logS = –1.225logP + 0.5461xvpc4 + 0.3202Gmin – 0.5465nrings + 1.86663 (1) Thus, the data set of training is good to satisfy and describe via QSSR model (1) in which it is very meaningful statistics. The cross appreciation technique is showed that the QSSR model (1) can be used to predict the logS values. The statistical values are to check the meaning of the coefficients in the QSSR model (1) (with k = 4), is shown in Table 3. To test the meaning of selected parameters in the model, the performance to take 100 times randomly of the logS value among substance is conducted. The R2 - R2nvalue with n of 1, 2, ..., 100 are calculated for each 1 2 3 4 5 6 92 93 94 95 96 97 98 R2training R2test R 2 t ra in in g a nd R 2 t es t k -4 -3 -2 -1 0 1 2 logSexp. logStest lo gS ex p. a nd lo gS te st Compounda) b) Vo Thanh Cong 166 corresponding QSSR model. The average R2n value is 0.1504; the average value of square deviation is 0.09849; the interval of R2n values is from 0.004609 to 0.4679. Table 3. The statistical values, the coefficient of QSSR model (1) with k = 4, and hypothesis testing. Parameters Coefficient P values Standard error t-stat Hypothesis testing Constant 1.8666 0 0.1171 15.9421 P < a = 0.05 log P -1.2251 0 0.0575 -21.2943 P < a = 0.05 xvpc4 0.5461 0.0419 0.2528 2.1603 P < a = 0.05 Gmin 0.3202 0.0019 0.0908 3.526 P < a = 0.05 nrings -0.5465 0.001 0.1448 -3.7736 P < a = 0.05 The contribution percentage, Pmxk, and the percentage of independent parameters in QSSR model (1) with k = 4 is determined through the contribution of parameters by the Ctotal value, presented in Table 4.The average contribution percentage, MPxk, and the percentage of each variable independent is identified as equation: ( )∑∑ ∑ == = =⎟⎠ ⎞⎜⎝ ⎛ = N j imim N j k i kmkmimimk xbN xbxb N MPx 1 total,, 1 1 ,,,, C.100 1.1001,% (2) where, N value of 27 is the total of compounds, and m is compounds to calculate Pmxk,%. The contribution of important levels of the molecular-structure parameters in model are arranged in the order based on MPxk value: logP > Gmin > nrings > xvpc4; whereas, the magnitude of coefficients corresponding of each parameter on the model is arranged in the order: logP > nrings > xvpc4 > Gmin. Table 4. The Pmxk value (%) and MPxk value (%) of each parameter in QSSR model (1) with k = 4. Compounds, m = 1 – 27 Ctotal Pmxk(%) xvpc4 Gmin nrings logP Isooctane 6.0274 2.6157 4.8036 0 92.5807 Heptane 5.8206 0 7.4877 0 92.5123 1,2,4-trichlorobenzene 5.877 8.7359 2.6695 9.2981 79.2965 Cyclohexane 5.1939 0 9.2474 10.521 80.2315 Cyclopentane 4.5477 0 10.5614 12.016 77.4226 Hexane 5.1794 0 8.3769 0 91.6231 o-dichlorobenzene 4.9338 8.2924 3.931 11.0756 76.701 1,1,2-trichloro trifluoro ethane 5.4325 16.8278 24.9094 0 58.2628 Pentane 4.5067 0 9.5473 0 90.4527 Chlorobenzene 3.7351 3.1908 6.8066 14.6301 75.3725 Toluene 4.3299 2.4274 9.7745 12.6203 75.1777 From the results in Table 4, level contribution of each parameter in the QSSR(1) model or rather is to contribute for the properties of compounds; It is not based on the magnitude of the coefficients to give important contribution order of parameters related to the properties of the substances. The log P parameter relate strongly to solubility of organic compounds. Thus, the Calculation of environmental properties 167 solubility of organic compounds is closed with the able dispersion of substance, shown in logP. The Gmin parameter represents the magnitude of the smallest electrostatic potential of atoms into molecule, this parameter is large effect to solubility of compounds, except for the logP parameter. This is shown the nature of global electrostatic potential of molecular. Further, the nrings parameter also contributed to solubility, and depends on the circle of the molecule which it are determined from R = p-(nvx-1), where, p is the side number of the circle, and nvx as the top number into molecule that is not hydrogen atoms. 3.2. The construction of QSSR neural model The QSSR neural models are constructed on the basis of the neuro-fuzzy technique with the support of genetic algorithms on INForm [9] system. The design of neural network is included with three layers, I(4)–HL(4)–O(1); the I(4) input layer including 4 neuron are logP parameter, Gmin, nrings, xvpc4,; and the output layer O(1) with 1 neuron is log S parameter; the HL(4) hidden layer includes 4 neuron. The error back propagation algorithm is used to train the network. The transmitted function is set on each neuron of the layer neural network; the training network parameters included speed learning are 0.7; the moment as 0.7. The mean square error (MSE) calculated to be 0.000816 with 10,000 epochs. The finishing-training network with R2training value is 99.030, whereas, R2training value of 96.600 is in the QSSR (1) linear models. As a result, the QSSR neural models based on the design of neural network I(4)-HL(4)-O(1) are adapted to be better than the QSSR(1)model. This can be shown in Figure 1 and Figure 2, where the relation and adaptability between prediction and experiment values are performed. Figure 2. a) A comparison of between experimental logSexp and predicted solubility, logStest; b) the correlation between the experimental values logSexp and the predicted values logStest. 3.3. Prediction of the solubility of compounds in test group The predictive ability of QSSR (1) and the QSSR neural model are carefully cross-validated by leave-one-out LOO techniques. The predicted results with 7 selected compounds randomly in Table 1 are shown in Table 5. -4 -3 -2 -1 0 1 2 logSexp. logStest lo gS ex p. a nd lo gS te st Compound -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2. -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 lo gS te st logSexp. R2 = 99,030 a) b) Vo Thanh Cong 168 Table 5. Solubility values of test compounds are predicted by QSSR (1) and QSSR neural model. No compounds logSexp. The neural model QSSR The linear model QSSR logStest ARE (%) logStest ARE (%) 1 n-butyl chloride -0.9586 -1.0117 5.5425 -0.7427 22.5235 2 Ethylenedichloride -0.092 -0.2191 138.1826 0.0356 138.7148 3 Isobutyl alcohol 0.93 1.0523 13.1505 1.1382 22.3885 4 Methyl ethyl ketone 1.38 1.1438 17.1167 0.1973 85.701 5 Methyl tert-butyl ether 0.6812 0.7703 13.0741 0.7886 15.7661 6 Cyclohexane -2.222 -2.3304 4.8771 -2.3667 6.51 7 o-dichlorobenzene -1.796 -1.8548 3.2717 -1.861 3.621 MARE (%) 27.8879 42.175 The predicted results of QSSR model are evaluated by the absolute value of relative error ARE (%) through equation as follows: exp test exp,% 100 (log log ) / logARE S S= − (3) The averaged absolute of the relative error MARE (%) is employed to evaluate the overall error of QSSR model through equation, exp test exp (log log100,% log S S MARE n S − = (4) where, n of 7 is the number of substances, and logSexp. and logStest are experimental and predicted solubility, respectively. Thus, it is a resulted comparison between the QSSR (1) and QSSR neural models based on the MARE values (%) shown previous in Table 5 that the QSSR (1) model have a predictive ability to be not better than QSSR neural model. The resulted prediction of log S is obtained from QSSR neural models which it gives an agreement with experimental values. And the QSSR neural model is adapted to be better than the QSSR (1) model. In this work the linear model exhibited to be suitable for solubility prediction of organic compounds. If the non-linear model is chosen for constructing the QSSR model, then it cannot adapted for multivariable models, and it is very difficult for fitting a set of multivariable data. Moreover the linear models can support to find the important variables confidently. The physicochemistry properties of organic compounds are predict by this way successfully. These can be used to establish the relationships between property and toxicity of organic compounds. And those are used for prediction of their environmental toxicity. This is important objectives of this work. 4. CONCLUSION This study has successfully constructed QSSR linear models with the support of genetic algorithms. This technique can allow to construct the linear regression models for large data sets. The genetic algorithm allows to select the important parameter in model. The QSSR linear models is obtained to satisfy the evaluated statistics. In addition, the artificial intelligence Calculation of environmental properties 169 technique based on fuzzy neural relations are also supported by genetic algorithms to construct the design of neural network as I(4)-HL(4)-O(1), adapting well with database. The QSSR neural model is predicted to give the results to be better than the QSSR linear model. The MARE value, % of QSSR linear models is larger than those from the QSSR neural models. The result in this works are opening the new researches and many promising applications in environmental treatments, also the design of medical products. REFERENCES 1. Zeng X. L., Wang H. J., Wang Y. - QSPR models of n-octanol/water partition coefficients and aqueous solubility, J. Chemosphere 86 (2012) 619–625. 2. Hawker D. W., Cumming J. L., Neale P. A., Bartkow M.E., Escher B. I. - A screening level fate model of organic contaminants from advanced water treatment in a potable water supply reservoir, J. Water research 45 (2011) 768 - 780. 3. Zhao H., Xie Q., Feng T., Chen J., Xie Q., Baocheng Q., Zhang X., Xiaona L. - Determination and prediction of octanol–air partition coefficients of hydroxylated and methoxylated polybrominated diphenyl ethers, J. Chemosphere 80 (2010) 660–664. 4. Wen Z., Zhicai Z., Zunyao W., Liansheng W. - Estimation of n-octanol/water partition coefficients (Kow) of all PCB congeners by density functional theory, J. Molecular Structure: Theochem 755 (2005) 137–145. 5. Steppan D. D., Werner J., Yeater P. R. - Essential Regression - Free Statistical Software for Microsoft Excel, John Wiley & Sons, Inc, (2000) 1-154. 6. Smallwood I. M. - Handbook of organic solvent properties, John Wiley Inc, 1996, pp. 50- 326. 7. Joseph B. E. - Excel for Chemists: A Comprehensive Guide, Wiley-VCH, (2001) 150- 300. 8. Tat P. V. - Development of Quantitative Structure – Activity Relationship (QSAR) and Quantitative Structure – Property Relationship (QSPR)., Publisher of Natural Science and Technology, Ha Noi, (2009)113-184. 9. INForm v2.0, Intelligensys Ltd., UK (2000) 50-150 10. HyperChem (TM) Release 7.05, Hypercube, Inc., Florida 32601, USA, (2002) 250-1500.

Các file đính kèm theo tài liệu này:

12037_103810382556_1_sm_2885_2061639.pdf