This study has successfully constructed QSSR linear models with the support of genetic
algorithms. This technique can allow to construct the linear regression models for large data sets.
The genetic algorithm allows to select the important parameter in model. The QSSR linear
models is obtained to satisfy the evaluated statistics. In addition, the artificial intelligence
technique based on fuzzy neural relations are also supported by genetic algorithms to construct
the design of neural network as I(4)-HL(4)-O(1), adapting well with database. The QSSR neural
model is predicted to give the results to be better than the QSSR linear model. The MARE value,
% of QSSR linear models is larger than those from the QSSR neural models.
The result in this works are opening the new researches and many promising applications in
environmental treatments, also the design of medical products.
8 trang |
Chia sẻ: honghp95 | Lượt xem: 531 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Calculation of environmental properties for organic compounds using quantitative structuresolubility relationships, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Journal of Science and Technology 54 (4B) (2016) 162-169
CALCULATION OF ENVIRONMENTAL PROPERTIES FOR
ORGANIC COMPOUNDS USING QUANTITATIVE STRUCTURE-
SOLUBILITY RELATIONSHIPS
Vo Thanh Cong1, Nguyen Minh Quang1, Pham Nu Ngoc Han2,
Nguyen Xuan Truong4, Tran Kim Cuong3, Pham Van Tat2, *
1Industrial University of Ho Chi Minh City, 12 Nguyen Van Bao, Go Vap Tp.HCM
2Faculty of Science and Technology, Hoa Sen University, 8 Nguyen Van Trang, Dist.1, HCMC
3Department of Chemistry, University of Dalat, 01-Phu Dong Thien Vuong, Dalat City
4University of Natural Resources and Environment, 236B Le Van Si, Dist. Tan Binh, HCMC
*Email: vantat@gmail.com
Received: 15 August 2016; Accepted for publication: 10 November 2016
ABSTRACT
The solubility of organic compounds in water was related to the environmental behaviors.
In this work, the solubility values of 27 organic compounds were calculated by using the
different molecular descriptors. The quantitative structure-solubility relationships (QSSRs) were
constructed by incorporating the multivariable technique and the genetic algorithm. The
important molecular descriptors such as logP, SsCH3_acnt, ABSQ, nelem, nrings, SHBa, Gmax,
Gmin, Xvp6, and Xvpc4 were selected to construct the linear models QSSRs with the genetic
algorithm. The best four-variable linear model QSSR was obtained from those descriptors. The
quality of QSSR linear model indicated in statistical values R2training of 96.60, standard error of
estimation, SE of 0.2961, F-statistic of 156.0, P-value of 0.0, R2test of 95.02, and RSS value of
2.823. The architecture of neural network I(4)-HL(4)-O(1) with R2training of 99.03 was
constructed by the molecular descriptors in the four-variable linear model. The predicted
solubility values of organic substances in test group resulting from these models are in good
agreement with those from literature.
Keywords: QSSRs, molecular descriptors, multiple regression, neural network.
1. INTRODUCTION
The solubility property of organic compounds in the water is one of most important target
to evaluate the environmental effects. This property is employed to treat the environmental
pollutants in waste water of chemical factory. The able solution of pollutants in water are
evaluated by the solubility. This parameter is thus one of the standard scale to investigate the
distributed and toxic level of chemical substances. The parameters of COD and BOD also
related to the solubility of organic compounds. Therefore, it is used to evaluate the quality in
Calculation of environmental properties
163
water. These are considered in this work to apply good chemicals in industry and separate the
inorganic substances in nature.
The quantitative structure-properties relationships (QSPR) were modeled by the multiple
regression techniques and the evaluation of different statistics [1,2]. The artificial neural network
has been used in previous studies of the quantitative structure-activity relationship QSAR in
which it was given in references [3,4]. The artificial intelligence technique combined with neural
network, the fuzzy logic and the genetic algorithms are presented flexibility properties when
searched the complex relationships and sophisticated in the data mining process [5].
In this study, We point out the techniques using linear multiple regression and neural
network to construct the different of quantitative structure-solubility relationships (QSSR).The
parameter descriptions of 2D and 3D molecular structure of organic compounds are calculated
with the combination of MM+ molecular mechanics and SCF PM3 half-experience in quantum
chemistry. The QSSRs linear and QSSR neural models are constructed from the structural
parameters with the support of genetic algorithm. The predicted solubility of organic compounds
are evaluated by the QSSR linear and QSSR neural models which this resulted calculations are
compared with experimental data.
2. COMPUTATIONAL METHOD
2.1. Data and software
The experimental solubility of 27 organic compounds are resulted in previous observations
[6], given in Table 1. Because of these compounds are presented in industrial wastewaters. The
properties to describe 2D and 3D molecular, and QSSR linear models is constructed by Regress
and QSARIS [5, 7, 8]. Whereas, the QSSR neural models are constructed by INForm [9].
Table 1. Experimental solubility (S) of organic compounds at 25°C [1].
No Compound logS No Compound logS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Isooctane
Pentane
Cyclohexane
Cyclopentane
Heptane
Hexane
1,1,2-trichloroethane
1,2,4-trichlorobenzene
Toluene
Chlorobenzene
Chloroform
n-butyl chloride
Ethylene dichloride
Dichloromethane
-3.699
-1.398
-2.222
-2.000
-3.523
-1.854
-1.770
-2.600
-1.284
-1.300
-0.089
-0.959
-0.092
-0.204
15
16
17
18
19
20
21
22
23
24
25
26
27
o-dichlorobenzene
n-butyl acetate
Ethyl ether
Methyl isoamyl ketone
Methyl tert-butyl ether
Methyl isobutyl ketone
Ethyl acetate
Methyl n-propyl
ketone
Triethylamine
Propylene carbonate
Methyl ethylketone
Isobutyl alcohol
n-butyl alcohol
-1.796
-0.168
-0.838
0.231
0.681
-0.268
0.940
0.775
0.740
1.243
1.380
0.930
0.893
Vo Thanh Cong
164
2.2. Calculation of the molecular parameters
The molecular parameters of organic compounds are optimized and calculated using
molecular mechanics of Hyperchem [10]. The molecular parameters such as 2D and 3D
structures, geometry, electrostatic, charge, and dispersion coefficient of octanol/ water are
obtained from QSARIS model [8].
2.3. Construction and validation of QSSR model
The performances of construction and evaluation are modeled with the steps as including:
− All the cases, except for the first one used to match or training models. The prediction
of the first one observes suitably through QSSR linear or QSSR neural model, and the
deviation value of Y1 - 1ˆY is determined.
− All the cases, except for the second one used to match or training models. The
prediction of the second one observes suitably through QSSR linear or QSSR neural
model, and the deviation value of Y2 - 2ˆY is determined.
− Other case are repeated continuously above step to observe the evaluation remaining.
The values of R2test average is received from the models above.
The cross-validations are performed to obtain the database. Two small data groups in
database are separated as training data and test data . Each QSSR model is established from the
training data use to predict the solubility of organic compounds in the testing data. The
correspondences of QSSR linear models and QSSR neural models are shown by R2training and
R2adj adjustment values; the predictability of the models are evaluated the cross appreciation
through R2test test data value as following equation:
2
2 1
2
1
ˆ( )
1 100
( )
N
i
N
i
Y Y
R
Y Y
=
=
⎛ ⎞
−⎜ ⎟⎜ ⎟= − ×⎜ ⎟
−⎜ ⎟⎝ ⎠
∑
∑
In where, Y is observation value, Yˆ as prediction value, and Y as average value.
3. RESULTS AND DISCUSSION
3.1. Construction of QSSR model
The QSSR linear models were established using Regress [5, 7] and QSARIS [8] systems.
The structural molecular parameters were selected by the model use of genetic algorithm with
the differential evolution technique. All of the structural molecular parameters are based on
statistical model such as R2training, SE standard error, R2adj, R2test, and F-stat values. The optimum
models were listed in Table 2.
Table 2. The calculations of QSSR linear models (k from 1 to 5) and statistical values.
The statistical
parameters and
molecular
descriptors
QSSR linear models
A (k = 1) B (k = 2) C (k = 3) D (k = 4) E (k = 5)
R2training 93.32 94.82 96.01 96.6 96.68
R2 adj 93.05 94.39 95.49 95.98 95.89
standard error, SE 0.389 0.3495 0.3136 0.2961 0.2994
Calculation of environmental properties
165
F-stat 349.2283 219.818 184.3612 156.0465 122.1842
R2test 92.17 92.98 94.42 95.02 93.83
Constant 0.9217 1.5831 2.1581 1.8666 0.3449
logP -1.1566 -1.135 -1.1926 -1.2251 -0.9714
SsCH3_acnt - 0.1503 0.1931 - 0.1933
ABSQ - - -0.5721 - -
nelem - - - - 0.4477
nrings - - - -0.5465 -
Gmax - - - - -0.0469
Gmin - - - 0.3202
Xvp6 - - - - -2.9653
Xvpc4 - - - 0.5461 -
It is an observation in Table 2, the optimization of QSSR linear models was selected with
the number of structural parameters (k from 1 to 5 value). The change of structural parameter
leads to change the R2training and the R2testvalues respectively which it is described in Figure 1.
Figure 1. a) The change of R2training and R2testvalues depend on k values in the models;
b) A comparison between experimental and predicted solubility for each substances.
We observe continuously in Table 2 that the QSSR model with k = 4 gives the result of
highest R2test value. This value is decreased while increased k. Thus, the QSSR model with k = 4
is the most favorable one. The quality of this model is presented in the R2 value of 96.600; the
SE value of 0.2961; the F-stat value of 156.0, and the R2test value of 95.020; The QSSR model
(with k = 4) is examined using the cross-validation technique which remove gradually each case
with statistics value regression squares RSS of 2.823. The linear regression equation of the
QSSR model as follows:
logS = –1.225logP + 0.5461xvpc4 + 0.3202Gmin – 0.5465nrings + 1.86663 (1)
Thus, the data set of training is good to satisfy and describe via QSSR model (1) in which it
is very meaningful statistics. The cross appreciation technique is showed that the QSSR model
(1) can be used to predict the logS values. The statistical values are to check the meaning of the
coefficients in the QSSR model (1) (with k = 4), is shown in Table 3. To test the meaning of
selected parameters in the model, the performance to take 100 times randomly of the logS value
among substance is conducted. The R2 - R2nvalue with n of 1, 2, ..., 100 are calculated for each
1 2 3 4 5 6
92
93
94
95
96
97
98 R2training
R2test
R
2 t
ra
in
in
g a
nd
R
2 t
es
t
k
-4
-3
-2
-1
0
1
2
logSexp.
logStest
lo
gS
ex
p.
a
nd
lo
gS
te
st
Compounda) b)
Vo Thanh Cong
166
corresponding QSSR model. The average R2n value is 0.1504; the average value of square
deviation is 0.09849; the interval of R2n values is from 0.004609 to 0.4679.
Table 3. The statistical values, the coefficient of QSSR model (1) with k = 4, and hypothesis testing.
Parameters Coefficient P values Standard error t-stat Hypothesis testing
Constant 1.8666 0 0.1171 15.9421 P < a = 0.05
log P -1.2251 0 0.0575 -21.2943 P < a = 0.05
xvpc4 0.5461 0.0419 0.2528 2.1603 P < a = 0.05
Gmin 0.3202 0.0019 0.0908 3.526 P < a = 0.05
nrings -0.5465 0.001 0.1448 -3.7736 P < a = 0.05
The contribution percentage, Pmxk, and the percentage of independent parameters in QSSR
model (1) with k = 4 is determined through the contribution of parameters by the Ctotal value,
presented in Table 4.The average contribution percentage, MPxk, and the percentage of each
variable independent is identified as equation:
( )∑∑ ∑
== =
=⎟⎠
⎞⎜⎝
⎛
=
N
j
imim
N
j
k
i
kmkmimimk xbN
xbxb
N
MPx
1
total,,
1 1
,,,, C.100
1.1001,%
(2)
where, N value of 27 is the total of compounds, and m is compounds to calculate Pmxk,%.
The contribution of important levels of the molecular-structure parameters in model are
arranged in the order based on MPxk value: logP > Gmin > nrings > xvpc4; whereas, the
magnitude of coefficients corresponding of each parameter on the model is arranged in the
order: logP > nrings > xvpc4 > Gmin.
Table 4. The Pmxk value (%) and MPxk value (%) of each parameter in QSSR model (1) with k = 4.
Compounds, m = 1 – 27 Ctotal
Pmxk(%)
xvpc4 Gmin nrings logP
Isooctane 6.0274 2.6157 4.8036 0 92.5807
Heptane 5.8206 0 7.4877 0 92.5123
1,2,4-trichlorobenzene 5.877 8.7359 2.6695 9.2981 79.2965
Cyclohexane 5.1939 0 9.2474 10.521 80.2315
Cyclopentane 4.5477 0 10.5614 12.016 77.4226
Hexane 5.1794 0 8.3769 0 91.6231
o-dichlorobenzene 4.9338 8.2924 3.931 11.0756 76.701
1,1,2-trichloro trifluoro ethane 5.4325 16.8278 24.9094 0 58.2628
Pentane 4.5067 0 9.5473 0 90.4527
Chlorobenzene 3.7351 3.1908 6.8066 14.6301 75.3725
Toluene 4.3299 2.4274 9.7745 12.6203 75.1777
From the results in Table 4, level contribution of each parameter in the QSSR(1) model or
rather is to contribute for the properties of compounds; It is not based on the magnitude of the
coefficients to give important contribution order of parameters related to the properties of the
substances. The log P parameter relate strongly to solubility of organic compounds. Thus, the
Calculation of environmental properties
167
solubility of organic compounds is closed with the able dispersion of substance, shown in logP.
The Gmin parameter represents the magnitude of the smallest electrostatic potential of atoms
into molecule, this parameter is large effect to solubility of compounds, except for the logP
parameter. This is shown the nature of global electrostatic potential of molecular. Further, the
nrings parameter also contributed to solubility, and depends on the circle of the molecule which
it are determined from R = p-(nvx-1), where, p is the side number of the circle, and nvx as the
top number into molecule that is not hydrogen atoms.
3.2. The construction of QSSR neural model
The QSSR neural models are constructed on the basis of the neuro-fuzzy technique with the
support of genetic algorithms on INForm [9] system. The design of neural network is included
with three layers, I(4)–HL(4)–O(1); the I(4) input layer including 4 neuron are logP parameter,
Gmin, nrings, xvpc4,; and the output layer O(1) with 1 neuron is log S parameter; the HL(4)
hidden layer includes 4 neuron. The error back propagation algorithm is used to train the
network. The transmitted function is set on each neuron of the layer neural network; the training
network parameters included speed learning are 0.7; the moment as 0.7. The mean square error
(MSE) calculated to be 0.000816 with 10,000 epochs. The finishing-training network with
R2training value is 99.030, whereas, R2training value of 96.600 is in the QSSR (1) linear models. As a
result, the QSSR neural models based on the design of neural network I(4)-HL(4)-O(1) are
adapted to be better than the QSSR(1)model. This can be shown in Figure 1 and Figure 2, where
the relation and adaptability between prediction and experiment values are performed.
Figure 2. a) A comparison of between experimental logSexp and predicted solubility, logStest;
b) the correlation between the experimental values logSexp and the predicted values logStest.
3.3. Prediction of the solubility of compounds in test group
The predictive ability of QSSR (1) and the QSSR neural model are carefully cross-validated
by leave-one-out LOO techniques. The predicted results with 7 selected compounds randomly in
Table 1 are shown in Table 5.
-4
-3
-2
-1
0
1
2
logSexp.
logStest
lo
gS
ex
p.
a
nd
lo
gS
te
st
Compound
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
lo
gS
te
st
logSexp.
R2 = 99,030
a) b)
Vo Thanh Cong
168
Table 5. Solubility values of test compounds are predicted by QSSR (1) and QSSR neural model.
No compounds logSexp.
The neural model QSSR The linear model QSSR
logStest ARE (%) logStest ARE (%)
1 n-butyl chloride -0.9586 -1.0117 5.5425 -0.7427 22.5235
2 Ethylenedichloride -0.092 -0.2191 138.1826 0.0356 138.7148
3 Isobutyl alcohol 0.93 1.0523 13.1505 1.1382 22.3885
4 Methyl ethyl ketone 1.38 1.1438 17.1167 0.1973 85.701
5 Methyl tert-butyl ether 0.6812 0.7703 13.0741 0.7886 15.7661
6 Cyclohexane -2.222 -2.3304 4.8771 -2.3667 6.51
7 o-dichlorobenzene -1.796 -1.8548 3.2717 -1.861 3.621
MARE (%) 27.8879 42.175
The predicted results of QSSR model are evaluated by the absolute value of relative error
ARE (%) through equation as follows:
exp test exp,% 100 (log log ) / logARE S S= − (3)
The averaged absolute of the relative error MARE (%) is employed to evaluate the overall
error of QSSR model through equation,
exp test
exp
(log log100,%
log
S S
MARE
n S
−
=
(4)
where, n of 7 is the number of substances, and logSexp. and logStest are experimental and predicted
solubility, respectively.
Thus, it is a resulted comparison between the QSSR (1) and QSSR neural models based on
the MARE values (%) shown previous in Table 5 that the QSSR (1) model have a predictive
ability to be not better than QSSR neural model. The resulted prediction of log S is obtained
from QSSR neural models which it gives an agreement with experimental values. And the QSSR
neural model is adapted to be better than the QSSR (1) model. In this work the linear model
exhibited to be suitable for solubility prediction of organic compounds. If the non-linear model
is chosen for constructing the QSSR model, then it cannot adapted for multivariable models, and
it is very difficult for fitting a set of multivariable data. Moreover the linear models can support
to find the important variables confidently.
The physicochemistry properties of organic compounds are predict by this way successfully.
These can be used to establish the relationships between property and toxicity of organic
compounds. And those are used for prediction of their environmental toxicity. This is important
objectives of this work.
4. CONCLUSION
This study has successfully constructed QSSR linear models with the support of genetic
algorithms. This technique can allow to construct the linear regression models for large data sets.
The genetic algorithm allows to select the important parameter in model. The QSSR linear
models is obtained to satisfy the evaluated statistics. In addition, the artificial intelligence
Calculation of environmental properties
169
technique based on fuzzy neural relations are also supported by genetic algorithms to construct
the design of neural network as I(4)-HL(4)-O(1), adapting well with database. The QSSR neural
model is predicted to give the results to be better than the QSSR linear model. The MARE value,
% of QSSR linear models is larger than those from the QSSR neural models.
The result in this works are opening the new researches and many promising applications in
environmental treatments, also the design of medical products.
REFERENCES
1. Zeng X. L., Wang H. J., Wang Y. - QSPR models of n-octanol/water partition coefficients
and aqueous solubility, J. Chemosphere 86 (2012) 619–625.
2. Hawker D. W., Cumming J. L., Neale P. A., Bartkow M.E., Escher B. I. - A screening
level fate model of organic contaminants from advanced water treatment in a potable
water supply reservoir, J. Water research 45 (2011) 768 - 780.
3. Zhao H., Xie Q., Feng T., Chen J., Xie Q., Baocheng Q., Zhang X., Xiaona L. -
Determination and prediction of octanol–air partition coefficients of hydroxylated and
methoxylated polybrominated diphenyl ethers, J. Chemosphere 80 (2010) 660–664.
4. Wen Z., Zhicai Z., Zunyao W., Liansheng W. - Estimation of n-octanol/water partition
coefficients (Kow) of all PCB congeners by density functional theory, J. Molecular
Structure: Theochem 755 (2005) 137–145.
5. Steppan D. D., Werner J., Yeater P. R. - Essential Regression - Free Statistical Software
for Microsoft Excel, John Wiley & Sons, Inc, (2000) 1-154.
6. Smallwood I. M. - Handbook of organic solvent properties, John Wiley Inc, 1996, pp. 50-
326.
7. Joseph B. E. - Excel for Chemists: A Comprehensive Guide, Wiley-VCH, (2001) 150-
300.
8. Tat P. V. - Development of Quantitative Structure – Activity Relationship (QSAR) and
Quantitative Structure – Property Relationship (QSPR)., Publisher of Natural Science and
Technology, Ha Noi, (2009)113-184.
9. INForm v2.0, Intelligensys Ltd., UK (2000) 50-150
10. HyperChem (TM) Release 7.05, Hypercube, Inc., Florida 32601, USA, (2002) 250-1500.
Các file đính kèm theo tài liệu này:
- 12037_103810382556_1_sm_2885_2061639.pdf