It can be observed that the performance of CFS and LLCFS achieved the best performance when using a large number of attributes, e.g. 55 for CFS and 57 for LLCFS. Compared
with LLCFS, the performance when using CFS is more stable when the number of attributes
is larger than 31 with an average of accuracy approximately 86% and an average of AUC
approximately 0.92. However, CFS performed much better than ILFS when the number of
selected attributes ranging from 7 to 30. For instance, the performance by only using 13
attributes selected by CFS can achieve an accuracy of 80.56% and an AUC of 0.87 while
the performance when using ILFS only achieve an accuracy of 74.10% and an AUC of 0.77.
The results demonstrated that different feature selection methods select different features
for the classification and the performance varies quite differently between the methods. The
results also suggested that depending on the availability of the given number of attributes,
different feature selections can be applied in consideration with the desirable performance.
Our aim is to find out the best feature selection method for HD prediction in terms of the
high performance achieved and the number of selected attributes, for this reason ILFS is
preferable compared with other methods.
Feature selection plays a critical role in many real-life applications, especially in Healthcare diagnosis, through which doctors, clinicians and clinical experts could explore the most
significant symptoms which drastically impact on the potential of having disease. In this
study, we have successfully applied feature selection method based on data mining technique
to apply in the application of HD prediction. For the 58 attributes provided, we have reduced
and selected a subset of selected features and achieved the best HD prediction performance.
Our method can be applied in many real-life applications or in other disease diagnosis applications to analyze the data, identify the risk factors to assist doctors in generating more
accurate prediction. Our future work includes applying our method on a large variety of healthcare datasets (e.g. Breast Cancer, Chronic Kidney) and providing a more comprehensive
15 trang |
Chia sẻ: huongthu9 | Lượt xem: 415 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Automatic heart disease prediction using feature selection and data mining technique, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Journal of Computer Science and Cybernetics, V.34, N.1 (2018), 33–47
DOI: 10.15625/1813-9663/34/1/12665
AUTOMATIC HEART DISEASE PREDICTION USING FEATURE
SELECTION AND DATA MINING TECHNIQUE
LE MINH HUNG1,a, TRAN DINH TOAN1, TRAN VAN LANG2
1Information Technology Faculty, Ho Chi Minh City University of Food Industry
2Institute of Applied Mechanics and Informatics, VAST
ahunglm@cntp.edu.vn
Abstract. This paper presents an automatic Heart Disease (HD) prediction method based on fe-
ature selection with data mining techniques using the provided symptoms and clinical information
assigned in the patients dataset. Data mining which allows the extraction of hidden knowledges
from the data and explores the relationship between attributes, is the promising technique for HD
prediction. HD symptoms can be effectively learned by the computer to classify HD into different
classes. However, the information provided may include redundant and interrelated symptoms. The
use of such information may degrade the classification performance. Feature selection is an effective
way to remove such noisy information meanwhile improving the learning accuracy and facilitating a
better understanding for learning model. In our method, HD attributes are weighted and re-ordered
based on their rank and weights assigned by Infinite Latent Feature Selection (ILFS) method. A soft
margin linear Support Vector Machine (SVM) is applied to classify a subset of selected attributes
into different HD classes. The experiment is performed using UCI Machine Learning Repository
Heart Disease public dataset. Experimental results demonstrated the effectiveness of the proposed
method for precise HD prediction making, our method gained the best performance with an accuracy
of 90.65% and an AUC of 0.96 for distinguishing ‘No presence’ HD with ‘Presence’ HD.
Keywords. Data mining, Heart Disease Prediction, Feature Selection, Classification.
1. INTRODUCTION
Heart disease (HD) is one of the top leading causes of death accounting for 17.7 million
deaths each year, 31% of all global deaths, as reported by World Health Organization 2017.
Patients unhealthy habits such as tobacco use, unhealthy diet, physical inactivity and alcohol
usage are the main reasons leading to many types of HD. Several clinical information and
symptoms are found to be related to HD including age, blood pressure, total cholesterol,
diabetes, hyper tension [1]. HD dataset basically consists of the above-mentioned information
and attributes which summarized and collected from the patients. With the increasing of
the huge amounts of dataset made available in recent years, the diagnosis of HD can be
automatically performed using traditional statistical methods to predict the potential of
having HD on each patient. Working with HD database can be considered as a real-life
application and learning such attributes helps clinicians in identifying the main risk factors
associated with HD. However, with a large number of attributes, it is challenging to identify
c© 2018 Vietnam Academy of Science & Technology
34 LE MINH HUNG, TRAN DINH TOAN, TRAN VAN LANG
which attributes are the most significant risk factors for HD prediction by just only based
on conventional statistical methods.
To tackle this problem, there have been numerous dedicated approaches based on data
mining techniques proposed in recent years to help healthcare professionals in the diagnosis
of HD. HD prediction systems based on data mining techniques could assist doctors in giving
accurately HD prediction making based on the clinical information data of patients. Data
mining techniques which refers to mining the information, allow the extraction of hidden
knowledge and establishe the relationships between attributes inside the data, is the pro-
mising techniques for HD prediction [2, 3, 4]. Such invention could assist doctors in better
health policy-making, prevention of hospital errors, early detection, prevention of diseases
and preventable hospital deaths. Specifically, Deepika et al. proposed association rule for
classification of Heart-attack patients [5]. K. Srinivas et al. presented data mining techni-
ques in Healthcare and Prediction of Heart Attacks based on Naive Bayes algorithm, K-NN,
Decision Tree, wherein Decision Tree achieved the best performance among the methods [6].
Similarly, several classification algorithms including Naive Bayes, Decision Tree, and Neural
Network were compared in [7] for the prediction of stroke diseases. The experimental results
showed that the Neural Network performed much better than the other two algorithms. Jab-
bar et al. proposed association rule mining for heart attack prediction based on the sequence
number and clustering, in which the patterns are extracted from the database with signifi-
cant weights calculation [8]. Shouman et al. combined k-means clustering with decision tree
method to predict the HD on a subset of 13 input attributes [9]. This study suggested that
integrating k-means clustering and decision tree could achieve a higher accuracy than other
traditional methods in the diagnosis of HD patients. Dangare et al. proposed an improved
Study of Heart Disease Prediction System using Data Mining Classification Techniques [10].
Their purpose was to build an Intelligent Heart Disease Prediction System that gives diagno-
sis of HD by using historical heart database such as sex, blood pressure, cholesterol, obesity
and smoking, etc. Neural networks were adopted for the classification of 14 attributes by
considering the single and multilayer neural network models in [11]. Olatubosun et al. [12]
proposed to use Artificial Neural Network with back propagation procedure for the diagnosis
of Cerebrovascular disease. M. Anbarasi et al. proposed Enhanced Prediction of Heart Di-
sease with Feature Subset Selection using Genetic Algorithm [13]. Classification techniques
such as Naive Bayes, Decision Tree and Classification were adopted, in which Naive Bayes
achieved the highest performance across the methods. Patel et al. [14] proposed to use the
reduced number of attributes using tree classification function techniques in data mining
including Naive Bayes, Decision Tree and Classification by Clustering, in which Decision
Tree gained the best performance among the methods.
For feature selection, Singh et al. [15] proposed to use Genetic feature selection method
combined with Naive Bayes method for HD prediction. Takci searched for the best ma-
chine learning method and feature selection method for heart attacks prediction, in which
SVM with linear kernel in combination with Relief-Based Feature achieved the best perfor-
mance [16]. However, this study used a small number of dataset with 270 instances and a
limited number of HD attributes (13 attributes). Similarly, Suganya et al. proposed a novel
feature selection method for Cardiac diseases prediction on the selected 13 attributes with a
total of 303 instances of patients dataset [17]. Mirmozaffari et al. applied clustering methods
integrated in WEKA data mining tool on a patients dataset with 8 attributes and a total
AUTHOR GUIDELINES FOR JCC SUBMISSION 35
Figure 1. Our 3-step proposed feature selection for data mining in HD diagnosis. (a) Step 1: Data
preparation. (b) Step 2: Feature Selection. (c) Step 3: Classification
of 209 instances for heart disease prediction [18]. Uma et al. applied several classification
algorithms (e.g. SVM, Bagging, Naive Bayes, Regression, J48) and feature selection met-
hods (e.g. CfsSubsetEval, Information Gain, Gain Ratio and Wrapper method) on a subset
of 18 attributes of HD on a dataset with a total of 689 instances [19]. They proved that
SVM achieved the best performance among the classifiers and most of the adopted feature
selection methods achieved nearly identical accuracy.
Despite various approaches have been proposed for HD prediction, most of the recent
feature selection methods were designed on a small subset of attributes with 14 attributes
or 6 attributes. There is still a lack of effective methods based on feature selection and
data mining techniques to study the significant risk factors associated with HD on the fully
provided attributes. There might be existing other hidden factors or attributes that play an
important role on making HD prediction, which has not yet been comprehensively explored
in previous studies. In this work, we proposed a method to efficiently and effectively predict
different classes of HD based on feature selection and data mining technique. The HD
diagnosis prediction task in this study is distinguishing between ‘No presence’ HD (labeled as
0 in the dataset) and ‘Presence’ HD (labeled as 1, 2, 3, 4 in the dataset). Our method consists
of three main steps which are: Step 1: Data Preparation; Step 2: Feature Selection; and
Step 3: Classification. Specifically, the unnecessary and noisy attributes are first manually
removed in the step 1. Then feature selection based ILFS described in [20] is adopted to
select the most significant attributes based on the extracted weights and rank. These selected
useful attributes could drastically affect the performance of the prediction diagnosis system.
A soft-margin linear kernel Support Vector Machine (SVM) is finally applied to classify
the subset of selected attributes into two classes of ‘No presence’ and ‘Presence’ HD. Our
contributions can be highlighted as follows:
• We performed feature selection with data mining methods on the fully provided attri-
butes of HD with a larger number of instances (699 instances), which is different from
previous studies which mainly based on a given subset of attributes (e.g. 13 attributes
or 6 attributes) and a limited number of patient dataset.
• We applied ILFS feature selection method based on [20] to select the most discrimi-
native and meaningful attributes used for the HD prediction making. We found that
36 LE MINH HUNG, TRAN DINH TOAN, TRAN VAN LANG
by using only an approximately half of the given HD attributes selected by ILFS, the
prediction performance is competitive compared with using the fully given HD attri-
butes. This demonstrated that the HD dataset contains more redundant attributes
which play less important roles for prediction making.
• We found that different feature selection methods select different attributes for HD
prediction and the performance varies quite differently. The choice of feature selection
methods may depend on the availability of the given number of the attributes to achieve
a desirable performance.
• The proposed method can be feasibly applied and integrated in many healthcare diag-
nosis systems for disease prediction making as well as real-life applications. The source
code of our method will be made available with the publication of this paper.
The rest of the paper is organized as follows: Sec. 2 describes in details our 3-step method
for HD prediction. Sec. 3 summarizes the results from our method. Sec. 4 is the discussion
of our paper.
2. METHODOLOGY
As illustrated in Fig. 1, our proposed method, which demonstrates an excellent agreement
with the manually assigned labels, consists of 3 main steps. Firstly, irrelevant attributes and
noisy information are manually removed from the original raw dataset and only the most
meaningful attributes are preserved. ILFS [20] for feature ranking and feature selection is
utilized in step 2 to select a subset of discriminative attributes, i.e. the most significant risk
factors associated with HD. A supervised SVM with soft-margin linear kernel is finally used
to classify the selected attributes into different classes.
2.1. Data preparation
Irrelevant attributes are firstly manually removed from the original dataset. As a result,
58 attributes are preserved from the provided original 75 attributes in each instance as
described in details in Table 1. To reduce the inhomogeneity in each attribute among the
patients, the numeric-valued numbers assigned in each attribute is normalized by z-score
method. The dataset is organized in the form of a matrix with the size of N ×M , where
N is the number of patients and M is the number of attributes (N = 699,M = 58 in this
study). After preprocessing, 80% of the dataset is selected for training and the remaining
20% of the dataset is used for testing.
2.2. Feature selection
It is worth noticing that most of the real-life data contains more information than it is
needed to build a model, or the wrong kind of information. Noisy or redundant information
makes it more challenging to extract the most meaningful information. Feature selection
which refers to the process of reducing the inputs for processing and analysis, or finding
the most meaningful subset of information, is effective for the prediction performance. Fe-
ature selection does not only improve the quality of the model but also makes the process
AUTHOR GUIDELINES FOR JCC SUBMISSION 37
of modeling more efficient. The most highlighted techniques proposed recently can be re-
ferred to is Recursive Feature Elimination Support Vector Machine (RFE-SVM) [22] which
successfully applied in the application of prostate cancer diagnosis to reduce the dimension
of hand-crafted features extracted from the lesion region of interest and achieved a very
high accuracy compared with using the fully dimension of data attributes [23]. However, in
the work [20] which provides a more comprehensive overview of feature selection techniques,
ILFS achieved the best performance among the 14 popular feature selection methods.
Inspired by [20], ILFS was adopted to select the most discriminative attributes of the
feature vectors used for HD prediction in our paper. ILFS allows the selection of a subset
of features expected to be most likely to discriminate between classes of HD. The HD at-
tributes weights and rank are automatically assigned based on ILFS method. Weights are
assigned by a Graph-weighting which is basically based on the undirected fully connected
graph and automatically learnt based on a learning framework on the probabilistic latent
semantic analysis (PLSA) [24]. Expectation Maximization algorithm is adopted to estimate
the parameters. The ranking step is built based on Infinite Feature Selection [25] filter al-
gorithm in an unsupervised manner, followed with the cross-validation strategy for selecting
the best subset of features. Specifically, suppose X =
{
X1, X2, , Xn
}
is a set of given trai-
ning features, m as the number of samples, m× 1 vector Xi is the distribution of the values
assumed by ith feature. Weights are associated with the undirected graph nodes
aij = ϕ(Xi, Xj) (1)
where aij is the node corresponding to features and edges model relationship between any
pairs of nodes, ϕ(Xi, Xj) is considered to be a real-valued function learned by the probability
of each co-occurrence in Xi, Xj as a mixture of an independent multinomial distributions.
Each weight represents the likelihood that features Xi and Xj are good candidates. For furt-
her details about the ILFS algorithm, interested readers can refer to [20]. We implemented
ILFS based on the MATLAB code provided in Feature Selection Library (FSLib 2017) [26].
2.3. Classification
A linear supervised SVM classifier is applied to map the selected attributes into 2 classes
of ‘No presence’ and ‘Presence’ HD. The basic idea of the SVM is to construct a hyperplane
to separate and maximize the margin of the positive and negative classes with the largest
margin. Suppose
{
(xi, yi)
}N
i=1
is a set of training samples which contain the most discrimi-
nant attributes selected by ILFS, (xi, yi) is the input feature for the i
th instance and its the
corresponding target output, respectively. The decision boundary separates the instances by
the equation form
wTxi + b ≥ 0 for yi = +1 (positive class), (2)
wTxi + b < 0 for yi = −1 (negative class), (3)
where w is an adjustable weight vector, x is an input vector, and b is a bias. Assume that the
features selected by ILFS are linear separable, the optimization problem of SVM to maximize
38 LE MINH HUNG, TRAN DINH TOAN, TRAN VAN LANG
Figure 2. SVM with soft margin kernel with different cases of slack variables.
the margin can be defined as
(w, b) = arg min
w,b
1
2
‖w‖22 s.t yi(wT .xi + b) ≥ 1, ∀i = 1, 2, ..., N. (4)
The normal SVM normally works with the linear separable features. However, in some
cases when there exist noises which belong to one class but appear closely to another class,
even if the two classes are linear separable, SVM in this scenario will construct a hyperplane
with a very small margin, which is very sensitive to noise. If the algorithm sacrifices these
noises, SVM could generate a better hyperplane with a better margin to best separate the
two classes. Another scenario is when the two classes are near linear separable, in which
there exist a small number of instances appeared unproperly, the optimization algorithm of
SVM margin is infeasible. Similarly, if the algorithm ignores those instances, SVM could
also generate a better margin that could mostly separate the two classes. This technique
called SVM with soft margin. The formulization of the SVM optimization problem can be
re-written as follows
(w, b, ξ) = arg min
w,b,ξ
1
2
‖w‖22 + C
N∑
i=1
ξi s.t 1− ξi − yi(wT .xi + b) ≥ 1, (5)
∀i = 1, 2, ..., N, ξi ≥ 0, C > 0,
where C is the regularization term used to avoid overfitting, ξ = [ξ1, ξ2, ..., ξN ] is a set of
slack variables. As shown in Fig. 2, for the variables which are located in the safety margin,
then ξi = 0 (e.g. x1, x2). For the variables which are not located in the safety margin, but
still in the right side of their class, then 0 < ξi < 1 (e.g. x3). For the variables which are
located in the wrong side of their class, then ξi > 1 (e.g. x4, x5).
AUTHOR GUIDELINES FOR JCC SUBMISSION 39
Table 1. Description of 58 HD attributes used for HD prediction
No. Attribute No. Attribute
1 age 30 tpeakbpd: peak exercise blood pres-
sure
2 sex 31 trestbpd: resting blood pressure
3 painloc: chest pain location 32 exang: exercise induced angina (1 =
yes; 0 = no)
4 painexer (1 = provoked by exertion; 0
= otherwise)
33 xhypo: (1 = yes; 0 = no)
5 relrest (1 = relieved after rest; 0 = ot-
herwise)
34 oldpeak = ST depression induced by
exercise relative to rest
6 cp: chest pain type 35 slope: the slope of the peak exercise
ST segment
7 trestbps: resting blood pressure 36 rldv5: height at rest
8 htn 37 rldv5e: height at peak exercise
9 chol: serum cholesterol in mg/dl 38 ca: number of major vessels (0-3) co-
lored by fluoroscopy
10 cigs (cigarettes per day) 39 thal: 3 = normal; 6 = fixed defect; 7
= reversable defect
11 years (number of years as a smoker) 40 thalsev
12 fbs: (fasting blood sugar > 120 mg/dl) 41 thalpul
13 famhist: family history of coronary ar-
tery disease
42 cmo: month of cardiac
14 restecg: resting electrocardiographic
results
43 cday: day of cardiac
15 20 ekgmo (month of exercise ECG re-
ading)
44 cyr: year of cardiac
16 ekgday(day of exercise ECG reading) 45 lmt
17 ekgyr (year of exercise ECG reading) 46 ladprox
18 dig (digitalis used furring exercise
ECG)
47 laddist
19 24 prop (Beta blocker used during exe-
rcise ECG)
48 diag
20 nitr (nitrates used during exercise
ECG)
49 cxmain
21 pro (calcium channel blocker used du-
ring exercise ECG)
50 ramus
22 diuretic (diuretic used during exercise
ECG)
51 om1
23 proto: exercise protocol 52 om2
24 thaldur: duration of exercise test in
minutes
53 rcaprox
25 thaltime: time when ST measure de-
pression was noted
54 rcadist
26 met: mets achieved 55 lvx3
27 thalach: maximum heart rate achieved 56 lvx4
28 thalrest: resting heart rate 57 lvf
29 tpeakbps: peak exercise blood pres-
sure
58 cathef
40 LE MINH HUNG, TRAN DINH TOAN, TRAN VAN LANG
3. EXPERIMENTAL RESULTS
3.1. Datasets
The HD database used in our study is the public dataset collected from UCI Machine
Learning Repository [21]. This directory consists of 4 HD datasets collected from 4 different
hospitals, which include
• Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
• University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
• University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
• V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano,
M.D., Ph.D.
We select 3 datasets with the total number of 699 instances including the Cleveland dataset
(282 instances), Hungarian dataset (294 instances) and the Switzerland dataset (123 instan-
ces) dataset. The instances in the original dataset are labeled into 5 different classes in which
class 0 indicates ‘No presence’ HD and class 1 to class 4 indicate the risk levels of HD, de-
noted as ‘Presence’ HD. Finally, a total number of instances of the two classes ‘No presence’
and ‘Presence’ HD are 353 and 346, respectively. The UCI Heart Disease database has been
examined by professional clinicians and widely used in many previous data mining-based
approaches for HD prediction. 76 raw attributes presented as numeric-valued numbers in
each row are the collection of different diagnosis attributes and medical information collected
from each patient. Unlike most of the recent studies which just only investigate a subset of
14 attributes or 6 attributes from this database, our study fully explores most of the provided
information in the original dataset (except for the attribute with missing values).
3.2. Experimental designs
In this section, we conducted 2 experiments to investigate the performance of several
classification and feature selection methods. In the experiment 1, different classification
methods are performed to select the most reliable method for HD prediction. The charac-
teristic of the HD dataset is also analyzed in this experiment. The selected classification
method is then utilized in the experiment 2 to classify the selected attributes of HD into
two classes. To avoid overfitting, the validation of all the methods are performed using the
hold-out strategy, where the dataset is randomly split into 2 independent parts for training
(80%) and testing (20%). We selected the hold-out strategy instead of k-fold cross validation
since the hold-out strategy avoids the overlap between training set and testing set, which
provides a more accurate estimate for the generalization performance of the algorithm. With
k-fold cross validation strategy, the feature selection and classification have to be performed
independently k times yielding k feature rankings and k models, respectively. With the li-
mited number of the given dataset, the ranking of the features given by the same feature
selection algorithm may be slightly inconsistent for each running time, which is not feasible
for the testing.
Experiment 1: Classifiers comparisons
AUTHOR GUIDELINES FOR JCC SUBMISSION 41
To evaluate the effectiveness of our proposed method, we compare our method with 4
classification methods including Non-linear SVM (Polynomial kernel, Gaussian kernel, and
Sigmoid kernel), Nave Bayes and Logistic regression classifier. Nave Bayes and Logistic
regression classifiers are performed using the WEKA data mining tool, which is an open
source software issued under the GNU General Public License and is a very popular software
for solving data mining problems. WEKA also includes a collection of machine learning
algorithms for data mining tasks and has been adopted in many data mining applications
due to its simplicity and friendly user interface. SVM algorithms with linear and non-linear
kernels are implemented using Matlab (Release 2017a, Natick MA) on a PC running on a
single Intel core i7 CPU, Windows 10 OS.
Experiment 2: Feature selection methods comparisons
In this experiment, several feature selection methods are selected for the performance
comparisons including:
• Principle Component Analysis denoted as PCA [27]: is one of the most important
unsupervised statistical procedure in machine learning for dimensionality reduction or
feature selection. The goal of PCA is to find the best representation of the data by
projecting them onto a lower dimension space called principal components (PCs), in
which the first PC has the largest variance and so on. Ziasabounchi et al. success-
fully applied PCA together with k-means clustering in the application of heart disease
prediction [28].
• Sort features according to pairwise correlations which is denoted as CFS [29]: CFS is a
simple filter algorithm which ranks the features based on the correlation with the class
labels and select the most informative features subset which highly correlated used for
classification. CFS is based on the assumption that good features are highly correlated
with the classification and not correlated to each other.
• Feature Selection and Kernel Learning for Local Learning-Based Clustering denoted as
LLCFS [30]: is an unsupervised clustering feature selection method which considers the
relevance of each feature for clustering based on a built-in structure learning procedure
to iteratively update the Laplacian graph. Feature weight learning process is performed
using the constructed k-nearest neighbor graph built on the weighted feature space.
3.3. Results
Accuracy, sensitivity and specificity are used as the evaluation metrics to evaluate the
classification performance of our HD diagnosis prediction system. Area under the curve
(AUC) of receiver operating characteristics (ROC) is also provided for the binary classifica-
tion. The classification accuracy, sensitivity and specificity are defined as follows
Accuracy =
TP + TN
TP + FP + FN + TN
, (6)
Sensitivity =
TP
TP + FN
, (7)
42 LE MINH HUNG, TRAN DINH TOAN, TRAN VAN LANG
Specificity =
TN
TN + FP
, (8)
where TP, FP, TN, FN are true positive, false positive, true negative and false negative,
respectively. In this study, we consider the best performance in term of Accuracy and AUC.
Experiment 1
Table 2. Results of HD prediction using different classifiers
Methods Accuracy (%) AUC Sensitivity Specificity
Linear SVM 89.93 0.96 0.87 0.93
Non-linear SVM (Gaussian) 49.64 0.66 0.00 1.00
Non-linear SVM (Polynomial) 83.45 0.92 0.85 0.81
Non-linear SVM (Sigmoid) 49.64 0.41 0.00 1.00
Nave Bayes 77.70 0.86 0.64 0.91
Logistic regression Classifier 85.61 0.91 0.81 0.90
We performed 3 classification methods on the selected 58 attributes from the original
dataset. Table 2 summarizes the results comparison among the methods, in which SVM
with linear kernel generates the best performance with an accuracy of 89.21% and an AUC
of 0.95. Logistic regression classifier achieved a competitive result with an accuracy of
85.61% and an AUC of 0.91 followed with Nave Bayes with an accuracy of 76.98% and an
AUC of 0.86. SVM with Gaussian and Sigmoid kernels fail to predict the two classes of
HD. Although, the performance of using SVM with Polynomial kernel could generate a high
result with an accuracy of 83.45% and AUC of 0.92, the performance achieved is still lower
than using linear kernel. It is maybe because of the overfitting problem caused when the
hyperplane of SVM is too fit to the data, which is too sensitive to the data. The results
demonstrate the effectiveness of soft-margin linear SVM in the classification task of HD. The
results also show that the attributes of HD dataset can be considered as linear-separable and
a linear SVM with soft margin is feasible for making precise prediction for HD. According
to the results, we select SVM as a classifier to perform in the next experiment where SVM is
used to classify a subset of the selected attributes extracted from feature selection methods.
Experiment 2
In order to intuitively visualize the effect of the selected attributes on the HD prediction,
we plotted the accuracy and AUC curves according to the number of attributes selected by
different feature selection methods, as shown in Fig. 3. Overall, the performance of all the
feature selection algorithms increase when the numbers of attributes increase. According to
Fig. 3, it can be observed that for the number of the selected attributes ranging from 1 to 31,
PCA yields a better performance compared with other methods. However, the performance
of PCA downgrades when the number of selected attributes increases until it reaches the
best performance using 58 PCs with an accuracy of 89.93% and an AUC of 0.96.
AUTHOR GUIDELINES FOR JCC SUBMISSION 43
Figure 3. Accuracy and AUC of different feature selection methods with different number of selected
attributes for HD prediction
Table 3. The best performances of HD prediction using different methods
Methods
Number of
attributes
AUC
Accuracy
(%)
Sensitivity Specificity
ILFS 39 0.96 90.65 0.91 0.90
CFS 55 0.95 89.93 0.91 0.89
LLCFS 57 0.95 89.93 0.93 0.90
PCA 58 0.96 89.93 0.92 0.88
For a subset with the number of selected attributes larger than 31, ILFS performed
the best and maintained stable in term of AUC, which reflects the effectiveness of ILFS in
selecting and re-ordering the attributes to best optimize the classification performance. The
classification accuracy achieved by CFS is competitive with ILFS when using a subset of
over 31 attributes and the performance of LLCFS increases when the number of attributes
increases. Table 3 summarizes the best performance achieved from the feature selection
methods. It can be observed that ILFS achieved the best performance by only using 39
attributes of HD, and the performance is slightly higher than using the fully 58 attributes
in term of accuracy. Although the effect of ILFS is not negligible for the incremental of
the performance, ILFS only uses 39 selected attributes to achieve the best performance
discarding the remaining 19 attributes.
4. DISCUSSION
In this study, we have conducted 2 experiments to investigate the performance of HD
prediction using different classification and feature selection methods. Although the HD
dataset can be considered as linear separable, a hard-margin SVM hardly separates the two
classes. Soft-margin kernel SVM is selected as the classifier to compare the effectiveness of 4
44 LE MINH HUNG, TRAN DINH TOAN, TRAN VAN LANG
different feature selection methods including ILFS, CFS, LLCFS and PCA. Our experiments
results show that PCA could generate a competitive result when the number of PCs used
is less than 31 while CFS and LLCFS perform well with over 31 attributes. ILFS generates
the best performance and maintains stable when the number of attributes used is over 31.
Although ILFS could effectively select and combine a set of attributes to best optimize
the classification performance, in which the best performance is recorded when using the
39 attributes selected by ILFS. However, it can be observed that using a subset with the
number of attributes ranging from 31 to 39 is still feasible since the performance in this
range of attributes is still reliable with the average of accuracy is approximately 88% and
average of AUC is approximately 0.94. This is interesting when the doctors can only work
with an approximately half of the given number of attributes but still achieve competitive
results compared with using fully given attributes. This helps to reduce the workloads and
time for doctors and to avoid other unnecessary clinical measurements for patients.
It can be observed that the performance of CFS and LLCFS achieved the best perfor-
mance when using a large number of attributes, e.g. 55 for CFS and 57 for LLCFS. Compared
with LLCFS, the performance when using CFS is more stable when the number of attributes
is larger than 31 with an average of accuracy approximately 86% and an average of AUC
approximately 0.92. However, CFS performed much better than ILFS when the number of
selected attributes ranging from 7 to 30. For instance, the performance by only using 13
attributes selected by CFS can achieve an accuracy of 80.56% and an AUC of 0.87 while
the performance when using ILFS only achieve an accuracy of 74.10% and an AUC of 0.77.
The results demonstrated that different feature selection methods select different features
for the classification and the performance varies quite differently between the methods. The
results also suggested that depending on the availability of the given number of attributes,
different feature selections can be applied in consideration with the desirable performance.
Our aim is to find out the best feature selection method for HD prediction in terms of the
high performance achieved and the number of selected attributes, for this reason ILFS is
preferable compared with other methods.
Feature selection plays a critical role in many real-life applications, especially in Healt-
hcare diagnosis, through which doctors, clinicians and clinical experts could explore the most
significant symptoms which drastically impact on the potential of having disease. In this
study, we have successfully applied feature selection method based on data mining technique
to apply in the application of HD prediction. For the 58 attributes provided, we have reduced
and selected a subset of selected features and achieved the best HD prediction performance.
Our method can be applied in many real-life applications or in other disease diagnosis ap-
plications to analyze the data, identify the risk factors to assist doctors in generating more
accurate prediction. Our future work includes applying our method on a large variety of he-
althcare datasets (e.g. Breast Cancer, Chronic Kidney) and providing a more comprehensive
analysis on the classification and feature selection methods.
REFERENCES
[1] Y.E. Shao, C.D. Hou, and C.C. Chiu, “Hybrid intelligent modeling schemes for heart
disease classification” Applied Soft Computing, vol. 14, no. 1, pp. 47–52, 2014.
AUTHOR GUIDELINES FOR JCC SUBMISSION 45
[2] R.D. Canlas, “Data mining in healthcare: Current applications and issues,”School of
Information Systems & Management, Carnegie Mellon University, Australia, 2009.
[3] Helma, Christoph, Eva Gottmann, and Stefan Kramer, “Knowledge discovery and data
mining in toxicology”Statistical methods in medical research, vol. 9, no. 4, pp. 329–358,
2000.
[4] I-N. Lee, S-C. Liao, and M. Embrechts, “Data mining techniques applied to medical
information”, Medical informatics and the Internet in medicine, vol. 25, no. 2, pp 81–
102, 2000.
[5] N. Deepika, K. Chandrashekar, “Association rule for classification of heart attack pa-
tients”, International Journal of Advanced Engineering Science and Technologies, vol.
11, no. 2, pp. 253–57, 2011.
[6] K. Srinivas, B. Kavitha Rani, and Dr. A. Govrdhan,“Application of data mining techni-
ques in healthcare and prediction of heart attacks”, International Journal on Computer
Science and Engineering, vol. 2, no. 2, pp. 250–255, 2011.
[7] A. Sudha, P. Gayathiri, and N. Jaisankar, “Effective analysis and predictive model of
stroke disease using classification methods”, International Journal of Computer Applica-
tions, vol. 43, no. 14, pp. 0975–8887, 2012.
[8] M. A. Jabbar, Priti Chandra, and B. L. Deekshatulu, “Cluster based association rule
mining for heart attack prediction”, Journal of Theoretical and Applied Information
Technology, vol. 32, no. 2, pp. 196–201, 2011.
[9] Shouman, Mai, Tim Turner, and Rob Stocker, “Integrating decision tree and k-means
clustering with different initial centroid selection methods in the diagnosis of heart disease
patients”, Proceedings of the International Conference on Data Mining (DMIN). The
Steering Committee of The World Congress in Computer Science, Computer Engineering
and Applied Computing (WorldComp), 2012.
[10] Dangare, Chaitrali S., and Sulabha S. Apte, “Improved study of heart disease prediction
system using data mining classification techniques”, International Journal of Computer
Applications, vol. 47, no. 10, pp. 44–48, 2012.
[11] K. Usha Rani, “Analysis of heart diseases dataset using neural network approach”,
International Journal of Data Mining and Knowledge Management Processive, vol. 1,
no. 5, pp. 1–8, 2011.
[12] Olatubosun Olabode and Bola Titilayo Olabode, “Cerebrovascular accident attack clas-
sification using multilayer feed forward artificial neural network with back propagation
error”, Journal of Computer Science, vol. 8, no. 1, pp.18–25, 2012.
[13] M. Anbarasi, E. Anupriya, and N.CH.S.N. Iyenga, “Enhanced prediction of heart di-
sease with feature subset selection using genetic algorithm”, International Journal of
Engineering Science and Technology, vol. 2, no. 10, pp. 5370–5376, 2010.
46 LE MINH HUNG, TRAN DINH TOAN, TRAN VAN LANG
[14] S. B. Patel, P. K. Yadav, D. D. Shukla, “Predict the diagnosis of heart disease pa-
tients using classification mining techniques”, IOSR Journal of Agriculture and Veteri-
nary Science (IOSR-JAVS), vol. 4, no. 2, pp. 61–64, 2013.
[15] N. Singh, P. Ferozepur, S. Jindal, “Heart disease prediction using classification and
feature selection techniques”, International Journal of Advance Research, Ideas and In-
novations in Technology, vol. 4, no. 2, 2018.
[16] H. Takci, “Improvement of heart attack prediction by the feature selection methods”,
Turkish Journal of Electrical Engineering & Computer Sciences, vol. 26, no. 1, pp. 1–10,
2018.
[17] R. Suganya, S. Rajaram, A. S. Abdullah, V. Rajendran, “A novel feature selection
method for predicting heart diseases with data mining techniques”, Asian Journal of
Information Technology, vol. 15, no. 8, 2016.
[18] M. Mirmozaffari, A. Alinezhad, A. Gilanpour, “Heart disease prediction with data mi-
ning clustering algorithms”, Int’l Journal of Computing, Communications & Instrumen-
tation Engineering (IJCCIE), vol. 4, no. 1, 2017.
[19] K. Uma, M. Hanumathappa, “Heart disease prediction using classification techniques
with feature selection method”, Adarsh Journal of Information Technology, vol. 5, no. 2,
pp. 22–29, 2016.
[20] Roffo, Giorgio, Melzi, Simone, Castellani, Umberto, Vinciarelli, Alessandro, “Infinite
latent feature selection: a probabilistic latent graph-based ranking approach”, Computer
Vision and Pattern Recognition, 2017.
[21]
disease.names, The contents of the heart-disease directory.
[22] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification
using support vector machines”, Mach. Learn., vol. 46, no. 1-3, pp. 389-422, 2002.
[23] D. Fehr, H. Veeraraghavan, A. Wibmer, T. Gondo, K. Matsumoto, HA. Vargas, E. Sala,
H. Hricak, JO. Deasy, “Automatic classification of prostate cancer Gleason scores from
multiparametric magnetic resonance images”, Proceedings of the National Academy of
Sciences, vol. 112, no. 46, E6265-73, 2015.
[24] T. Hofmann, “Probabilistic latent semantic analysis”, Proceedings of the Fifteenth con-
ference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., 1999
(pp. 289-296).
[25] G. Roffo, S. Melzi, M. Cristani, “Infinite feature selection”, In Conf. IEEE International
Conference on Computer Vision, 2015 (pp. 4202-4210).
[26] https://www.mathworks.com/matlabcentral/fileexchange/56937-feature-selection-
library, The MATLAB Feature Selection Library 2017.
[27] Jolliffe, Ian., “Principal component analysis”, International encyclopedia of statistical
science, Springer, Berlin, Heidelberg, 2011 (pp. 1094–1096).
AUTHOR GUIDELINES FOR JCC SUBMISSION 47
[28] Ziasabounchi, Negar, Askerzade, Iman N., “A comparative study of heart disease pre-
diction based on principal component analysis and clustering methods”, Turkish Journal
of Mathematics and Computer Science (TJMCS), 16.17: 18, 2014.
[29] Hall, Mark Andrew, “Correlation-based feature selection for machine learning”, PhD
thesis, 1999.
[30] H. Zeng, Y. M. Cheung, “Feature selection and kernel learning for local learning-based
clustering”, IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 33,
no. 8, pp. 1532–1547, 2011.
Received on June 11, 2018
Revised on July 20, 2018
Các file đính kèm theo tài liệu này:
- automatic_heart_disease_prediction_using_feature_selection_a.pdf