This paper proposed some metrics for validating the efficiency of unsupervised anomaly
detection algorithms. The new metrics relies on the internal evaluation metrics used for
validating the goodness of clustering algorithms. The experiments were conducted on a
number of benchmarking anomaly datasets. The results showed that four of the proposed
metrics produce competitive or better than the results obtained by the previous method [17].
Thus, these metrics could replace the logistic regression method [17] when being used for
validating the outcome of unsupervised anomaly detection techniques.
There are some research areas for future work which arise from this paper. First, the
authors would like to apply the good metrics in this paper to evaluate the performance of
unsupervised anomaly detection algorithms in real world applications. In this paper, the
metrics has been investigated using a number of benchmarking datasets with the available
labels. In the future, the authors want to use the good metrics in this paper to evaluate the
performance of various unsupervised anomaly detection algorithms when they are applied
to solving real world problems such as online games cheating detection [17] and credit card
fraud detection [12].
Second, the authors would like to propose better evaluation metrics for unsupervised
anomaly detection algorithms. The metrics tested in this paper are often based on compactness or both compactness and separation of the clusters. This is suitable for clustering
problems. However, for anomaly detection problems, it may be better if the metrics is only
based on separation properties. One of such metrics is the dissimilarity measures of data in
hierarchical clustering [26]. In the future, the authors would like to study this method for
anomaly detection algorithms.
14 trang |
Chia sẻ: huongthu9 | Lượt xem: 422 | Lượt tải: 0
Bạn đang xem nội dung tài liệu An evaluation method for unsupervised anomaly detection algorithms, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Journal of Computer Science and Cybernetics, V.32, N.3 (2016), 259–272
DOI 10.15625/1813-9663/32/3/8455
AN EVALUATION METHOD FOR UNSUPERVISED ANOMALY
DETECTION ALGORITHMS
HUY VAN NGUYEN1, TRUNG THANH NGUYEN2, AND QUANG UY NGUYEN2
1Institute of Information Technology, Vietnam Academy of Military Science and Technology;
2Faculty of IT, Le Quy Don Technical University;
1vannguyenhuy.vn@gmail.com; 2quanguyhn@gmail.com, 2trungthanhnt@gmail.com
Abstract. In data mining, anomaly detection aims at identifying the observations which do not
conform to an expected behavior. To date, a large number of techniques for anomaly detection have
been proposed and developed. These techniques have been successfully applied to many real world
applications such as fraud detection for credit cards and intrusion detection in network security. How-
ever, there are very little research relating to the method for evaluating the goodness of unsupervised
anomaly detection techniques. In this paper, the authors introduce a method for evaluating the
performance of unsupervised anomaly detection techniques. The method is based on the application
of internal validation metrics in clustering algorithms to anomaly detection. The experiments were
conducted on a number of benchmarking datasets. The results are compared with the result of a
recent proposed approach that shows that some proposed metrics are very consistent when being
used to evaluate the performance of unsupervised anomaly detection algorithms.
Keywords. Anomaly detection, evaluation, clustering validation.
1. INTRODUCTION
Detecting anomaly has received great attention from the research community in machine
learning [6]. Anomaly detection aims at finding samples in data that do not follow the
expected behavior. These samples are often referred to as anomalies or outliers. These two
terms are often used interchangeably. Anomaly detection techniques have extensively been
applied to a wide variety of applications such as fraud detection for credit cards, insurance
or health care, intrusion detection for cyber-security [18].
Since the first study of anomaly detection by statisticians in the early 19th century,
there has been a variety of anomaly detection techniques developed for diverse application
domains [6]. Regarding the availability of labeled data, anomaly detection techniques are
classified into three classes: Supervised anomaly detection, semi-supervised anomaly detec-
tion and unsupervised anomaly detection. Among them, unsupervised anomaly detection
are the most popular techniques and they have been applied to a wide range of problems [6].
In unsupervised anomaly detection, since labeled data is not available, evaluating the
accuracy of detection methods has been a constant challenge in data mining research [31].
So far, the performance of unsupervised anomaly detection techniques has often been tested
by using labeled data sets. In other words, the labels are not used by the algorithms during
c© 2016 Vietnam Academy of Science & Technology
260 HUY VAN NGUYEN, TRUNG THANH NGUYEN AND QUANG UY NGUYEN
the training process, but only for evaluating their results [5]. This method is often referred
to as external evaluation approach.
The downside of the external methods is that they are not applicable to many real
world problems where the labeled data is not available. To the best of our knowledge,
there has only been two published research on the approach for evaluating the accuracy of
unsupervised anomaly detection techniques by Marquest et al. and Nguyen et al. [15, 17].
They proposed the idea of using classification algorithm to measure the performance of
unsupervised anomaly detection techniques. Their method assumes that abnormal samples
are often farther from normal samples and can therefore be more easily separated from other
samples. Based on this assumption, they applied a classification algorithm on the output of
anomaly detection and used the accuracy of the classification algorithm as the indicator for
the performance of detection techniques.
The drawback of the approach in [15] and [17] is that it requires to execute one more algo-
rithm (logistic regression in [17] and kernel based classification in [15]) on the top of anomaly
detection techniques. Subsequently, this method may be considerably computational expen-
sive. Moreover, the results may also depend on the selected classification algorithm and the
parameters settings for the classification algorithms. Various classification algorithms with
different settings may lead to significantly different results of the performance of anomaly
detection methods.
In this paper, a new method for evaluating the accuracy of unsupervised anomaly detec-
tion approaches is introduced. The main contributions of the paper are:
• The use of internal validation metrics in clustering for measuring the performance of
unsupervised anomaly detection algorithms proposed.
• Ten validation metrics are tested on some benchmarking abnormal datasets and com-
pared with the method that used logistic regression [17].
• The experimental results show that one of the tested metrics is better than logistic
regression when used for measuring the accuracy of unsupervised anomaly detection
algorithms.
The paper is organized as follows. The next section introduces some popular anomaly
detection techniques. The method for measuring the performance of unsupervised anomaly
detection algorithms is discussed in Section 3. Section 4 presents ten internal metrics that
will be examined in this paper. Section 5 describes the tested datasets and the experimen-
tal setup. The results of testing evaluation metrics are presented in Section 6. Section 7
concludes the paper and highlights some future work.
2. BACKGROUNDS
Anomaly detection has been the topic of a large number of research. For a comprehensive
review of the research on anomaly detection, the readers are recommended to read [6]. In
this section, some well-known anomaly detection approaches are described. Based on the
extent to which the labeled data is available, anomaly detection techniques are categorized
into the following three classes [6].
AN EVALUATION METHOD FOR UNSUPERVISED ANOMALY DETECTION 261
Supervised anomaly detection: These methods rely on the assumption that labeled in-
stances for both normal and anomaly class are available during the training process. The
objective is to learn a predictive model for normal versus abnormal classes. The resulting
model is then used to determine which class an unseen data sample belongs to [23, 27].
Although, the accuracy of supervised methods is often higher than other approaches, la-
beling data is often strenuous and expensive. Subsequently, supervised anomaly detection
techniques have not been applied as frequently as unsupervised methods.
Semi-Supervised anomaly detection: Semi-supervised techniques assume that there is
only one class of instances (often normal class) in the training data. The typical approach
is to construct a model for normal behavior, and use the model to identity anomalies in the
test data. In the testing phase, if an unseen sample is not recognized by the learnt model,
then this sample is considered as anomaly. Popular anomaly detection techniques based on
one class learning include one-class Kernel Fisher Discriminants [20] and one class support
vector machine [19]. These methods are more widely applicable than supervised techniques
since abnormal instances are not required in the training phase.
Unsupervised anomaly detection: The techniques that operate in unsupervised mode do
not require labeled samples for both classes. Thus they are the most widely applicable tech-
niques. The techniques in this category assume that normal instances are far more frequent
than anomalies in the dataset and that abnormal samples are often significantly different from
the normal samples. To date, a large number of unsupervised anomaly detection approaches
have been developed [6]. Among them, nearest neighbor based techniques, clustering based
techniques and statistical techniques have been widely applied.
Nearest neighbor based techniques assume that the density in normal region is higher
than in abnormal region. In other words, the abnormal samples are supposed to be isolated
or lied in the region of sparse density. Therefore, the distance of a data instance to its kth
nearest neighbor can be considered as the anomaly score. If this distance is too high then
the data sample is suspected to be anomaly [3, 30]. The advantage of nearest neighbor
based techniques is that they are unsupervised in nature. Nevertheless, the computational
complexity of the techniques in testing phase is often high (O(N2), N is the number of
samples). This hinders the application of nearest neighbor based techniques to some real
world applications where time constraint is important.
Clustering based techniques make the assumption that normal data instances belong to
large clusters, while anomalies belong to small clusters. The techniques use clustering algo-
rithms to divide the dataset into a number of clusters and report any instance that does not
belong to any cluster or belongs to the clusters with a small number of samples as anoma-
lous [14, 29]. Similar to nearest neighbor based techniques, clustering based techniques can
operate fully in unsupervised mode. However, the computational complexity for clustering
the data is also challenging in testing phase especially for the algorithms such as hierarchical
clustering [24] where the complexity is O(N2).
Statistical anomaly detection techniques rely on the assumption that normal data in-
stances are generated by a probabilistic model. The training process aims to learn the
parameters of the probabilistic model. In the testing phase, the methods declare any sample
with low probability of being generated from the learnt model as anomalous [2, 9]. The
advantage of statistical approaches is that the complexity of fitting data is low (often lin-
ear). Consequently, statistical approaches have extensively been used in variety of real world
262 HUY VAN NGUYEN, TRUNG THANH NGUYEN AND QUANG UY NGUYEN
problems particularly when the data volume is high. However, statistical approaches are
based on the assumption that the data is generated from a particular distribution. If this
assumption does not hold, the results of statistical methods may not be robust.
3. METHODS
The evaluation method introduced in this paper relies on the assumption that if an
anomaly detection technique performs well on a dataset then the technique will separate the
normal and abnormal samples into different clusters. In other words, let C1 and C2 be the
normal and abnormal sets/clusters determined by applying an anomaly detection technique
A on the data D, then we can see that: the performance of A on D is good if C1 and C2 are
well separated and the performance of A on D is not good if C1 and C2 are not well separated
(i.e. some normal samples are mixed in the abnormal clusters and vice versa). Therefore, it
will be relevant to use the metrics for evaluating the performance of clustering algorithms
(i.e. to evaluate if C1 and C2 are well separated) to measure the quality of anomaly detection
techniques. The objective of this paper is to examine whether the validation metrics used in
clustering could be applied for measuring the performance of anomaly detection approaches.
In data mining, clustering validation has long been recognized as one of the vital issues
to the success of clustering applications [16]. Clustering validation is based on the metric
to evaluate the goodness of clustering results. Clustering validation metrics can be catego-
rized into two main types: external clustering validation and internal clustering validation.
External validation metrics use external information (for example the data labels) that is
not presented in clustering process to evaluate the extent to which the clustering structure
discovered by an algorithm matches to the external structure. On the other hand, inter-
nal measures evaluate the goodness of a clustering structure without using any external
information.
In this paper, we focus only on examining the internal validation metrics since they are
more suitable for the unsupervised anomaly detection algorithms. In order to examine these
metrics, a number of benchmarking datasets were selected. The labels of the chosen datasets
are available but they are not used when calculating the tested validation metrics. This is to
mimic the scenario in many real-world applications such as bank transfer, online game, etc.
where the data label is not available. The data label is used only for swapping some data
samples between normal and abnormal cluster to simulate the situation in which an anomaly
detection technique inaccurately identifies some normal samples as abnormal samples and
vice versa.
For each dataset, the normal and abnormal samples are divided into two clusters (the
normal and the abnormal cluster). The validation metrics are then measured on these two
clusters and the obtained value is referred to as t0. Next, a number of abnormal samples is
selected and swapped to the normal cluster. Correspondingly, the same number of normal
samples are swapped to the abnormal cluster. The validation metrics are again applied to
two new formed clusters (the normal and abnormal clusters that were formed after swapping
some data samples). The swapping process is repeated 10 times with the number of swapped
samples is varied from 10% to 100% of abnormal samples. The value of the validation metrics
calculated on the new formed clusters is referred to as t1, ..., t10, respectively.
Assuming that, for a specific validation metric, the greater value presents the better
AN EVALUATION METHOD FOR UNSUPERVISED ANOMALY DETECTION 263
clustering result. Then we expect its value will gradually decrease from t0 to t10. Conversely,
if the smaller value of the metric indicates the better clustering result, then we expect that the
value from t0 to t10 is gradually increased. To measure the correlation between the validation
metric result and the percent value of the swapped samples in the dataset, we calculate the
Pearson correlation coefficient of two vectors: The first vector is T = {t0, t1, ..., t10} and the
second vector is K = {0, 1, ..., 10}.
In statistics, the correlation between sets of data is a measure of how well they are
related [1]. The most common measure of correlation in statistics is the Pearson correlation.
This measure shows the linear relationship between two sets of data. If the value of Pearson
correlation coefficient is close to 1 or -1 then two sets of data are highly correlated. In this
case, the tested validation metric is good for measuring the performance of anomaly detection
techniques. However, if the correlation coefficient is close to zero, then two datasets weakly
correlated and the metric is not reliable for measuring the quality of detection approaches.
In the following section, we will present 10 internal clustering validation metrics that are
applied to measuring the goodness of anomaly detection techniques.
4. CLUSTERING VALIDATION METRICS
This section presents ten clustering validation metrics that will be tested for the evalua-
tion of the goodness of anomaly detection. Since the target of clustering is ensuring objects
within each cluster similar and objects in different clusters distinct [25], most of internal
clustering validation measures are based on two criteria that are compactness and separa-
tion. Compactness measures the level of differently or closely related between the samples in
the same cluster. Separation measures how a cluster is distinct or well-separated from other
ones. The rest of this section will briefly present ten validation metrics. For the sake of the
presentation, some important notations are shown in Table 1. In six out of ten tested metrics
including RS,H,CH, S, I,D, the greater values present the better clustering result. Con-
versely, for four metrics (STD,DB,XB, SD), the smaller values mean the obtained clusters
are better separated.
• The root-mean-square standard deviation (STD) [22] is the square root of the
pooled sample variance of all the attributes. STD takes homogeneous level of the
formed clusters into account[10] by summing up them and then normalizing the result.
RMSSTD =
(∑NC
i=1
∑
x∈Ci ‖x− ci‖
2
P ·∑NCi=1 (ni − 1)
) 1
2
. (1)
• The R-squared (RS) [22] is the ratio of sum of squares between clusters to the total
sum of squares of the whole data set. RS measures the degree of homogeneity between
clusters [10].
RS =
∑
x∈D ‖x− c‖2 −
∑NC
i=1
∑
x∈Ci ‖x− ci‖
2∑
x∈D ‖x− c‖2
. (2)
264 HUY VAN NGUYEN, TRUNG THANH NGUYEN AND QUANG UY NGUYEN
Table 1. Notation in clustering validation measure
Notation Meaning
D Data set
n Number of objects in data set
c Center of D
P Number of attributes in data set
p Contrast factor (taken p = 2)
NC Number of clusters
Ci The ith Cluster
ni Number of objects in Ci
ci Center of Ci
σ (Ci) Variance vector of Ci
d (x, y) The distance between x and y
‖Xi‖
(
XTi ·Xi
) 1
2
• The Modified Hubert statistic (H) [13] evaluates the clustering quality the cor-
relation between two square matrices of the same size. The first one is the proximity
matrix of all objects in the data set and the second is the proximity matrix of the
clusters’ centers to which each object belongs.
H =
2
n (n− 1) ·
∑
x∈D
∑
y∈D
d (x, y) dx∈Ci,y∈Cj (ci, cj) . (3)
• The CalinskiHarabasz index (CH) evaluates the cluster solution based on the av-
erage between- and within-cluster sum of squares [4].
CH =
[∑NC
i=1
nid
2 (ci, c) / (NC − 1)
]/[∑NC
i=1
∑
x∈Ci
d2 (x, ci) / (n−NC)
]
. (4)
• The Index I (I) [16] takes the maximum distance between cluster centers as separa-
tion, and the sum of distances between objects and their cluster center as compactness.
I =
(
1
NC
·
∑
x∈D d (x, c)∑NC
i=1
∑
x∈Ci d (x, ci)
·maxi,j d (ci, cj)
)p
. (5)
• The Dunns index (D) [8] uses the minimum pair-wise distance between objects in
different clusters as the inter-cluster separation and the maximum cluster diameter as
the intracluster compactness. The index’s value is ratio of the inter-cluster separation
to the intra-cluster compactness.
D =
min1≤i≤NC min1≤j≤NC,j 6=i minx∈Ci,y∈Cj d (x, y)
max1≤k≤NC maxx∈Ck,y∈Ck d (x, y)
. (6)
AN EVALUATION METHOD FOR UNSUPERVISED ANOMALY DETECTION 265
• The Silhouette index (S) [21] measures clustering partition based on the dissimilarity
of each instance to its cluster’s instances, and to its ‘neighbor’ cluster’s instances.
S =
1
NC
NC∑
i=1
[
1
ni
∑
x∈Ci
b (x)− a (x)
max [b (x) , a (x)]
]
, (7)
a (x) =
1
ni − 1
∑
y∈Ci,y 6=x
d (x, y) ,
b (x) = min
j,j 6=i
[
1
nj
∑
y∈Cj
d (x, y)
]
.
• The DaviesBouldin index (DB) [7] is average of cluster’s similarities. The similarity
of each cluster is defined as the maximum value of its similarities to other clusters.
DB =
1
NC
∑
i
max
j,j 6=i
1
ni
∑
x∈Ci
d (x, ci) +
1
nj
∑
x∈Cj
d (x, cj)
/d (ci, cj)
. (8)
• The Xie-Beni index (XB) uses the minimum square distance between cluster centers
as intercluster separation and the mean square distance between each data object and
its cluster center as the intracluster compactness. This index is defined as the ratio of
the compactness to the separation [28].
XB =
∑NC
i=1
∑
x∈Ci d
2 (x, ci)
/
n
mini,j d2 (ci, cj)
. (9)
• The last metric is SD index (SD) [11] the summation of two terms: the average
scattering and the total separation of clusters. The former evaluates compactness based
on variances of clusters and dataset, and the latter evaluates separation difference based
on distances between cluster centers.
SD = Dis (NCmax) ·Scat (NC) +Dis (NC) , (10)
Scat (NC) =
∑NC
i=1 ‖σ (Ci)‖
/
NC
‖σ (D)‖ ,
Dis (NC) =
maxi,j d (ci, cj)
mini,j d (ci, cj)
·
∑NC
i=1
(∑NC
j=1
d (ci, cj)
)−1
.
Among the ten above metrics, there are three metrics (STD, RS and H) that take
either separation or compactness into account. In fact, RS and H consider only separation
while STD considers only compactness. All other metrics consider both separation and
compactness when measuring the goodness of the clustering result.
266 HUY VAN NGUYEN, TRUNG THANH NGUYEN AND QUANG UY NGUYEN
Table 2. Datasets used for evaluation measures
# Data set Instance Field Class
Name Notation Abnor. Nom.
1 Ionosphere IONO 22 225 34 2
2 Wisconsin Breast Cancer WBC 23 236 9 2
3 Wisconsin Diagnostic Breast Cancer WDBC 35 357 30 2
4 Pima Indians Diabetes DIAB 50 500 8 2
5 Diabetic Retinopathy Debrecen MESS 60 606 19 2
6 Banknote Authentication BNAU 73 738 4 2
7 Cardiotocography CARD 164 1645 21 2
8 MAGIC gamma telescope MAGI 1233 12332 10 2
5. EXPERIMENTAL SETTINGS
This section presents the settings of the experiments in this paper. First, the selected
datasets for testing the validation metrics are introduced. After that, the data pre-processing
steps are presented.
5.1. Selected datasets
The validation metrics are tested on eight benchmarking datasets drawn from UCI ma-
chine learning repository. The tested datasets, their notation and properties (pre-processed)
are presented in Table 2. The detailed description of the datasets is as follows.
• Ionosphere dataset (IONO): This radar data was collected by a system in Goose Bay,
Labrador. The targets were free electrons in the ionosphere. “Good” radar returns are
those showing evidence of some type of structure in the ionosphere. “Bad” returns are
those that do not. The dataset includes 351 instances, each has 34 attributes.
• Wisconsin Dataset (WBC): The Wisconsin diagnostic breast cancer dataset contains
699 instances and has 9 attributes.
• The Wisconsin Diagnostic Breast Cancer (WDBC): This dataset describes nuclear
characteristics for breast cancer diagnosis, also distinguishing cancer types as benign
(normal) or malignant (abnormal). The dataset includes 659 samples and has 30
attributes.
• Diabetes dataset (DIAB): Several constraints were placed on the selection of these
instances from a larger database. The dataset includes 768 (500 negative, 268 positive)
instances and has 8 attributes.
• Diabetic Retinopathy Debrecen dataset (MESS): This dataset contains features ex-
tracted from the Messidor image set to predict whether an image contains signs of
diabetic retinopathy or not. There are 1151 instances and each instance has 19 at-
tributes.
AN EVALUATION METHOD FOR UNSUPERVISED ANOMALY DETECTION 267
• Banknote Authentication dataset (BNAU): Data were extracted from images that were
taken from genuine and forged banknote-like specimens. There are 1372 instances and
each of them has 4 attributes.
• Cardiotocography dataset (CARD): 2126 fetal cardiotocograms (CTGs) were automat-
ically processed and the respective diagnostic features measured. The CTGs were also
classified by three expert obstetricians into a fetal state including three classes (N, S,
P).
• Magic gamma telescope dataset (MAGI): The data are Monte Carlo generated to
simulate registration of high energy gamma particles in a ground-based atmospheric
Cherenkov gamma telescope using the imaging technique. In the dataset, there are
19020 instances (12332 gamma and 6688 hadron), each has 10 attributes.
5.2. Pre-processing
Before the datasets could be used for testing the validation metrics, they need to be
processed to present for the anomaly detection problems. The following pre-processing steps
are applied to each dataset.
• Removing missing values: there are some methods for dealing with missing values in a
dataset. In this paper, to avoid side-effect from pre-processing to the accuracy of the
internal measures, the instances that have missing values are removed from datasets.
• Removing duplicated samples: the datasets may have some duplicate instances that
may affect to the calculation of validation metrics. In this paper, each instance in a
dataset will be checked for duplication, then all other identical ones will be removed.
• Normalization: datasets are usually normalized to avoid dominance of some properties
over the others. The method is chosen to normalize datasets is as follows:
xnew =
x− xmin
xmax − xmin . (11)
After being normalized, the value of all properties in the datasets are in range [0, 1].
• Down-sampling: To produce datasets that are suitable for the anomaly detection prob-
lem, the original datasets are down-sampled. Instances of the biggest class are retained
to form normal class, while instances from others class are randomly selected to create
abnormal class. The rate between number of abnormal instances and normal instance
is 1/10.
6. RESULTS AND DISCUSSION
We first calculated the value of each metric when the number of abnormal samples that
are swapped to the normal cluster is varied from 10% to 100%. The results of 10 validation
metrics are compared with the result obtained by using logistic regression [17]. Table 3 and 4
present the comparison between ten validation metrics and logistic regression (F1) on two
268 HUY VAN NGUYEN, TRUNG THANH NGUYEN AND QUANG UY NGUYEN
datasets WBC and BNAU, respectively 1. In these tables, if a value is smaller than 1, this
value is presented as .xyz (for example 0.298 is presented as .298) and the first column (%)
presents the number of the samples swapped between two clusters. Moreover, in each metric,
if any result that does not follow the desired rule of the metric, this result is printed bold
faced.
Table 3. Values of measures on dataset WBC
% STD RS H CH I D S DB XB SD F1
0 .298 .152 .708 46.20 .529 .114 .388 1.02 .45 2.42 .759
10 .303 .122 .619 35.77 .408 .041 .502 1.26 .58 2.91 .615
20 .309 .089 .512 25.17 .285 .041 .469 1.64 .83 3.62 .466
30 .312 .072 .446 19.86 .225 .040 .406 1.89 1.05 4.12 .447
40 .316 .045 .345 12.14 .137 .040 .317 2.54 1.71 5.47 .348
50 .319 .027 .260 7.23 .081 .040 .221 3.35 2.88 7.15 .327
60 .320 .023 .237 6.04 .068 .040 .162 3.68 3.44 7.81 .263
70 .321 .017 .202 4.45 .050 .040 .116 4.30 4.67 9.13 .250
80 .323 .007 .125 1.78 .020 .040 .064 6.79 11.71 14.36 .203
90 .324 .002 .056 .39 .004 .040 .024 14.09 53.38 29.97 .206
100 .323 .003 .076 .73 .008 .099 .009 10.09 28.40 21.49 .158
It can be seen from Table 3 that most metrics produce a consistent result when the
number of swapped abnormal samples varied. Apparently, there are nine out of ten metrics
(excepts D) that are very good for measuring the quality of anomaly detection approaches.
With these measures, there is only one result that did not follow the expected rule. Con-
versely, the results on metric D are not as good as the others. The metric D produces many
results that did not follow the metric’s rules on this dataset. Moreover, the logistic regression
approach [17] produces the best result on this tested problem.
The results on Table 4 are mostly consistent with the results in Table 3. It can be
observed that metric D also did not produce good results on this dataset. Moreover, the
nine metrics that produced the good result in Table 3 are also the good metrics in Table 4.
However, the results of metric S are not as consistent as those in the previous table. This
metric created 2 inconsistent results on this dataset. For logistic regression approach, its
result is still very good on this problems with only one inconsistent value.
The last result presented in the section is the Pearson correlation coefficient of each
metric on the tested dataset. The Pearson correlation coefficients of ten metrics and of the
logistic regression approach are shown in Table 5. In this table, the last row presents the
average correlation value of each metrics over the tested datasets. Moreover, if the average
value of a metric is better than the logistic regression method, its average value is printed
bold faced, and if it is worse than the logistic regression method, the value is printed italic
faced.
It can be seen from Table 5 that, on average, there are is only one metric (H) that is
better than the logistic regression method when being used for evaluating the performance
1The results on other datasets are consistent with those in these tables and they are not presented for the
succinct presentation of the paper.
AN EVALUATION METHOD FOR UNSUPERVISED ANOMALY DETECTION 269
Table 4. Values of indices with data set BNAU
% STD RS H CH I D S DB XB SD F1
0 .175 .098 .082 87.57 .044 .052 .222 1.70 .76 10.34 .848
10 .177 .083 .074 72.89 .037 .011 .293 1.89 .91 11.42 .747
20 .178 .072 .068 63.23 .032 .010 .295 2.07 1.05 12.36 .630
30 .180 .054 .056 46.02 .023 .009 .257 2.43 1.44 14.48 .534
40 .181 .043 .050 36.19 .018 .006 .210 2.77 1.83 16.53 .444
50 .182 .025 .036 20.70 .011 .006 .155 3.57 3.20 21.10 .345
60 .183 .018 .029 14.61 .007 .005 .111 4.22 4.54 24.88 .252
70 .184 .011 .023 9.40 .005 .005 .075 5.11 7.05 30.24 .213
80 .184 .007 .017 5.49 .003 .002 .049 6.47 12.07 38.12 .183
90 .184 .001 .007 1.12 .001 .000 .024 13.80 59.36 81.53 .164
100 .184 .002 .008 1.27 .001 .000 .015 12.69 52.14 75.40 .196
Table 5. Correlation coefficient of the validation measures
Dataset STD RS H CH I D S DB XB SD F1
IONO .981 -.981 -.982 -.980 -.980 -.494 -.916 .874 .764 .869 -.955
WBC .943 -.946 -.984 -.933 -.932 -.095 -.956 .862 .718 .858 -.948
WDBC .932 -.937 -.972 -.915 -.923 -.846 -.947 .717 .545 .719 -.974
DIAB .928 -.929 -.969 -.927 -.926 -.500 -.972 .914 .806 .912 -.951
MESS .965 -.965 -.990 -.965 -.965 -.671 -.892 .924 .845 .926 -.931
BNAU .969 -.970 -.992 -.965 -.965 -.700 -.939 .869 .761 .867 -.959
CARD .959 -.960 -.989 -.958 -.958 -.933 -.925 .871 .750 .870 -.971
MAGI .937 -.938 -.979 -.936 -.935 -.801 -.946 .764 .587 .765 -.976
AVG .952 -.953 -.982 -.947 -.948 -.630 -.937 .849 .722 .848 -.958
of anomaly detection techniques. The average of the Pearson correlation coefficient of H
is very close to -1 (-0.982). Therefore, this metric might be very reliable for validating the
goodness of anomaly detection methods. All other metrics are worse compared to the logistic
regression method. However, four metrics including STD, RS, CH and I also produce very
solid results and the average value of their Pearson correlation coefficient is also very close
the value of the logistic regression method. In fact, the mean of the correlation value of
STD, RS, CH, and I is 0.952, -0.953, -947 and -0.948 respectively while this value for
logistic regression method is 0.958. There are four out of ten tested metrics that produced
substantially worse results that are: D, DB, XB, SD. Thus, these four metrics should not
be used for measuring the performance of anomaly detection techniques.
One of the reason while H metric is better than other metrics in this experiment is that H
metric is one of the two metrics that consider only separation. Another metric that considers
only separation is RS and this metric is the second best metric amongst ten tested metrics.
We suppose that for measuring the performance of anomaly detection methods, the metrics
that take into account only separation are more suitable than the metrics that consider
270 HUY VAN NGUYEN, TRUNG THANH NGUYEN AND QUANG UY NGUYEN
only compactness or both. In the future, we will conduct more research to investigate this
hypothesis. Overall, the results in this section show that some internal validation metrics for
clustering algorithms such as STD, RS and H are also very reliable for measuring for the
performance of anomaly detection methods. Particularly, the modified Hubert (H) is even
better than the method using logistic regression. Therefore, this method could replace the
logistic regression approach when validating the goodness of unsupervised anomaly detection
algorithms.
7. CONCLUSIONS AND FUTURE WORK
This paper proposed some metrics for validating the efficiency of unsupervised anomaly
detection algorithms. The new metrics relies on the internal evaluation metrics used for
validating the goodness of clustering algorithms. The experiments were conducted on a
number of benchmarking anomaly datasets. The results showed that four of the proposed
metrics produce competitive or better than the results obtained by the previous method [17].
Thus, these metrics could replace the logistic regression method [17] when being used for
validating the outcome of unsupervised anomaly detection techniques.
There are some research areas for future work which arise from this paper. First, the
authors would like to apply the good metrics in this paper to evaluate the performance of
unsupervised anomaly detection algorithms in real world applications. In this paper, the
metrics has been investigated using a number of benchmarking datasets with the available
labels. In the future, the authors want to use the good metrics in this paper to evaluate the
performance of various unsupervised anomaly detection algorithms when they are applied
to solving real world problems such as online games cheating detection [17] and credit card
fraud detection [12].
Second, the authors would like to propose better evaluation metrics for unsupervised
anomaly detection algorithms. The metrics tested in this paper are often based on com-
pactness or both compactness and separation of the clusters. This is suitable for clustering
problems. However, for anomaly detection problems, it may be better if the metrics is only
based on separation properties. One of such metrics is the dissimilarity measures of data in
hierarchical clustering [26]. In the future, the authors would like to study this method for
anomaly detection algorithms.
ACKNOWLEDGMENT
This research is funded by Vietnam National Foundation for Science and Technology
Development (NAFOSTED) under grant number 102.01-2014.09. Trung Thanh Nguyen was
partly funded by VNG company.
REFERENCES
[1] J. Benesty, J. Chen, and Y. Huang, “On the importance of the pearson correlation coefficient in
noise reduction,” IEEE Trans. Audio, Speech & Language Processing, vol. 16, no. 4, pp. 757–765,
2008.
[2] M. Bouguessa, “A mixture model-based combination approach for outlier detection,” Interna-
tional Journal on Artificial Intelligence Tools, vol. 23, no. 4, 2014.
AN EVALUATION METHOD FOR UNSUPERVISED ANOMALY DETECTION 271
[3] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local
outliers,” in Proceedings of the 2000 ACM SIGMOD International Conference on Management
of Data, May 16-18, 2000, Dallas, Texas, USA., 2000, pp. 93–104.
[4] T. Caliski and J. Harabasz, “A dendrite method for cluster analysis,” Communications in Statis-
tics, vol. 3, no. 1, pp. 1–27, 1974.
[5] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenkova´, E. Schubert, I. Assent,
and M. E. Houle, “On the evaluation of unsupervised outlier detection: measures, datasets, and
an empirical study,” Data Min. Knowl. Discov., vol. 30, no. 4, pp. 891–927, 2016.
[6] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv.,
vol. 41, no. 3, 2009.
[7] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 1, no. 2, pp. 224–227, 1979.
[8] J. C. Dunn, “Well-separated clusters and optimal fuzzy partitions,” Journal of Cybernetics,
vol. 4, no. 1, pp. 95–104, 1974.
[9] M. Gebski and R. K. Wong, “An efficient histogram method for outlier detection,” in Advances
in Databases: Concepts, Systems and Applications, 12th International Conference on Database
Systems for Advanced Applications, DASFAA 2007, Bangkok, Thailand, April 9-12, 2007, Pro-
ceedings, 2007, pp. 176–187.
[10] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On clustering validation techniques,” J. Intell.
Inf. Syst., vol. 17, no. 2-3, pp. 107–145, 2001.
[11] M. Halkidi, M. Vazirgiannis, and Y. Batistakis, “Quality scheme assessment in the clustering
process,” in Principles of Data Mining and Knowledge Discovery, 4th European Conference,
PKDD 2000, Lyon, France, September 13-16, 2000, Proceedings, 2000, pp. 265–276.
[12] N. S. Halvaiee and M. K. Akbari, “A novel model for credit card fraud detection using artificial
immune systems,” Appl. Soft Comput., vol. 24, pp. 40–49, 2014.
[13] L. Hubert and P. Arabie, “Comparing partitions,” Journal of Classification, vol. 2, no. 1, pp.
193–218, 1985.
[14] E. Leo´n, O. Nasraoui, and J. Go´mez, “Anomaly detection based on unsupervised niche clus-
tering with application to network intrusion detection,” in Proceedings of the IEEE Congress
on Evolutionary Computation, CEC 2004, 19-23 June 2004, Portland, OR, USA, 2004, pp.
502–508.
[15] H. O. Marques, R. J. G. B. Campello, A. Zimek, and J. Sander, “On the internal evaluation of
unsupervised outlier detection,” in Proceedings of the 27th International Conference on Scientific
and Statistical Database Management, SSDBM ’15, La Jolla, CA, USA, June 29 - July 1, 2015,
2015, pp. 7:1–7:12.
[16] U. Maulik and S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and
validity indices,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 12, pp. 1650–1654, 2002.
[17] T. T. Nguyen, A. T. Nguyen, T. A. H. Nguyen, L. T. Vu, Q. U. Nguyen, and L. D. Hai,
“Unsupervised anomaly detection in online game,” in Proceedings of the Sixth International
Symposium on Information and Communication Technology, Hue City, Vietnam, December 3-4,
2015, 2015, pp. 4–10.
272 HUY VAN NGUYEN, TRUNG THANH NGUYEN AND QUANG UY NGUYEN
[18] A. Patcha and J. Park, “An overview of anomaly detection techniques: Existing solutions and
latest technological trends,” Computer Networks, vol. 51, no. 12, pp. 3448–3470, 2007.
[19] G. Ra¨tsch, S. Mika, B. Scho¨lkopf, and K. Mu¨ller, “Constructing boosting algorithms from svms:
An application to one-class classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,
no. 9, pp. 1184–1199, 2002.
[20] V. Roth, “Kernel fisher discriminants for outlier detection,” Neural Computation, vol. 18, no. 4,
pp. 942–960, 2006.
[21] P. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster
analysis,” J. Comput. Appl. Math., vol. 20, no. 1, pp. 53–65, Nov. 1987.
[22] S. Sharma, Applied Multivariate Techniques. New York, NY, USA: John Wiley & Sons, Inc.,
1996.
[23] S. Singh and M. Markou, “An approach to novelty detection applied to the classification of image
regions,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 4, pp. 396–407, 2004.
[24] G. J. Sze´kely and M. L. Rizzo, “Hierarchical clustering via joint between-within distances: Ex-
tending ward’s minimum variance method,” J. Classification, vol. 22, no. 2, pp. 151–183, 2005.
[25] P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison-Wesley, 2005.
[26] D. Wei, Q. Jiang, Y. Wei, and S. Wang, “A novel hierarchical clustering algorithm for gene
sequences,” BMC Bioinformatics, vol. 13, p. 174, 2012.
[27] W. Wong, A. W. Moore, G. F. Cooper, and M. M. Wagner, “Bayesian network anomaly pat-
tern detection for disease outbreaks,” in Machine Learning, Proceedings of the Twentieth In-
ternational Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, 2003, pp.
808–815.
[28] X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 13, no. 8, pp. 841–847, 1991.
[29] D. Yu, G. Sheikholeslami, and A. Zhang, “Findout: Finding outliers in very large datasets,”
Knowl. Inf. Syst., vol. 4, no. 4, pp. 387–412, 2002.
[30] J. Zhang and H. H. Wang, “Detecting outlying subspaces for high-dimensional data: the new
task, algorithms, and performance,” Knowl. Inf. Syst., vol. 10, no. 3, pp. 333–355, 2006.
[31] A. Zimek, R. J. G. B. Campello, and J. Sander, “Ensembles for unsupervised outlier detection:
challenges and research questions a position paper,” SIGKDD Explorations, vol. 15, no. 1, pp.
11–22, 2013.
Received on June 30 - 2016
Revised on February 28 - 2017
Các file đính kèm theo tài liệu này:
- an_evaluation_method_for_unsupervised_anomaly_detection_algo.pdf