Abstract
During the last decade, data mining, also known as knowledge discovery in databases
(KDD), has been becoming one of the dominant research trends in the fields
of computer science and knowledge engineering. Many research results have been
proposed and successfully applied to the real world demonstrates that data mining
is an active research area based on a solid background rather than an impulsive
and volatile one as some people once suspected.
Data mining includes several approaches. Almost all techniques used in this area
are borrowed from database, machine learning, artificial intelligence, information
theory, probability & statistics, and high performance computing (including parallel
computing). There is a wide range of problems need to be solved in this area
such as classification/prediction, clustering, association rules mining, sequence
mining. Data mining is also an interdisciplinary field because its research results
are put into practice in commerce, finance & stock market, biology, medical treatment,
education, telecommunication, and so on.
Being aware of the potential of this research domain, I make my choice of “Parallel
Mining of Fuzzy Association Rules” for my master thesis. This dissertation
not only summarizes the previous work but also remarks the delicate relationship
between fuzzy association rules and fuzzy logic theory. In addition, the
dissertation proposes a novel parallel algorithm for fuzzy association rules mining.
Roadmap: the thesis is organized as follows:
Chapter one presents an overview of data mining. It encompasses both several
descriptive definitions of data mining and main steps in KDD. Also, This section
mentions major approaches and techniques as well as classification of data mining
systems according to different criteria. The conclusion of this chapter enumerates
focused issues during the last few years (probably in next time) and outlines the
chief applications of this research domain.
Chapter two formally describes the problem of association rules mining by giving
basic concepts such as frequent itemset, minimum support, minimum confidence,
2
etc. Furthermore, this section sums up significant proposals in this field within its
10-year history. A complete understanding of this chapter means the readers will
easily perceive next two chapters.
Chapter three presents fuzzy association rules. We first state the issue of mining
quantitative and categorical association rules together with corresponding methods
of data discretization. Quantitative association rules, however, exposes some
drawbacks such as “sharp boundary problem” and non-intuitive interpretation.
Fuzzy association rule was recommended to overcome these disadvantages.
Besides the summarization of the previous research about this kind of association
rule, this chapter tries to highlight the subtle relevance between fuzzy association
rule and fuzzy theory by determining which functions are suited for T-norm
operator. Finally, this section suggests a method to convert fuzzy association rules
into quantitative ones based on a threshold wf of the membership function
associated with the fuzzy set of each fuzzy attribute.
Chapter four focuses on parallel mining of association rules. The first part of this
chapter sums up previously proposed parallel algorithms that were successfully
experimented. Most of these algorithms have common shortcomings such that
they must strongly perform either data communication or synchronization among
related processors during the period of mining frequent itemsets. Making the most
of distinct features of fuzzy association rules, I suggest a new parallel algorithm
for mining this kind of association rules that significantly decrease the amount of
data or control information need to be communicated or synchronized. In addition,
this algorithm pays attention to load-balancing among processors thanks to an
appropriate strategy in dividing the set of candidate itemsets.
Chapter five concludes the dissertation by summing up the achievements
throughout the previous four chapters. Furthermore, this chapter reveals several
complex issues that solutions to them remain unconvincing. Finally, I will outline
the future research topics.
Table of Contents
Abstract 1
Acknowledgements 3
Table of Contents . 4
List of Figures 6
List of Tables . 7
Notations & Abbreviations 8
Chapter 1. Introduction to Data Mining 9
1.1 Data Mining 9
1.1.1 Motivation: Why Data Mining? 9
1.1.2 Definition: What is Data Mining? . 10
1.1.3 Main Steps in Knowledge Discovery in Databases (KDD) 11
1.1 Major Approaches and Techniques in Data Mining 12
1.2.1 Major Approaches and Techniques in Data Mining 12
1.2.2 Kinds of data could be mined? 13
1.2 Application of Data Mining . 14
1.2.1 Application of Data Mining . 14
1.2.2 Classification of Data Mining Systems . 14
1.3 Focused issues in Data Mining . 15
Chapter 2. Association Rules . 17
2.1 Motivation: Why Association Rules? 17
2.2 Association Rules Mining – Problem Statement . 18
2.3 Main Research Trends in Association Rules Mining . 20
Chapter 3. Fuzzy Association Rules Mining . 23
3.1 Quantitative Association Rules 23
5
3.1.1 Association Rules with Quantitative and Categorical Attributes 23
3.1.2 Methods of Data Discretization . 24
3.2 Fuzzy Association Rules 27
3.2.1 Data Discretization based on Fuzzy Set 27
3.2.2 Fuzzy Association Rules . 29
3.2.3 Algorithm for Fuzzy Association Rules Mining . 34
3.2.4 Relation between Fuzzy Association Rules and Quantitative Association
Rules . 39
3.2.5 Experiments and Conclusions . 39
Chapter 4. Parallel Mining of Fuzzy Association Rules 41
4.1 Several Previously Proposed Parallel Algorithms . 42
4.1.1 Count Distribution Algorithm . 42
4.1.2 Data Distribution Algorithm 43
4.1.3 Candidate Distribution Algorithm . 45
4.1.4 Algorithm for Parallel Generation of Association Rules 48
4.1.5 Other Parallel Algorithms 50
4.2 A New Parallel Algorithm for Fuzzy Association Rules Mining 50
4.2.1 Our Approach 51
4.2.2 The New Algorithm . 55
4.3 Experiments and Conclusions 55
Chapter 5. Conclusions 56
5.1 Achievements throughout the dissertation . 56
5.2 Future research . 57
Reference . 59
63 trang |
Chia sẻ: maiphuongtl | Lượt xem: 2012 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Đề tài Parallel Mining for Fuzzy Association Rules, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
les we receive
from has the form of '\ '\' ' AAisXXAisX fc⎯→⎯ , in which, X’
and A’ are non-empty subsets of X and A respectively. The inverse slash
(i.e. \ sign) in the implication denotes the subtraction operator between two
sets. fc is the fuzzy confidence factor of the rule and must meet the
condition fc ≥ fminconf.
The inputs of the algorithm are a database D with attribute set I and record set T,
and fminsup as well as fminconf.
The outputs of the algorithm are all possible confident fuzzy association rules.
35
Notation table:
Notations Description
D A relational or transactional database
I Attribute set in D
T Record set in D
DF The output database after applying fuzzification over the
original database D
IF Set of fuzzy attributes in DF, each of them is attached
with a fuzzy set. Each fuzzy set f, in turn, has a threshold
wf as used in formula (3.7)
TF Set of records in DF, value of each record at a given
fuzzy attribute is in [0, 1]
Ck Set of fuzzy k-itemset candidates
Fk Set of frequent fuzzy k-itemsets
F Set of all possible frequent itemsets from database DF
fminsup Fuzzy minimum support
fminconf Fuzzy minimum confidence
Table 9 - Notations used in fuzzy association rules mining algorithm
The algorithm:
1 BEGIN
2 (DF, IF, TF) = FuzzyMaterialization(D, I, T);
3 F1 = Counting(DF, IF, TF, fminsup);
4 k = 2;
5 while (Fk-1 ≠ ∅) {
6 Ck = Join(Fk-1);
7 Ck = Prune(Ck);
8 Fk = Checking(Ck, DF, fminsup);
9 F = F ∪ Fk;
10 k = k + 1;
11 }
12 GenerateRules(F, fminconf);
13 END
Table 10 - Algorithm for mining fuzzy association rules
The algorithm in table 10 uses the following sub-programs:
• (DF, IF, TF) = FuzzyMaterialization(D, I, T): this function is to convert the
original database D into the fuzzified database DF. Afterwards, I and T are
also transformed to IF and TF respectively.
36
For example, with the database in table 8, after running this function, we
will obtain:
IF = {[Age, Age_Young] (1), [Age, Age_Middle-aged] (2), [Age, Age_Old] (3),
[Cholesterol, Cholesterol_Low] (4), [Cholesterol, Cholesterol_High] (5),
[BloodSugar, BloodSugar_0] (6), [BloodSugar, BloodSugar_1] (7),
[HeartDisease, HeartDisease_No] (8), [HeartDisease, HeartDisease_Yes] (9)}
After converting, IF contains 9 new fuzzy attributes comparing to 4 in I.
Each fuzzy attribute is a pair including the name of original attribute and
the name of corresponding fuzzy set and surrounded by square brackets.
For instance, after fuzzifying the Age attribute, we receive three new fuzzy
attributes [Age, Age_Young], [Age, Age_Middle-aged], and [Age, Age_Old].
In addition, the function FuzzyMaterialization also converts T into TF as
shown in the following table:
A 1 2 3 C 4 5 S 6 7 H 8 9
60 0.00 0.41 0.92 206 0.60 0.40 0 1 0 2 0 1
54 0.20 0.75 0.83 239 0.56 0.44 0 1 0 2 0 1
54 0.20 0.75 0.83 286 0.52 0.48 0 1 0 2 0 1
52 0.29 0.82 0.78 255 0.54 0.46 0 1 0 2 0 1
68 0.00 0.32 1.00 274 0.53 0.47 1 0 1 2 0 1
54 0.20 0.75 0.83 288 0.51 0.49 1 0 1 1 1 0
46 0.44 0.97 0.67 204 0.62 0.38 0 1 0 1 1 0
37 0.59 0.93 0.31 250 0.54 0.46 0 1 0 1 1 0
71 0.00 0.28 1.00 320 0.43 0.57 0 1 0 1 1 0
74 0.00 0.25 1.00 269 0.53 0.47 0 1 0 1 1 0
29 0.71 0.82 0.25 204 0.62 0.38 0 1 0 1 1 0
70 0.00 0.28 1.00 322 0.43 0.57 0 1 0 2 0 1
67 0.00 0.32 1.00 544 0.00 1.00 0 1 0 1 1 0
Table 11 - TF: Values of records at attributes after fuzzifying
Note that the characters A, C, S, and H in the table 11 are all the first
character of Age, Cholesterol, Sugar, and Heart respectively.
Each fuzzy set f is accomanied by a threshold wf, so only values that
greater or equal to that threshold are taken into consideration. All other
values are considered as 0. All gray cells in the table 11 indicates that
theirs values are larger or equal to threshold (all thresholds in table 11 are
0.5). All values located in white-ground cells are equal to 0.
37
• F1 = Counting(DF, IF, TF, fminsup): this function is to generate F1, that is set
of all frequent fuzzy 1-itemsets. All elements in F1 must have supports
greater or equal to fminsup. For instance, if applying the normal
multiplication for T-norm (⊗) operator in formula (3.6) and fminsup with
46%, we achieve the F1 that looks like the following table:
Fuzzy 1-itemsets Support Is it frequent?
fminsup = 46%
{[Age, Age_Young]} (1) 10 % No
{[Age, Age_Middle-aged]} (2) 45 % No
{[Age, Age_Old]} (3) 76 % Yes
{[Serum cholesterol, Cholesterol_Low]} (4) 43 % No
{[Serum cholesterol, Cholesterol_High]} (5) 16 % No
{[BloodSugar, BloodSugar_0]} (6) 85 % Yes
{[BloodSugar, BloodSugar_1]} (7) 15 % No
{[HeartDisease, HeartDisease_No]} (8) 54 % Yes
{[HeartDisease, HeartDisease_Yes]} (9) 46 % Yes
Table 12 - C1: set of candidate 1-itemsets
F1 = {{3}, {6}, {8}, {9}}
• Ck = Join(Fk-1): this function is to produce the set of all fuzzy candidate k-
itemsets (Ck) based on the set of frequent fuzzy (k-1)-itemsets (Fk-1)
discovered in the previous step. The following SQL statement indicates
how elements in Fk-1 are combined to form candidate k-itemsets.
INSERT INTO Ck
SELECT p.i1, p.i2, …, p.ik-1, q.ik-1
FROM Lk-1 p, Lk-1 q
WHERE p.i1 = q.i1, …, p.ik-2 = q.ik-2, p.ik-1 < q.ik-1 AND p.ik-1.o ≠ q.ik-1.o;
In which, p.ij and q.ij are index number of jth fuzzy attributes in itemsets p
and q respectively. p.ij.o and q.ij.o are the index number of original
attribute. Two fuzzy attributes sharing a common original attribute must
not exist in the same fuzzy itemset. For example, after running the above
SQL command, we obtain C2 = {{3, 6}, {3, 8}, {3, 9}, {6, 8}, {6, 9}}. The
2-itemset {8, 9} is invalid because its two fuzzy attributes are derived from
a common attribute HeartDisease.
38
• Ck = Prune(Ck): this function helps us to prune any unnecessary candidate
k-itemset in Ck thanks to the downward closure property “all subsets of a
frequent itemset are also frequent, and any superset of a non-frequent
itemset will be not frequent”. To evaluate the usefulness of any k-itemste
in Ck, the Prune function must make sure that all (k-1)-subsets of Ck are
present in Fk-1.
For instance, after pruning the C2 = {{3, 6}, {3, 8}, {3, 9}, {6, 8}, {6, 9}}.
• Fk = Checking(Ck, DF, fminsup): this function first scans over the whole
transactions in datatabases to update support factors for candidate itemsets
in Ck. Afterwards, Checking eliminates any infrequent candidate itemset,
i.e. whose support is smaller than fminsup. All frequent itemsets are
retained and put into Fk.
After running F2 = Checking(C2, DF, 46%), we receive F2 = {{3,6},
{6,8}}. The following table displays the detailed information.
Candidate 2-itemset Support factor Is it frequent?
{3, 6} 62 % Yes
{3, 8} 35 % No
{3, 9} 41 % No
{6, 8} 46 % Yes
{6, 9} 38 % No
Table 13 - F2: set of frequent 2-itemsets
• GenerateRules(F, fminconf): this function generates all possible
confident fuzzy association rules from the set of all frequent fuzzy itemsets
F.
With the above example, after finishing the first phase (finding all possible
frequent itemsets), we get F = F1 ∪ F2 = {{3}, {6}, {8}, {9}, {3,6}, {6,8}}
(F3 is not created because C3 is empty). The following table lists
discovered fuzzy association rules.
39
No. Fuzzy association rules or 1-itemsets Support Confidence
1 Old people 76 %
2 Blood sugar ≤ 120 mg/ml 85 %
3 Not suffer from heart disease 54 %
4 Suffer from heart disease 46 %
5 Old people => Blood sugar ≤ 120 mg/ml 62 % 82 %
6 Blood sugar ≤ 120 mg/ml => Old people 62 % 73 %
7 Blood sugar ≤ 120 mg/ml => Not suffer
from heart disease
46 % 54 %
8 Not suffer from heart disease => Blood
sugar ≤ 120 mg/ml
46 % 85 %
Table 14 - Fuzzy association rules generated from database in table 8
The minimum confidence is 70%, so the 7th rule is rejected.
3.2.4 Relation between Fuzzy Association Rules and Quantitative
Association Rules
According to the formula (3.7), the membership function of each fuzzy set f is
attached with a threshold wf. Based on this threshold, we can defuzzify to convert
association rule into another form similar to quantitative one.
For example, the fuzzy rule “Old people => Blood sugar ≤ 120 mg/ml, with
support 62% and confidence 82%” in the table 14 should be changed to the rule
“Age ≥ 46 => Blood sugar ≤ 120 mg/ml, with support 62% and confidence 82%”.
We see the minimum value of attribute [Age, Age_Old] that greater or equal to
wAge_Old (= 0.5) is 0.67. The age corresponding to the fuzzy value 0.67 is 46, so
any person whose age larger or equal to 46 will have fuzzy value greater or equal
to 0.67. Therefore, we substitute “Age_Old” by “Age ≥ 46”. Similarly, we can
change any fuzzy association rule to quantitative one.
Because almost all fuzzy membership functions have smooth graph or their
derivatives change the sign (from positive to nagative and vice versa), the
defuzzification is relatively simple.
3.2.5 Experiments and Conclusions
• Experiments by varying the database size (increase or decrease the number of
records) to measure the mining time.
• Experiments by changing the fminsup and fminconf
40
• Experiments in changing the thresholds associated with fuzzy sets.
• Experiments by changing the choice of T-norm operator (min function and
normal multiplication)
• Experiment the transformation from fuzzy association rules to quantitative
ones.
41
Chapter 4. Parallel Mining of Fuzzy Association Rules
One of the most essential and time-consuming tasks in association rules mining is
finding all possible frequent itemsets from immense volumes of data. It needs
much CPU time (CPU-bound) and I/O operation (I/O-bound). Thus, researchers
have been trying their best to improve the existing algorithms or devise new ones
in order to speed up the whole mining process [11] [13] [20] [24] [30] [35]. Most
of these algorithms are sequential and work efficiently upon small or medium
databases (the sizes of databases are recognized based on their number of
attributes and records). However, they lose their performance and expose some
disadvantages while working with extremely large databases (usually hundreds of
megabytes or more) due to the limitations in the processor’s speed as well as the
capacity of internal memory of a single computer.
Fortunately, with the explosive development in hardware industry, high
performance computing systems are introduced to the market. This has opened up
an opportunity for a new research direction in data mining community. Since
1995, researchers continually devise efficient parallel and distributed algorithms
for the issue of association rules mining [5] [12] [18] [26] [31] [32] [34]. These
algorithms are diverse because of their tight dependences upon architectures of
various parallel computing systems.
In the first section of this chapter, I coarsely present previous parallel algorithms
that are successfully experimented on real databases. In the next section, I would
like to recommend a novel parallel algorithm for mining fuzzy association rules. It
has been experimented on a Linux-based PC-Cluster system using Message
Passing Interface standard (MPI) [27] [28] [29] and returns optimistic results. This
algorithm is relatively optimal because it strongly reduces the data communication
and synchronization among processors. However, it can only mine the fuzzy or
quantitative association rules as well as suite for relational rather than
transactional databases.
42
4.1 Several Previously Proposed Parallel Algorithms
In this section, I would like to restate some existing parallel algorithms. They are
all designed to run on shared-nothing parallel systems that have the following
characteristics:
• A shared-nothing system has N processors. Each processor Pi has its own
internal (RAM) and external (hard disk) memory.
• N processors can communicate with each other thanks to a high-speed
network using message passing mechanism.
4.1.1 Count Distribution Algorithm
The Count Distribution approach proposed in [34] is an Apriori-like algorithm
[35]. In this algorithm, N denotes the number of processors, Pi stand for ith
processor, and Di is the data partition for processor Pi (the original database D is
divided into N equal partitions, each of them is placed on the hard disk of a
processor. The algorithm consists of the following steps:
• (1) Pass one (k = 1): the first pass is special. All processors in the system
will receive the set of frequent 1-itemsets, i.e. L1.
• (2) Pass two: for all k > 1, the algorithm repeatedly perform the following
steps:
o (2.1): Each processor Pi generates the complete Ck, using the
complete set of frequent itemsets Lk-1 created at the pass k-1.
Observe that since each processor has the identical Lk-1, they will be
generating identical Ck.
o (2.2): Processor Pi makes a pass over its data partition Di and
develops local support counts for candidate itemsets in Ck. This is
the parallel step. This means that all processors can independently
scan over its own data partition.
43
o (2.3): Processor Pi exchanges local Ck support counts with all other
processors to develop the global Ck support counts. Processors are
forced to synchronize in this step.
o (2.4): Each processor Pi now computes Lk from Ck based on minsup
parameter.
o (2.5): Each processor Pi independently makes the decision to
terminate or continue the next loop. The decision will be identical as
the processors all have identical Lk.
The figure below illustrates the Count Distribution algorithm.
Figure 7 - Count distribution algorithm on a 3-processor parallel system
4.1.2 Data Distribution Algorithm
The attractive feature of the Count Distribution algorithm is that no data tuples are
exchanged between processors – only support counts are exchanged. Thus,
processors can operate independently and asynchronously while reading the data.
However, the disadvantage is that this algorithm does not exploit the aggregate
memory of the system effectively. Suppose that each processor has memory of
size |M|. The number of candidate itemsets that can be counted in one pass is
determined by |M|. As we increase the number of processors from 1 to N, the
44
system has N x |M| total memory, but we still count the same number of candidate
itemsets in one pass, as each processor is counting for identical Ck.
The Data distribution algorithm is designed to exploit better total system’s
memory as the number of processors is increased. In this algorithm, each
processor counts mutually exclusive candidate itemsets. Thus, as the number of
processors is increased, a larger number of candidate itemsets can be counted in a
pass. The downside of this algorithm is that every processor must broadcast its
local data to all other processors in every pass. Therefore, this algorithm can
become feasible only on a machine with very fast and stable communication.
The Data distribution is also based on the Apriori. In this algorithm, N denotes the
number of processors, Pi is the ith processor, and Di is the data partition for
processor Pi (the original database D is divided into N equal partitions, each of
them is placed on the hard disk of a processor. The algorithm includes the
following steps:
• (1) Pass one (k = 1): The same Count distribution algorithm.
• (2) Pass two (for every k > 1):
o (2.1): Processor Pi generates Ck from Lk-1. It retains only 1/Nth of the
itemsets forming the candidate subset Cki that it will count. Which
1/N itemsets are retained is determined by the processor identifier
and can be computed without communicating with other processors.
In real implementation, itemsets are assigned in a round-robin
fashion. The Cki sets are all disjoint and union of all Cki sets is the
original Ck (i.e. Cki ∩ Ckj = ∅ (where i ≠ j) and kik
N
i
CC =
=1
U ).
o (2.2): Processor Pi develops support counts for itemsets in its local
candidate set Cki using both local data pages and data page received
from other processors.
o (2.3): At the end of the pass over the data, each processor Pi
calculates Lki using the local Cki. Again, all Lki sets are disjoint and
45
the union of all Lki sets is Lk (i.e. Lki ∩ Lkj = ∅ where i ≠ j and
k
i
k
N
i
LL =
=1
U ).
o (2.4): Processors exchanges Lki so that every processor has the
complete Lk for generating Ck+1 for the next pass. This step requires
processors to synchronize. Having obtained the complete Lk, each
processor can independently (but identically) decide whether to
terminate or continue on to the next loop.
The following figure depicts this algorithm:
Figure 8 - Data distribution algorithm on a 3-processor parallel system
4.1.3 Candidate Distribution Algorithm
Both Count and Data distribution algorithms require processors to synchronize at
the end of a pass to exchange support counts or frequent itemsets respectively. If
the workload is not perfectly balanced, this can cause all the processors to wait for
whichever processor finishes last in every pass. The Candidate distribution
algorithm attempts to do away with the dependence between processors so that
they may proceed independently without synchronizing at the end of every pass.
In some pass l, where l is heuristically determined, this algorithm divides the
frequent itemsets Ll-1 between processors in such a way that a processor Pi can
46
generate a unique Cmi (m ≥ l) independent of all other processors (Cmi ∩ Cmj = ∅,
∀i ≠ j). At the same time, data is selectively replicated so that a processor can
count candidates in Cmi independent of all other processors. This choice of the
redistribution pass is a tradeoff between decoupling processor dependence as soon
as possible and waiting until the itemsets become more easily and equitably
partitionable. The partitioning algorithm exploits the semantics of the Apriori
candidate generation procedure described before.
After this candidate distribution, the only dependence that a processor has on other
processors is for pruning the local candidate set during the prune step of candidate
generation. However, a processor does not wait for the complete pruning
information to arrive from all other processors. During the prune step of candidate
generation, it prunes the candidate set as much as possible using whatever
information has arrived, and opportunistically starts counting the candidates. The
late arriving pruning information can be used in subsequent passes. The algorithm
is described below:
• (1) Pass one (k < l): Use either Count or Data distribution algorithm.
• (2) Pass two (k = l):
o (2.1): Partition Lk-1 among the N processors such that Lk-1 sets are
“well balanced”. We discuss below how this partitioning is done.
Record with each frequent itemset in Lk-1 which processor has been
assigned this itemset. This partitioning is identically done in parallel
by each processor.
o (2.2): Processor Pi produces Cki, logically using only the Lk-1
partition assigned to it. Note that Pi still has access to the complete
Lk-1, and hence can use standard pruning while generating Cki in this
pass.
o (2.3): Pi develops global counts for candidates in Cki and the
database is repartitioned into DRi at the same time. The details of
this step are given below.
o (2.4): After Pi has processed all its local data and any data received
from all other processors, it posts N-1 asynchronous receive buffers
47
to receive Lkj from all other processors. These Lkj are needed for
pruning Ck+1i in the prune step of candidate generation.
o (2.5): Processor Pi computes Lki from Cki and asynchronously
broadcast it to the other N-1 processors using N-1 asynchronous
sends.
• (3) Pass (k > l):
o (3.1): Processor Pi collects all frequent itemsets that have sent to it
by other processors. They are used in the pruning step of the
candidate generation, but not the join step. Itemsets received from
processor j could be of length k-1, smaller than k-1 (slower
processor), or greater than k-1 (faster processor). Pi keeps track for
each processor Pj the largest size of the frequent itemsets sent by it.
Receive buffers for the frequent itemsets are reposted after
processing.
o (3.2): Pi creates Cki using the local Lk-1i. Now it can happen that Pi
has not received Lk-1j from all other processors, so Pi needs to be
careful at the time of pruning. It needs to distinguish an itemset (a k-
1 long subset of a candidate itemset) which is not present in any of
Lk-1j from an itemset that is present in some Lk-1j but this set has not
yet been received by processor Pi. It does so by probing Ll-1
(remember that repartitioning took place in pass l) using the l-1 long
prefix of the itemset in question, finding the processor reponsible for
it, and checking if Lk-1j has been received from this processor.
o (3.3): Pi makes a pass over DRi and counts Cki. It then computes Lki
and asynchronously broadcast Lki to every other processor using N-1
asynchronous sends.
Partitioning: We motivate the algorithm for partitioning Lk by an example. Let L3
be {ABC, ABD, ABE, ACD, ACE, BCD, BCE, BDE, CDE}. Then L4 = {ABCD,
ABCE, ABDE, ACDE, BCDE}, L5 = {ABCDE} và L6 = ∅. Consider ε = {ABC,
ABD, ABE} whose members all have the common prefix AB. Not that the
candidates ABCD, ABCE, ABDE, and ABCDE also have the prefix AB. The
48
Apriori candidate generation procedure generates these candidates by joining only
the items in ε.
Therefore, assuming that the items in the itemsets are lexicographically ordered,
we can partition the itemsets in Lk based on common k-1 long prefixes. By
ensuing that no parition is assigned to more than one processor, we have ensured
that each processor can create candidates independently (ignoring the prune step).
Suppose we also repartition the database in such a way that any tuple that supports
an itemset contained in any of the Lk partitions assigned to a processor is copied to
the local disk of that processor. The processor can then proceed entire
synchronization.
The actual algorithm is more involved because of two reasons. A processor may
have to obtain frequent itemsets computed by other processors for the prune step
of the candidate generation. In the example above, the processor assigned the set ε
has to know whether BCDE is frequent to be able to decide whether to prune the
candidate ABCDE, but the set with prefix BC may have been assigned to a
different processor. The other problem is that we need to balance load across
processors.
4.1.4 Algorithm for Parallel Generation of Association Rules
To generate rules, for every large itemset l, we find all non-empty subsets of l. For
every such subset a, we output a rule of the form a => (l – a) if the ratio of s(l) to
s(a) is at least minconf. We consider all subsets of l to generate rules with multiple
consequents. Since the large itemsets are store in hash tables, the support counts
for the subset itemsets can be found efficiently.
We can improve the above procedure by generating the subsets of a large itemset
in a recursive depth-first fashion. For example, given an itemset ABDC, we first
consider the subset ABC, then AB, etc. Then if a subset a of a large itemset l does
not generate a rule, the subsets of a need not be considered for generating rules
using l. For example, if ABC => D does not have adequate confidence, we need
not check whether AB => CD holds. We do not miss any rules because the support
of any subset a’ of a must be as great as the support of a. Therefore, the
confidence of the rule a’ => (l – a’) cannot be more than the confidence of a => (l
49
– a). Hence, if a did not yield a rule involving all the items in l with a as the
antecedent, neither will a’.
The sequential algorithm [35] for generating association rules is shown in the table
below:
// Simple algorithm
Forall frequent itemset lk, k > 1 do
Call gen_rules(lk, lk);
// The gen_rules generates all valid rules α => (l - α), for all α ⊂ am
Procedure gen_rules(lk : frequent k-itemset, am : frequent m-itemset)
Begin
1 A = {(m-1)-itemsets am-1 | am-1 ⊂ am}
2 Forall am-1 ∈ A do begin
3 conf = s(lk)/s(am-1);
4 if (conf ≥ minconf) then begin
5 output the rule am-1 => (lk – am-1);
6 if (m – 1 > 1) then
7 Call gen_rules(lk, am-1);
8 end
9 end
End
Table 15 - The sequential algorithm for generating association rules
Generating rules in parallel simply involves partitioning the set of all frequent
itemsets among the processors. Each processor then generates rules for its partition
only using the algorithm above. Since the number of rules that can be generated
from an itemset is sensitive to the itemset’s size, we attempt equitable balancing
by partitioning the itemsets of each length equally across the processors.
Note that in the calculation of the confidence of a rule, a processor may need to
examine the support of an itemset for which it is not responsible. For this reason,
each processor must have access to all the frequent itemsets before rule generation
can begin. This is not a problem for the Count and Data distribution algorithms
because at the end of the last pass, all the processors have all the frequent itemsets.
In the Candidate distribution algorithm, fast processors may need to wait until
slower processors have discovered and transmitted all of their frequent itemsets.
For this reason and because the rule generation step is relatively cheap, it may be
better in the Candidate distribution algorithm to simply discover the frequent
50
itemsets and generate the rules off-line, possibly on a serial processor. This would
allow processors to be freed to run other jobs as soon as they are done finding
frequent itemsets, even while other processors in the system are still working.
4.1.5 Other Parallel Algorithms
In addition to four above algorithms, we also list herein some of other parallel
algorithms for mining association rules.
The Intelligent Data Distribution algorithm [12] was proposed based on the
normal data distribution with an additional improvement in data communication
among processors. Instead of data transmission between every pair of processors,
this algorithm transmit data based on a logical ring set up at the startup time of the
algorithm.
The Multiple Local Frequent Pattern Tree (MLFPT) [31] was derived from the FP-
growth algorithm. This algorithm needs no candidate generation and can reduce
the number of passes over the database. Furthermore, it investigates the load
balancing among processors.
Another parallel algorithm working on shared-everything systems (normally SMP
systems) was proposed by [26].
4.2 A New Parallel Algorithm for Fuzzy Association Rules Mining
Almost all known parallel algorithms, more or less, need the data communication
and synchronization among processors. This leads to an additional complexity in
real implementations of these algorithms. Hence, they are not considered to be
“ideal” parallel computing problems. Based on the approach in fuzzy association
rules mentioned above, I would like to suggest a new parallel algorithm for mining
this kind of rule. It is ideal that little communication needs to be taken place
during the processing time. Data communication is made only twice - one at the
startup for dividing and delivering fuzzy attributes among processors, and one for
rules gathering as the algorithm finishes.
51
4.2.1 Our Approach
According to the fuzzy association rules mining algorithm mentioned in chapter
three, each attribute iu in I is associated with a set of fuzzy sets uiF as follows:
{ }kiiii uuuu fffF ,...,, 21=
For example, in the sample database in table 8, we have:
1i
F = FAge = {Age_Young, Age_Middle-aged, Age_Old} (with k = 3)
2i
F = FSerum_cholesterol = {Cholesterol_Low, Cholesterol_High} (with k = 2)
The fuzzy association rule has the form of:
X is A ⇒ Y is B.
Where:
• X, Y ⊆ I are itemsets. X = {x1, x2, …, xp} (xi ≠ xj if i ≠ j) and Y = {y1, y2,
…, yq} (yi ≠ yj if i ≠ j).
• A = {fx1, fx2, …, fxp}, B = {fy1, fy2, …, fyq} are sets of fuzzy sets
corresponding to attributes in X and Y. fxi ∈ Fxi và fyj ∈ Fyj.
Each fuzzy attribute is a pair of attribute name accompanied by fuzzy set name.
For instance, with I = {Age, SerumCholesterol, BloodSugar, HeartDisease}, we
now have the set of fuzzy attributes IF as:
IF = {[Age, Age_Young] (1), [Age, Age_Middle-aged] (2), [Age, Age_Old] (3),
[Cholesterol, Cholesterol_Low] (4), [Cholesterol, Cholesterol_High] (5), [BloodSugar,
BloodSugar_0] (6), [BloodSugar, BloodSugar_1] (7),
[HeartDisease, HeartDisease_No] (8), [HeartDisease, HeartDisease_Yes] (9)}
Table 16 - Set of fuzzy attributes received after being fuzzified the database in table 8
After being fuzzified, IF consists of 9 fuzzy attributes comparing to 4 in I. All
values in at fuzzy attributes in record set TF is now belonging to the range [0, 1].
The objective of the mining issue is to discover all possible confident fuzzy
association rules.
52
We totally perceive that any fuzzy association rule (both antecedent and
consequent) never contains two fuzzy attributes that share a common original
attribute in I. For example, the rules such “Age_Old AND Cholesterol_High And
Age_Young => HeartDisease_Yes” or “BloodSugar > 120mg/ml AND
HeartDisease_No => HeartDisease_Yes” are invalid because the former contains
both Age_Old and Age_Young (derived from Age), and the latter encompasses
HeartDisease_No as well as HeartDisease_Yes (derived from HeartDisease).
There are two chief reasons for the above supposition. First, fuzzy attributes
sharing a common original attribute are usually mutually exclusive in meaning so
that they will largely decrease the support of rules in which they are contained
together. For example, the Age_Old is opposite in semantics with Age_Young
because no person in the world is “both young and old”. Second, such rule is not
worthwhile and carries little meaning.
Thus, we can conclude that all fuzzy attributes in the same rule are independent in
that there is no pair of fuzzy attribute whose original attribute is the same. This
observation is the foundation of our new parallel algorithm.
I will coarsely describe our idea via a simple example. Suppose that we will run
our algorithm on the database in table 8 and on a 6-processor parallel system. We
need to divide the set of fuzzy attributes IF among processors so that processors
can operate in parallel and independently as follows.
The processor P1:
IF1 = {[Age, Age_Young] (1), [Cholesterol, Cholesterol_Low] (4),
[BloodSugar, BloodSugar_0] (6), [BloodSugar, BloodSugar_1] (7),
[HeartDisease, HeartDisease_No] (8), [HeartDisease, HeartDisease_Yes] (9)}
= {1, 4, 6, 7, 8, 9}
The processor P2: IF2 = {1, 5, 6, 7, 8, 9}
The processor P3: IF3 = {2, 4, 6, 7, 8, 9}
The processor P4: IF4 = {2, 5, 6, 7, 8, 9}
The processor P5: IF5 = {3, 4, 6, 7, 8, 9}
The processor P6: IF6 = {3, 5, 6, 7, 8, 9}
53
We divide the IF based on the first two attributes Age and Cholesterol. The 9
initial fuzzy attributes are now distributed among 6 processors and each processor
receive 6 fuzzy attributes. This division is “ideal” because the number of
processors (6) is equal to the multiplication of number fuzzy sets associated with
attribute Age (3) and number of fuzzy sets associated with attribute Cholesterol (2)
(i.e. 6 = 3*2).
The optimal division is where we could equally disperse fuzzy attributes to all
processors in the system. In the case of being unable to obtain a optimal division,
we will use “the most reasonable” one. This means that several processors are in
idle state while others work hard. I would like to present an algorithm used for
fuzzy attributes division. It first tries to find the optimal solution, if not it will
return “the most reasonable” one. The division algorithm is formally described
below:
Given a database D with I = {i1, i2, …, in} is set of n attributes, and T = {t1, t2, …,
tm} is set of m records. After being fuzzified, D, I, and T are converted into DF, IF,
and TF respectively.
IF = {[i1, fi11], …, [i1, fi1k1], [i2, fi21], …, [i2, fi2k2], …, [in, fin1], …, [in, finkn]}.
Where, fiju and kj are the uth fuzzy set and the number of fuzzy sets associated with
attribute ij. For example, the database in table 8, we have I = {Age,
SerumCholesterol, BloodSugar, HeartDisease} and after converting we receive IF
as shown in table 16. In this case, k1 = 3, k2 = 2, k3 = 2, k4 = 2 are numbers of
fuzzy sets associated with original attributes in I.
Let FN = {k1} ∪ {k2} ∪ … ∪ {kn} = {s1, s2, …, sv} (v ≤ n as the may be pairs
such as (ki, kj) that equal) and N is the number of processors in the system. The
division algorithm is changed to the problem stated as: Find the non-empty subset
Fn of FN such that the multiplication among elements in Fn is equal to N (this is
the optimal solution). In the case of being unable to obtain the optimal solution,
the algorithm will return “the most reasonable” solution. This means that the
multiplication among elements in Fn is a lower approximation of N.
This algorithm is designed according to backtracking strategy, it is whether
recursive or not.
54
The division algorithm:
BOOLEAN Subset(FN, N, Idx)
1 k = 1;
2 Idx[1] = 0;
3 S = 0;
4 While (k > 0) {
5 Idx[k]++;
6 if (Idx[k] <= sizeof(FN)) {
7 if (S * FN[Idx[k]] <= N) {
8 if (S * FN[Idx[k]] == N)
9 return TRUE;
10 else {
11 S *= FN[Idx[k]];
12 Idx[k + 1] = Idx[k];
13 k++;
14 }
15 }
16 } else {
17 k--;
18 S /= FN[Idx[k]];
19 }
20 }
21 return FALSE;
FindSubset(FN, N, Idx, Fn)
1 for (n = N; n > 0; n--)
2 If (Subset(FN, n, Idx)) {
3 Fn = {FN[i] | i ∈ Idx}
4 return;
5 }
Table 17 - Fuzzy attributes dividing algorithm among processors
In the above example, the Fn is {3, 2}. The above algorithm surely to return “the
most reasonable” solution in the case not receiving the optimal one since the
FindSubset will gradually decrease n to search for lower approximation space of
N.
55
4.2.2 The New Algorithm
Inputs: The database D with the attribute set I and the record set T, the number of
processors is refered to as N, The minsup and minconf indicate the minimum
support and minimum confidence threshold respectively.
Outputs: All possible confident fuzzy association rules.
The parallel algorithm for mining fuzzy association rules includes the following
steps:
• (1): Converting the database D, attribute set I, and record set T into DF, IF,
and TF respectively. This is fuzzification process.
• (2): Using the algorithm in table 17 for fuzzy attributes scattering among N
processors in the system.
• (3): Each processor Pi use the sequential algorithm in table 10 to mine
frequent itemsets and algorithm in table 15 to generate confident fuzzy
association rules.
• (4): Collecting discovered rules from all processors in the system.
This algorithm is successfully applied to not only fuzzy association rules but
quantitative and categorical ones.
4.3 Experiments and Conclusions
• Experimenting while varying number of attributes and number of records to
measure mining time.
• Experimenting while increasing and decreasing the number of processors in the
system
56
Chapter 5. Conclusions
5.1 Achievements throughout the dissertation
Being based on the previous research outputs in the field of data mining, the
dissertation not only sums up the most noticeable features in this research
direction but also presents several new proposals. We summarize herein the chief
achievements throughout the thesis.
In the first chapter, the thesis outlines the overall picture of data mining - some
descriptive definitions as well as its motivation. This section also generally
mentions the dominant techniques that are applied for subproblems such as
clustering, classification/prediction, association rules mining, etc. Finnaly, the
chapter enumerates the recently focused research trends in this research domain.
The chapter two formally describes the issue of association rules mining proposed
by R. Agrawal in 1993. The most fundamental concepts are also given such as
itemset, frequent itemset, confident rule, etc. Moreover, this chapter depicts some
most interesting research topics that are concentrated in recent years such as fuzzy
association rules, parallel mining of association rules, etc. The core target of this
chapter is to present the background of association rules mining to readers. A
thorough understanding of this chapter will be very convenient for studying the
next two chapters.
On the foundation of the previous proposals of [4] [9] [38] [39], chapter three
present the problem of quantitative & categorical association rules mining together
with its advantages and disadvantages. However, the focused target of this chapter
is to deeply study a special kind of rule - the fuzzy association rule. This type of
rule is much more flexible, readable, and intuitive comparing to the elementary
kind of rule described in the previous chapter. The depiction of this kind of rule in
[4] [9] is so brief that they could not emphasize the sensitive relation between
fuzzy association rules and fuzzy logic theory. The thesis also explains why we
choose min function and algebraic multiplication for T-norm operator in formula
(3.6). In addition, this section restates the algorithm for mining fuzzy association
rules in [4] [9] based on Apriori with just slight customizations. Finnaly, this
57
chapter offers a transformation method for converting fuzzy association rule into
quantitative ones.
Chapter four suggests a novel parallel algorithm for mining fuzzy association
rules. In this algorithm, processors will largely reduce the amount of information
need to be communicated during processing time. The algorithm is considered to
be “ideal” thanks to its wise strategy in deliver the original set of fuzzy attributes
for processors. This division method is both balanced and intelligent in that
partitions after dividing are equal and each processor can operate upon its partition
independently. However, the algorithm contains some drawbacks. First, it is only
successfully applied for mining fuzzy or quantitative association rules. Second, it
only works well on shared-nothing parallel systems.
During the time I write this thesis and previous time, I have tried my best to
research and consult many other reference documents. However, the thesis cannot
avoid several mistakes and shortcomings. I would like to appreciate any comment
and suggestion in both technical and editorial problems from all readers.
5.2 Future research
Association rule mining is attractive to many researchers because of its widespread
applications in various areas as well as it potentially contains different research
trends. In this dissertation, I only focus on fuzzy association rule and parallel
algorithm for mining this kind of rule. In the next time, the following research
directions will be taken into consideration:
• Mining fuzzy association rules with weighted attributes. The major goal of this
kind of rule is to measure the “contribution level” of each attribute to the
overall rule. For example, as we mining association rules on data related to
heart disease, the information of attributes such as blood pressure, blood sugar,
cholesterol is much more critical than that of body weight and age. Thus, we
attach higher weights to blood pressure, blood sugar, cholesterol comparing to
those of body weight and age.
• The parallel algorithm we recommend in chapter four works only on shared-
nothing systems. In the near future, we would like to concentrate on designing
parallel algorithms for other architectures such as SMP, etc.
58
• Although the issue of association rules mining is highly independent to what
kind of data to be mined. However, we would like to focus on a certain
database to improve the proposed algorithms and to tune related parameters.
59
Reference
Vietnamese documents:
[1]. [PDD99] Phan Dinh Dieu. Logic in knowledge systems. Faculty of
Technology, Vietnam National University, Hanoi – 1999.
[2]. [DMT03] Dinh Manh Tuong. Artificial Intelligence. Faculty of
Technology, Vietnam National University, Hanoi – 2003.
English documents:
[3]. [AR95] Alan Rea. Data Mining – An Introduction. The Parallel Computer
Centre, Nor of The Queen’s University of Belfast.
[4]. [AG00] Attila Gyenesei. A Fuzzy Approach for Mining Quantitative
Association Rules. Turku Centre for Computer Science, TUCS Technical
Reports, No 336, March 2000.
[5]. [AM95] Andreas Mueller. Fast Sequential and Parallel Algorithms for
Association Rule Mining: A Comparison. Department of Computer
Science, University of Maryland-College Park, MD 20742.
[6]. [LHM99] Bing Liu, Wynne Hsu, and Yiming Ma. Mining Association
Rules with Multiple Minimum Supports. In ACM SIGKDD International
Conference on KDD & Data Mining (KDD-99), August 15-18, 1999, San
Diego, CA, USA.
[7]. [KV01] Boris Kovalerchuk and Evgenii Vityaev. Data Mining in Finance –
Advances in Relational and Hybrid Methods. Kluwer Academic Publishers,
Boston – Dordrecht - London. 2001.
[8]. [MHT02] Bui Quang Minh, Phan Xuan Hieu, Ha Quang Thuy. Some
Parallel Computing Experiments with PC-Cluster. In Proc. of Conference
on IT of Faculty of Technology, VNUH. Hanoi 2002.
[9]. [KFW98] Chan Man Kuok, Ada Fu, and Man Hon Wong. Mining Fuzzy
Association Rules in Databases. Department of Computer Science and
Engineering, The Chinese University of Hong Kong, Shatin, New
Territories, Hong Kong.
60
[10]. [THH02] Do Van Thanh, Pham Tho Hoan, and Phan Xuan Hieu. Mining
Association Rules with different supports. In Proc. of the National
Conference on Information Technology, Nhatrang, Vietnam, May 2002.
[11]. [BCJ01] Doug Burdick, Manuel Calimlim, and Johannes Gehrke. MAFIA:
A Maximal Frequent Itemset Algorithm for Transactional Databases.
Department of Computer Science, Cornell University.
[12]. [HKK97] Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar.
Scalable Parallel Data Mining for Association Rules. Department of
Computer Science, University of Minnesota, 4-192 EECS Building, 200
Union St. SE, Minneapolis, MN 55455, USA.
[13]. [PHM01] Jian Pei, Jiawei Han, and Runying Mao. CLOSET: An Efficient
Algorithm for Mining Frequent Closed Itemsets. Intelligent Database
Systems Research Lab, School of Computing Science, Simon Fraser
University, Burnaby, B.C., Canada.
[14]. [HK02] Jiawei Han and Micheline Kamber. Data Mining: Concepts and
Techniques. University of Illinois, Morgan Kaufmann Publishers 2002.
[15]. [HF95] Jiawei Han and Yongjian Fu. Discovery of Multiple-Level
Association Rules from Large Databases. In Proc. of the 21st International
Conference on Very Large Dadabases, Zurich, Switzerland, Sep 1995.
[16]. [LZDRS99] Jinyan Li, Xiuzhen Zhang, Guozhu Dong, Kotagiri
Ramamohanarao, and Qun Sun. Efficient Mining of High Confidence
Association Rules without Support Thresholds. Department of CSSE, The
University of Melbourne, Parkville, Vic, 3052, Australia.
[17]. [HG00] Jochen Hipp, Ulrich Guntzer, and Gholamreza Nakhaeizadeh.
Algorithms for Association Rule Mining – A General Survey and
Comparison. ACM SIGKDD, July 2000, Volume 2, Issue 1 – page 58 – 64.
[18]. [PCY95] Jong Soo Park, Ming-Syan Chen, and Philip S. Yu. Efficient
Parallel Data Mining for Association Rules. In Fourth International
Conference on Information and Knowledge Management, Baltimore,
Maryland, Nov 1995.
61
[19]. [PYC98] Jong Soon Park (Sungshin Women’s Univ, Seoul, Korea), Philip
S. Yu (IBM T.J. Watson Res. Ctr.), and Ming-Syan Chen (National Taiwan
Univ., Taipei, Taiwan). Mining Association Rules with Adjustable
Accuracy.
[20]. [MTV94] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.
Efficient Algorithms for Discovering Association Rules. In KDD-1994:
AAAI Workshop on Knowledge Discovery in Databases, pages 181-192,
Seattle, Washington, July 1994.
[21]. [LAZ65] L. A. Zadeh. Fuzzy sets. Informat. Control, 338-353, 1965.
[22]. [KMRTV94] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and
A.I. Verkamo. Finding Interesting Rules from Large Sets of Discovered
Association Rules. In Proc. 3rd International Conference on Information
and Knowledge Management, pages 401-408, Gaithersburg, Maryland,
November 1994.
[23]. [MM00] Manoel Mendonca. Mining Software Engineering Data: A Survey.
University of Maryland, Department of Computer Science, A. V. Williams
Building #3225 College Park, MD 20742. 2000.
[24]. [ZH99] Mohammed J. Zaki and Ching-Jui Hsiao. CHARM: An Efficient
Algorithm for Closed Association Rules Mining. RPI Technical Report 99-
10, 1999.
[25]. [ZO98] Mohammed J. Zaki and Mitsunori Ogihara. Theoretical
Foundations of Association Rules. In 3rd ACM SIGMOD Workshop on
Research Issues in Data Mining and Knowledge Discovery, June 1998.
[26]. [ZPO01] Mohammed J. Zaki, Srinivasan Parthasarathy, and Mitsunori
Ogihara. Parallel Data Mining for Association Rules on Shared-Memory
Systems. In Knowledge and Information Systems, Vol. 3, Number 1, pages
1-29 February 2001.
[27]. [MPIS95] MPI: A Message-Passing Interface Standard, Message Passing
Interface Forum. June 12, 1995.
62
[28]. [EMPI97] MPI-2: Extensions to the Message-Passing Interface, Message
Passing Interface Forum, July 18, 1997
[29]. [JDMPI97] MPI-2 Journal of Development, Message Passing Interface
Forum, July 18, 1997.
[30]. [PBTL99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal.
Discovering Frequent Closed Itemsets for Association Rules. Laboratoire
d’Informatique, Université Blaise Pascal – Clermont-Ferrand II, Complexe
Scientifique des Cézeaux.
[31]. [ZHL98] Osmar R. Zaiane, Mohammad El-Hajj, and Paul Lu. Fast Parallel
Association Rule Mining Without Candidacy Generation. University of
Alberta, Edmonton, Alberta, Canada.
[32]. [DP01] Qin Ding and William Perrizo. Using Active Networks in Parallel
Mining of Association Rules. Computer Science Department, North Dakota
State University, Fargo ND 58105-5164.
[33]. [AY98] R. Agrawal and P. Yu. Online Generation of Association Rules. In
IEEE International Conference on Data Mining, February 1998.
[34]. [AS96] Rakesh Agrawal and John Shafer. Parallel mining of association
rules: Design, implementation and experience. Research Report RJ 10004,
IBM Almaden Research Center, San Jose, California, February 1996.
[35]. [AS94] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for
Mining Association Rules. In Proc. of the 20th International Conference on
Very Large Databases, Santiago, Chile, Sep 1994.
[36]. [AIS93] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining
association rules between sets of items in large databases. In Proc. of
theACM SIGMOD Conference on Management of Data, pages 207-216,
Washington, D.C., May 1993.
[37]. [SA95] Ramakrishnan Srikant and Rekesh Agrawal. Mining Generalized
Association Rules. In Proc. of the 21st International Conference on Very
Large Databases, Zurich, Switzerland. Sep 1995.
63
[38]. [SA96] Ramakrishnan Srikant and Rakesh Agrawal. Mining Quantitative
Association Rules in Large Relational Tables. IBM Almaden Research
Center, San Jose, CA 95120.
[39]. [MY98] R. J. Miller and Y. Yang. Association Rules over Interval Data.
Department of Computer & Information Science, Ohio State University,
USA.
[40]. [PMPIP] RS/6000 SP: Practical MPI Programming, Yukiya Aoyama &
Jun Nakano, International Technical Support Organization,
www.redbooks.ibm.com
[41]. [MS00] T. Murai and Y. Sato. Association Rules from a Point of View of
Modal Logic and Rough Sets. In proceeding of the forth Asian Fuzzy
Symposium, May 31, June 3, 2000, Tsukuba, Japan, pp, 427-432.
[42]. [HHMT02] Tran Vu Ha, Phan Xuan Hieu, Bui Quang Minh, and Ha Quang
Thuy. A Model for Parallel Association Rules Mining from The Point of
View of Rough Set. In Proc. of International Conference on East-Asian
Language Processing and Internet Information Technology, Hanoi,
Vietnam, January 2002.
[43]. [FSSU96] Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth,
and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data
Mining. AAAI Press / The MIT Press 1996.
[44]. [WYY01] Wei Wang, Jiong Yang, and Philip S. Yu. Efficient Mining of
Weighted Association Rules (WAR). IBM Watson Research Center.
[45]. [WMPPP] Writing Message-Passing Parallel Programs with MPI, Neil
MacDonald, Elspeth Minty, Tim Harding, Simon Brown, Edinburgh
Parallel Computing Centre, The University of Edinburgh.
[46]. [ZKM01] Zijian Zheng, Ron Kohavi, and Llew Mason. Real World
Performance of Association Rule Algorithms. Blue Martini Software, 2600
Campus Drive, San Mateo, CA 94403, USA.
[47]. [ZHJ91] Zimmermann H. J. Fuzzy Set Theory and Its Applications. Kluwer
Academic Publishers, 1991.
Các file đính kèm theo tài liệu này:
- MSc03_Phan_Xuan_Hieu_Thesis_English.pdf