Đề tài Parallel Mining for Fuzzy Association Rules

Abstract During the last decade, data mining, also known as knowledge discovery in databases (KDD), has been becoming one of the dominant research trends in the fields of computer science and knowledge engineering. Many research results have been proposed and successfully applied to the real world demonstrates that data mining is an active research area based on a solid background rather than an impulsive and volatile one as some people once suspected. Data mining includes several approaches. Almost all techniques used in this area are borrowed from database, machine learning, artificial intelligence, information theory, probability & statistics, and high performance computing (including parallel computing). There is a wide range of problems need to be solved in this area such as classification/prediction, clustering, association rules mining, sequence mining. Data mining is also an interdisciplinary field because its research results are put into practice in commerce, finance & stock market, biology, medical treatment, education, telecommunication, and so on. Being aware of the potential of this research domain, I make my choice of “Parallel Mining of Fuzzy Association Rules” for my master thesis. This dissertation not only summarizes the previous work but also remarks the delicate relationship between fuzzy association rules and fuzzy logic theory. In addition, the dissertation proposes a novel parallel algorithm for fuzzy association rules mining. Roadmap: the thesis is organized as follows: Chapter one presents an overview of data mining. It encompasses both several descriptive definitions of data mining and main steps in KDD. Also, This section mentions major approaches and techniques as well as classification of data mining systems according to different criteria. The conclusion of this chapter enumerates focused issues during the last few years (probably in next time) and outlines the chief applications of this research domain. Chapter two formally describes the problem of association rules mining by giving basic concepts such as frequent itemset, minimum support, minimum confidence, 2 etc. Furthermore, this section sums up significant proposals in this field within its 10-year history. A complete understanding of this chapter means the readers will easily perceive next two chapters. Chapter three presents fuzzy association rules. We first state the issue of mining quantitative and categorical association rules together with corresponding methods of data discretization. Quantitative association rules, however, exposes some drawbacks such as “sharp boundary problem” and non-intuitive interpretation. Fuzzy association rule was recommended to overcome these disadvantages. Besides the summarization of the previous research about this kind of association rule, this chapter tries to highlight the subtle relevance between fuzzy association rule and fuzzy theory by determining which functions are suited for T-norm operator. Finally, this section suggests a method to convert fuzzy association rules into quantitative ones based on a threshold wf of the membership function associated with the fuzzy set of each fuzzy attribute. Chapter four focuses on parallel mining of association rules. The first part of this chapter sums up previously proposed parallel algorithms that were successfully experimented. Most of these algorithms have common shortcomings such that they must strongly perform either data communication or synchronization among related processors during the period of mining frequent itemsets. Making the most of distinct features of fuzzy association rules, I suggest a new parallel algorithm for mining this kind of association rules that significantly decrease the amount of data or control information need to be communicated or synchronized. In addition, this algorithm pays attention to load-balancing among processors thanks to an appropriate strategy in dividing the set of candidate itemsets. Chapter five concludes the dissertation by summing up the achievements throughout the previous four chapters. Furthermore, this chapter reveals several complex issues that solutions to them remain unconvincing. Finally, I will outline the future research topics. Table of Contents Abstract 1 Acknowledgements 3 Table of Contents . 4 List of Figures 6 List of Tables . 7 Notations & Abbreviations 8 Chapter 1. Introduction to Data Mining 9 1.1 Data Mining 9 1.1.1 Motivation: Why Data Mining? 9 1.1.2 Definition: What is Data Mining? . 10 1.1.3 Main Steps in Knowledge Discovery in Databases (KDD) 11 1.1 Major Approaches and Techniques in Data Mining 12 1.2.1 Major Approaches and Techniques in Data Mining 12 1.2.2 Kinds of data could be mined? 13 1.2 Application of Data Mining . 14 1.2.1 Application of Data Mining . 14 1.2.2 Classification of Data Mining Systems . 14 1.3 Focused issues in Data Mining . 15 Chapter 2. Association Rules . 17 2.1 Motivation: Why Association Rules? 17 2.2 Association Rules Mining – Problem Statement . 18 2.3 Main Research Trends in Association Rules Mining . 20 Chapter 3. Fuzzy Association Rules Mining . 23 3.1 Quantitative Association Rules 23 5 3.1.1 Association Rules with Quantitative and Categorical Attributes 23 3.1.2 Methods of Data Discretization . 24 3.2 Fuzzy Association Rules 27 3.2.1 Data Discretization based on Fuzzy Set 27 3.2.2 Fuzzy Association Rules . 29 3.2.3 Algorithm for Fuzzy Association Rules Mining . 34 3.2.4 Relation between Fuzzy Association Rules and Quantitative Association Rules . 39 3.2.5 Experiments and Conclusions . 39 Chapter 4. Parallel Mining of Fuzzy Association Rules 41 4.1 Several Previously Proposed Parallel Algorithms . 42 4.1.1 Count Distribution Algorithm . 42 4.1.2 Data Distribution Algorithm 43 4.1.3 Candidate Distribution Algorithm . 45 4.1.4 Algorithm for Parallel Generation of Association Rules 48 4.1.5 Other Parallel Algorithms 50 4.2 A New Parallel Algorithm for Fuzzy Association Rules Mining 50 4.2.1 Our Approach 51 4.2.2 The New Algorithm . 55 4.3 Experiments and Conclusions 55 Chapter 5. Conclusions 56 5.1 Achievements throughout the dissertation . 56 5.2 Future research . 57 Reference . 59

63 trang | Chia sẻ: maiphuongtl | Lượt xem: 2358 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Đề tài Parallel Mining for Fuzzy Association Rules, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

les we receive from has the form of '\ '\' ' AAisXXAisX fc⎯→⎯ , in which, X’ and A’ are non-empty subsets of X and A respectively. The inverse slash (i.e. \ sign) in the implication denotes the subtraction operator between two sets. fc is the fuzzy confidence factor of the rule and must meet the condition fc ≥ fminconf. The inputs of the algorithm are a database D with attribute set I and record set T, and fminsup as well as fminconf. The outputs of the algorithm are all possible confident fuzzy association rules. 35 Notation table: Notations Description D A relational or transactional database I Attribute set in D T Record set in D DF The output database after applying fuzzification over the original database D IF Set of fuzzy attributes in DF, each of them is attached with a fuzzy set. Each fuzzy set f, in turn, has a threshold wf as used in formula (3.7) TF Set of records in DF, value of each record at a given fuzzy attribute is in [0, 1] Ck Set of fuzzy k-itemset candidates Fk Set of frequent fuzzy k-itemsets F Set of all possible frequent itemsets from database DF fminsup Fuzzy minimum support fminconf Fuzzy minimum confidence Table 9 - Notations used in fuzzy association rules mining algorithm The algorithm: 1 BEGIN 2 (DF, IF, TF) = FuzzyMaterialization(D, I, T); 3 F1 = Counting(DF, IF, TF, fminsup); 4 k = 2; 5 while (Fk-1 ≠ ∅) { 6 Ck = Join(Fk-1); 7 Ck = Prune(Ck); 8 Fk = Checking(Ck, DF, fminsup); 9 F = F ∪ Fk; 10 k = k + 1; 11 } 12 GenerateRules(F, fminconf); 13 END Table 10 - Algorithm for mining fuzzy association rules The algorithm in table 10 uses the following sub-programs: • (DF, IF, TF) = FuzzyMaterialization(D, I, T): this function is to convert the original database D into the fuzzified database DF. Afterwards, I and T are also transformed to IF and TF respectively. 36 For example, with the database in table 8, after running this function, we will obtain: IF = {[Age, Age_Young] (1), [Age, Age_Middle-aged] (2), [Age, Age_Old] (3), [Cholesterol, Cholesterol_Low] (4), [Cholesterol, Cholesterol_High] (5), [BloodSugar, BloodSugar_0] (6), [BloodSugar, BloodSugar_1] (7), [HeartDisease, HeartDisease_No] (8), [HeartDisease, HeartDisease_Yes] (9)} After converting, IF contains 9 new fuzzy attributes comparing to 4 in I. Each fuzzy attribute is a pair including the name of original attribute and the name of corresponding fuzzy set and surrounded by square brackets. For instance, after fuzzifying the Age attribute, we receive three new fuzzy attributes [Age, Age_Young], [Age, Age_Middle-aged], and [Age, Age_Old]. In addition, the function FuzzyMaterialization also converts T into TF as shown in the following table: A 1 2 3 C 4 5 S 6 7 H 8 9 60 0.00 0.41 0.92 206 0.60 0.40 0 1 0 2 0 1 54 0.20 0.75 0.83 239 0.56 0.44 0 1 0 2 0 1 54 0.20 0.75 0.83 286 0.52 0.48 0 1 0 2 0 1 52 0.29 0.82 0.78 255 0.54 0.46 0 1 0 2 0 1 68 0.00 0.32 1.00 274 0.53 0.47 1 0 1 2 0 1 54 0.20 0.75 0.83 288 0.51 0.49 1 0 1 1 1 0 46 0.44 0.97 0.67 204 0.62 0.38 0 1 0 1 1 0 37 0.59 0.93 0.31 250 0.54 0.46 0 1 0 1 1 0 71 0.00 0.28 1.00 320 0.43 0.57 0 1 0 1 1 0 74 0.00 0.25 1.00 269 0.53 0.47 0 1 0 1 1 0 29 0.71 0.82 0.25 204 0.62 0.38 0 1 0 1 1 0 70 0.00 0.28 1.00 322 0.43 0.57 0 1 0 2 0 1 67 0.00 0.32 1.00 544 0.00 1.00 0 1 0 1 1 0 Table 11 - TF: Values of records at attributes after fuzzifying Note that the characters A, C, S, and H in the table 11 are all the first character of Age, Cholesterol, Sugar, and Heart respectively. Each fuzzy set f is accomanied by a threshold wf, so only values that greater or equal to that threshold are taken into consideration. All other values are considered as 0. All gray cells in the table 11 indicates that theirs values are larger or equal to threshold (all thresholds in table 11 are 0.5). All values located in white-ground cells are equal to 0. 37 • F1 = Counting(DF, IF, TF, fminsup): this function is to generate F1, that is set of all frequent fuzzy 1-itemsets. All elements in F1 must have supports greater or equal to fminsup. For instance, if applying the normal multiplication for T-norm (⊗) operator in formula (3.6) and fminsup with 46%, we achieve the F1 that looks like the following table: Fuzzy 1-itemsets Support Is it frequent? fminsup = 46% {[Age, Age_Young]} (1) 10 % No {[Age, Age_Middle-aged]} (2) 45 % No {[Age, Age_Old]} (3) 76 % Yes {[Serum cholesterol, Cholesterol_Low]} (4) 43 % No {[Serum cholesterol, Cholesterol_High]} (5) 16 % No {[BloodSugar, BloodSugar_0]} (6) 85 % Yes {[BloodSugar, BloodSugar_1]} (7) 15 % No {[HeartDisease, HeartDisease_No]} (8) 54 % Yes {[HeartDisease, HeartDisease_Yes]} (9) 46 % Yes Table 12 - C1: set of candidate 1-itemsets F1 = {{3}, {6}, {8}, {9}} • Ck = Join(Fk-1): this function is to produce the set of all fuzzy candidate k- itemsets (Ck) based on the set of frequent fuzzy (k-1)-itemsets (Fk-1) discovered in the previous step. The following SQL statement indicates how elements in Fk-1 are combined to form candidate k-itemsets. INSERT INTO Ck SELECT p.i1, p.i2, …, p.ik-1, q.ik-1 FROM Lk-1 p, Lk-1 q WHERE p.i1 = q.i1, …, p.ik-2 = q.ik-2, p.ik-1 < q.ik-1 AND p.ik-1.o ≠ q.ik-1.o; In which, p.ij and q.ij are index number of jth fuzzy attributes in itemsets p and q respectively. p.ij.o and q.ij.o are the index number of original attribute. Two fuzzy attributes sharing a common original attribute must not exist in the same fuzzy itemset. For example, after running the above SQL command, we obtain C2 = {{3, 6}, {3, 8}, {3, 9}, {6, 8}, {6, 9}}. The 2-itemset {8, 9} is invalid because its two fuzzy attributes are derived from a common attribute HeartDisease. 38 • Ck = Prune(Ck): this function helps us to prune any unnecessary candidate k-itemset in Ck thanks to the downward closure property “all subsets of a frequent itemset are also frequent, and any superset of a non-frequent itemset will be not frequent”. To evaluate the usefulness of any k-itemste in Ck, the Prune function must make sure that all (k-1)-subsets of Ck are present in Fk-1. For instance, after pruning the C2 = {{3, 6}, {3, 8}, {3, 9}, {6, 8}, {6, 9}}. • Fk = Checking(Ck, DF, fminsup): this function first scans over the whole transactions in datatabases to update support factors for candidate itemsets in Ck. Afterwards, Checking eliminates any infrequent candidate itemset, i.e. whose support is smaller than fminsup. All frequent itemsets are retained and put into Fk. After running F2 = Checking(C2, DF, 46%), we receive F2 = {{3,6}, {6,8}}. The following table displays the detailed information. Candidate 2-itemset Support factor Is it frequent? {3, 6} 62 % Yes {3, 8} 35 % No {3, 9} 41 % No {6, 8} 46 % Yes {6, 9} 38 % No Table 13 - F2: set of frequent 2-itemsets • GenerateRules(F, fminconf): this function generates all possible confident fuzzy association rules from the set of all frequent fuzzy itemsets F. With the above example, after finishing the first phase (finding all possible frequent itemsets), we get F = F1 ∪ F2 = {{3}, {6}, {8}, {9}, {3,6}, {6,8}} (F3 is not created because C3 is empty). The following table lists discovered fuzzy association rules. 39 No. Fuzzy association rules or 1-itemsets Support Confidence 1 Old people 76 % 2 Blood sugar ≤ 120 mg/ml 85 % 3 Not suffer from heart disease 54 % 4 Suffer from heart disease 46 % 5 Old people => Blood sugar ≤ 120 mg/ml 62 % 82 % 6 Blood sugar ≤ 120 mg/ml => Old people 62 % 73 % 7 Blood sugar ≤ 120 mg/ml => Not suffer from heart disease 46 % 54 % 8 Not suffer from heart disease => Blood sugar ≤ 120 mg/ml 46 % 85 % Table 14 - Fuzzy association rules generated from database in table 8 The minimum confidence is 70%, so the 7th rule is rejected. 3.2.4 Relation between Fuzzy Association Rules and Quantitative Association Rules According to the formula (3.7), the membership function of each fuzzy set f is attached with a threshold wf. Based on this threshold, we can defuzzify to convert association rule into another form similar to quantitative one. For example, the fuzzy rule “Old people => Blood sugar ≤ 120 mg/ml, with support 62% and confidence 82%” in the table 14 should be changed to the rule “Age ≥ 46 => Blood sugar ≤ 120 mg/ml, with support 62% and confidence 82%”. We see the minimum value of attribute [Age, Age_Old] that greater or equal to wAge_Old (= 0.5) is 0.67. The age corresponding to the fuzzy value 0.67 is 46, so any person whose age larger or equal to 46 will have fuzzy value greater or equal to 0.67. Therefore, we substitute “Age_Old” by “Age ≥ 46”. Similarly, we can change any fuzzy association rule to quantitative one. Because almost all fuzzy membership functions have smooth graph or their derivatives change the sign (from positive to nagative and vice versa), the defuzzification is relatively simple. 3.2.5 Experiments and Conclusions • Experiments by varying the database size (increase or decrease the number of records) to measure the mining time. • Experiments by changing the fminsup and fminconf 40 • Experiments in changing the thresholds associated with fuzzy sets. • Experiments by changing the choice of T-norm operator (min function and normal multiplication) • Experiment the transformation from fuzzy association rules to quantitative ones. 41 Chapter 4. Parallel Mining of Fuzzy Association Rules One of the most essential and time-consuming tasks in association rules mining is finding all possible frequent itemsets from immense volumes of data. It needs much CPU time (CPU-bound) and I/O operation (I/O-bound). Thus, researchers have been trying their best to improve the existing algorithms or devise new ones in order to speed up the whole mining process [11] [13] [20] [24] [30] [35]. Most of these algorithms are sequential and work efficiently upon small or medium databases (the sizes of databases are recognized based on their number of attributes and records). However, they lose their performance and expose some disadvantages while working with extremely large databases (usually hundreds of megabytes or more) due to the limitations in the processor’s speed as well as the capacity of internal memory of a single computer. Fortunately, with the explosive development in hardware industry, high performance computing systems are introduced to the market. This has opened up an opportunity for a new research direction in data mining community. Since 1995, researchers continually devise efficient parallel and distributed algorithms for the issue of association rules mining [5] [12] [18] [26] [31] [32] [34]. These algorithms are diverse because of their tight dependences upon architectures of various parallel computing systems. In the first section of this chapter, I coarsely present previous parallel algorithms that are successfully experimented on real databases. In the next section, I would like to recommend a novel parallel algorithm for mining fuzzy association rules. It has been experimented on a Linux-based PC-Cluster system using Message Passing Interface standard (MPI) [27] [28] [29] and returns optimistic results. This algorithm is relatively optimal because it strongly reduces the data communication and synchronization among processors. However, it can only mine the fuzzy or quantitative association rules as well as suite for relational rather than transactional databases. 42 4.1 Several Previously Proposed Parallel Algorithms In this section, I would like to restate some existing parallel algorithms. They are all designed to run on shared-nothing parallel systems that have the following characteristics: • A shared-nothing system has N processors. Each processor Pi has its own internal (RAM) and external (hard disk) memory. • N processors can communicate with each other thanks to a high-speed network using message passing mechanism. 4.1.1 Count Distribution Algorithm The Count Distribution approach proposed in [34] is an Apriori-like algorithm [35]. In this algorithm, N denotes the number of processors, Pi stand for ith processor, and Di is the data partition for processor Pi (the original database D is divided into N equal partitions, each of them is placed on the hard disk of a processor. The algorithm consists of the following steps: • (1) Pass one (k = 1): the first pass is special. All processors in the system will receive the set of frequent 1-itemsets, i.e. L1. • (2) Pass two: for all k > 1, the algorithm repeatedly perform the following steps: o (2.1): Each processor Pi generates the complete Ck, using the complete set of frequent itemsets Lk-1 created at the pass k-1. Observe that since each processor has the identical Lk-1, they will be generating identical Ck. o (2.2): Processor Pi makes a pass over its data partition Di and develops local support counts for candidate itemsets in Ck. This is the parallel step. This means that all processors can independently scan over its own data partition. 43 o (2.3): Processor Pi exchanges local Ck support counts with all other processors to develop the global Ck support counts. Processors are forced to synchronize in this step. o (2.4): Each processor Pi now computes Lk from Ck based on minsup parameter. o (2.5): Each processor Pi independently makes the decision to terminate or continue the next loop. The decision will be identical as the processors all have identical Lk. The figure below illustrates the Count Distribution algorithm. Figure 7 - Count distribution algorithm on a 3-processor parallel system 4.1.2 Data Distribution Algorithm The attractive feature of the Count Distribution algorithm is that no data tuples are exchanged between processors – only support counts are exchanged. Thus, processors can operate independently and asynchronously while reading the data. However, the disadvantage is that this algorithm does not exploit the aggregate memory of the system effectively. Suppose that each processor has memory of size |M|. The number of candidate itemsets that can be counted in one pass is determined by |M|. As we increase the number of processors from 1 to N, the 44 system has N x |M| total memory, but we still count the same number of candidate itemsets in one pass, as each processor is counting for identical Ck. The Data distribution algorithm is designed to exploit better total system’s memory as the number of processors is increased. In this algorithm, each processor counts mutually exclusive candidate itemsets. Thus, as the number of processors is increased, a larger number of candidate itemsets can be counted in a pass. The downside of this algorithm is that every processor must broadcast its local data to all other processors in every pass. Therefore, this algorithm can become feasible only on a machine with very fast and stable communication. The Data distribution is also based on the Apriori. In this algorithm, N denotes the number of processors, Pi is the ith processor, and Di is the data partition for processor Pi (the original database D is divided into N equal partitions, each of them is placed on the hard disk of a processor. The algorithm includes the following steps: • (1) Pass one (k = 1): The same Count distribution algorithm. • (2) Pass two (for every k > 1): o (2.1): Processor Pi generates Ck from Lk-1. It retains only 1/Nth of the itemsets forming the candidate subset Cki that it will count. Which 1/N itemsets are retained is determined by the processor identifier and can be computed without communicating with other processors. In real implementation, itemsets are assigned in a round-robin fashion. The Cki sets are all disjoint and union of all Cki sets is the original Ck (i.e. Cki ∩ Ckj = ∅ (where i ≠ j) and kik N i CC = =1 U ). o (2.2): Processor Pi develops support counts for itemsets in its local candidate set Cki using both local data pages and data page received from other processors. o (2.3): At the end of the pass over the data, each processor Pi calculates Lki using the local Cki. Again, all Lki sets are disjoint and 45 the union of all Lki sets is Lk (i.e. Lki ∩ Lkj = ∅ where i ≠ j and k i k N i LL = =1 U ). o (2.4): Processors exchanges Lki so that every processor has the complete Lk for generating Ck+1 for the next pass. This step requires processors to synchronize. Having obtained the complete Lk, each processor can independently (but identically) decide whether to terminate or continue on to the next loop. The following figure depicts this algorithm: Figure 8 - Data distribution algorithm on a 3-processor parallel system 4.1.3 Candidate Distribution Algorithm Both Count and Data distribution algorithms require processors to synchronize at the end of a pass to exchange support counts or frequent itemsets respectively. If the workload is not perfectly balanced, this can cause all the processors to wait for whichever processor finishes last in every pass. The Candidate distribution algorithm attempts to do away with the dependence between processors so that they may proceed independently without synchronizing at the end of every pass. In some pass l, where l is heuristically determined, this algorithm divides the frequent itemsets Ll-1 between processors in such a way that a processor Pi can 46 generate a unique Cmi (m ≥ l) independent of all other processors (Cmi ∩ Cmj = ∅, ∀i ≠ j). At the same time, data is selectively replicated so that a processor can count candidates in Cmi independent of all other processors. This choice of the redistribution pass is a tradeoff between decoupling processor dependence as soon as possible and waiting until the itemsets become more easily and equitably partitionable. The partitioning algorithm exploits the semantics of the Apriori candidate generation procedure described before. After this candidate distribution, the only dependence that a processor has on other processors is for pruning the local candidate set during the prune step of candidate generation. However, a processor does not wait for the complete pruning information to arrive from all other processors. During the prune step of candidate generation, it prunes the candidate set as much as possible using whatever information has arrived, and opportunistically starts counting the candidates. The late arriving pruning information can be used in subsequent passes. The algorithm is described below: • (1) Pass one (k < l): Use either Count or Data distribution algorithm. • (2) Pass two (k = l): o (2.1): Partition Lk-1 among the N processors such that Lk-1 sets are “well balanced”. We discuss below how this partitioning is done. Record with each frequent itemset in Lk-1 which processor has been assigned this itemset. This partitioning is identically done in parallel by each processor. o (2.2): Processor Pi produces Cki, logically using only the Lk-1 partition assigned to it. Note that Pi still has access to the complete Lk-1, and hence can use standard pruning while generating Cki in this pass. o (2.3): Pi develops global counts for candidates in Cki and the database is repartitioned into DRi at the same time. The details of this step are given below. o (2.4): After Pi has processed all its local data and any data received from all other processors, it posts N-1 asynchronous receive buffers 47 to receive Lkj from all other processors. These Lkj are needed for pruning Ck+1i in the prune step of candidate generation. o (2.5): Processor Pi computes Lki from Cki and asynchronously broadcast it to the other N-1 processors using N-1 asynchronous sends. • (3) Pass (k > l): o (3.1): Processor Pi collects all frequent itemsets that have sent to it by other processors. They are used in the pruning step of the candidate generation, but not the join step. Itemsets received from processor j could be of length k-1, smaller than k-1 (slower processor), or greater than k-1 (faster processor). Pi keeps track for each processor Pj the largest size of the frequent itemsets sent by it. Receive buffers for the frequent itemsets are reposted after processing. o (3.2): Pi creates Cki using the local Lk-1i. Now it can happen that Pi has not received Lk-1j from all other processors, so Pi needs to be careful at the time of pruning. It needs to distinguish an itemset (a k- 1 long subset of a candidate itemset) which is not present in any of Lk-1j from an itemset that is present in some Lk-1j but this set has not yet been received by processor Pi. It does so by probing Ll-1 (remember that repartitioning took place in pass l) using the l-1 long prefix of the itemset in question, finding the processor reponsible for it, and checking if Lk-1j has been received from this processor. o (3.3): Pi makes a pass over DRi and counts Cki. It then computes Lki and asynchronously broadcast Lki to every other processor using N-1 asynchronous sends. Partitioning: We motivate the algorithm for partitioning Lk by an example. Let L3 be {ABC, ABD, ABE, ACD, ACE, BCD, BCE, BDE, CDE}. Then L4 = {ABCD, ABCE, ABDE, ACDE, BCDE}, L5 = {ABCDE} và L6 = ∅. Consider ε = {ABC, ABD, ABE} whose members all have the common prefix AB. Not that the candidates ABCD, ABCE, ABDE, and ABCDE also have the prefix AB. The 48 Apriori candidate generation procedure generates these candidates by joining only the items in ε. Therefore, assuming that the items in the itemsets are lexicographically ordered, we can partition the itemsets in Lk based on common k-1 long prefixes. By ensuing that no parition is assigned to more than one processor, we have ensured that each processor can create candidates independently (ignoring the prune step). Suppose we also repartition the database in such a way that any tuple that supports an itemset contained in any of the Lk partitions assigned to a processor is copied to the local disk of that processor. The processor can then proceed entire synchronization. The actual algorithm is more involved because of two reasons. A processor may have to obtain frequent itemsets computed by other processors for the prune step of the candidate generation. In the example above, the processor assigned the set ε has to know whether BCDE is frequent to be able to decide whether to prune the candidate ABCDE, but the set with prefix BC may have been assigned to a different processor. The other problem is that we need to balance load across processors. 4.1.4 Algorithm for Parallel Generation of Association Rules To generate rules, for every large itemset l, we find all non-empty subsets of l. For every such subset a, we output a rule of the form a => (l – a) if the ratio of s(l) to s(a) is at least minconf. We consider all subsets of l to generate rules with multiple consequents. Since the large itemsets are store in hash tables, the support counts for the subset itemsets can be found efficiently. We can improve the above procedure by generating the subsets of a large itemset in a recursive depth-first fashion. For example, given an itemset ABDC, we first consider the subset ABC, then AB, etc. Then if a subset a of a large itemset l does not generate a rule, the subsets of a need not be considered for generating rules using l. For example, if ABC => D does not have adequate confidence, we need not check whether AB => CD holds. We do not miss any rules because the support of any subset a’ of a must be as great as the support of a. Therefore, the confidence of the rule a’ => (l – a’) cannot be more than the confidence of a => (l 49 – a). Hence, if a did not yield a rule involving all the items in l with a as the antecedent, neither will a’. The sequential algorithm [35] for generating association rules is shown in the table below: // Simple algorithm Forall frequent itemset lk, k > 1 do Call gen_rules(lk, lk); // The gen_rules generates all valid rules α => (l - α), for all α ⊂ am Procedure gen_rules(lk : frequent k-itemset, am : frequent m-itemset) Begin 1 A = {(m-1)-itemsets am-1 | am-1 ⊂ am} 2 Forall am-1 ∈ A do begin 3 conf = s(lk)/s(am-1); 4 if (conf ≥ minconf) then begin 5 output the rule am-1 => (lk – am-1); 6 if (m – 1 > 1) then 7 Call gen_rules(lk, am-1); 8 end 9 end End Table 15 - The sequential algorithm for generating association rules Generating rules in parallel simply involves partitioning the set of all frequent itemsets among the processors. Each processor then generates rules for its partition only using the algorithm above. Since the number of rules that can be generated from an itemset is sensitive to the itemset’s size, we attempt equitable balancing by partitioning the itemsets of each length equally across the processors. Note that in the calculation of the confidence of a rule, a processor may need to examine the support of an itemset for which it is not responsible. For this reason, each processor must have access to all the frequent itemsets before rule generation can begin. This is not a problem for the Count and Data distribution algorithms because at the end of the last pass, all the processors have all the frequent itemsets. In the Candidate distribution algorithm, fast processors may need to wait until slower processors have discovered and transmitted all of their frequent itemsets. For this reason and because the rule generation step is relatively cheap, it may be better in the Candidate distribution algorithm to simply discover the frequent 50 itemsets and generate the rules off-line, possibly on a serial processor. This would allow processors to be freed to run other jobs as soon as they are done finding frequent itemsets, even while other processors in the system are still working. 4.1.5 Other Parallel Algorithms In addition to four above algorithms, we also list herein some of other parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm [12] was proposed based on the normal data distribution with an additional improvement in data communication among processors. Instead of data transmission between every pair of processors, this algorithm transmit data based on a logical ring set up at the startup time of the algorithm. The Multiple Local Frequent Pattern Tree (MLFPT) [31] was derived from the FP- growth algorithm. This algorithm needs no candidate generation and can reduce the number of passes over the database. Furthermore, it investigates the load balancing among processors. Another parallel algorithm working on shared-everything systems (normally SMP systems) was proposed by [26]. 4.2 A New Parallel Algorithm for Fuzzy Association Rules Mining Almost all known parallel algorithms, more or less, need the data communication and synchronization among processors. This leads to an additional complexity in real implementations of these algorithms. Hence, they are not considered to be “ideal” parallel computing problems. Based on the approach in fuzzy association rules mentioned above, I would like to suggest a new parallel algorithm for mining this kind of rule. It is ideal that little communication needs to be taken place during the processing time. Data communication is made only twice - one at the startup for dividing and delivering fuzzy attributes among processors, and one for rules gathering as the algorithm finishes. 51 4.2.1 Our Approach According to the fuzzy association rules mining algorithm mentioned in chapter three, each attribute iu in I is associated with a set of fuzzy sets uiF as follows: { }kiiii uuuu fffF ,...,, 21= For example, in the sample database in table 8, we have: 1i F = FAge = {Age_Young, Age_Middle-aged, Age_Old} (with k = 3) 2i F = FSerum_cholesterol = {Cholesterol_Low, Cholesterol_High} (with k = 2) The fuzzy association rule has the form of: X is A ⇒ Y is B. Where: • X, Y ⊆ I are itemsets. X = {x1, x2, …, xp} (xi ≠ xj if i ≠ j) and Y = {y1, y2, …, yq} (yi ≠ yj if i ≠ j). • A = {fx1, fx2, …, fxp}, B = {fy1, fy2, …, fyq} are sets of fuzzy sets corresponding to attributes in X and Y. fxi ∈ Fxi và fyj ∈ Fyj. Each fuzzy attribute is a pair of attribute name accompanied by fuzzy set name. For instance, with I = {Age, SerumCholesterol, BloodSugar, HeartDisease}, we now have the set of fuzzy attributes IF as: IF = {[Age, Age_Young] (1), [Age, Age_Middle-aged] (2), [Age, Age_Old] (3), [Cholesterol, Cholesterol_Low] (4), [Cholesterol, Cholesterol_High] (5), [BloodSugar, BloodSugar_0] (6), [BloodSugar, BloodSugar_1] (7), [HeartDisease, HeartDisease_No] (8), [HeartDisease, HeartDisease_Yes] (9)} Table 16 - Set of fuzzy attributes received after being fuzzified the database in table 8 After being fuzzified, IF consists of 9 fuzzy attributes comparing to 4 in I. All values in at fuzzy attributes in record set TF is now belonging to the range [0, 1]. The objective of the mining issue is to discover all possible confident fuzzy association rules. 52 We totally perceive that any fuzzy association rule (both antecedent and consequent) never contains two fuzzy attributes that share a common original attribute in I. For example, the rules such “Age_Old AND Cholesterol_High And Age_Young => HeartDisease_Yes” or “BloodSugar > 120mg/ml AND HeartDisease_No => HeartDisease_Yes” are invalid because the former contains both Age_Old and Age_Young (derived from Age), and the latter encompasses HeartDisease_No as well as HeartDisease_Yes (derived from HeartDisease). There are two chief reasons for the above supposition. First, fuzzy attributes sharing a common original attribute are usually mutually exclusive in meaning so that they will largely decrease the support of rules in which they are contained together. For example, the Age_Old is opposite in semantics with Age_Young because no person in the world is “both young and old”. Second, such rule is not worthwhile and carries little meaning. Thus, we can conclude that all fuzzy attributes in the same rule are independent in that there is no pair of fuzzy attribute whose original attribute is the same. This observation is the foundation of our new parallel algorithm. I will coarsely describe our idea via a simple example. Suppose that we will run our algorithm on the database in table 8 and on a 6-processor parallel system. We need to divide the set of fuzzy attributes IF among processors so that processors can operate in parallel and independently as follows. The processor P1: IF1 = {[Age, Age_Young] (1), [Cholesterol, Cholesterol_Low] (4), [BloodSugar, BloodSugar_0] (6), [BloodSugar, BloodSugar_1] (7), [HeartDisease, HeartDisease_No] (8), [HeartDisease, HeartDisease_Yes] (9)} = {1, 4, 6, 7, 8, 9} The processor P2: IF2 = {1, 5, 6, 7, 8, 9} The processor P3: IF3 = {2, 4, 6, 7, 8, 9} The processor P4: IF4 = {2, 5, 6, 7, 8, 9} The processor P5: IF5 = {3, 4, 6, 7, 8, 9} The processor P6: IF6 = {3, 5, 6, 7, 8, 9} 53 We divide the IF based on the first two attributes Age and Cholesterol. The 9 initial fuzzy attributes are now distributed among 6 processors and each processor receive 6 fuzzy attributes. This division is “ideal” because the number of processors (6) is equal to the multiplication of number fuzzy sets associated with attribute Age (3) and number of fuzzy sets associated with attribute Cholesterol (2) (i.e. 6 = 3*2). The optimal division is where we could equally disperse fuzzy attributes to all processors in the system. In the case of being unable to obtain a optimal division, we will use “the most reasonable” one. This means that several processors are in idle state while others work hard. I would like to present an algorithm used for fuzzy attributes division. It first tries to find the optimal solution, if not it will return “the most reasonable” one. The division algorithm is formally described below: Given a database D with I = {i1, i2, …, in} is set of n attributes, and T = {t1, t2, …, tm} is set of m records. After being fuzzified, D, I, and T are converted into DF, IF, and TF respectively. IF = {[i1, fi11], …, [i1, fi1k1], [i2, fi21], …, [i2, fi2k2], …, [in, fin1], …, [in, finkn]}. Where, fiju and kj are the uth fuzzy set and the number of fuzzy sets associated with attribute ij. For example, the database in table 8, we have I = {Age, SerumCholesterol, BloodSugar, HeartDisease} and after converting we receive IF as shown in table 16. In this case, k1 = 3, k2 = 2, k3 = 2, k4 = 2 are numbers of fuzzy sets associated with original attributes in I. Let FN = {k1} ∪ {k2} ∪ … ∪ {kn} = {s1, s2, …, sv} (v ≤ n as the may be pairs such as (ki, kj) that equal) and N is the number of processors in the system. The division algorithm is changed to the problem stated as: Find the non-empty subset Fn of FN such that the multiplication among elements in Fn is equal to N (this is the optimal solution). In the case of being unable to obtain the optimal solution, the algorithm will return “the most reasonable” solution. This means that the multiplication among elements in Fn is a lower approximation of N. This algorithm is designed according to backtracking strategy, it is whether recursive or not. 54 The division algorithm: BOOLEAN Subset(FN, N, Idx) 1 k = 1; 2 Idx[1] = 0; 3 S = 0; 4 While (k > 0) { 5 Idx[k]++; 6 if (Idx[k] <= sizeof(FN)) { 7 if (S * FN[Idx[k]] <= N) { 8 if (S * FN[Idx[k]] == N) 9 return TRUE; 10 else { 11 S *= FN[Idx[k]]; 12 Idx[k + 1] = Idx[k]; 13 k++; 14 } 15 } 16 } else { 17 k--; 18 S /= FN[Idx[k]]; 19 } 20 } 21 return FALSE; FindSubset(FN, N, Idx, Fn) 1 for (n = N; n > 0; n--) 2 If (Subset(FN, n, Idx)) { 3 Fn = {FN[i] | i ∈ Idx} 4 return; 5 } Table 17 - Fuzzy attributes dividing algorithm among processors In the above example, the Fn is {3, 2}. The above algorithm surely to return “the most reasonable” solution in the case not receiving the optimal one since the FindSubset will gradually decrease n to search for lower approximation space of N. 55 4.2.2 The New Algorithm Inputs: The database D with the attribute set I and the record set T, the number of processors is refered to as N, The minsup and minconf indicate the minimum support and minimum confidence threshold respectively. Outputs: All possible confident fuzzy association rules. The parallel algorithm for mining fuzzy association rules includes the following steps: • (1): Converting the database D, attribute set I, and record set T into DF, IF, and TF respectively. This is fuzzification process. • (2): Using the algorithm in table 17 for fuzzy attributes scattering among N processors in the system. • (3): Each processor Pi use the sequential algorithm in table 10 to mine frequent itemsets and algorithm in table 15 to generate confident fuzzy association rules. • (4): Collecting discovered rules from all processors in the system. This algorithm is successfully applied to not only fuzzy association rules but quantitative and categorical ones. 4.3 Experiments and Conclusions • Experimenting while varying number of attributes and number of records to measure mining time. • Experimenting while increasing and decreasing the number of processors in the system 56 Chapter 5. Conclusions 5.1 Achievements throughout the dissertation Being based on the previous research outputs in the field of data mining, the dissertation not only sums up the most noticeable features in this research direction but also presents several new proposals. We summarize herein the chief achievements throughout the thesis. In the first chapter, the thesis outlines the overall picture of data mining - some descriptive definitions as well as its motivation. This section also generally mentions the dominant techniques that are applied for subproblems such as clustering, classification/prediction, association rules mining, etc. Finnaly, the chapter enumerates the recently focused research trends in this research domain. The chapter two formally describes the issue of association rules mining proposed by R. Agrawal in 1993. The most fundamental concepts are also given such as itemset, frequent itemset, confident rule, etc. Moreover, this chapter depicts some most interesting research topics that are concentrated in recent years such as fuzzy association rules, parallel mining of association rules, etc. The core target of this chapter is to present the background of association rules mining to readers. A thorough understanding of this chapter will be very convenient for studying the next two chapters. On the foundation of the previous proposals of [4] [9] [38] [39], chapter three present the problem of quantitative & categorical association rules mining together with its advantages and disadvantages. However, the focused target of this chapter is to deeply study a special kind of rule - the fuzzy association rule. This type of rule is much more flexible, readable, and intuitive comparing to the elementary kind of rule described in the previous chapter. The depiction of this kind of rule in [4] [9] is so brief that they could not emphasize the sensitive relation between fuzzy association rules and fuzzy logic theory. The thesis also explains why we choose min function and algebraic multiplication for T-norm operator in formula (3.6). In addition, this section restates the algorithm for mining fuzzy association rules in [4] [9] based on Apriori with just slight customizations. Finnaly, this 57 chapter offers a transformation method for converting fuzzy association rule into quantitative ones. Chapter four suggests a novel parallel algorithm for mining fuzzy association rules. In this algorithm, processors will largely reduce the amount of information need to be communicated during processing time. The algorithm is considered to be “ideal” thanks to its wise strategy in deliver the original set of fuzzy attributes for processors. This division method is both balanced and intelligent in that partitions after dividing are equal and each processor can operate upon its partition independently. However, the algorithm contains some drawbacks. First, it is only successfully applied for mining fuzzy or quantitative association rules. Second, it only works well on shared-nothing parallel systems. During the time I write this thesis and previous time, I have tried my best to research and consult many other reference documents. However, the thesis cannot avoid several mistakes and shortcomings. I would like to appreciate any comment and suggestion in both technical and editorial problems from all readers. 5.2 Future research Association rule mining is attractive to many researchers because of its widespread applications in various areas as well as it potentially contains different research trends. In this dissertation, I only focus on fuzzy association rule and parallel algorithm for mining this kind of rule. In the next time, the following research directions will be taken into consideration: • Mining fuzzy association rules with weighted attributes. The major goal of this kind of rule is to measure the “contribution level” of each attribute to the overall rule. For example, as we mining association rules on data related to heart disease, the information of attributes such as blood pressure, blood sugar, cholesterol is much more critical than that of body weight and age. Thus, we attach higher weights to blood pressure, blood sugar, cholesterol comparing to those of body weight and age. • The parallel algorithm we recommend in chapter four works only on shared- nothing systems. In the near future, we would like to concentrate on designing parallel algorithms for other architectures such as SMP, etc. 58 • Although the issue of association rules mining is highly independent to what kind of data to be mined. However, we would like to focus on a certain database to improve the proposed algorithms and to tune related parameters. 59 Reference Vietnamese documents: [1]. [PDD99] Phan Dinh Dieu. Logic in knowledge systems. Faculty of Technology, Vietnam National University, Hanoi – 1999. [2]. [DMT03] Dinh Manh Tuong. Artificial Intelligence. Faculty of Technology, Vietnam National University, Hanoi – 2003. English documents: [3]. [AR95] Alan Rea. Data Mining – An Introduction. The Parallel Computer Centre, Nor of The Queen’s University of Belfast. [4]. [AG00] Attila Gyenesei. A Fuzzy Approach for Mining Quantitative Association Rules. Turku Centre for Computer Science, TUCS Technical Reports, No 336, March 2000. [5]. [AM95] Andreas Mueller. Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison. Department of Computer Science, University of Maryland-College Park, MD 20742. [6]. [LHM99] Bing Liu, Wynne Hsu, and Yiming Ma. Mining Association Rules with Multiple Minimum Supports. In ACM SIGKDD International Conference on KDD & Data Mining (KDD-99), August 15-18, 1999, San Diego, CA, USA. [7]. [KV01] Boris Kovalerchuk and Evgenii Vityaev. Data Mining in Finance – Advances in Relational and Hybrid Methods. Kluwer Academic Publishers, Boston – Dordrecht - London. 2001. [8]. [MHT02] Bui Quang Minh, Phan Xuan Hieu, Ha Quang Thuy. Some Parallel Computing Experiments with PC-Cluster. In Proc. of Conference on IT of Faculty of Technology, VNUH. Hanoi 2002. [9]. [KFW98] Chan Man Kuok, Ada Fu, and Man Hon Wong. Mining Fuzzy Association Rules in Databases. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong. 60 [10]. [THH02] Do Van Thanh, Pham Tho Hoan, and Phan Xuan Hieu. Mining Association Rules with different supports. In Proc. of the National Conference on Information Technology, Nhatrang, Vietnam, May 2002. [11]. [BCJ01] Doug Burdick, Manuel Calimlim, and Johannes Gehrke. MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases. Department of Computer Science, Cornell University. [12]. [HKK97] Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar. Scalable Parallel Data Mining for Association Rules. Department of Computer Science, University of Minnesota, 4-192 EECS Building, 200 Union St. SE, Minneapolis, MN 55455, USA. [13]. [PHM01] Jian Pei, Jiawei Han, and Runying Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. Intelligent Database Systems Research Lab, School of Computing Science, Simon Fraser University, Burnaby, B.C., Canada. [14]. [HK02] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. University of Illinois, Morgan Kaufmann Publishers 2002. [15]. [HF95] Jiawei Han and Yongjian Fu. Discovery of Multiple-Level Association Rules from Large Databases. In Proc. of the 21st International Conference on Very Large Dadabases, Zurich, Switzerland, Sep 1995. [16]. [LZDRS99] Jinyan Li, Xiuzhen Zhang, Guozhu Dong, Kotagiri Ramamohanarao, and Qun Sun. Efficient Mining of High Confidence Association Rules without Support Thresholds. Department of CSSE, The University of Melbourne, Parkville, Vic, 3052, Australia. [17]. [HG00] Jochen Hipp, Ulrich Guntzer, and Gholamreza Nakhaeizadeh. Algorithms for Association Rule Mining – A General Survey and Comparison. ACM SIGKDD, July 2000, Volume 2, Issue 1 – page 58 – 64. [18]. [PCY95] Jong Soo Park, Ming-Syan Chen, and Philip S. Yu. Efficient Parallel Data Mining for Association Rules. In Fourth International Conference on Information and Knowledge Management, Baltimore, Maryland, Nov 1995. 61 [19]. [PYC98] Jong Soon Park (Sungshin Women’s Univ, Seoul, Korea), Philip S. Yu (IBM T.J. Watson Res. Ctr.), and Ming-Syan Chen (National Taiwan Univ., Taipei, Taiwan). Mining Association Rules with Adjustable Accuracy. [20]. [MTV94] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Efficient Algorithms for Discovering Association Rules. In KDD-1994: AAAI Workshop on Knowledge Discovery in Databases, pages 181-192, Seattle, Washington, July 1994. [21]. [LAZ65] L. A. Zadeh. Fuzzy sets. Informat. Control, 338-353, 1965. [22]. [KMRTV94] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding Interesting Rules from Large Sets of Discovered Association Rules. In Proc. 3rd International Conference on Information and Knowledge Management, pages 401-408, Gaithersburg, Maryland, November 1994. [23]. [MM00] Manoel Mendonca. Mining Software Engineering Data: A Survey. University of Maryland, Department of Computer Science, A. V. Williams Building #3225 College Park, MD 20742. 2000. [24]. [ZH99] Mohammed J. Zaki and Ching-Jui Hsiao. CHARM: An Efficient Algorithm for Closed Association Rules Mining. RPI Technical Report 99- 10, 1999. [25]. [ZO98] Mohammed J. Zaki and Mitsunori Ogihara. Theoretical Foundations of Association Rules. In 3rd ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, June 1998. [26]. [ZPO01] Mohammed J. Zaki, Srinivasan Parthasarathy, and Mitsunori Ogihara. Parallel Data Mining for Association Rules on Shared-Memory Systems. In Knowledge and Information Systems, Vol. 3, Number 1, pages 1-29 February 2001. [27]. [MPIS95] MPI: A Message-Passing Interface Standard, Message Passing Interface Forum. June 12, 1995. 62 [28]. [EMPI97] MPI-2: Extensions to the Message-Passing Interface, Message Passing Interface Forum, July 18, 1997 [29]. [JDMPI97] MPI-2 Journal of Development, Message Passing Interface Forum, July 18, 1997. [30]. [PBTL99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Discovering Frequent Closed Itemsets for Association Rules. Laboratoire d’Informatique, Université Blaise Pascal – Clermont-Ferrand II, Complexe Scientifique des Cézeaux. [31]. [ZHL98] Osmar R. Zaiane, Mohammad El-Hajj, and Paul Lu. Fast Parallel Association Rule Mining Without Candidacy Generation. University of Alberta, Edmonton, Alberta, Canada. [32]. [DP01] Qin Ding and William Perrizo. Using Active Networks in Parallel Mining of Association Rules. Computer Science Department, North Dakota State University, Fargo ND 58105-5164. [33]. [AY98] R. Agrawal and P. Yu. Online Generation of Association Rules. In IEEE International Conference on Data Mining, February 1998. [34]. [AS96] Rakesh Agrawal and John Shafer. Parallel mining of association rules: Design, implementation and experience. Research Report RJ 10004, IBM Almaden Research Center, San Jose, California, February 1996. [35]. [AS94] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th International Conference on Very Large Databases, Santiago, Chile, Sep 1994. [36]. [AIS93] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proc. of theACM SIGMOD Conference on Management of Data, pages 207-216, Washington, D.C., May 1993. [37]. [SA95] Ramakrishnan Srikant and Rekesh Agrawal. Mining Generalized Association Rules. In Proc. of the 21st International Conference on Very Large Databases, Zurich, Switzerland. Sep 1995. 63 [38]. [SA96] Ramakrishnan Srikant and Rakesh Agrawal. Mining Quantitative Association Rules in Large Relational Tables. IBM Almaden Research Center, San Jose, CA 95120. [39]. [MY98] R. J. Miller and Y. Yang. Association Rules over Interval Data. Department of Computer & Information Science, Ohio State University, USA. [40]. [PMPIP] RS/6000 SP: Practical MPI Programming, Yukiya Aoyama & Jun Nakano, International Technical Support Organization, www.redbooks.ibm.com [41]. [MS00] T. Murai and Y. Sato. Association Rules from a Point of View of Modal Logic and Rough Sets. In proceeding of the forth Asian Fuzzy Symposium, May 31, June 3, 2000, Tsukuba, Japan, pp, 427-432. [42]. [HHMT02] Tran Vu Ha, Phan Xuan Hieu, Bui Quang Minh, and Ha Quang Thuy. A Model for Parallel Association Rules Mining from The Point of View of Rough Set. In Proc. of International Conference on East-Asian Language Processing and Internet Information Technology, Hanoi, Vietnam, January 2002. [43]. [FSSU96] Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press / The MIT Press 1996. [44]. [WYY01] Wei Wang, Jiong Yang, and Philip S. Yu. Efficient Mining of Weighted Association Rules (WAR). IBM Watson Research Center. [45]. [WMPPP] Writing Message-Passing Parallel Programs with MPI, Neil MacDonald, Elspeth Minty, Tim Harding, Simon Brown, Edinburgh Parallel Computing Centre, The University of Edinburgh. [46]. [ZKM01] Zijian Zheng, Ron Kohavi, and Llew Mason. Real World Performance of Association Rule Algorithms. Blue Martini Software, 2600 Campus Drive, San Mateo, CA 94403, USA. [47]. [ZHJ91] Zimmermann H. J. Fuzzy Set Theory and Its Applications. Kluwer Academic Publishers, 1991.

Các file đính kèm theo tài liệu này:

MSc03_Phan_Xuan_Hieu_Thesis_English.pdf