Applying the attributed prefix tree for mining closed sequential patterns - Pham Thi Thiet

This paper proposed an algorithm for mining closed sequential patterns. This algorithm combined the parent–child relationship between nodes on prefix tree and the definition of closed sequential pattern to determine whether a sequential pattern is a closed sequential pattern by adding IsCSP field into each sequential pattern node on the prefix tree. The algorithm also applied the prime block encoding approach and the join operations over the prime blocks in [3] for generating candidate sequences and determining the frequency for each candidate. Experimental results are examined on the small sequence database also showed that the number of closed sequential patterns is much smaller than that of the sequential patterns. In future, based on this algorithm, we will perform more experimental results with the larger sequence databases such as synthetic and real databases, namely C6T5S4I4N1kD1k, Chess, and Mushroom and so on.

pdf9 trang | Chia sẻ: honghp95 | Lượt xem: 390 | Lượt tải: 0download
Bạn đang xem nội dung tài liệu Applying the attributed prefix tree for mining closed sequential patterns - Pham Thi Thiet, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Journal of Science and Technology 54 (3A) (2016) 106-114 APPLYING THE ATTRIBUTED PREFIX TREE FOR MINING CLOSED SEQUENTIAL PATTERNS Pham Thi Thiet * , Vo Thi Thanh Van Faculty of Information Technology, Industrial University of Ho Chi Minh City, 12 Nguyen Van Bao Street, Ward 4, Go Vap District, HoChiMinh City Email: phamthithiet@iuh.edu.vn Received: 4 May 2016; Accepted for publication: 28 July 2016 ABSTRACT Mining closed sequential patterns is one of important tasks in data mining. It is proposed to resolve difficult problems in mining sequential pattern such as mining long frequent sequences that contain a combinatorial number of frequent subsequences or using very low support thresholds to mine sequential patterns is usually both time- and memory-consuming. This paper applies the characteristics of closed sequential patterns and sequence extensions into the prefix tree structure to mine closed sequential patterns from the sequence database. The paper uses the parent–child relationship on prefix tree structure and each node on prefix tree is also added fields to determine whether that is a closed sequential pattern or not. Experimental results show that the number of sequential patterns is reduced significantly. Keywords: sequential pattern, closed sequential pattern, prefix tree, sequence database. 1. INTRODUCTION Sequential pattern mining, since it was first introduced by Agrawal [1], has played an important role in data mining tasks with broad applications including market and customer analysis, web log analysis, pattern discovery in protein sequences, and mining XML query access patterns for caching and so on. The sequential pattern mining algorithms proposed so far have a good performance in databases with short frequent sequences [2, 3, 9 - 10, 16]. However, when mining long frequent sequences that contain a combinatorial number of frequent subsequences, such a mining will generate an explosive number of frequent subsequences for long patterns, or when using very low support thresholds to mine sequential patterns, which is prohibitively expensive in both time and space. So, the performance of such algorithms often degrades dramatically. To overcome this difficultly, the problem for mining closed sequential patterns have also been proposed. A sequence S is called closed if there exists no supersequence of S with the same support in the database. For example, sequence pattern (AB)(B)(BC) is a closed sequence pattern of sequence patterns (B)(BC) and (AB)(B) if they have the same support. Applying the attributed prefix tree for mining closed sequential patterns 107 Several studies have been recently proposed to mine closed sequential patterns [4, 11 - 12, 14 - 15]. But these algorithms used the corresponding projected databases of frequent subsequences to find closed sequences. It consumes much time to construct projected databases of frequent subsequences for a set of sequence. In this paper, we propose an efficient algorithm for directly finding closed sequential patterns at the generating sequential patterns process, which is based on the combination of the parent–child relationship and its propertied on prefix tree structure. On the prefix tree in our approach, each node stores a sequential pattern and its corresponding support value. Besides, it will be added one field to consider whether this node is a closed sequential pattern (IsCSP). Based on the IsCSP field added to each node, the algorithm easily determines if a node is a closed sequential pattern so the mining time is reduced significantly. This algorithm also uses join operations over the prime block encoding approach of the prime factorization theory to represent candidate sequences and determine the frequency for each candidate. The experimental results showed that the performance for mining closed sequential patterns in this paper is much better. The rest of this paper is organized as follows. Section 2 reviews some works related to mine closed sequential patterns. Section 3 presents some problem definitions related to sequential patterns/closed sequential pattern and prefix tree. The algorithm for mining closed sequential patterns is discussed in Section 4. Section 5 presents the experimental results, and conclusion and future work are presented in Section 6. 2. RELATED WORK Mining sequential patterns with closed patterns may significantly reduce the number of patterns generated in the mining process without losing any information because it can be used to derive the complete set of sequential patterns; so, the number of closed sequential patterns is usually fewer than the number of sequential patterns. Several studies have been recently proposed to mine closed sequential patterns [4, 11 - 12, 14 - 15]. The CloSpan algorithm [15] has been proposed. Like most of the frequent closed itemset mining algorithms CLOSET [6] and CHARM [17], CloSpan algorithm used the candidate maintenance and test approach. It needs to maintain the set of already mined closed sequence candidates for doing the backward subpattern and backward superpattern check to verify if a newly found frequent sequence is promising to be closed or not. Because CloSpan needs to maintain the set of historical closed sequence candidates, when there are many frequent closed sequences, it will consume much memory and lead to huge search space for pattern closure checking. BIDE [14] is another faster closed sequence mining algorithm. Different from CloSpan, it used a novel sequence closure checking scheme called BI-Directional Extension, and pruned the search space more by using the BackScan pruning method and the ScanSkip optimization technique to directly get the complete set of the frequent closed sequence patterns without candidate maintenance. Thus, in most cases, BIDE is more efficient than CloSpan, especially when a database is dense or the minimum support value is low. But to implement the closure check, the BIDE algorithm spends a lot of time on scanning the pseudo-projected database repeatedly to verify the existence of extension of position with a prefix sequence, which costs much time in the mining process. To reduce the time consumed on scanning the pseudo-projected database for verifying in the BIDE algorithm, the FCSM-PD algorithm was proposed by Huang et al. [4]; the positional data was used to reserve the position information of items in the data sequences. In the pattern growth process, the extension of position with a prefix sequence is checked directly and all the position information of the new prefix sequences will be recorded. The FCSM-PD algorithm must store Thi-Thiet Pham, Van Vo 108 all the position information of a prefix sequence in the process of pattern growth in advance; so it consumes more memory in this algorithm. In 2013, Pham et al. proposed an algorithm called CloGen algorithm [7]. It used the parent–child relationship on prefix tree structure to find closed sequences and their corresponding generators at the same time. The result of CloGen algorithm has been used to generate non-redundant sequential rules [8]. This paper is using the parent– child relationship on prefix tree structure to find only closed sequences and it’s result can be used to mine sequential rules. 3. PROBLEM DEFINITIONS A sequence database SD is a set of sequences S = {s1, s2, , sn} and a set of items I = {i1, i2, , in}, where each sequence sx is an ordered list of itemsets sx ={x1, x2, , xn}, and s1 occurs before s2, which occurs before s3, and so on, such that x1, x2, , xn I. The size of a sequence is the number of itemsets in the sequence. The length of a sequence is the number of instances of items in the sequence. A sequence with length l is called an l-pattern sequence. A sequence with size k is called a k-sequence. Given two sequences = a1 a2 an and = b1 b2 bm (where ai, bi are itemsets), sequence α is called a subsequence of β and β is a supersequence of α, denoted as α β, if there exist integers 1 ≤ j1 < j2 < < jn ≤m (n ≤ m) such that a1 bj1 , a2 bj2 , . . . , an bjn. For example, if α = (ab), d and β = (abc), (de) , where a, b, c, d, and e are items, then α is a subsequence of β and β is a supersequence of α. The support of a sequence α (denoted by Sup(α)) in a sequence database is the number of sequences in the database containing α. Sequence α is a frequent sequence in sequence database SD, if Sup(α) ≥ minSup where minSup is the minimum support threshold defined by user. A frequent sequence is called a sequential pattern. A sequential pattern α is called a closed sequential pattern if and only if such that (i.e., contains ) and Sup(α) = Sup(β). Considering the sequence database in table 1 and minSup = 50%, sequences (AB) , (B)(BC) and (B)(B)(BC) are sequential patterns because their support are greater and equal than minSup ( (AB) has the support value 4, (B)(BC) has the support value 3, and (B)(B)(BC) has the support value 3). Sequential pattern (B)(B)(BC) is a closed sequential pattern (B)(BC) Sequence α is a prefix of β if and only if ai = bi for all 1≤ j≤n. After eliminating the prefix part α of sequence β, the remainder of β is a postfix of β. From the above definition, we know that a sequence of size k has (k-1) prefixes. For example, a sequence (A)(BC)(D) has 2 prefixes: (A) and (A)(BC) . Therefore, (BC)(D) is the postfix for prefix (A) , and (D) is the postfix for prefix (A)(BC) . A prefix tree is similar to a lexicographic tree, which starts from the tree root at level 0. In this paper, the prefix tree is started at the root with a null sequence . Each child node stores a sequential pattern, its support value, one field to consider whether this node is a closed sequential pattern (IsCSP). At level 1, each node is set with a single frequent item; at level k, each node is set with a k-pattern sequence. Recursively, there are nodes at the next level (k+1) after a k-pattern sequence is extended with a single frequent item. There are two ways to extend a k-pattern sequence, namely sequence extension and itemset extension [3]. In sequence extension, a single frequent item from I is added to the k-pattern sequence as a new itemset, increasing the size of the sequence. A sequence α is a prefix of all sequence-extended sequences of α, and α is the prefix of all subnodes of the nodes that are sequence-extended in α. In itemset extension, single frequent item from I that is greater than all items in the last itemset is added to Applying the attributed prefix tree for mining closed sequential patterns 109 the last itemset in the k-pattern sequence. The size of itemset-extended sequences does not change and α is an incomplete prefix of all subnodes of itemset-extended nodes in α. 4. PROPOSED ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS In this section, the extension of a sequence on the prefix tree by performing a depth-first search and the attribute of closed seuquential patterns are used to produce an algorithm for generating all closed sequential patterns. Using the prefix tree, new sequences, which are children nodes Cnode, can be easily created by appending an item to the last position of a parent node Pnode as an itemset extension or a sequence extension. When a new node Pnode is created, if Sup(Pnode) = Sup(Cnode), then we set the IsCSP of Cnode to false and the IsCSP of Pnode to true. First, the algorithm initializes the prefix tree pretree with the root node being null and children nodes being sequential 1-patterns with its IsCSP field as true. Each child node cn on pretree is considered as a root node for EXTEND_TREE(cn, pretree) function to create its children nodes and extend the pretree tree. In this function, each child node of root node P is created by itemset extension or sequence extension. To represent candidate sequences as well as determine the frequency for each candidate, it uses the prime block encoding approach and the join operations over the prime blocks in [3]. With each new child node is created Pnew from P, if Sup(Pnew)=Sup(P), then the value of Pnew.IsCSP is set to true and the value of P.IsCSP is set to false. The CHECK_CLOSED(Pnew, pretree) function is called to update closed sequential patterns on the pretree tree. Finally, the algorithm returns the corresponding pretree tree with the sequential patters which have the corresponding IsCSP values. The details of the proposed algorithm are introduced in Figure 1. Input: SD, minSup. Output: Prefix tree that contains Sequential/Closed sequential patterns. Method: MCSP-PreTree(SD, minSup) pretree ; SPs all frequent 1-pattern sequences; for each pattern P in SPs Add P into pretree as a child node; for each child node r in pretree EXTEND_TREE(r, pretree);//extend tree with root node r return pretree; //...................................... EXTEND_TREE(Root, pretree) EXTEND_ITEMSET(Root, pretree);//create a new child node of root by itemset extension EXTEND_SEQUENCE(Root, pretree );//create a new child node of root by sequence extension For each node Pi that is an itemset extension of Root EXTEND_TREE(Pi, pretree);// recursively call to extend tree with root node Pi For each node Ps that is a sequence extension of Root EXTEND_TREE(Ps, pretree); // recursively call to extend tree with root node Ps // .................................................... EXTEND_ITEMSET(P, pretree) Thi-Thiet Pham, Van Vo 110 For each pattern Pi in SPs with Pi last item in last itemset of P Pnew is a new node created by adding Pi into last position in last itemset of P and using block encoding based on prism factorization in [3] to count the support; If (sup(Pnew) ≥ minSup) Set Pnew as a closed pattern; If (sup(P) = sup(Pnew)) Set P as a non-closed pattern; Else CHECK_CLOSED(Pnew, pretree);//check whether there exist subsequences of Pnew in pretree which are closed sequential patterns. Add Pnew into pretree as a itemset-extended child node of P; //................................................... EXTEND_SEQUENCE(P, pretree) For each 1-pattern Pi in SPs Pnew is a new node created by adding Pi as last itemset into P and using block encoding based on prism factorization in [3] to count the support; If (sup(Pnew) ≥ minSup) Set Pnew as a closed pattern; If (sup(P) = sup(Pnew)) Set P as non-closed pattern; Else CHECK_CLOSED(Pnew, pretree);//check whether there exist subsequences of Pnew in pretree which are closed sequential patterns Add Pnew into pretree as a sequence-extended child node of P; //................................................... CHECK_CLOSED (Pnew, pretree) For each node P in pretree If Pnew is a supersequence of P If P is a closed pattern and sup(P) = sup(Pnew) Set P as a non-closed pattern; subtree = the tree which rooted at P; CHECK_CLOSED(Pnew, subtree); Figure 1. Algorithm for generating set of closed sequential patterns. 5. EXPERIMENTAL RESULTS Figures 2–4 illustrate the process of mining closed sequential patterns on prefix tree using the proposed algorithm for the sequence database as presented in Table 1, where minSup = 50 %. First, the root node of prefix tree is null ({}) with each child node containing: sequential pattern, support value, and IsCSP field that contains one of two values (true T or false F). Level 1 of the prefix tree contains nodes with sequential 1-pattern, their support values and the values of IsCSP set to true as in Figure 2, for example node (B) :5:T . Consider node (A) :4:T in Figure 2, the children nodes of (A) :4:T are created by itemset extension or sequence extension, which include sequential pattern nodes: (A)(B) :3:T, (A)(C) :3:T, (AB) :4:T. The support value of Applying the attributed prefix tree for mining closed sequential patterns 111 the child node containing sequential pattern (AB) is 4, which is same as the support value of the parent node containing (A) , according to the definition of closed sequential pattern, then (A) is not a closed sequential pattern, so the IsCSP value of (A) is set to false T. After updating nodes in the prefix tree, we have the prefix tree as in Figure 3. The above process is repeated for all children nodes of (A) and all remaining nodes on the prefix tree. The final experimental results are shown in Figure 4. The experimental results in Figure 4 show that the number of closed sequential patterns is always smaller than the number of sequential patterns, where the complete set of closed sequential patterns consists of only 8 sequences while the whole set of sequential patterns consists of 21 sequences. Table 1. Sequence database. SID Sequence 1 (AB)(B)(B)(AB)(B)(AC) 2 (AB)(BC)(BC) 3 (B)(AB) 4 (B)(B)(BC) 5 (AB)(AB)(AB)(A)(BC) Figure 2. Level 1 of the prefix tree (each node contains: sequential pattern, support, and IsCSP). Figure 3. Updating nodes in the prefix tree after the children nodes of (A) node are created. { } (B) :5:T (C) :4:T (A) :4:T Itemset extension Sequence extension (B) : 5 :T (C) :4:T (A) :4:F (A)(B) :3:T (AB) :4:T (A)(C) :3:T { } Thi-Thiet Pham, Van Vo 112 Figure 4. The prefix tree of sequence database in Table 1 with minSup = 50 %, which has the corresponding IsCSP values. 5. CONCLUSIONS AND FUTURE WORK This paper proposed an algorithm for mining closed sequential patterns. This algorithm combined the parent–child relationship between nodes on prefix tree and the definition of closed sequential pattern to determine whether a sequential pattern is a closed sequential pattern by adding IsCSP field into each sequential pattern node on the prefix tree. The algorithm also applied the prime block encoding approach and the join operations over the prime blocks in [3] for generating candidate sequences and determining the frequency for each candidate. Experimental results are examined on the small sequence database also showed that the number of closed sequential patterns is much smaller than that of the sequential patterns. In future, based on this algorithm, we will perform more experimental results with the larger sequence databases such as synthetic and real databases, namely C6T5S4I4N1kD1k, Chess, and Mushroom and so on. C6T5S4I4N1kD1k was generated using the synthetic data generator developed by IBM to mimic transactions in a retail environment Chess and Mushroom databases were downloaded from Besides, the lattice-base approach has been proposed for mining association rules and classification association rules in recent Itemset extension Sequence extension (A)(B)(B) 3:F { } (B) 5:F (C) 4 :F (A) 4:F (A)(B) 3:F (AB) 4 :T (A)(B)(C) 3:F (AB)(C) 3:F (AB)(B) 3:F (A)(C) 3:F (B)(A) 3:F (B)(B) 5:T (B)(C) 4:F (B)(AB) 3:T (B)(B)(B) 4 :T (B)(B)(C) 4:T (AB)(B)(B) 3:T (AB)(B)(C) 3:T (BC) 3:F (B)(BC) 3:F (B)(B)(BC) 3:T Applying the attributed prefix tree for mining closed sequential patterns 113 years [5, 13]. We will study how to apply this approach for mining sequential patterns and sequential rules in the future. Acknowledgements. This research is funded by Viet Nam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2015.07. REFERENCES 1. Agrawal R. and Srikant R. - Mining Sequential Patterns. In: Proc. of 11th Int’l Conf. Data Engineering, DC, USA, (1995) 3 –14. 2. Ayres J., Gehrke J.E., Yiu T. and Flannick J. - Sequential Pattern Mining using a Bitmap Representaion. In: SIGKDD Conf., NY, USA, (2002) 1–7. 3. Gouda K., Hassaan M. and Zaki M.J. - PRISM: A Primal-Encoding Approach for Frequent Sequence Mining. Journal of Computer and System Sciences 76(1) (2010) 88- 102. 4. Huang G.-Y., Yang F., Hu C.-Z. and Ren J.-D. - Fast Discovery of Frequent Closed Sequential Patterns based on Positional. Proc. of the International Conference on Machine Learning and Cybernetics, ICMLC 2010, Qingdao, China, (2010) 444 – 449. 5. Nguyen L.T.T., Vo B., Hong T.P. and Thanh H.C. - Classification based on association rules: A lattice-based. Expert Systems with Applications, 39(13), (2012) 1135 –1136. 6. Pei J., Han J. and Mao R. - CLOSET: An efficient algorithm for mining frequent closed itemsets. In DMKD’01 workshop, Dallas, TX, (2001) 21 – 30. 7. Pham T.-T., Luo J. and Vo B. - An effective algorithm for mining closed sequential patterns and their minimal generators based on prefix trees. International Journal of Intelligent Information and Database Systems 7(4) (2013) 324-339. 8. Pham T.T., Luo J., Hong T.-P. and Vo B. - An Efficient Method for Mining Non- Redundant Sequential Rules Using Prefix-Trees, Engineering Applications of Artificial Intelligence 32 (2014), 88 - 99. 9. Pei J., et al - Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. In: IEEE Trans. Knowledge and Data Engineering 16(10) (2004) 1424 –1440. 10. Srikant R. and Agrawal R. - Mining Sequential Patterns: Generalizations and Performance Improvements. In: Proc. of 5th Int’l Conf. Extending Database Technology, London, UK, (1996) 3–17. 11. Thilagu M., Nadarajan R., Ahmed M.S.I. and Bama S.S. - PBFMCSP: Prefix Based Fast Mining of Closed Sequential Patterns. The International Conference on Advances in Computing, Control, and Telecommunication Technologies ATC’09, Trivandrum, Kerala, India,(2009) 484 – 488. 12. Tzvetkov P., Yan X. and Han J. - TSP: Mining Top -K Closed Sequential Patterns. Knowl. Inf. Syst., (2005) 438-457. 13. Vo B. and Le B. - Interestingness measures for association rules: Combination between lattice and hash tables. Expert Systems with Applications 38(9) (2011) 1630 - 1640. 14. Wang J. and Han J. - BIDE: Efficient mining of frequent closed sequences. In proc of the 20th Int’ Conf on Data Engineering (ICDE95): IEEE Computer Society Press, DC, USA, (2004) 79-91. Thi-Thiet Pham, Van Vo 114 15. Yan X., Han J. and Afshar R. - CloSpan: Mining closed sequential patterns in large datasets. Proc of the 3th SIAM International Conference on Data Mining, San Francisco, CA, USA: SIAM Press, (2003) 166 -177. 16. Zaki M.J. - SPADE: An Efficient Algorithm for Mining Frequent Sequences. In: Machine Learning Journal 42(1/2) (2000) 31- 60. 17. Zaki M.J. and Hsiao C. - CHARM: An efficient algorithm for closed itemset mining, In SDM ‘02, Arlington, VA, (2002) 457 - 473.

Các file đính kèm theo tài liệu này:

  • pdf11964_103810382404_1_sm_284_2061583.pdf