# PrefixSpan Algorithm: Efficient Sequential Pattern Mining

This article discusses the PrefixSpan algorithm, a widely used approach for mining sequential patterns efficiently. The article introduces the problem statement and describes the existing solutions before presenting the proposed solution. The article concludes with references for further reading.

• Uploaded on | 9 Views
• lilyhaag

## About PrefixSpan Algorithm: Efficient Sequential Pattern Mining

PowerPoint presentation about 'PrefixSpan Algorithm: Efficient Sequential Pattern Mining'. This presentation describes the topic on This article discusses the PrefixSpan algorithm, a widely used approach for mining sequential patterns efficiently. The article introduces the problem statement and describes the existing solutions before presenting the proposed solution. The article concludes with references for further reading.. The key topics included in this slideshow are PrefixSpan Algorithm, Mining Sequential Patterns, Prefix Projected Pattern Growth, Algorithm, Efficient, Sequence Mining,. Download this presentation absolutely free.

## Presentation Transcript

1. 1/16 PREFIXSPAN ALGORITHM PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth al113301m@student.etf.rs

2. 2/16 2. Problem statement 5. A lgorithm 1. Introduction 3. Existing Solutions 4. Proposed Solution 6. Conclusion 7. References Contet Contet al113301m@student.etf.rs

3. 3/16 Introduction Introduction Given a set of sequences, where each sequence consists of a list of elements and each element consists of set of items. - 5 elements, 9 items - 9-sequence id Sequence 10 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 al113301m@student.etf.rs

4. 4/16 Subsequence vs. super sequence Subsequence vs. super sequence Given two sequences = and =. is called a subsequence of , denoted as , if there exist integers 1j 1 Correct : 1 = 2 = < ( ac)(ac)d(cf)> 3 = Not correct: 4 =< df(cf)> 5 =<( cf)d> 6 =<( abc)dcf al113301m@student.etf.rs

5. 5/16 Sequential Pattern Mining Sequential Pattern Mining Find all the frequent subsequences, i.e. the subsequences whose occurrence frequency in the set of sequences is no less than min_support (user-specified). id Sequence 10 20 <(ad)c(bc)(ae) > 30 <(ef)(ab)(df)cb > 40 Solution 53 frequent subsequences: <(ab)> <(ab)c> <(ab)d> <(ab)f> <(ab)dc> <(bc)> <(bc)a> min_support = 2 al113301m@student.etf.rs

6. 6/16 Existing Solutions Existing Solutions Apriori-like approache s (AprioriSome (1995.), AprioriAll (1995.), DynamicSome (1995.), GSP (1996.)): Potentially huge set of candidate sequences, Multiple scans of databases, Difficulties at mining long sequential patterns. FreeSpan (2000.) - pattern groth method ( Frequent pattern-projected Sequential pattern mining) General idea is to use frequent items to recursively project sequence databases into a smaller projected databases , and grow subsequence fragments in each projected database. PrefixSpan (Prefix-projected Sequential pattern mining) Less projections and quickly shrinking sequence. al113301m@student.etf.rs

7.

8. 8/16 Projection Projection Given sequences and , such that is a subsequence of . A subsequence of sequence is called a projection of w.r.t. prefix if and only if: has prefix ; There exist no proper super-sequence of such that: is a subsequence of and also has prefix . Example: =< a(abc)(ac)d(cf)> =<( bc)a> =<( bc)(ac)d(cf)> al113301m@student.etf.rs

9. 9/16 Postfix Postfix Let =< a 1 ,a 2 a n > be the projection of w.r.t. prefix =< a 1 a 2 a m-1 a m > (m n) . Sequence =< a m a m+1 a n > is called the postfix of w.r.t. prefix , denoted as = / , where a m =(a m - a m ). We also denote = . Example: =< a(abc)(ac)d(cf)>, =< a(abc)a>, =<(_ c)d(cf)> . al113301m@student.etf.rs

10. 10/16 PrefixSpan Algorithm PrefixSpan Algorithm Input of the algorithm : A sequence database S, and the minimum support threshold min_support. Output of the algorithm: The complete set of sequential patterns. Subroutine: PrefixSpan( , L, S| ) . Parameters: : sequential pattern, L: the length of ; S| : : the -projected database, if <>; otherwise; the sequence database S. Call PrefixSpan(<>,0,S). id Sequence 10 20 <(ad)c(bc)(ae) > 30 <(ef)(ab)(df)cb > 40 al113301m@student.etf.rs

11. 11/16 PrefixSpan Algorithm (2) PrefixSpan Algorithm (2) Method: 1. Scan S | once, find the set of frequent items b such that: b can be assembled to the last element of to form a sequential pattern; or can be appended to to form a sequential pattern. 2. For each frequent item b: append it to to form a sequential pattern and output ; output ; 3. For each : construct -projected database S | and call PrefixSpan( , L+1, S | ). al113301m@student.etf.rs

12.

13. 13/16 PrefixSpan Example (2) PrefixSpan Example (2) Find subsets of sequential patterns: <(cf)> <(_f)cb> <(bc)(ae)> <(_c)(ae)> <> <_f> 1 2 3 0 1 1 1 2 1 1 1 al113301m@student.etf.rs

14. 14/16 Conclusions Conclusions PrefixSpan Efficient pattern growth method. Outperforms both GSP and FreeSpan. Explores prefix-projection in sequential pattern mining. Mines the complete set of patterns, but reduces the effort of candidate subsequence generation. Prefix-projection reduces the size of projected database and leads to efficient processing. Bi-level projection and pseudo-projection may improve mining efficiency. al113301m@student.etf.rs

15. 15/16 References References Pei J., Han J., Mortazavi-Asl J., Pinto H., Chen Q., Dayal U., Hsu M., PrefixSpan: Mining Sequential Patterns Efficiently by Prefix- Projected Pattern Growth, 17th International Conference on Data Engineering (ICDE), April 2001. Agrawal R., Srikant R., Mining sequential patterns, Proceedings 1995 Int. Conf. Very Large Data Bases (VLDB94), pp. 487-499, 1999. Han J., Dong G., Mortazavi-Asl B., Chen Q., Dayal U., Hsu M.-C., Freespan: Frequent pattern-projected sequential pattern mining, Proceedings 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD00), pp. 355-359, 2000. Wojciech Stach, http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput695- 04/slides/PrefixSpan-Wojciech.pdf. al113301m@student.etf.rs

16. 16/16 Thank you for attention. Questions? Lazar Arsi 2011/3301 AL113301m@student.etf.rs Thank you for attention. Questions? Lazar Arsi 2011/3301 AL113301m@student.etf.rs al113301m@student.etf.rs