Information Mining: Ideas and Procedures Mining grouping designs in value-based databases.

Uploaded on:
Succession Databases
Slide 1

Information Mining: Concepts and Techniques Mining arrangement designs in value-based databases

Slide 2

Sequence Databases & Sequential Patterns Transaction databases, time-arrangement databases versus arrangement databases Frequent examples versus (regular) consecutive examples Applications of successive example mining Customer shopping arrangements: First purchase PC, then CD-ROM, and after that computerized camera, inside 3 months. Medicinal medications, normal fiascos (e.g., seismic tremors), science & eng. procedures, stocks and markets, and so on. Phone calling designs, Weblog click streams DNA groupings and quality structures

Slide 3

What Is Sequential Pattern Mining? Given an arrangement of successions, locate the complete arrangement of incessant subsequences A grouping : < (ef) (abdominal muscle) (df) c b > A grouping database A component may contain an arrangement of things. Things inside a component are unordered and we show them in order. <a(bc)dc> is a subsequence of < an (a bc )(air conditioning) d ( c f)> Given bolster edge min_sup =2, <(ab)c> is a consecutive example

Slide 4

Challenges on Sequential Pattern Mining countless successive examples are covered up in databases A mining calculation ought to locate the complete arrangement of examples , when conceivable, fulfilling the base bolster (recurrence) edge be exceptionally productive, adaptable , including just a little number of database sweeps have the capacity to fuse different sorts of client particular limitations

Slide 5

Sequential Pattern Mining Algorithms Concept presentation and an underlying Apriori-like calculation Agrawal & Srikant. Mining consecutive examples, ICDE\'95 Apriori-based technique: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT\'96) Pattern-development strategies: FreeSpan & PrefixSpan (Han et al.@KDD\'00; Pei, et al.@ICDE\'01) Vertical arrangement based mining: SPADE (Zaki@Machine Leanining\'00) Constraint-based successive example mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB\'99; Pei, Han, Wang @ CIKM\'02) Mining shut consecutive examples: CloSpan (Yan, Han & Afshar @SDM\'03)

Slide 6

Seq. ID Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> The Apriori Property of Sequential Patterns An essential property: Apriori (Agrawal & Sirkant\'94) If an arrangement S is not visit Then none of the super-successions of S is incessant E.g, <hb> is rare  so do <hab> and <(ah)b> Given bolster edge min_sup =2

Slide 7

GSP—Generalized Sequential Pattern Mining GSP (Generalized Sequential Pattern) mining calculation proposed by Agrawal and Srikant, EDBT\'96 Outline of the technique Initially, every thing in DB is an applicant of length-1 for every level (i.e., groupings of length-k) do examine database to gather bolster mean every competitor arrangement create hopeful length-(k+1) groupings from length-k continuous arrangements utilizing Apriori rehash until no regular grouping or no applicant can be discovered Major quality: Candidate pruning by Apriori

Slide 8

Seq. ID Sequence min_sup =2 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> Finding Length-1 Sequential Patterns Examine GSP utilizing a case Initial hopefuls: all singleton arrangements <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, check support for competitors

Slide 9

GSP: Generating Length-2 Candidates 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 applicants Apriori prunes 44.57% applicants

Slide 10

Seq. ID Sequence Cand. can\'t pass sup. limit 5 th filter: 1 cand. 1 length-5 seq. pat. <(bd)cba> 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> Cand. not in DB at all <abba> <(bd)bc> … 4 th check: 8 cand. 6 length-4 seq. pat. 30 <(ah)(bf)abf> 3 rd check: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all <abb> <aab> <aba> <baa> <bab> … 40 <(be)(ce)d> 2 nd check: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 50 <a(bd)bcb(ade)> <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> 1 st examine: 8 cand. 6 length-1 seq. pat. <a> <b> <c> <d> <e> <f> <g> <h> The GSP Mining Process min_sup =2

Slide 11

Candidate Generate-and-test: Drawbacks A colossal arrangement of applicant successions created. Particularly 2-thing competitor succession. Numerous Scans of database required. The length of every competitor develops by one at every database examine. Wasteful for mining long successive examples. A long example grow up from short examples The quantity of short examples is exponential to the length of mined examples.

Slide 12

The SPADE Algorithm SPADE ( S equential PA ttern D iscovery utilizing E quivalent Class) created by Zaki 2001 A vertical configuration consecutive example mining technique A succession database is mapped to a vast arrangement of Item: <SID, EID> Sequential example mining is performed by developing the subsequences (designs) one thing at once by Apriori applicant era

Slide 13

The SPADE Algorithm

Slide 14

Bottlenecks of GSP and SPADE A colossal arrangement of hopefuls could be produced 1,000 incessant length-1 groupings produce s gigantic number of length-2 competitors! Numerous outputs of database in mining Breadth-first pursuit Mining long successive examples Needs an exponential number of short applicants A length-100 consecutive example needs 10 30 hopeful arrangements!

Slide 15

Prefix and Suffix (Projection) <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of succession <a(abc)(ac)d(cf)> Given arrangement <a(abc)(ac)d(cf)>

Slide 16

Mining Sequential Patterns by Prefix Projections Step 1: discover length-1 consecutive examples <a>, <b>, <c>, <d>, <e>, <f> Step 2: partition seek space. The complete arrangement of seq. pat. can be parceled into 6 subsets: The ones having prefix <a>; The ones having prefix <b>; … The ones having prefix <f>

Slide 17

Finding Seq. Designs with Prefix <a> Only need to consider projections w.r.t. <a> <a>- anticipated database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Further segment into 6 subsets Having prefix <aa>; … Having prefix <af>

Slide 18

Completeness of PrefixSpan SDB Length-1 successive examples <a>, <b>, <c>, <d>, <e>, <f> Having prefix <c>, … , <f> Having prefix <a> Having prefix <b> <a>- anticipated database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> <b>- anticipated database … Length-2 consecutive examples <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> … Having prefix <aa> Having prefix <af> … <aa>- proj. db <af>- proj. db

Slide 19

Efficiency of PrefixSpan No hopeful grouping should be produced Projected databases continue contracting Major expense of PrefixSpan: developing anticipated databases Can be enhanced by pseudo-projections

Slide 20

Speed-up by Pseudo-projection Major expense of PrefixSpan: projection Postfixes of successions frequently show up over and again in recursive anticipated databases When (anticipated) database can be held in primary memory, use pointers to frame projections Pointer to the arrangement Offset of the postfix s=<a(ab c)(ac)d(cf) > <a> s|<a>: ( , 2) <(ab c)(ac)d(cf) > <ab> s|<ab>: ( , 4) <(_ c)(ac)d(cf) >

Slide 21

Pseudo-Projection versus Physical Projection Pseudo-projection stays away from physically replicating postfixes Efficient in running time and space when database can be held in primary memory However, it is not effective when database can\'t fit in fundamental memory Disk-based arbitrary getting to is exorbitant Suggested Approach: Integration of physical and pseudo-projection Swapping to pseudo-projection when the information set fits in memory

Slide 22

Constraint-Based Seq.- Pattern Mining Constraint-based successive example mining Constraints: User-determined, for centered mining of sought examples How to investigate productive mining with requirements? — Optimization Classification of imperatives Anti-monotone : E.g., value_sum(S) < 150, min(S) > 10 Monotone : E.g., number (S) > 5, S  {PC, digital_camera} Succinct : E.g., length(S)  10, S  {Pentium, MS/Office, MS/Money} Convertible: E.g., value_avg(S) < 25, profit_sum (S) > 160, max(S)/avg(S) < 2, median(S) – min(S) > 5 Inconvertible: E.g., avg(S) – median(S) = 0

Slide 23

From Sequential Patterns to Structured Patterns Sets, groupings, trees, charts, and different structures Transaction DB: Sets of things {{i 1 , i 2 , … , i m }, … } Seq. DB: Sequences of sets: {<{i 1 , i 2 }, … , {i m , i n , i k }>, … } Sets of Sequences: {{<i 1 , i 2 >, … , <i m , i n , i k >}, … } Sets of trees: {t 1 , t 2 , … , t n } Sets of diagrams (digging for incessant subgraphs): {g 1 , g 2 , … , g n } Mining organized examples in XML reports, bio-synthetic structures, and so on

Slide 24

Episodes and Episode Pattern Mining Other techniques for determining the sorts of examples Serial scenes: A  B Parallel scenes: A & B Regular expressions: (A | B)C*(D  E) Methods for scene design mining Variations of Apriori-like calculations, e.g., GSP Database projection-based example development Similar to the successive example development without competitor era

Slide 25

Periodicity Analysis Periodicity is all over the place: tides, seasons, day by day power utilization

View more...