Time Series II: Data Mining and Analysis Course

This course covers various techniques in data mining and analysis, including association rules, clustering, classification, similarity matching, combining models, time series analysis, ranking, and model evaluation. Homework assignments and exercise sessions will be included, culminating in a final exam and re-exam.

Uploaded on May 27, 2023 | 0 Views
fionablanda

About Time Series II: Data Mining and Analysis Course

PowerPoint presentation about 'Time Series II: Data Mining and Analysis Course'. This presentation describes the topic on This course covers various techniques in data mining and analysis, including association rules, clustering, classification, similarity matching, combining models, time series analysis, ranking, and model evaluation. Homework assignments and exercise sessions will be included, culminating in a final exam and re-exam.. The key topics included in this slideshow are Time series analysis, data mining, clustering, classification, association rules, model evaluation,. Download this presentation absolutely free.

Presentation Transcript

1. Time Series II 1

2. Syllabus Nov 4 Introduction to data mining Nov 5 Association Rules Nov 10, 14 Clustering and Data Representation Nov 17 Exercise session 1 ( Homework 1 due ) Nov 19 Classification Nov 24, 26 Similarity Matching and Model Evaluation Dec 1 Exercise session 2 ( Homework 2 due ) Dec 3 Combining Models Dec 8, 10 Time Series Analysis Dec 15 Exercise session 3 ( Homework 3 due ) Dec 17 Ranking Jan 13 Review Jan 14 EXAM Feb 23 Re-EXAM 2

3. Last time What is time series? How do we compare time series data? 3

4. Today What is the structure of time series data? Can we represent this structure compactly and accurately ? How can we search streaming time series ? 4

5. Time series summarization 5

6. We can reduce the length of time series We should not lose any information We can process it faster Why Summarization? 6

7. Jean Fourier 1768-1830 0 20 40 60 80 100 120 140 0 1 2 3 X X' 4 5 6 7 8 9 Discrete Fourier Transform (DFT) Discrete Fourier Transform (DFT) Excellent free Fourier Primer Hagit Shatkay, The Fourier Transform - a Primer'', Technical Report CS- 95-37, Department of Computer Science, Brown University, 1995. http://www.ncbi.nlm.nih.gov/CBBresearch/Postdocs/Shatkay/ Basic Idea: Represent the time series as a linear combination of sines and cosines Transform the data from the time domain to the frequency domain Highlight the periodicities but keep only the first n/2 coefficients Why n /2 coefficients? Because they are symmetric 7

8. A: several real sequences are periodic Q: Such as? A: sales patterns follow seasons economy follows 50-year cycle (or 10?) temperature follows daily and yearly cycles Many real signals follow (multiple) cycles Why DFT? 8

9. How does it work? Decomposes signal to a sum of sine and cosine waves How to assess similarity of x with a (discrete) wave? 0 1 n-1 time value x ={ x 0 , x 1 , ... x n-1 } s ={ s 0 , s 1 , ... s n-1 } 9

10. Consider the waves with frequency 0, 1, Use the inner-product (~cosine similarity) 0 1 n-1 time value freq. f=0 0 1 n-1 time value freq. f=1 sin(t * 2 n) Freq=1/period How does it work? 10

11. 0 1 n-1 time value freq. f=2 Consider the waves with frequency 0, 1, Use the inner-product (~cosine similarity) How does it work? 11

12. basis functions 0 1 n-1 0 1 n-1 0 1 n-1 sine, freq =1 sine, freq = 2 0 1 n-1 0 1 n-1 cosine, f=1 cosine, f=2 How does it work? 12

13. Basis functions are actually n-dim vectors, orthogonal to each other similarity of x with each of them: inner product DFT: ~ all the similarities of x with the basis functions How does it work? 13

14. Since: e jf = cos(f) + j sin(f) , with j=sqrt (-1) we finally have: inverse DFT How does it work? 14

15. Each X f is an imaginary number: X f = a + b j is the real part is the imaginary part Examples: 10 + 5j 4.5 4j How does it work? 15

16. SYMMETRY property of imaginary numbers: X f = (X n-f )* ( * : complex conjugate : (a + b j)* = a - b j ) Thus: we use only the first n/2 numbers How does it work? 16

17. DFT: Amplitude spectrum Amplitude Intuition: strength of frequency f time count freq. f A f freq: 12 17

18. 50 100 150 200 250 -5 0 5 Reconstruction using 1coefficients Example 18

19. 50 100 150 200 250 -5 0 5 Reconstruction using 2coefficients Example 19

20. 50 100 150 200 250 -5 0 5 Reconstruction using 7coefficients Example 20

21. 50 100 150 200 250 -5 0 5 Reconstruction using 20coefficients Example 21

22. DFT: Amplitude spectrum Can achieve excellent approximations, with only very few frequencies! SO what? 22

23. DFT: Amplitude spectrum Can achieve excellent approximations, with only very few frequencies! We can reduce the dimensionality of each time series by representing it with the k most dominant frequencies Each frequency needs two numbers (real part and imaginary part) Hence, a time series of length n can be represented using 2*k real numbers, where k << n 23

24. 0 20 40 60 80 100 120 140 C 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 Raw Data The graphic shows a time series with 128 points. The raw data used to produce the graphic is also reproduced as a column of numbers (just the first 30 or so points are shown). n = 128

25. 0 20 40 60 80 100 120 140 C . . . . . . . . . . . . . . 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... Fourier Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 Raw Data We can decompose the data into 64 pure sine waves using the Discrete Fourier Transform (just the first few sine waves are shown). The Fourier Coefficients are reproduced as a column of numbers (just the first 30 or so coefficients are shown).

26. 0 20 40 60 80 100 120 140 C 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 Truncated Fourier Coefficients C We have discarded of the data. 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... Fourier Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 Raw Data n = 128 N = 8 C ratio = 1/16

27. 0 20 40 60 80 100 120 140 C Sorted Truncated Fourier Coefficients C 1.5698 1.0485 0.7160 0.8406 0.3709 0.1670 0.4667 0.1928 0.1635 0.1302 0.0992 0.1282 0.2438 0.2316 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... Fourier Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 Raw Data 1.5698 1.0485 0.7160 0.8406 0.2667 0.1928 0.1438 0.1416 Instead of taking the first few coefficients, we could take the best coefficients

28. Discrete Fourier Transformrecap Discrete Fourier Transformrecap Pros and Cons of DFT as a time series representation Pros and Cons of DFT as a time series representation Pros: Good ability to compress most natural signals Fast, off the shelf DFT algorithms exist O( n log( n )) Cons: Difficult to deal with sequences of different lengths 0 20 40 60 80 100 120 140 0 1 2 3 X X' 4 5 6 7 8 9 28

29. Piecewise Aggregate Approximation (PAA) Piecewise Aggregate Approximation (PAA) 0 20 40 60 80 100 120 140 X X' x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Basic Idea: Represent the time series as a sequence of box basis functions, each box being of the same length Keogh, Chakrabarti, Pazzani & Mehrotra, KAIS (2000) Byoung-Kee Yi, Christos Faloutsos, VLDB (2000) Computation: X: time series of length n Can be represented in the N-dimensional space as: 29

30. Piecewise Aggregate Approximation (PAA) Piecewise Aggregate Approximation (PAA) 0 20 40 60 80 100 120 140 X X' x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Example Let X = [1 3 -1 4 4 4 5 3 7] X can be mapped from its original dimension n = 9 to a lower dimension, e.g., N = 3, as follows: [1 3 -1 4 4 4 5 3 7] [ 1 4 5 ] 30

31. 0 20 40 60 80 100 120 140 X X' Pros: Extremely fast to calculate As efficient as other approaches (empirically) Support queries of arbitrary lengths Can support any Minkowski metric Supports non Euclidean measures Simple! Intuitive! Cons: If visualized directly, looks ascetically unpleasing Pros and Cons of PAA as a time series representation Pros and Cons of PAA as a time series representation . Piecewise Aggregate Approximation (PAA) Piecewise Aggregate Approximation (PAA) x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 31

32. Symbolic ApproXimation (SAX) similar in principle to PAA uses segments to represent data series represents segments with symbols (rather than real numbers) small memory footprint 32

33. Creating SAX Input A time series (blue curve) Output SAX representation of the input time series (red string) baabccbc Input Series PAA SAX 33

34. -3 -2 -1 0 1 2 3 4 8 12 16 0 A time series T 4 8 12 16 0 PAA( T ,4) -3 -2 -1 0 1 2 3 The Process (STEP 1) Represent time series T of length n with w segments using Piecewise Aggregate Approximation (PAA) PAA( T,w ) = where 34

35. 4 8 12 16 0 PAA( T ,4) -3 -2 -1 0 1 2 3 The Process (STEP 2) Discretize into a vector of symbols Use breakpoints to map to a small alphabet of symbols -3 -2 -1 0 1 2 3 4 8 12 16 0 0 0 0 1 1 0 1 1 iSAX( T ,4,4) 35

36. Symbol Mapping Each average value from the PAA vector is replaced by a symbol from an alphabet An alphabet size, a of 5 to 8 is recommended a,b,c,d,e a,b,c,d,e,f a,b,c,d,e,f,g a,b,c,d,e,f,g,h Given an average value we need a symbol 36

37. Symbol Mapping This is achieved by using the normal distribution from statistics: Assuming our input series is normalized we can use normal distribution as the data model We divide the area under the normal distribution into a equal sized areas where a is the alphabet size Each such area is bounded by breakpoints 37

38. SAX Computation in pictures SAX Computation in pictures 0 20 40 60 80 100 120 C C baabccbc This slide taken from Eamonn s Tutorial on SAX 38

39. Finding the BreakPoints Breakpoints for different alphabet sizes can be structured as a lookup table When a=3 Average values below -0.43 are replaced by A Average values between -0.43 and 0.43 are replaced by B Average values above 0.43 are replaced by C a=3 a=4 a=5 b1 -0.43 -0.67 -0.84 b2 0.43 0 -0.25 b3 0.67 0.25 b4 0.84 39

40. The GEMINI Framework Raw data: original full-dimensional space Summarization: reduced dimensionality space Searching in original space costly Searching in reduced space faster : Less data, indexing techniques available, lower bounding Lower bounding enables us to prune search space: through away data series based on reduced dimensionality representation guarantee correctness of answer no false negatives false positives: filtered out based on raw data 40

41. GEMINI Solution: Quick filter-and-refine: extract m features (numbers, e.g., average) map into a point into m- dimensional feature space organize points retrieve the answer using a NN query discard false alarms 41

42. Generic Search using Lower Bounding query simplified query Simplified DB Original DB Answer Superset Verify against original DB Final Answer set No false negatives!! Remove false positives!! 42

43. GEMINI: contractiveness GEMINI works when: D feature (F(x), F(y)) <= D(x, y) Note that, the closer the feature distance to the actual one, the better 43

44. Streaming Algorithms Similarity search is the bottleneck for most time series data mining algorithms, including streaming algorithms Scaling such algorithms can be tedious when the target time series length becomes very large ! This will allow us to solve higher-level time series data mining problems: e.g., similarity search in data streams, motif discovery, at scales that would otherwise be untenable 44

45. Fast Serial Scan A streaming algorithm for fast and exact search in very large data streams: 45 query data stream

46. Z-normalization Needed when interested in detecting trends and not absolute values For streaming data: each subsequence of interest should be z-normalized before being compared to the z-normalized query otherwise the trends lost Z-normalization guarantees: offset invariance scale/amplitude invariance 46 A B C

47. Pre-Processing z-Normalization data series encode trends usually interested in identifying similar trends but absolute values may mask this similarity 47

48. Pre-Processing z-Normalization two data series with similar trends but large distance 48 v 1 v 2

49. Pre-Processing z-Normalization zero mean compute the mean of the sequence subtract the mean from every value of the sequence 49 v 1 v 2

50. Pre-Processing z-Normalization zero mean compute the mean of the sequence subtract the mean from every value of the sequence 50

51. Pre-Processing z-Normalization zero mean compute the mean of the sequence subtract the mean from every value of the sequence 51

52. Pre-Processing z-Normalization zero mean compute the mean of the sequence subtract the mean from every value of the sequence 52

53. Pre-Processing z-Normalization zero mean standard deviation one compute the standard deviation of the sequence divide every value of the sequence by the stddev 53

54. Pre-Processing z-Normalization 54 zero mean standard deviation one compute the standard deviation of the sequence divide every value of the sequence by the stddev

55. Pre-Processing z-Normalization 55 zero mean standard deviation one compute the standard deviation of the sequence divide every value of the sequence by the stddev

56. Pre-Processing z-Normalization zero mean standard deviation one 56

57. Pre-Processing z-Normalization when to z-normalize interested in trends when not to z-normalize interested in absolute values 57

58. Proposed Method: UCR Suite 58 An algorithm for similarity search in large data streams Supports both ED and DTW search Works for both z-normalized and un-normalized data series Combination of various optimizations

59. Squared Distance + LB Using the Squared Distance Lower Bounding LB_Yi LB_Kim LB_Keogh C U L Q LB_Keogh 2 59

60. Lower Bounding LB_Yi LB_Kim LB_Keogh 60 A B C D max(Q) min(Q) C U L Q Lower Bounds

61. Early Abandoning Early Abandoning of ED Early Abandoning of LB_Keogh 61 U , L is an envelope of Q

62. Early Abandoning Early Abandoning of DTW Earlier Early Abandoning of DTW using LB Keogh 62 C Q R (Warping Windows) Stop if dtw_dist bsf dtw_dist

63. Early Abandoning Early Abandoning of DTW Earl ier Early Abandoning of DTW using LB_Keogh 63 C Q R (Warping Windows) (partial) dtw_dist (partial) lb_keogh Stop if dtw_dist + lb_keogh bsf

64. Z-normalization Early Abandoning Z-Normalization Do normalization only when needed (just in time) Every subsequence needs to be normalized before it is compared to the query Online mean and std calculation is needed Keep a buffer of size m and compute a running mean and standard deviation 64

65. The Pseudocode 65

66. Reordering Reordering Early Abandoning We dont have to compute ED or LB from left to right Order points by expected contribution 66 - Order by the absolute height of the query point - This step is performed only once for the query and can save about 30%-50% of calculations Idea

67. Reordering 67 Intuition - The query will be compared to many data stream points during a search - Candidates are z-normalized: - the distribution of many candidates will be Gaussian, with a zero mean of zero - the sections of the query that are farthest from the mean ( zero ) will on average have the largest contributions to the distance measure Idea Reordering Early Abandoning We dont have to compute ED or LB from left to right Order points by expected contribution

68. Different Envelopes Reversing the Query/Data Role in LB_Keogh Make LB_Keogh tighter Much cheaper than DTW Online envelope calculation 68 Envelop on Q Envelop on C

69. Lower bounds Cascading Lower Bounds At least 18 lower bounds of DTW was proposed. Use some lower bounds only on the Skyline . 69 Tightness of LB (LB/DTW)

70. Experimental Result: Random Walk Million ( Seconds ) Billion ( Minutes ) Trillion ( Hours ) UCR-ED 0.034 0.22 3.16 SOTA-ED 0.243 2.40 39.80 UCR-DTW 0.159 1.83 34.09 SOTA-DTW 2.447 38.14 472.80 70 Random Walk: Varying size of the data Code and data is available at: www.cs.ucr.edu/~eamonn/ UCRsuite .html

71. Data: One year of Electrocardiograms 8.5 billion data points. Query: Idealized Premature Ventricular Contraction (PVC) of length 421 (R=21=5%) . UCR-ED SOTA-ED UCR-DTW SOTA-DTW ECG 4.1 minutes 66.6 minutes 18.0 minutes 49.2 hours Experimental Result: ECG 71 PVC (aka. skipped beat) ~30,000X faster than real time!

72. Up next Nov 4 Introduction to data mining Nov 5 Association Rules Nov 10, 14 Clustering and Data Representation Nov 17 Exercise session 1 ( Homework 1 due ) Nov 19 Classification Nov 24, 26 Similarity Matching and Model Evaluation Dec 1 Exercise session 2 ( Homework 2 due ) Dec 3 Combining Models Dec 8, 10 Time Series Analysis Dec 15 Exercise session 3 ( Homework 3 due ) Dec 17 Ranking Jan 13 No Lecture Jan 14 EXAM Feb 23 Re-EXAM 72