Surmised Recurrence Numbers over Information Streams.

Uploaded on:
Category: Medical / Health
The Problem
Slide 1

Surmised Frequency Counts over Data Streams Gurmeet Singh Manku (Standford) Rajeev Motwani (Standford) Presented by Michal Spivak November, 2003

Slide 2

Stream The Problem … Identify all components whose present recurrence surpasses bolster limit s = 0.1%.

Slide 3

Stream Related issue … Identify all subsets of things whose present recurrence surpasses s=0.1%

Slide 4

Purpose of this paper Present a calculation for figuring recurrence numbers surpassing a client indicated edge over information streams with the accompanying focal points: Simple Low memory impression Output is rough yet ensured not to surpass a client determined mistake parameter. Can be conveyed for floods of singleton things and handle surges of variable measured arrangements of things.

Slide 5

Overview Introduction Frequency tallying applications Problem definition Algorithm for Frequent Items Algorithm for Frequent Sets of Items Experimental results Summary

Slide 6


Slide 7

Motivating cases Iceberg Query Perform a total capacity over a characteristic and kill those underneath some limit. Affiliation Rules Require calculation of regular itemsets. Chunk of ice Datacubes Group by " s of a CUBE administrator whose total recurrence surpasses limit Traffic estimation Require recognizable proof of streams that surpass a specific part of aggregate movement

Slide 8

What " s out there today … Algorithms that register careful results Attempt to minimize number of information passes (best calculations take two passes). Issues when adjusted to streams: Only one pass is permitted. Results are relied upon to be accessible with short reaction time. Neglect to give any from the earlier ensure on the nature of their yield.

Slide 9

Why Streams? Streams versus Put away information Volume of a stream over its lifetime can be immense Queries for streams require convenient answers, reaction times should be little therefore it is unrealistic to store the stream as a whole.

Slide 10

Frequency checking applications

Slide 11

Existing applications for the accompanying issues Iceberg Query Perform a total capacity over a characteristic and dispose of those beneath some edge. Affiliation Rules Require calculation of successive itemsets. Ice shelf Datacubes Group by " s of a CUBE administrator whose total recurrence surpasses edge Traffic estimation Require distinguishing proof of streams that surpass a specific division of aggregate activity

Slide 12

Iceberg Queries Identify totals that surpass a client indicated edge r One of the distributed calculations to register ice shelf inquiries productively utilizes rehashed hashing over various passes.* Basic Idea: In the principal pass an arrangement of counters is kept up Each approaching thing is hashed to one of the counters which is augmented These counters are then compacted to a bitmap, with a 1 signifying substantial counter esteem In the second pass definite frequencies are kept up for just those components that hash to a counter whose bitmap quality is 1 This calculation is hard to adjust for streams since it requires two passes * M. Tooth, N. SHIVAKUMAR, H. GARCIA-MOLINA,R. MOTWANI, AND J. ULLMAN. Figuring chunk of ice inquiries effectively. In Proc. of 24th Intl. Conf. on Very Large Data Bases , pages 299 – 310, 1998.

Slide 13

Association Rules Definitions Transaction – subset of things drawn from I, the universe of all Items. Itemset X I has bolster s if X happens as a subset no less than a division - s of all exchanges Associations rules over an arrangement of exchanges are of the structure X=>Y, where X and Y are subsets of I with the end goal that X∩Y = 0 and XUY has support surpassing a client determined limit s. Certainty of a guideline X => Y is the quality support(XUY)/support(X) U|

Slide 14

Example - Market wicker bin investigation For backing = half, certainty = half, we have the accompanying principles 1 => 3 with half backing and 66% certainty 3 => 1 with half backing and 100% certainty

Slide 15

Reduce to registering regular itemsets For backing = half, certainty = half For the standard 1 => 3: Support = Support({1, 3}) = half Confidence = Support({1,3})/Support({1}) = 66%

Slide 16

Toivonen " s calculation Based on testing of the information stream. Fundamentally, in the main pass, frequencies are registered for tests of the stream, and in the second pass these the legitimacy of these things is resolved. Can be adjusted for information stream Problems: - false negatives happen in light of the fact that the blunder in recurrence tallies is two sided - for little estimations of e , the quantity of tests is gigantic ~ 1/e (100 million examples)

Slide 17

Network stream recognizable proof Flow – grouping of transport layer parcels that have the same source+destination addresses Estan and Verghese proposed a calculation for this distinguishing streams that surpass a specific limit. The calculation is a mix of rehashed hashing and examining, like those for ice shelf questions. Calculation displayed in this paper is straightforwardly pertinent to the issue of system stream distinguishing proof. It beats the calculation as far as space and necessities.

Slide 18

Problem definition

Slide 19

Problem Definition Algorithm acknowledges two client determined parameters - bolster edge s E (0,1) - blunder parameter ε E (0,1) - ε << s N – length of stream (i.e no. of tuples seen in this way) Itemset – set of things Denote item(set) to be thing or itemset At any purpose of time, the calculation can be solicited to create a rundown from item(set)s alongside their assessed recurrence.

Slide 20

Approximation ensures All item(set)s whose genuine recurrence surpasses sN are yield. There are no false negatives. No item(set)s whose genuine recurrence is not as much as (s-ε ( N is yield. Evaluated frequencies are not as much as genuine frequencies by at most ε N

Slide 21

Input Example S = 0.1% ε as a general guideline, ought to be set to one-tenth or one-twentieth of s. ε = 0.01% according to property 1, ALL components with recurrence surpassing 0.1% will be yield. According to property 2, NO component with recurrence beneath 0.09% will be yield Elements somewhere around 0.09% and 0.1% could possibly be yield. Those that "advance" are false positives according to property 3, all individual frequencies are not exactly their actual frequencies by at most 0.01%

Slide 22

Problem Definition cont … A calculation keeps up a ε - lacking outline if its yield satisifies the previously mentioned properties Goal: to devise calculations to bolster ε - insufficient summary utilizing as meager principle memory as could be expected under the circumstances

Slide 23

The Algorithms for incessant Items Sticky Sampling Lossy Counting

Slide 24

Stream Sticky Sampling Algorithm 34 15 30 28 31 41 23 35 19  Create counters by examining

Slide 25

Notations … Data structure S - set of passages of the structure (e,f) f – gauges the recurrence of a component e. r – examining rate. Examining a component with rate = r implies we select the component with probablity = 1/r

Slide 26

Sticky Sampling cont … Initially – S is void, r = 1. For every approaching component e if (e exists in S) increment comparing f else { sample component with rate r if (examined) add passage (e,1) to S else ignore }

Slide 27

The testing rate Let t = 1/ε log(s - 1  - 1 ) (  = likelihood of disappointment) First 2t components are inspected at rate=1 The following 2t components at rate=2 The following 4t components at rate=4 thus on …

Slide 28

Sticky Sampling cont … Whenever the testing rate r changes: for every section (e,f) in S rehash { toss an unprejudiced coin if (hurl is not effective) diminsh f by one if (f == 0) { delete section from S break } } until hurl is fruitful

Slide 29

Sticky Sampling cont … The quantity of unsuccessful coin hurls folows a geometric circulation. Successfully, after every rate change S is changed to precisely the state it would have been in, if the new rate had been utilized from the earliest starting point. At the point when a client demands a rundown of things with edge s, the yield are those sections in S where f ≥ (s – ε )N

Slide 30

Theorem 1 Sticky Sampling figures a ε - inadequate abstract with likelihood no less than 1 -  utilizing at most 2/ε log(s - 1  - 1 ) expected number of passages.

Slide 31

Theorem 1 - verification First 2t components discover their way into S When r ≥ 2 N = rt + rt` ( t` E [1,t) ) => 1/r ≥ t/N Error in recurrence relates to a succession of unsuccessful coin hurls amid the initial couple of events of e. the likelihood that this length surpasses ε N is at most (1 – 1/r) ε N < (1 – t/N) - ε N < e - ε t Number of components with f > s is close to 1/s => the likelihood that the assessment for any of them is insufficient by ε N is at most e - ε t/s

Slide 32

Theorem 1 – confirmation cont … Probability of disappointment ought to be at generally . This yields e - ε t/s <  t ≥ 1/ε log(s - 1  - 1 ) since the space prerequisites are 2t, the space bound takes after …

Slide 33

Sticky Sampling synopsis The calculation name is called sticky inspecting in light of the fact that S clears over the stream like a magnet, drawing in all components which as of now have a passage in S The space many-sided quality is autonomous of N keeping up tests was initially exhibited by Gibbons and Matias who utilized it to take care of the top-k issue. This calculation is diverse in that the inspecting rate r builds logarithmically to create ALL things with recurrence > s, not only the top k

Slide 34

pail 1 container 3 can 2 Lossy Counting Divide the stream into cans Keep accurate counters for things in the basins Prune entrys at can limits

Slide 35

Lossy Counting cont … A deterministic calculation that figures recurrence numbers over a flood of singleitem exchanges, fulfilling the assurances delineated in Section 3 utilizing at most 1/ε log( ε N) space where N means the present length of the stream. The client indicates two parameters: - bolster s - blunder ε .:tsli

View more...