Description

The Problem

Transcripts

Surmised Frequency Counts over Data Streams Gurmeet Singh Manku (Standford) Rajeev Motwani (Standford) Presented by Michal Spivak November, 2003

Stream The Problem … Identify all components whose present recurrence surpasses bolster limit s = 0.1%.

Stream Related issue … Identify all subsets of things whose present recurrence surpasses s=0.1%

Purpose of this paper Present a calculation for figuring recurrence numbers surpassing a client indicated edge over information streams with the accompanying focal points: Simple Low memory impression Output is rough yet ensured not to surpass a client determined mistake parameter. Can be conveyed for floods of singleton things and handle surges of variable measured arrangements of things.

Overview Introduction Frequency tallying applications Problem definition Algorithm for Frequent Items Algorithm for Frequent Sets of Items Experimental results Summary

Introduction

Motivating cases Iceberg Query Perform a total capacity over a characteristic and kill those underneath some limit. Affiliation Rules Require calculation of regular itemsets. Chunk of ice Datacubes Group by " s of a CUBE administrator whose total recurrence surpasses limit Traffic estimation Require recognizable proof of streams that surpass a specific part of aggregate movement

What " s out there today … Algorithms that register careful results Attempt to minimize number of information passes (best calculations take two passes). Issues when adjusted to streams: Only one pass is permitted. Results are relied upon to be accessible with short reaction time. Neglect to give any from the earlier ensure on the nature of their yield.

Why Streams? Streams versus Put away information Volume of a stream over its lifetime can be immense Queries for streams require convenient answers, reaction times should be little therefore it is unrealistic to store the stream as a whole.

Frequency checking applications

Existing applications for the accompanying issues Iceberg Query Perform a total capacity over a characteristic and dispose of those beneath some edge. Affiliation Rules Require calculation of successive itemsets. Ice shelf Datacubes Group by " s of a CUBE administrator whose total recurrence surpasses edge Traffic estimation Require distinguishing proof of streams that surpass a specific division of aggregate activity

Iceberg Queries Identify totals that surpass a client indicated edge r One of the distributed calculations to register ice shelf inquiries productively utilizes rehashed hashing over various passes.* Basic Idea: In the principal pass an arrangement of counters is kept up Each approaching thing is hashed to one of the counters which is augmented These counters are then compacted to a bitmap, with a 1 signifying substantial counter esteem In the second pass definite frequencies are kept up for just those components that hash to a counter whose bitmap quality is 1 This calculation is hard to adjust for streams since it requires two passes * M. Tooth, N. SHIVAKUMAR, H. GARCIA-MOLINA,R. MOTWANI, AND J. ULLMAN. Figuring chunk of ice inquiries effectively. In Proc. of 24th Intl. Conf. on Very Large Data Bases , pages 299 – 310, 1998.

Association Rules Definitions Transaction – subset of things drawn from I, the universe of all Items. Itemset X I has bolster s if X happens as a subset no less than a division - s of all exchanges Associations rules over an arrangement of exchanges are of the structure X=>Y, where X and Y are subsets of I with the end goal that X∩Y = 0 and XUY has support surpassing a client determined limit s. Certainty of a guideline X => Y is the quality support(XUY)/support(X) U|

Example - Market wicker bin investigation For backing = half, certainty = half, we have the accompanying principles 1 => 3 with half backing and 66% certainty 3 => 1 with half backing and 100% certainty

Reduce to registering regular itemsets For backing = half, certainty = half For the standard 1 => 3: Support = Support({1, 3}) = half Confidence = Support({1,3})/Support({1}) = 66%

Toivonen " s calculation Based on testing of the information stream. Fundamentally, in the main pass, frequencies are registered for tests of the stream, and in the second pass these the legitimacy of these things is resolved. Can be adjusted for information stream Problems: - false negatives happen in light of the fact that the blunder in recurrence tallies is two sided - for little estimations of e , the quantity of tests is gigantic ~ 1/e (100 million examples)

Network stream recognizable proof Flow – grouping of transport layer parcels that have the same source+destination addresses Estan and Verghese proposed a calculation for this distinguishing streams that surpass a specific limit. The calculation is a mix of rehashed hashing and examining, like those for ice shelf questions. Calculation displayed in this paper is straightforwardly pertinent to the issue of system stream distinguishing proof. It beats the calculation as far as space and necessities.

Problem definition

Problem Definition Algorithm acknowledges two client determined parameters - bolster edge s E (0,1) - blunder parameter ε E (0,1) - ε << s N – length of stream (i.e no. of tuples seen in this way) Itemset – set of things Denote item(set) to be thing or itemset At any purpose of time, the calculation can be solicited to create a rundown from item(set)s alongside their assessed recurrence.

Approximation ensures All item(set)s whose genuine recurrence surpasses sN are yield. There are no false negatives. No item(set)s whose genuine recurrence is not as much as (s-ε ( N is yield. Evaluated frequencies are not as much as genuine frequencies by at most ε N

Input Example S = 0.1% ε as a general guideline, ought to be set to one-tenth or one-twentieth of s. ε = 0.01% according to property 1, ALL components with recurrence surpassing 0.1% will be yield. According to property 2, NO component with recurrence beneath 0.09% will be yield Elements somewhere around 0.09% and 0.1% could possibly be yield. Those that "advance" are false positives according to property 3, all individual frequencies are not exactly their actual frequencies by at most 0.01%

Problem Definition cont … A calculation keeps up a ε - lacking outline if its yield satisifies the previously mentioned properties Goal: to devise calculations to bolster ε - insufficient summary utilizing as meager principle memory as could be expected under the circumstances

The Algorithms for incessant Items Sticky Sampling Lossy Counting

Stream Sticky Sampling Algorithm 34 15 30 28 31 41 23 35 19 Create counters by examining

Notations … Data structure S - set of passages of the structure (e,f) f – gauges the recurrence of a component e. r – examining rate. Examining a component with rate = r implies we select the component with probablity = 1/r

Sticky Sampling cont … Initially – S is void, r = 1. For every approaching component e if (e exists in S) increment comparing f else { sample component with rate r if (examined) add passage (e,1) to S else ignore }

The testing rate Let t = 1/ε log(s - 1 - 1 ) ( = likelihood of disappointment) First 2t components are inspected at rate=1 The following 2t components at rate=2 The following 4t components at rate=4 thus on …

Sticky Sampling cont … Whenever the testing rate r changes: for every section (e,f) in S rehash { toss an unprejudiced coin if (hurl is not effective) diminsh f by one if (f == 0) { delete section from S break } } until hurl is fruitful

Sticky Sampling cont … The quantity of unsuccessful coin hurls folows a geometric circulation. Successfully, after every rate change S is changed to precisely the state it would have been in, if the new rate had been utilized from the earliest starting point. At the point when a client demands a rundown of things with edge s, the yield are those sections in S where f ≥ (s – ε )N

Theorem 1 Sticky Sampling figures a ε - inadequate abstract with likelihood no less than 1 - utilizing at most 2/ε log(s - 1 - 1 ) expected number of passages.

Theorem 1 - verification First 2t components discover their way into S When r ≥ 2 N = rt + rt` ( t` E [1,t) ) => 1/r ≥ t/N Error in recurrence relates to a succession of unsuccessful coin hurls amid the initial couple of events of e. the likelihood that this length surpasses ε N is at most (1 – 1/r) ε N < (1 – t/N) - ε N < e - ε t Number of components with f > s is close to 1/s => the likelihood that the assessment for any of them is insufficient by ε N is at most e - ε t/s

Theorem 1 – confirmation cont … Probability of disappointment ought to be at generally . This yields e - ε t/s < t ≥ 1/ε log(s - 1 - 1 ) since the space prerequisites are 2t, the space bound takes after …

Sticky Sampling synopsis The calculation name is called sticky inspecting in light of the fact that S clears over the stream like a magnet, drawing in all components which as of now have a passage in S The space many-sided quality is autonomous of N keeping up tests was initially exhibited by Gibbons and Matias who utilized it to take care of the top-k issue. This calculation is diverse in that the inspecting rate r builds logarithmically to create ALL things with recurrence > s, not only the top k

pail 1 container 3 can 2 Lossy Counting Divide the stream into cans Keep accurate counters for things in the basins Prune entrys at can limits

Lossy Counting cont … A deterministic calculation that figures recurrence numbers over a flood of singleitem exchanges, fulfilling the assurances delineated in Section 3 utilizing at most 1/ε log( ε N) space where N means the present length of the stream. The client indicates two parameters: - bolster s - blunder ε .:tsli