Description

17% security fundamentalists, 56% down to business lion's share, 27% hardly concerned ... We can draftsman registering frameworks to ensure values we accept are ...

Transcripts

Hippocratic Data Management Rakesh Agrawal IBM Almaden Research Center

Thesis We require data frameworks that regard the security of information they oversee AND don\'t block the helpful stream of data. It is achievable to accommodate the obvious inconsistency

Outline Why Privacy in Data Systems Some Technology Directions Some Challenging Problems

Drivers for Privacy Surveys: 17% protection fundamentalists, 56% realistic greater part, 27% insignificantly concerned (Understanding net clients\' demeanor about online security, April 99) 83% would quit working with an organization in the event that it abused client data (Privacy on and off the Internet: What purchasers need, Nov. 2001) Govt. enactments & rules: Fair Information Practices Act (US, 1974) OECD Guidelines (Europe, 1980) Canadian Standards Association\'s Model Code (1995) Australian Privacy Amendment (2000) Japan: proposed enactment (2003) HIPAA, GLB, Recent U.S. Government & State Initiatives

Privacy Violations Accidents: Kaiser, GlobalHealthrax Lax security: Massachusetts govt. Morally faulty conduct: Lotus & Equifax, Lexis-Nexis, Medical Marketing Service, Boston University, CVS & Giant Food Illegal: Toysmart

Assertion Enterprises need instruments and advancements for overseeing private information and upholding protection approaches.

Founding Tenets of Current Database Systems Ullman, "Standards of Database and Knowledgebase Systems" Fundamental: Manage tenacious information. Access a lot of information proficiently. Attractive: Support for information model, abnormal state dialects, exchange administration, access control , and strength. Comparable rundown in other database course readings.

Statistical & Secure Databases Statistical Databases Provide measurable data (whole, number, and so on.) without bargaining delicate data about people, [AW89] Multilevel Secure Databases Multilevel relations, e.g., records labeled "mystery", "classified", or "unclassified", e.g. [JS91] Need to ensure protection in value-based databases that bolster every day operations. Can\'t limit questions to factual inquiries. Can\'t label every one of the records "top mystery".

Our Research Directions Privacy Preserving Data Mining Hippocratic Databases

Data Mining and Privacy The essential errand in information mining: improvement of models about totaled information. Can we create exact models without access to exact data in individual information records? R. Agrawal, R. Srikant. Security Preserving Data Mining . ACM Int\'l Conf. On Management of Data (SIGMOD), May 2000.

30 | 25K | … 50 | 40K | … Randomizer 65 | 50K | … 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Data Mining Algorithm Model Privacy Preserving Data Mining

Reconstruction Problem Original qualities x 1 , x 2 , ..., x n from likelihood circulation X To conceal these qualities, we utilize y 1 , y 2 , ..., y n from likelihood conveyance Y Given x 1 +y 1 , x 2 +y 2 , ..., x n +y n the likelihood dissemination of Y Estimate the likelihood dispersion of X.

Intuition (Reconstruct single point) Use Bayes\' principle for thickness capacities

Intuition (Reconstruct single point) Use Bayes\' tenet for thickness capacities

Reconstruction: Intuition Combine assessments of where a point originated from for every one of the focuses: yields evaluation of unique appropriation.

Reconstruction Algorithm f X 0 := Uniform dispersion j := 0 rehash f X j+1 (a) := Bayes\' Rule j := j+1 until (halting measure met) Converges to greatest probability gauge. D. Agrawal & C.C. Aggarwal, PODS 2001.

Works Well

Classification Naïve Bayes Assumes autonomy between qualities. Choice Tree Correlations are debilitated by randomization.

Experimental Methodology Compare precision against Original : unperturbed information without randomization. Randomized : annoyed information however without making any remedies for randomization. Test information not randomized. Manufactured information benchmark from [AGI+92]. Preparing set of 100,000 records, split similarly between the two classes.

Decision Tree Experiments

Accuracy versus Randomization

So far … Question: Can we create exact models without access to exact data in individual information records? Answer: yes, by randomization. for numerical traits, grouping How about Association Rules?

Associations Recap An exchange t is an arrangement of things (e.g. books) All exchanges shape a set T of exchanges Any itemset A has bolster s in T if Itemset An is successive if s s min Task: Find all regular itemsets

The Problem How to randomize exchanges with the goal that we can discover incessant itemsets while safeguarding security at exchange level? Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy Preserving Data . eighth Int\'l Conf. on Knowledge Discovery in Databases and Data Mining, July 2002 .

Randomization Overview Alice J.S. Bach, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … Recommendation Service B. Lances, baseball, cnn.com, … Bob B. Lances, baseball, cnn.com, … B. Marley, outdoors, linux.org, … Chris B. Marley, outdoors, linux.org, …

Randomization Overview Alice J.S. Bach, painting, nasa.gov, … J.S. Bach, painting, nasa.gov, … Recommendation Service B. Lances, baseball, cnn.com, … Bob Associations B. Lances, baseball, cnn.com, … B. Marley, outdoors, linux.org, … Chris Recommendations B. Marley, outdoors, linux.org, …

Randomization Overview Alice J.S. Bach, painting, nasa.gov, … Metallica, painting, nasa.gov, … Recommendation Service Support Recovery B. Lances, soccer, bbc.co.uk, … Bob Associations B. Lances, baseball, cnn.com, … B. Marley, outdoors, ibm.com … Chris Recommendations B. Marley, outdoors, linux.org, …

Uniform Randomization Given an exchange, keep thing with, say 20% likelihood, supplant with another arbitrary thing with 80% likelihood.

10 M exchanges of size 10 with 10 K things: 1% have { x , y , z } 5% have { x , y } , { x , z } , or { y , z } just 94% have one or zero things of { x , y , z } Example: { x , y , z } at most • 0.2 • (9/10,000) 2 • 0.2 3 • 0.2 2 • 8/10,000 0.008% 800 ts. 97.8% 0.00016% 16 trans. 1.9% under 0.00002% 2 exchanges 0.3% Privacy Breach : Given { x , y , z } in the randomized exchange, we have around 98% sureness of { x , y , z } in the first one

Privacy Breach Suppose: t is a unique exchange; t\' is the comparing randomized exchange; A will be a (continuous) itemset. Definition : Itemset A causes a protection break of level if, for some thing z A ,

Our Solution "Where does an astute man conceal a leaf? In the backwoods. Be that as it may, what does he do if there is no backwoods?" "He grows a woods to conceal it in." G.K. Chesterton Insert numerous false things into every exchange Hide genuine itemsets among false ones Can regardless we find successive itemsets while having adequate protection?

Cut and Paste Randomization Given exchange t of size m , build t\' : t = a , b , c , u , v , w , x , y , z t\' =

Cut and Paste Randomization Given exchange t of size m , develop t\' : Choose a number j somewhere around 0 and K m (cutoff); t = a , b , c , u , v , w , x , y , z t\' = j = 4

Cut and Paste Randomization Given exchange t of size m , build t\' : Choose a number j somewhere around 0 and K m (cutoff); Include j things of t into t\' ; t = a , b , c , u , v , w , x , y , z t\' = b , v , x , z j = 4

Cut and Paste Randomization Given exchange t of size m , develop t\' : Choose a number j somewhere around 0 and K m (cutoff); Include j things of t into t\' ; Each other thing is incorporated into t\' with likelihood p m . The decision of K m and p m depends on the wanted level of security. t = a , b , c , u , v , w , x , y , z t\' = b , v , x , z œ , å , ß , ξ , ψ , € , א , ъ , ђ , … j = 4

Partial Supports To recoup unique backing of an itemset, we require randomized backings of its subsets. Given an itemset An of size k and exchange size m , A vector of halfway backings of An is Here s k is the same as the backing of A . Randomized incomplete backings are signified by

Transition Matrix Let k = | A | , m = | t |. Move lattice P = P ( k, m ) associates randomized halfway backings with unique ones: Randomized backings are conveyed as an aggregate of multinomial dispersions.

The Unbiased Estimators Given randomized fractional backings, we can evaluate unique halfway backings: Covariance lattice for this estimator: To gauge it, substitute s l with ( s est ) l . Unique case: estimators for backing and its change

Privacy Breach Analysis what number added things are sufficient to ensure security? Need to fulfill Pr [ z t | A t\' ] < ( no security ruptures) Select parameters with the goal that it holds for all itemsets. Use equation ( ): Parameters are to be chosen ahead of time! Build a security testing test: an itemset whose all subsets have most extreme conceivable backing. Enough to know maximal backing of an itemset for every size.

Lowest Discoverable Support LDS is s.t., when pre