Demonstrating and Querying Possible Repairs in Duplicate Detection .


52 views
Uploaded on:
Description
Displaying and Questioning Conceivable Repairs in Copy Identification. George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David. Information Cleaning. Genuine information is messy Cases: Grammatical Mistakes: e.g., Micrsoft Heterogeneous Configurations: e.g., Telephone number arrangements
Transcripts
Slide 1

Demonstrating and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David

Slide 2

Data Cleaning Real-world information is messy Examples: Syntactical Errors: e.g., Micrsoft Heterogeneous Formats: e.g., Phone number organizations Missing Values (Incomplete information) Violation of Integrity Constraints: e.g., FDs/INDs Duplicate Records Data Cleaning is the way toward enhancing information quality by evacuating mistakes, irregularities and peculiarities of information George Beskales

Slide 3

Duplicate Elimination Process Unclean Relation P1 P2 Compute Pairwise Similarity 0.9 Cluster Records P6 0.3 0.5 P5 P3 P4 1.0 P1 P2 Clean Relation C1 Merge Clusters P6 C3 P5 P3 P4 C2 George Beskales

Slide 4

Probabilistic Data Cleaning Probabilistic Results Queries Results RDBMS Uncertainty and Cleaning-mindful RDBMS Deterministic Database Uncertain Database Single Clean Instance Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Deterministic Duplicate Elimination Probabilistic Duplicate Elimination Unclean Database Unclean Database (a) One-Shot Cleaning (b) Probabilistic Cleaning George Beskales

Slide 5

Motivation Probabilistic Data Cleaning: Avoid deterministic determination of contentions amid information cleaning Enrich question comes about by considering all conceivable cleaning cases Allows indicating inquiry time cleaning prerequisites George Beskales

Slide 6

Probabilistic Results Queries Uncertainty and Cleaning-mindful RDBMS Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Outline Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation Probabilistic Results Queries Uncertainty and Cleaning-mindful RDBMS Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Probabilistic Duplicate Elimination Probabilistic Duplicate Elimination Unclean Database George Beskales

Slide 7

Uncertainty in Duplicate Detection Uncertain Duplicates : Determining what records ought to be grouped is dubious because of loud closeness estimations A conceivable repair of a connection is a bunching (apportioning) of the unclean connection Person Possible Repairs Uncertain Clustering George Beskales

Slide 8

Challenges The space of every single conceivable bunching (repairs) is exponentially expansive How to productively create, store and inquiry the conceivable repairs? George Beskales

Slide 9

Possible repairs given by any settled parameterized bunching calculation 2 Possible repairs given by any settled progressive grouping calculation 3 Link-based calculations Hierarchical cut bunching calculation … The Space of Possible Repairs All conceivable repairs 1 George Beskales

Slide 10

Hierarchical Clustering Algorithms Records are grouped in a type of a chain of importance: all singletons are at leaves, and one bunch is at root Example : Linkage-based Clustering Algorithm Distance Threshold (  ) {r 1 ,r 2 },{r 3 ,r 4 ,r 5 } Pair-wise Distance r 1 r 2 r 3 r 4 r 5 George Beskales

Slide 11

Uncertain Hierarchical Clustering Hierarchical bunching calculations can be changed to acknowledge unverifiable parameters The quantity of created conceivable repairs is direct {r 1 ,r 2 },{r 3 ,r 4 ,r 5 } Possible Thresholds [  l ,  u ] {r 1 ,r 2 },{r 3 },{r 4 ,r 5 } {r 1 },{r 2 },{r 3 },{r 4 ,r 5 } {r 1 },{r 2 },{r 3 },{r 4 },{r 5 } r 1 r 2 r 3 r 4 r 5 George Beskales

Slide 12

Probabilities of Repairs Clustering 1 Clustering 2 Clustering 3  ~U(0,10) 0   1 Pr = 0.1 1    3 Pr = 0.2 3  10 Pr = 0.7 George Beskales

Slide 13

Representing the Possible Repairs Clustering 1 Clustering 2 Clustering 3 0   1 Pr = 0.1 1    3 Pr = 0.2 3  10 Pr = 0.7 U-clean Relation Person C George Beskales

Slide 14

Probabilistic Results Queries Uncertainty and Cleaning-mindful RDBMS Outline Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation Probabilistic Results Queries Uncertainty and Cleaning-mindful RDBMS Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Probabilistic Duplicate Elimination Unclean Database George Beskales

Slide 15

Queries over U-Clean Relations We embrace the conceivable universes semantics to characterize inquiries on U-clean relations U-Clean Relation(s) R C Implementation U-Clean Relation Q(R C ) Instances Representation Possible Clean Instances X 1 ,X 2 ,… ,X n Possible Clean Instances Q(X 1 ), Q(X 2 ),… ,Q(X n ) Definition George Beskales

Slide 16

Example: Selection Query Person C SELECT ID, Income FROM Person c WHERE Income>35k George Beskales

Slide 17

Example: Projection Query Person C SELECT DISTINCT Income FROM Person c George Beskales

Slide 18

Example: Join Query Vehicle c Person c SELECT Income, Price FROM Person c , Vehicle c WHERE Income/10 >= Price George Beskales

Slide 19

Aggregation Queries Person c SELECT Sum(Income) FROM Person c X 1 X 2 X 3 Pr = 0.7 Sum=109k Pr = 0.2 Pr = 0.1 Sum=156k Sum=187k George Beskales

Slide 20

Other Meta-Queries Obtaining the most plausible clean case Obtaining the -certain groups Obtaining a spotless occurrence comparing to a particular parameters of the grouping calculations Obtaining the likelihood of grouping an arrangement of records together George Beskales

Slide 21

Outline Generating and Storing the Possible Clean Instances Querying the Clean Instances Experimental Evaluation George Beskales

Slide 22

Experimental Evaluation A model as an augmentation of PostgreSQL Synthetic information generator gave by Febrl (openly extensible biomedical record linkage) Two various leveled calculations: Single-Linkage (S.L.) Nearest-neighbor based grouping calculation (N.N.) [Chaudhuri et al., ICDE\'05] George Beskales

Slide 23

Experimental Evaluation George Beskales

Slide 24

Experimental Evaluation George Beskales

Slide 25

Experimental Evaluation George Beskales

Slide 26

Conclusion We permit speaking to and questioning different conceivable clean cases We altered progressive bunching calculations to permit producing numerous repairs We minimally store the conceivable repairs by keeping the genealogy data of groups in unique qualities New (probabilistic) inquiry sorts can be issued against the number of inhabitants in conceivable repairs George Beskales

Recommended
View more...