Enhancing Digest-Based Shared Spam Location.

Uploaded on:
Category: Food / Beverages
MICS. Enhancing Digest-Based Community oriented Spam Location. Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le Boudec EPFL, Switzerland. MIT_Spam_Conference, Damage 27-28, 2008, MIT, Cambridge. Talk content. Digest-based sifting – worldwide picture outline
Slide 1

MICS Improving Digest-Based Collaborative Spam Detection Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le Boudec EPFL, Switzerland MIT_Spam_Conference, Mar 27-28, 2008, MIT, Cambridge.

Slide 2

Talk substance Digest-based separating – worldwide picture diagram Understanding “HOW Digests WORK” - “Open Digest” Paper [1] (Very positive results/determinations, refered to and alluded a ton!) Understanding it better - Our re-assessment of “Open Digest” Paper results (Different conclusions!) Our Alternative Digests - results IMPROVE a considerable measure, understanding “WHY” Understanding the “why” => further enhancements conceivable (Negative choice) Conclusions [1] "An Open Digest-based Technique for Spam Detection”, E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, P. Samarati, in Proc. of the 2004 International Workshop on Security in Parallel and Distributed Systems , San Francisco, CA USA, September 15-17, 2004.

Slide 3

Two fundamental synergistic spam location approaches 1) White-posting utilizing Social Networks 2) Bulky Content Detection utilizing digests connections User 1 User 1 User n User 2 Recent summaries User 3 User n User 2 Example: PGP chart of declarations Examples: DCC, Vipul’s Razor, Commtouch Implementations (in both cases): concentrated or decentralized, open or restrictive This discussion (paper): reviews approach for cumbersome substance identification

Slide 4

s MC MS MC MS MC A Real Digest-Based System: DCC (Distributed Checksum Clearinghouse) … ~ 250 DCC Servers … ~ n * 10 000 Mail servers … Reply =counter (n=3) … Query = digest ~ n * a large number of Mail clients Strengths/disadvantages : - quick reaction not exact (FP issues) constrained muddling resistance Spammer (sends in mass) Reproducible assessment of condensations productivity: “Open Digest” Paper

Slide 5

Che Cha hea b 7 ... b 0 b 7 ... b 0 b 7 ... b 0 Accumulator After L-N+1 steps 0 15 255 Producing Digests: Nilsimsa likeness hashing as clarified in OD-paper Cheap N=5 characters sliding window E-mail, L characters long … 1: 2: 8: trigrams Cheapest vac... Hash: 30^3 - > 2^8 Hash() Hash() Hash() … 00001111 +1 collector ... Best Regards, John 0 15 255 Digest = 0 1 0 1 0 15 255 Digest is a double string of 256 bits Definition: Nilsimsa Compare Value (NCV) between two overviews is equivalent to the quantity of bits at relating positions that are equivalent, short 128. Indistinguishable messages  NCV=128, irrelevant messages  NCV near 0. More comparative messages  more comparable condensations  higher NCV

Slide 6

> look at “Open Digest” paper trials and results Evaluation <= investigation: spam mass location <= discovery of closeness between two messages from the same spam mass ham miss-recognition <= miss-identification of likeness between random messages Bulk identification test: OD-paper result for “adding arbitrary text” jumbling: (rehashed commonly, to get measurement) Spam Corpus Select indiscriminately Obfuscate (2 duplicates) Compute digests 010110…10 011010…11 Evaluate comparability OD-paper just assesses (discusses) the normal NCV Threshold=54 OD-paper conclusion: Average NCV > Threshold => mass location impervious to solid muddling by spammer NCV esteem (whole number) Matching pointer (0/1)

Slide 7

> think about “Open Digest” paper examinations and results (cont.) Ham miss-discovery analysis: Ham and Spam Corpus OD-paper result: n1~2500, n2~2500 messages no coordinating (miss-recognition) case is watched For every pair of random messages Compute digests 100110…10 011100…11 OD-paper conclusion: Miss-recognition of good messages must be low approximating miss-location likelihood by utilization of Binomial dissemination underpins the watched result Evaluate similitude Threshold=54 NCV values (number) Matching markers (0/1)

Slide 8

> analyze Extending OD-paper analyses: spam mass location Bulk discovery test, indistinguishable as in OD-paper: But we test higher confusion proportions: (rehashed ordinarily, to get measurement) Spam Corpus Select aimlessly Obfuscate (2 duplicates) Compute digests 010110…10 011010…11 Evaluate closeness Threshold=54 OD-paper results is very much recuperated (blue spotted line) NCV esteem (whole number) Matching pointer (0/1) OD-paper conclusion does not hold! Indeed, even just somewhat higher jumbling proportion brings the normal NCV howl the edge

Slide 9

Ham Corpus 2/2 > Understanding better what happens “Compare X to Database” (bland test): EITHER Ham Corpus1/2 (ham to channel) OR Spam Corpus (Obfuscation 1) X n2 n1 Spam Corpus (Obfuscation 2) Select aimlessly Compute digest 010110…10 Database DB of spam and ham overviews (speaks to “previous digest queries”) contrast with each from DB Threshold=54 NCV values (number) Matching pointers (0/1) We take a gander at more measurements Probability of email-to-email coordinating Max(NCV) normal NCV histogram

Slide 10

SPAM – DB test results: Mean Max(NCV) esteem not educational Effect of confusion changes nimbly Spammer may pick up by extra muddling.

Slide 11

SPAM – DB, NCV histograms: impact of obscurity Small jumbling: overviews are still usefull for mass identification

Slide 12

SPAM – DB, NCV histograms: impact of muddling Stronger confusion: the vast majority of the summary are rendered to not be valuable !

Slide 13

HAM – DB examination results: Mean Max(NCV) esteem not educational Miss-discovery likelihood still too high for reasonable utilization

Slide 14

HAM – DB, NCV histograms: impact of muddling Spam jumbling does not effect miss-identification of good messages. Moved and wide histograms wonders => high false positives clarified

Slide 15

Alternative summaries Sampling strings: settled length, arbitrary positions 011010…11 101110…11 001010…10 Email-to-email coordinating: max NCV between over sets of reviews (find how comparative are the most comparable parts – e.g. spammy phrases)

Slide 16

SPAM – DB investigation results (alt. reviews) Spam mass recognition not any more powerless against obscurity...

Slide 17

SPAM – DB (alt. condensations): impact of jumbling … and we can see why it is similar to that.

Slide 18

HAM – DB examination results (alt. condensations): miss-det. Prob still too high

Slide 19

HAM – DB (alt. condensations) impact of obscurity: What should be possible to lessening ham miss-recognition?

Slide 20

Alternative summaries open new potential outcomes New email digest(s) database of good processes Negative determination process that don\'t match Compare to collective database of condensations (DB) This part is the same as without negative choice

Slide 21

Effect of negative choice on miss-location of ham:

Slide 22

Conclusions Use of fitting measurements is essential for legitimate conclusions from tests. Option summaries give vastly improved results, and by utilization of NCV histograms w

View more...