MICS Improving Digest-BasedCollaborative Spam Detection Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le Boudec EPFL, Switzerland MIT_Spam_Conference, Mar 27-28, 2008, MIT, Cambridge.
Talk content • Digest-based filtering – global picture overview • Understanding “HOW Digests WORK” - “Open Digest” Paper  • (Very positive results/conclusions, cited and referred a lot!) • Understanding it better - Our re-evaluation of “Open Digest” Paper results • (Different conclusions!) • Our Alternative Digests - results IMPROVE a lot, understanding “WHY” • Understanding the “why” => further improvements possible • (Negative selection) • Conclusions  "An Open Digest-based Technique for Spam Detection”, E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, P. Samarati, in Proc. of the 2004 International Workshop on Security in Parallel and Distributed Systems, San Francisco, CA USA, September 15-17, 2004.
Two main collaborative spam detection approaches 1) White-listing using Social Networks 2) Bulky Content Detection using Digests digests relationships User 1 User 1 User n User 2 Recent digests User 3 User n User 2 Example: PGP graph of certificates Examples: DCC, Vipul’s Razor, Commtouch Implementations (in both cases): centralized or decentralized, open or proprietary This talk (paper): digests approach for bulky content detection
s s s s s MC MS MS MS MC MS MC MC A Real Digest-Based System: DCC(Distributed Checksum Clearinghouse) … ~ 250 DCC Servers … ~ n * 10 000 Mail servers … Reply=counter (n=3) … Query= digest ~ n * millions of Mail users • Strengths/drawbacks: • - fast response • not precise (FP problems) • limited obfuscation resistance Spammer (sends in bulk) Reproducible evaluation of digests-efficiency: “Open Digest” Paper
Che Cha hea b7 ... b0 b7 ... b0 b7 ... b0 Accumulator After L-N+1 steps 0 15 255 Producing Digests: Nilsimsa similarity hashingas explained in OD-paper Cheap N=5 characters sliding window E-mail, L characters long … 1: 2: 8: trigrams Cheapest vac... Hash:30^3 -> 2^8 Hash() Hash() Hash() … 00001111 +1 +1 +1 accumulator ... Best Regards, John 0 15 255 Digest = 0 1 1 0 1 0 15 255 • Digest is a binary string of 256 bits • Definition: Nilsimsa Compare Value (NCV) between two digests is equal to the • number of bits at corresponding positions that are equal, minus 128. • Identical emails NCV=128, unrelated emails NCV close to 0. More similar emails more similar digests higher NCV
> compare “Open Digest” paper experiments and results • Evaluation <= experiment: • spam bulk detection <=detection of similarity between two emails from the same spam bulk • ham miss-detection <= miss-detection of similarity between unrelated emails Bulk detection experiment: OD-paper result for “adding random text” obfuscation: (repeated many times, to get statistic) Spam Corpus Select at random Obfuscate (2 copies) Compute digests 010110…10 011010…11 Evaluate similarity • OD-paper only evaluates (talks about) the average NCV Threshold=54 OD-paper conclusion: Average NCV > Threshold => bulk detection resistant to strong obfuscation by spammer NCV value (integer) Matching indicator (0/1)
> compare “Open Digest” paper experiments and results (cont.) Ham miss-detection experiment: Ham and Spam Corpus • OD-paper result: • n1~2500, n2~2500 emails • no matching (miss-detection) case is observed For each pair of unrelated emails Compute digests 100110…10 011100…11 • OD-paper conclusion: • Miss-detection of good emails must be very low • approximating miss-detection probability by use of Binomial distribution supports the observed result Evaluate similarity Threshold=54 NCV values (integer) Matching indicators (0/1)
> compare Extending OD-paper experiments: spam bulk detection Bulk detection experiment, identical as in OD-paper: But we test higher obfuscation ratios: (repeated many times, to get statistic) Spam Corpus Select at random Obfuscate (2 copies) Compute digests 010110…10 011010…11 Evaluate similarity Threshold=54 • OD-paper results is well recovered (blue dotted line) NCV value (integer) Matching indicator (0/1) OD-paper conclusion does not hold! Even only slightly higher obfuscation ratio brings the average NCV bellow the threshold
Ham Corpus 2/2 > Understanding better what happens “Compare X to Database” (generic experiment): EITHER Ham Corpus1/2 (ham to filter) OR Spam Corpus (Obfuscation 1) X n2 n1 Spam Corpus (Obfuscation 2) Select at random Compute digest 010110…10 Database DB of spam and ham digests (represents “previous digest queries”) compare to each from DB Threshold=54 NCV values (integer) Matching indicators (0/1) We look at more metrics Probability of email-to-email matching Max(NCV) average NCV histogram
SPAM – DB experiment results: Mean Max(NCV) value not informative Effect of obfuscation changes gracefully Spammer may gain by additional obfuscation.
SPAM – DB, NCV histograms: effect of obfuscation Small obfuscation: digests are still usefull for bulk detection
SPAM – DB, NCV histograms: effect of obfuscation Stronger obfuscation: most of the digest are rendered to not be useful !
HAM – DB experiment results: Mean Max(NCV) value not informative Miss-detection probability still too high for practical use
HAM – DB, NCV histograms: effect of obfuscation Spam obfuscation does not impact miss-detection of good emails. Shifted and wide histograms phenomena => high false positives explained
Alternative digests Sampling strings: fixed length, random positions 011010…11 101110…11 001010…10 Email-to-email matching: max NCV between over pairs of digests (find how similar are the most similar parts – e.g. spammy phrases)
SPAM – DB experiment results (alt. digests) Spam bulk detection not any more vulnerable to obfuscation...
SPAM – DB (alt. digests): effect of obfuscation … and we can see why it is like that.
HAM – DB experiment results (alt. digests): • miss-det. Prob still too high
HAM – DB (alt. digests) effect of obfuscation: What can be done to decrease ham miss-detection?
Alternative digests open new possibilities New email digest(s) database of good digests Negative selection digest that do not match Compare to collaborative database of digests (DB) This part is the same as without negative selection
Conclusions • Use of proper metrics is crucial for proper conclusions from experiments. • Alternative digests provide much better results, and by use of • NCV histograms we understand why. • Use of proper metrics crucial for understanding what happens… • … and for understanding how to fix the problems.