Machine Learning for Personal Information Management .


102 views
Uploaded on:
Description
ML for email. . Beginning stage: Ishmail, an emacs RMAIL augmentation composed by Charles Isbell in summer \'95 (to a great extent for Ron Brachman)Could physically compose letter box definitions and sifting tenets in Lisp. [Cohen, AAAI Spring Symposium on ML and IR 1996]. Foldering undertakings. Principle learning system [Cohen, ICML95].
Transcripts
Slide 1

Machine Learning for Personal Information Management William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science Carnegie Mellon University and Vitor Carvalho, Einat Minkov, Tom Mitchell, Andrew Ng (Stanford) and Ramnath Balasubramanyan

Slide 2

ML for email [Cohen, AAAI Spring Symposium on ML and IR 1996] Starting point: Ishmail, an emacs RMAIL expansion composed by Charles Isbell in summer \'95 (generally for Ron Brachman) Could physically compose letter drop definitions and separating rules in Lisp

Slide 3

Foldering errands Rule-learning technique [Cohen, ICML95] [Rocchio, 71]

Slide 4

Machine Learning in Email Why consider learning for email ? Email has more obvious effect than whatever else you do with PCs. Email is difficult to oversee: People get overpowered. Individuals lose imperative data in email files. Individuals commit repulsive errors.

Slide 5

Machine Learning in Email Why think about learning for email ? For which undertakings can learning help ? Foldering Spam sifting Search: past catchphrase scan Recognizing mistakes Help for following errands seek don\'t sort! essential and very much examined "Uh oh, did I simply hit answer to-all?" "Failing"

Slide 6

Learning to Search Email [SIGIR 2006, CEAS 2006, WebKDD/SNA 2007] CALO Term In Subject Sent To William chart proposition CMU 6/17/07 6/18/07 einat@cs.cmu.edu

Slide 7

Q: "what are Jason\'s email false names?" Basic thought: figuring out how to hunt email is figuring out how to question a diagram for data Sent To einat@cs.cmu.edu "Jason" Has term inv. einat Sent-to Msg 18 Msg5 Msg 2 Jason Ernst Sent to Email Sent from Email EmailAddressOf jernst@ cs.cmu.edu jernst@ andrew.cmu.edu Similar to

Slide 8

How would you posture inquiries to a diagram? An expanded similitude measure by means of diagram strolls:

Slide 9

How would you posture inquiries to a chart? An amplified likeness measure by means of diagram strolls: Propagate "comparability" from begin hubs through edges in the chart – collecting proof of closeness over various associating ways.

Slide 10

How would you posture inquiries to a diagram? An augmented likeness measure by means of chart strolls: Propagate "similitude" from begin hubs through edges in the diagram – aggregating proof of comparability over numerous associating ways. Settled likelihood of stopping the stroll at each progression – i.e., shorter interfacing ways have more prominent significance (exponential rot)

Slide 11

How would you posture questions to a diagram? An amplified comparability measure by means of chart strolls: Propagate "likeness" from begin hubs through edges in the diagram – aggregating proof of closeness over various interfacing ways. Settled likelihood of stopping the stroll at each progression – i.e., shorter associating ways have more noteworthy significance (exponential rot) by and by we can surmised with a short limited chart walk, executed with meager lattice duplication

Slide 12

How would you posture questions to a diagram? A developed closeness measure by means of diagram strolls: Propagate "likeness" from begin hubs through edges in the chart – aggregating confirmation of similitude over numerous interfacing ways. Settled likelihood of ending the stroll at each progression – i.e., shorter associating ways have more prominent significance (exponential rot) practically speaking we can rough with a short limited chart walk, actualized with inadequate lattice increase The outcome is a rundown of hubs, sorted by "closeness" to an information hub circulation (last hub probabilities ).

Slide 13

An inquiry dialect: Q: { , } Email, contacts and so forth: a diagram Graph hubs are written, edges are coordinated and written Multiple edges may interface two given hubs. Each edge sort is allocated a settled weight—which decides likelihood of being followed in a walk: e.g., uniform Returns a rundown of hubs (of sort ) positioned by the chart walk probs. = question "terms" Random stroll with restart, chart pieces, warm dispersion parts, dissemination forms, Laplacian regularization, diagram databases (BANKS, DbExplorer, … ), diagram mincut, cooperative Markov systems, …

Slide 14

Tasks that resemble likeness inquiries Person name disambiguation [ term " andy " record msgId ] " individual " Threading What are the nearby messages in this string? An intermediary for discovering "more messages like this one" [ record msgId ] " document " Alias finding What are the email-locations of Jason ?... [ term Jason ] " email-address " Meeting participants discoverer Which email-addresses (people) would it be a good idea for me to tell about this meeting? [ meeting mtgId ] " email-address "

Slide 15

Learning to hunt better Task T (inquiry class) Standard arrangement of elements utilized for x on every issue: Edge n-grams in all ways from V q to x Number of reachable source hubs Features of top-positioning ways (e.g. edge bigrams) … Query q Query a Query b + Rel. answers a + Rel. answers b + Rel. answers q GRAPH WALK hub rank 1 hub rank 2 hub rank 3 hub rank 4 … hub rank 10 hub rank 11 hub rank 12 … hub rank 50 hub rank 1 hub rank 2 hub rank 3 hub rank 4 … hub rank 10 hub rank 11 hub rank 12 … hub rank 50 hub rank 1 hub rank 2 hub rank 3 hub rank 4 … hub rank 10 hub rank 11 hub rank 12 … hub rank 50

Slide 16

Learning Node re-requesting: prepare undertaking Feature era Learn re-ranker Re-positioning capacity Graph walk

Slide 17

Node re-requesting: Feature era Learn re-ranker Re-positioning capacity Graph walk Graph walk Feature era Score by re-positioning capacity Boosting Learning Approach prepare errand test assignment Voted Perceptron; RankSVM; PerceptronCommittees; … [Joacchim KDD 2002, Elsas et al WSDM 2008] [Collins & Koo, CL 2005; Collins, ACL 2002]

Slide 18

Tasks that resemble comparability inquiries Person name disambiguation [ term " andy " record msgId ] " individual " Threading What are the nearby messages in this string? An intermediary for discovering "more messages like this one" [ record msgId ] " document " Alias finding What are the email-locations of Jason ?... [ term Jason ] " email-address " Meeting participants discoverer Which email-addresses (people) would it be a good idea for me to tell about this meeting? [ meeting mtgId ] " email-address "

Slide 19

Corpora and datasets Corpora PERSON NAME DISAMBIGUATION Person names Nicknames: Dave for David , Kai for Keiko , Jenny for Qing Common names are vague

Slide 20

CSpace Email: gathered at CMU 15,000+ messages from semester-line administration course understudies shaped gatherings that went about as "organizations" and cooperated many gatherings with some known social associations (e.g., "president")

Slide 21

Results Mgmt. amusement PERSON NAME DISAMBIGUATION

Slide 22

Results Mgmt. diversion PERSON NAME DISAMBIGUATION

Slide 23

Results Mgmt. amusement PERSON NAME DISAMBIGUATION

Slide 24

Results Mgmt. diversion PERSON NAME DISAMBIGUATION

Slide 25

Results On All Three Problems Mgmt. Amusement Enron: S ager-E PERSON NAME DISAMBIGUATION Enron: S hapiro-R

Slide 26

Tasks Person name disambiguation [ term " andy " record msgId ] " individual " Threading What are the nearby messages in this string? An intermediary for discovering "more messages like this one" [ document msgId ] " record " Alias finding What are the email-locations of Jason ?... [ term Jason ] " email-address " Meeting participants discoverer Which email-addresses (people) would it be advisable for me to inform about this meeting? [ meeting mtgId ] " email-address "

Slide 27

Threading: Results Mgmt. Amusement 73.8 71.5 60.3 58.4 50.2 MAP 36.2 Header & Body Subject Reply lines Header & Body Subject - Header & Body - 79.8 Enron: Farmer 65.7 65.1 MAP 36.1 Header & Body Subject Reply lines Header & Body Subject - Header & Body -

Slide 28

Learning approaches Edge weight tuning: Graph walk Weight redesign Theta*

Slide 29

Node re-requesting: Feature era Learn re-ranker Re-positioning capacity Graph walk Graph walk Feature era Score by re-positioning capacity Boosting ; Voted Perceptron Learning approaches Edge weight tuning: [Diligenti et al, IJCAI 2005; Toutanova & Ng, ICML 2005; … ] Graph walk Weight upgrade Theta* Graph walk assignment Question: which is better?

Slide 30

Results (MAP) Name disambiguation Reranking and edge-weight tuning are correlative. Best outcome is for the most part to tune weights, and after that rerank Reranking overfits on little datasets (gatherings) * + * Threading * + Alias discovering

Slide 31

Machine Learning in Email Why contemplate learning for email ? For which errands can learning help ? Foldering Spam separating Search past watchword scan Recognizing blunders Help for following errands "Uh oh, did I simply hit answer to-all?" "Failing"

Slide 33

http://www.sophos.com/

Slide 34

Slide 35

Preventing mistakes in Email [SDM 2007] Idea Goal: to identify messages inadvertently sent to the wrong individual Generate manufactured holes: Email breaks might be reenacted by different criteria: an error, comparable last names, indistinguishable first names, forceful auto-fulfillment of locations, and so on. Strategy: Look for anomalies. Email Leak: email unintentionally sent to wrong individual Email Leak

Slide 36

P(rec t ) Most likely anomaly Rec 6 Rec 2 … Rec K Rec 5 Least likely exception Preventing Email Leaks Method Create mimic d/fake email beneficiaries Build demonstrate for (msg.recipients): prepare classifier on genuine information to identify artificially made anomalies (added to the genuine beneficiary rundown). Highlights: textual(subject, body), organize highlights (frequencies, co-events, and so on). Rank potential anomalies - Detect exception and caution client in light of certainty. P (rec t ) =Probability beneficiary t is an exception given "message content and different beneficiaries in the me

Recommended
View more...