**Ravindra Jaju**An Introduction to Text Mining**Outline of the presentation**Initiation/Introduction ... What makes text stand apart from other kinds of data? Classification Clustering Mining on “The Web”**Data Mining**What: Looking for information from usually large amounts of data Mainly two kinds of activities – Descriptive and Predictive Example of a descriptive activity – Clustering Example of a predictive activity - Classification**What kind of data is this?**<1, 1, 0, 0, 1, 0> <0, 0, 1, 1, 0, 1> It could be two customers' baskets, containing (milk, bread, butter) and (shaving cream, razor, after-shave lotion) respectively. Or, it could be two documents - “Java programming language” and “India beat Pakistan”**And what kind of data is this?**<550000, 155> <750000, 115> <120000, 165> Data about people, <income, IQ> pairs!**Data representation**Humans understand data in various forms Text Sales figures Images Computers understand only numbers**Working with data**Most of the mining algorithms work only with numeric data All data, hence, are represented as numbers so that they can lend themselves to the algorithms Whether it is sales figures, crime rates, text, or images – one has to find a suitable way to transform data into numbers.**Text mining – Working with numbers**“Java Programming Language” “India beat Pakistan” OR <1, 1, 0, 0, 1, 0> <0, 0, 1, 1, 0, 1> The transformation to 1's and 0's hides all relationship between Java and Language, and India and Pakistan, which humans can make out (How?)**Text mining – Working with numbers (contd.)**As we have seen, data transformation (from text/word to some index number in this case) means that there is some information loss One big challenge in this field today is to find a good data representation for input to the mining algorithms**Text Representation Issues**Each word has a dictionary meaning, or meanings Run – (1) the verb. (2) the noun, in cricket Cricket – (1) The game. (2) The insect. Each word is used in various “senses” Tendulkar made 100 runs Because of an injury, Tendulkar can not run and will need a runner between the wickets Capturing the “meaning” of sentences is an important issue as well. Grammar, parts of speech, time sense could be easy! Finding out automatically who the “he” in “He is the President” given a document is hard. And “president of?” Well ...**Text Representation Issues (contd.)**In general, it is hard to capture these features from a text document One, it is difficult to extract this automatically Two, even if we did it, it won't scale! One simplification is to represent documents as a vector of words We have already seen examples Each document is represented as a vector, and each component of the vector represents some “quantity” related to a single word.**The Document Vector**“Java Programming Language” <1, 1, 0, 0, 1, 0, 0> (document A) “India beat Pakistan” <0, 0, 1, 1, 0, 1, 0> (document B) “India beat Australia” <0, 0, 1, 1, 0, 0, 1> (document C) What vector operation can you think of to find two similardocuments? How about the dot product? As we can easily verify, documents B and C will have a higher dot product than any other combination**More on document similarity**The dot product or cosine between two vectors is a measure of similarity. Documents about related topics should have higher similarity Language Java 0, 0, 0 Indonesia**Document Similarity (contd.)**How about distance measures? Cosine similarity measure will not capture the inter-cluster distances!**Further refinements to the DV representation**Not all words are equally important the, is, and, to, he, she, it (Why?) Of course, these words could be important in certain contexts We have the option of scaling the components of these words, or completely removing them from the corpus In general, we prefer to remove the stopwords and scale the remaining words Important words should be scaled upwards, and vice versa One widely used scaling factor – TF-IDF TF-IDFstands for Term Frequency and Inverse Document Frequency product, for a word.**Text Mining – Moving Further**Document/Term Clustering Given a large set, group similar entities Text Classification Given a document, find what topic does it talk about Information Retrieval Search engines Information Extraction Question Answering**Clustering (Descriptive Activity)**Activity: Group together similar documents Techniques used Partitioning Hierarchical Agglomerative Divisive Grid based Model based**Clustering (contd.)**Partitioning Divide the input data into k partitions K-means, K-medoids Hierarchical clustering Agglomerative Each data point is assumed to be a cluster representative Keep merging similar clusters till we get a single cluster Divisive The opposite of agglomerative**“Frequent term-based text clustering”**Idea Frequent terms carry more information about the “cluster” they might belong to Highly co-related frequent terms probably belong to the same cluster D = {D1, …, Dn} – the set of documents DjsubsetOf T, the set of all terms Then candidate clusters are generated from F = {F1, … , Fk}, where each Fi is a set of all frequent terms which occur together.**Classification**The problem statement Given a set of documents, each with a label called the class label for that document Given, a classifier which learns from the above data set For a new, unseen document, the classifier should be able to “predict” with a high degree of accuracy the correct class to which the new document belongs**Decision Tree Classifier**A tree Each node represents some kind of an “evaluation” for an attribute of the data Each edge, the decision taken The evaluation at each node is some kind of an information gain measure Reduction in entropy – more information gained Entropy E(x) = -pilog2(pi) pi represents the probability that the data corresponds to sample i Each edge represents a choice for the value of the attribute the node represents Good for text mining. But doesn’t scale**Statistical (Bayesian) Classification**For a document-class data, we calculate the probabilities of occurrence of events Bayes’ Theorem P(c|d) = P(c) . P(d|c) / P(d) Given a document d, the probability that it belongs to a class c is given by the above formula. In practice, the exact values of the probabilities of each event are unknown, and are estimated from the samples**Naïve Bayes Classification**Probability of the document eventd P(d) = P(w1, …, wn) – wi are the words The RHS is generally a headache. We have to consider the inter-dependence of each of the wj events Naïve Bayes – Assume all the wj events are independent. The RHS expands to p(wj) Most of the Bayesian text classifiers work with this simplification**Bayesian Belief Networks**This is an intermediate approach Not all words are independent “If java and program occur together, then boost the probability value of class computer programming” “If java and indonesia occur together, then the document is more likely about some-other-class” Problem? How do we come up with co-relations like above?**Other classification techniques**Support Vector Machines Find the best discriminant plane between two classes k Nearest Neighbour Association Rule Mining Neural Networks Case-based reasoning**An example – “Text Classification from labeled and**unlabeled documents with Expectation Maximization” Problem setting Labeling documents is a manual process A lot more unlabeled documents are available as compared to labeled documents Unlabeled documents contain information which could help in the classification activity**An example (contd.)**Train a classifier with the labeled documents Say, a Naïve Bayes classifier This classifier estimates the model parameters (the prior probabilities of the various events) Now, classify the unlabeled documents. Assuming the applied labels to be correct, re-estimate the model parameters Repeat the above step till convergence**Expectation Maximization**A useful technique for estimating hidden parameters In the previous example, the class labels were missing from some documents Consists of two steps E-step: Set z(k+1) = E [z | D; (k)] M-step: Set (k+1) = arg max P( | D; z(k+1)) The above steps are repeated till convergence, and convergence does occur**Another example – “Fast and accurate Text Classification**via Multiple Linear Discriminant Projections”**Contd.**Idea Find a direction which maximizes the separation between classes. Why? Reduce “noise”, or rather Enhance the differences between classes The vector corresponding to this direction is the Fisher’s discriminant Project the data-points onto this For all data-points not separated by this vector, choose another ’**Contd.**Repeat till all data are now separable Note, we are looking at a 2-class case. This easily extends to multiple classes Project all the document vectors into the space represented by the vectors as the basis vectors Now, induce a decision tree on this projected representation The number of attributes is highly reduced Since this representation nicely “separates” the data points (documents), accuracy increases**Web Text Mining**The WWW is a huge, directed graph, with documents as nodes and hyperlinks as the directed edges Apart from the text itself, this graph structure carries a lot of information about the “usefulness” of the “nodes” For example 10 random, average people on the streets say Mr. T. Ache is a good dentist 5 reputed doctors, including dentists, recommend Mr. P. Killer as a better dentist Who would you choose?**Kleinberg’s HITS**HITS – Hypertext Induced Topic Selection Nodes on the web can be categorized into two types – hubs and authorities Authorities are nodes which one refers to for definitive information about a topic Hubs point to authorities HITS computes the hub and authority scores on a sub-universe of the web How does one collect this ‘sub-universe’?**HITS (contd.)**The basic steps Au = Hv for all v pointing to u Hu= Av for all v pointed to by u Repeat the above till convergence Nodes with high A scores are “relevant” Relevant to what? Can we use this for efficient retrieval for a query?**PageRank**Similar to HITS, but all pages have only one score – a Rank R(u) = c (R(v)/Nv) v is the set of pages linking to u, and Nv is the number of links in v. c is a scaling factor (< 1) The higher the rank of pages linking to a page, the higher is its own rank! To handle rank sinks (documents which do not link outside a set of pages), the formula is modified as R’(u) = c (R’(v)/Nv) + cE(u) E(u) is a set of some pages, and acts as a rank source (what kind of pages?)**Some more topics which we haven’t touched**• Using external dictionaries • WordNet • Using language specific techniques • Computational linguistics • Use grammar for judging the “sense” of a query in the “information retrieval” scenario • Other interesting techniques • Latent Semantic Indexing • Finding the latent information in documents using Linear Algebra Techniques**Some more comments**• Some “purists” do not consider most of the current activities in the text mining field as real text mining • For example, see Marti Hearst’s write-up at Untangling Text Data Mining**Some more comments (contd.)**• One example that he mentions • stress is associated with migraines • stress can lead to loss of magnesium • calcium channel blockers prevent some migraines • magnesium is a natural calcium channel blocker • spreading cortical depression (SCD) is implicated in some migraines • high levels of magnesium inhibit SCD • migraine patients have high platelet aggregability • magnesium can suppress platelet aggregability • The above was inferred from a set of documents, with some human help**References**• Data Mining – Concepts and Techniques, by Jiawei Han and Micheline Kamber • Principle of Data Mining, by David J. Hand et al • Text Classification from Labeled and Unlabeled Documents using EM, Kamal Nigam et al • Fast and accurate text classification via multiple linear discriminant projections, S. Chakrabarti et al • Frequent Term-Based Text Clustering, Florian Beil et al • The PageRank Citation Ranking: Bringing Order to the Web, Lawrence Page and Sergey Brin • Untangling Text Data Mining, by Marti. A. Hearst, http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html • And others …