Pre-Processing Techniques for Document Retrieval

This article discusses various pre-processing techniques for document retrieval, including dictionary-based linguistic processing, removal of stop words, stemming, lemmatization, and normalization. The article highlights the importance of these techniques in improving search quality and reducing information loss.

Uploaded on May 27, 2023 | 1 Views
kimber

About Pre-Processing Techniques for Document Retrieval

PowerPoint presentation about 'Pre-Processing Techniques for Document Retrieval'. This presentation describes the topic on This article discusses various pre-processing techniques for document retrieval, including dictionary-based linguistic processing, removal of stop words, stemming, lemmatization, and normalization. The article highlights the importance of these techniques in improving search quality and reducing information loss.. The key topics included in this slideshow are document retrieval, pre-processing, stop words, stemming, lemmatization, normalization,. Download this presentation absolutely free.

Presentation Transcript

1. Dictionaries and Tolerant Retrieval Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata

2. Pre-processing of a document 2 document text, word, XML, simple text sequence of characters ASCII, UTF-8 sequence of tokens ASCII, UTF-8 decoding tokenizing sequence of processed tokens ASCII, UTF-8 linguistic processing The dictionary

3. Pre-processing of a document Removal of stopwords: of, the, and, Modern search does not completely remove stopwords Such words add meaning to sentences as well as queries Stemming: words stem (root) of words Statistics, statistically, statistical statistic (same root) Loss of slight information (the form of the word also matters) But unifies differently expressed queries on the same topic Lemmatization: words morphological root saw see, not saw s Normalization: unify equivalent words as much as possible U.S.A, USA Windows, windows We will cover details of these later in this course Left for you to read the book 3

4. The dictionary User sends the query The engine Determine the query terms Determine if each query term is present in the dictionary Dictionary: Lookup Search trees or hashing 4 Query Query Dictionary Dictionary Posting lists Posting lists Posting lists Posting lists Posting lists Posting lists Posting lists Posting lists User Search engine

5. Binary search trees Binary search tree Each node has two children O(log M ) comparisons if the tree is balanced Problem Balancing the tree when terms are inserted and deleted 5 Root Root 0-9, a-k 0-9, a-k l-z l-z aaai aaai zzzz zzzz .. .. M = number of terms log M

6. B-tree B-tree Number of children for each node is between a and b for some predetermined a and b O(log a M ) comparisons Very few rebalancing required B+ tree Similar to B-tree All data (pointers to posting lists) are in leaf nodes Linear scan of data easier 6 Root Root 0-7 0-7 x-z x-z aaai aaai zzzz zzzz .. .. M = number of terms log a M ..

7. WILDCARD QUERIES 7

8. Wildcard queries Wildcard: is a character that may be substituted for any of a defined subset of all possible characters Wildcard queries: queries with wildcards Sydney/Sidney: s*dney Sankhadeep/Shankhadeep/Sankhadip: s*ankhad*p Judicial/Judiciary: judicia* Trailing wildcard queries Simplest: search trees work well Determine the node(s) which correspond to the range of terms specified by the query Retrieve the posting lists for the set W of all the terms in the entire sub-tree under those nodes 8 Trailing wildcard query Trailing wildcard query

9. Queries with a single * Leading wildcard queries: *ata Matching kolkata, bata, Use a reverse B-tree B-tree obtained by considering terms backwards Consider the leading wildcard query backwards, it becomes a trailing wildcard query Lookup as before for a trailing wildcard query on a normal B-tree Queries with a single * Queries of the form: s*dney Matching sydney, sidney, Use a B-tree and a reverse B-tree Use the B-tree to get the set W of all terms matching s* Use the reverse B-tree to get the set R of all terms matching *dney Intersect W and R 9 Root Root 0-7 0-7 x-z x-z iaaa iaaa zzzz zzzz .. .. M = number of terms log a M .. ataklok ataklok

10. General wildcard queries The permuterm index Special character $ as end of term The term sydney sydney$ Enumerate all rotations of sydney$ and have all of them in the B-tree, finally pointing to sydney Wildcard queries A single *: sy*ey Query ey$sy* to the B-tree One rotation of sydney$ will be a match General: s*dn*y Query y$s* to the B-tree Works equivalent to s*y, not all of the matches would have dn in the middle Filter the others by exhaustive checks Problems Blows up the dictionary Empirically seen to be 10 times for English 10 sydney sydney sydney$ sydney$ ydney$s ydney$s dney$sy dney$sy ney$syd ney$syd ey$sydn ey$sydn y$sydne y$sydne $sydney $sydney

11. k- gram index for wildcard queries k- gram: sequence of k characters k- gram index: < k- gram> words in which the k- gram appears, sorted lexicographically Consider all words with the beginning and ending marker $ 11 etr etr beetroot metric retrieval symmetry on$ on$ aviation son xeon $bo $bo book box boy k is predetermined

12. Wildcard queries with k -gram index User query: re*ve Send the Boolean query $re AND ve$ to the k -gram index Will return terms such as revive, remove, Proceed with those terms and retrieve from inverted index User query: red* Send the query $re AND red to the 3 - gram index Returns all results starting with re and containing red Post-processing to keep only the ones matching red* Exercise: more general wildcard query s*dn*y Can we do this using the k- gram index (assume 3-gram)? 12

13. Discussion on wildcard queries Semantics What does re*d AND fe*ri mean? (Any term matching re*d) AND (Any term matching fe*ri) Once the terms are identified, the operation on posting lists ( Union ) Intersection ( Union ) Expensive operations, particularly if there are many matching terms Expensive even without Boolean combinations Hidden functionality in search engines Otherwise users would play around even when not necessary For example, a query s* produces huge number of terms for which the union of posting lists need to be computed 13

14. Why search trees are better than hashing? Possible hash collision Prefix queries cannot be performed red and re may be hashed to entirely different range of values Almost similar terms do not hash to almost similar integers A hash function designed now may not be suitable if the data grows to a much larger size 14

15. SPELLING CORRECTIONS Did you mean? 15

16. Misspelled queries People type a lot of misspelled queries britian spears, britneys spears, brandy spears, prittany spears britney spears What to do? 1. Among the possible corrections, choose the nearest one 2. Among the possible near corrections, choose the most frequent one (probability of that being the users intention is the highest) 3. Context sensitive correction 4. The query may not be actually incorrect. Retrieve results for the original as well as possible correction of the query debapriyo majumder returns results for debapriyo majumdar and majumder both Approaches for spelling correction Edit distance k- gram overlap 16

17. Edit distance Edit distance E(A,B) = minimum number of operations required to obtain B from A Operations allowed: insertion, deletion, substitution Example: E(food, money) = 4 food mood mond moned money Computing edit distance in O(|A| . |B|) time Spelling correction Given a (possibly misspelled) query term, need to find other terms (in the dictionary) with very small edit distance Precomputing edit distance for all pairs of terms absurd Use several heuristics to limit possible pairs Only consider pairs of terms starting with same letter 17

18. Computing edit distance Observation: E(food, money) = 4 One sequence: food mood mond moned money E(food, moned) = Why? If E(food, moned) < 3, then E(food, money) < 4 Prefix property: If we remove the last step of an optimal edit sequence then the remaining steps represent an optimal edit sequence for the remaining substrings 18 3

19. Computing edit distance Fix the strings A and B. Let |A| = m , |B| = n. Define: E( i , j ) = E(A[1, , i ], B[1, , j ]) That is, edit distance between the length i prefix of A and length j prefix of B Note: E( m , n ) = E(A,B) Recursive formulation (a) E( i, 0) = i (b) E(0, j ) = j The last step: 4 possibilities Insertion: E( i, j ) = E( i, j 1) + 1 Deletion: E( i, j ) = E( i 1 , j ) + 1 Substitution: E( i, j ) = E( i 1 , j 1) + 1 No action: E( i, j ) = E( i 1 , j 1) 19

20. Computing edit distance: dynamic programming The recursion 20 P is an indicator variable P = 1 if A[i] B[j], 0 otherwise 0 1 2 3 4 F O O D 0 1 M 2 O 3 N 4 E 5 Y

21. Computing edit distance: dynamic programming The recursion 21 P is an indicator variable P = 1 if A[i] B[j], 0 otherwise 0 1 2 3 4 F O O D 0 0 1 2 3 4 1 M 1 2 O 2 3 N 3 4 E 4 5 Y 5 Backtrace: Compute E( i, j ) and also keep track of where E( i, j ) came from

22. Computing edit distance: dynamic programming The recursion 22 P is an indicator variable P = 1 if A[i] B[j], 0 otherwise 0 1 2 3 4 F O O D 0 0 1 2 3 4 1 M 1 1 2 O 2 3 N 3 4 E 4 5 Y 5 Backtrace: Compute E( i, j ) and also keep track of where E( i, j ) came from

23. Computing edit distance: dynamic programming The recursion 23 P is an indicator variable P = 1 if A[i] B[j], 0 otherwise 0 1 2 3 4 F O O D 0 0 1 2 3 4 1 M 1 1 2 O 2 2 3 N 3 4 E 4 5 Y 5 Backtrace: Compute E( i, j ) and also keep track of where E( i, j ) came from

24. Computing edit distance: dynamic programming The recursion 24 P is an indicator variable P = 1 if A[i] B[j], 0 otherwise 0 1 2 3 4 F O O D 0 0 1 2 3 4 1 M 1 1 2 3 4 2 O 2 2 3 N 3 3 4 E 4 4 5 Y 5 5 Backtrace: Compute E( i, j ) and also keep track of where E( i, j ) came from

25. Computing edit distance: dynamic programming The recursion 25 P is an indicator variable P = 1 if A[i] B[j], 0 otherwise 0 1 2 3 4 F O O D 0 0 1 2 3 4 1 M 1 1 2 3 4 2 O 2 2 1 3 N 3 3 4 E 4 4 5 Y 5 5 Backtrace: Compute E( i, j ) and also keep track of where E( i, j ) came from

26. Computing edit distance: dynamic programming The recursion 26 P is an indicator variable P = 1 if A[i] B[j], 0 otherwise 0 1 2 3 4 F O O D 0 0 1 2 3 4 1 M 1 1 2 3 4 2 O 2 2 1 2 3 3 N 3 3 2 4 E 4 4 3 5 Y 5 5 4 Backtrace: Compute E( i, j ) and also keep track of where E( i, j ) came from

27. Computing edit distance: dynamic programming The recursion 27 P is an indicator variable P = 1 if A[i] B[j], 0 otherwise 0 1 2 3 4 F O O D 0 0 1 2 3 4 1 M 1 1 2 3 4 2 O 2 2 1 2 3 3 N 3 3 2 2 3 4 E 4 4 3 3 3 5 Y 5 5 4 4 4 Backtrace: Compute E( i, j ) and also keep track of where E( i, j ) came from

28. Computing edit distance: dynamic programming The recursion 28 P is an indicator variable P = 1 if A[i] B[j], 0 otherwise 0 1 2 3 4 F O O D 0 0 1 2 3 4 1 M 1 1 2 3 4 2 O 2 2 1 2 3 3 N 3 3 2 2 3 4 E 4 4 3 3 3 5 Y 5 5 4 4 4 Backtrace: Compute E( i, j ) and also keep track of where E( i, j ) came from Backtrace to find an optimal edit path Backtrace to find an optimal edit path

29. Spelling correction using k- gram index The k- grams are small portions of words Misspelled word would still have some k- grams intact Misspelled query: bord 29 bo bo or or rd rd aboard boardroom border boring border lord morbid north aboard boardroom border hard Intersect the list of words for k- grams Problem: long words which contain the k -grams but are not good corrections

30. Phonetic correction Some users misspell because they dont know the spelling Types as it sounds Approach for correction: use a phonetic hash function Hash similarly sounding terms to the same hash value Soundex algorithm Several variants 30

31. Soundex algorithm 1. Retain the first letter of the term 2. Change all A, E, I, O, U, H, W, Y 0 B, F, P, V 1. C, G, J, K, Q, S, X, Z 2. D,T to 3. L to 4. M, N to 5. R to 6. 3. Repeat: remove one of each pair of same digit 4. Remove all 0s. Pad the result with trailing 0s. Return the first 4 positions: one letter, 3 digits Example: Hermann H065055 H06505 H655 31

32. References and acknowledgements Primarily: IR Book by Manning, Raghavan and Schuetze: http://nlp.stanford.edu/IR-book/ The part on edit distance: lectures notes by John Reif, Duke University: https://www.cs.duke.edu/courses/fall08/cp s230/Lectures/L-04.pdf 32