Applications of Finite State Transducers in Natural Language Processing

Applications of Finite State Transducers in Natural Language Processing
paly

This paper explores the use of Finite State Transducers (FSTs) in Natural Language Processing (NLP) applications. It covers the operations, properties, and differences between FSA and FST, including their use in computational morphology and Chomsky's hierarchy. The author presents her contributions to Estonian finite state morphology and various morphology-based NLP applications. Finally, the paper concludes with a discussion on the potential of FSTs in NLP.

  • Uploaded on | 0 Views
  • kalihand kalihand

About Applications of Finite State Transducers in Natural Language Processing

PowerPoint presentation about 'Applications of Finite State Transducers in Natural Language Processing'. This presentation describes the topic on This paper explores the use of Finite State Transducers (FSTs) in Natural Language Processing (NLP) applications. It covers the operations, properties, and differences between FSA and FST, including their use in computational morphology and Chomsky's hierarchy. The author presents her contributions to Estonian finite state morphology and various morphology-based NLP applications. Finally, the paper concludes with a discussion on the potential of FSTs in NLP.. The key topics included in this slideshow are Finite State Transducers, Natural Language Processing, Computational Morphology, Chomsky's Hierarchy, Estonian NLP,. Download this presentation absolutely free.

Presentation Transcript


1. Finite-State Transducers: Applications in Natural Language Processing Heli Uibo Institute of Computer Science University of Tartu Heli.Uibo @ut.ee

2. Outline FSA and FST: operations, properties Natural languages vs. Chomskys hierarchy FST-s: application areas in NLP Finite-state computational morphology Authors contribution: Estonian finite-state morphology Different morphology-based applications Conclusion

3. FSA-s and FST-s

4. O perations on FSTs concatenation union iteration (Kleenes star and plus) *complementation composition reverse, inverse *subtraction *intersection containment substitution cross-product projection

5. Algorithmic properties of FSTs epsilon-free d etermin istic minimized

6. Natural languages vs. Chomskys hierarchy English is not a finite state language. (Chomsky Syntactic structures 1957) Chomskys hierarchy: Finite- state Context- free Context- sensitive Turing machine

7. Natural languages vs. Chomskys hierarchy The Chomskys claim was about syntax (sentence structure). Proved by (theoretically unbounded) recursive processes in syntax : embedd ed subclauses I saw a dog, who chased a cat, who ate a rat, who adding of free adjuncts S NP (AdvP)* VP (AdvP)*

8. Natural languages vs. Chomskys hierarchy Attempts to use more powerful formalisms Syntax: phrase structure grammars (PSG) and unification grammars (HPSG , LFG ) Morphology: context-sensitive rewrite rules (not- reversible)

9. Natural languages vs. Chomskys hierarchy Generative phonology by Chomsky&Halle (1968) used context-sensitive rewrite rules , applied in the certain order to convert the abstract phonological representation to the surface representation (wordform) through the intermediate representations. General form of rules: x y / z _ w, where x, y, z, w arbitrary complex feature structures

10. Natural languages vs. Chomskys hierarchy BUT: Writing large scale, practically usable context-sensitive grammars even for well-studied languages such as English turned out to be a very hard task. Finite-state devices have been "rediscovered" and widely used in language technology during last two decades.

11. Natural languages vs. Chomskys hierarchy Finite-state methods have been especially successful for describing morphology . The usability of FSA-s and FST-s in computational morphology relies on the following results: D. Johnson, 1972: Phonological rewrite rules are not context-sensitive in nature, but they can be represent as FST-s. Schtzenberger, 1961: If we apply two FST-s sequentially, there exist a single FST, which is the composition of the two FST-s.

12. Natural languages vs. Chomskys hierarchy Generalization to n FST-s: we manage without intermediate representations deep representation is converted to surface representation by a single FST! 1980 the result was rediscovered by R. Kaplan and M. Kay (Xerox PARC)

13. Natural languages vs. Chomskys hierarchy Deep representation Deep representation Surface representation Surface representation one big rule = FST Rule 1 Rule 2 Rule n ..

14. Applications of FSA-s and FST-s in NLP Lexicon (word list) as FSA compression of data! Bilingual dictionary as lexical transducer Morphological transducer (may be combined with rule-transducer(s), e.g. Koskenniemis two-level rules or Karttunens replace rules composition of transdu cers). E ach path from the initial state to a final state represents a mapping between a surface form and its lemma ( lexical form ).

15. Finite-state computational morphology Mor phological readings Wordforms Mor phological anal yzer /generator

16. Morfological analysis by lexical transducer Morphological analysis = lookup The paths in the lexical transducers are traversed, until one finds a path, where the concatenation of the lower labels of the arcs is equal to the given wordform. The output is the concatenation of the upper labels of the same path (lemma + grammatical information). If no path succeeds (transducer rejects the wordform), then the wordform does not belong to the language, described by the lexical transducer.

17. Morfological synthesis by lexical transducer Morphological synthesis = lookdown The paths in the lexical transducers are traversed, until one finds a path, where the concatenation of the upper labels of the arcs is equal to the given lemma + grammatical information. The output is the concatenation of the lower labels of the same path (a wordform). If no path succeeds (transducer rejects the given lemma + grammatical information), then either the lexicon does not contain the lemma or the grammatical information is not correct.

18. Finite-state computational morphology In m or phology, one usually has to model two principally different pro cesses : 1. Mor photactics ( how to combine wordforms from morphemes ) - prefi xation and suf f i xation , compounding = concatenati on - redupli cation , infi xation , interdigitation non- concatenative pro cesses

19. Finite-state computational morphology 2. Ph onologi cal/ ort h ogra phical alternation s - assimilation (hi nd : hi nn a) - insertion (jooksma : jooks e v) - deletion (numb e r : numbri) - gemination (tu b a : tu pp a) All the listed morphological phenomena can be described by regular expressions .

20. Estonian finite-state morphology In Estonian language d ifferent grammatical word forms are built using stem flexion t ub a - singular nominative ( room ) t o a - singular genitive ( of the room ) suffixes (e.g. plural features and case endings) tuba de st - plural elative ( from the rooms )

21. Estonian finite-state morphology productive derivation, using suffixes kiire ( quick ) kiire sti ( quickly ) compounding, using concatenation piiri + valve + ve + osa = piirivalveveosa border (Gen) + guarding (Gen) + force (Gen) + part = a troup of border guards

22. Estonian finite-state morphology Two-level model by K. Koskenniemi LexiconFST .o. RuleFST Three types of two-level rules: <=>, <=, => (formally regular expressions) e.g. two-level rule a:b => L _ R is equivalent to regular expression [ ~[ [ [ ?* L ] a:b ?* ] | [ ?* a:b ~[ R ?* ] ] ] Linguists are used to rules of type a b || L _ R

23. Estonian finite-state morphology Phenomena handled by lexicons: noun declination verb conjugation comparison of adjectives derivation compounding stem end alternations ne-se, 0-da, 0-me etc. choice of stem end vowel a, e, i, u Appropriate suffixes are added to a stem according to its inflection type

24. Estonian finite-state morphology Handled by rules: stem flexion k gu : k o , h p ata : h pp an phonotactics lumi : lu md* lu nd morphophonological distribution sei s + d a sei st a orthography kir j * kir i , krista ll + ne krista l ne

25. Estonian finite-state morphology Problem with derivation from verbs with weakening stems: every stem occurs twice at the upper side of the lexicon vaste of space! LEXICON Verb lika:liKa V2; .. LEXICON Verb-Deriv liga VD0; .. LEXICON VD0 tud+A:tud #; tu+S:tu S1; nud+A:nud #; nu+S:nu S1;

26. Estonian finite-state morphology M y own scientific contribution : S ol ution to the problem of weak-grade verb derivatives : also primary form , belonging to the level of morphological information, ha s lexical (or deep) represent ation. That is, two-levelness has been extended to the upper side of the lexical transducer (only for verbs). LEXICON Verb liKa:liKa V2; . No stem doubling for productively derived forms!

27. Estonian finite-state morphology Result: The morphological transducer for Estonian is composed as follows: ((LexiconFST) -1 RulesFST 1 ) -1 RulesFST , where RulesFST 1 RulesFST (subset of the whole rule set, containing grade alternation rules only) Operations used: composition, inversion

28. Estonian finite-state morphology The experimental two-level morphology for Estonian has been implemented using the XEROX finite-state tools lexc and twolc . 45 two-level rules The root lexicons include 20 00 word roots. Over 200 small lexicons describe the stem end alternations, conjugation, declination, derivation and compounding.

29. Estonian finite-state morphology To-do list : avoid overgeneration of compound word s solution : compose the transducer with other transducer s which constrain the generation process guess the analys is of unknown words (words not in the lexicon) solution: use regexp in the lexicon which stand for any root, e.g. [Alpha*]

30. Language technological applications: requirements Different approaches of building the morpholog ical transducer may be suitable for different language technological applications. S peller is the given wordform correct? ( = accepted by the morphological transducer ) Important to avoid overgeneration! Improved information retrieval find all the documents where the given keyword occurs in arbitrary form and sort the documents by relevance Weighted FST-s may be useful; morphological disambiguation also recommended; overgeneration not so big problem.

31. Full NLP with FST-s? Mor ph- FST S yntax- FST Semanti cs- FST Description of a natural language = one big transducer ana lysis gener ation Speech- Text FST