PaNoLa: Integrating Constraint Grammar research in Nordic Languages

PaNoLa: Integrating Constraint Grammar research in Nordic Languages

PaNoLa project aims to integrate Constraint Grammar research in Nordic countries by internet-based grammar teaching with the help of VISL model and morphologically and syntactically annotated corpus data.

About PaNoLa: Integrating Constraint Grammar research in Nordic Languages

PowerPoint presentation about 'PaNoLa: Integrating Constraint Grammar research in Nordic Languages'. This presentation describes the topic on PaNoLa project aims to integrate Constraint Grammar research in Nordic countries by internet-based grammar teaching with the help of VISL model and morphologically and syntactically annotated corpus data.. The key topics included in this slideshow are PaNoLa, Nordic languages, Constraint Grammar, VISL model, annotated corpus data,. Download this presentation absolutely free.

Presentation Transcript

1. PaNoLa: Parsing Nordic Languages Eckhard Bick

2. PaNoLa Goals 1. Integrate existing and stimulate new Constraint Grammar-research in Nordic countries 2. Internet based Grammar Teaching , applying the VISL model to different Nordic languages 3. Morphologically and syntactically annotated corpus data

3. Participants University of Southern Denmark (Eckhard Bick, Anette Wulff) Danish CG as well as CGs for 6 other languages Oslo University (Janne Bondi Johannessen, Kristin Hagen) Bokml and Nynorsk CGs Helsinki University (Fred Karlsson):Finnish and Swedish CGs Gteborg University (Torbjrn Lager) TBL-system (corpus trained automatic CG) Tartu University (Heli Uibo, Kaili Mrisep): Estonian CG Troms University (Trond Trosterud): Sami CG The Greenlandic Language Secretariat Oqaasileriffik (Per Langgrd) Iceland University of Education (Jhanna Karlsdottir) University of the Faroe Islands (Zakaris Hansen)

4. Project framework Funding: Nordic Council of Ministries Funded project period: PaNoLa : January 2002 December 2003: da, no, sv, fi PaNoLa-addon : 2004: is, fo, smi, kl PaNoLa-plus : 2005 (- 2006): is, fo, smi, kl planned: PaNoLa-neighbour : 2005/6 (- 2007): lit, lav, ru Historical basis and ongoing cooperation PaNoLa PaNoLa addon PaNoLa-plus PaNoLa-neighbour da, no, sv, fi is, fo, smi, kl lit, lav, ru

5. Project framework Network aspect: 4 workshops in Denmark, Norway, Iceland and Sweden Odense, 19.-21. May 2002 Ustaoset, 25.-27. October 2002 Reykjavik, 1.-2. June 2003 Gteborg, 24.-25. October 2003 Odense, 23.-26. October 2004 Fefor, 11.-13. Marts 2005 (Tallin, 1.-3. April 2005) planned: Thorshavn, 16.-19. September 2005 Administration, Web-server, Data-integration: VISL/ISK, University of Southern Denmark Satellite projects: e.g. Arboretum, GREI, Arborest

6. Constraint Grammar Rule and lexicon based robust parsing (Karlsson et. al. 1995), methodological paradigm Shared conceptual and notational conventions, allowing productive research transfer Language dependent differences: Lexicon, rules (Inter-scandinavian comparative payoff?) Compiler and rule type differences Focus differences: tagging? Parsing? Semantics? Teaching? Corpus annotation? QA?, NER?, ...

7. Rule formalism and architecture cg1-compiler cg2- compiler visl-cg- compiler Swe CG Fin CG Oslo- Bergen tagger DanGram, Sami other VISL languages -TBL Lingsoft-compatible Needs more rules than cg2 Sets as targets Barrier- conditions cg2-like plus substitute operator for correcting hybrid input Automatic learning, local context, rule ordering PoS Syntax Case roles Swedish or language-indep. trained CG cgx- compiler Est CG da smi no est sv fi

8. The Lexical Base TWOL Core lexicon + morphological analyser Swe CG Fin CG Oslo-Bergen tagger DanGram Corpus dependent Valency potential (especially for verbs) Semantic sets NER -TBL Full semantic prototype lexicon Samic CG Est CG

9. Theoretical Framework (Syntax) Cg2tree (MC) (visl-psg) Traditional CG: Flat dependency Word based form and function tags Dependency filter (SH) TIGER format PENN format Visl2penn (EB) Visl2tiger (LN, EB, ..) Treebank format PSG- Grammar Danish Norwegian Editing tools Search interfaces Korpus90/2000 Oslo-Bergen Corpus Arboretum Redwood

10. Treebank data compatibility CG CG-dep VISL VISL- dep TIGER TIGER-dep MALT-dep DTAG- dep CG cg2dep depspli cator cg2visl (visl-psg + grammar) depspli cator cg2visl | cg2visl | | cg2dep | visldep2malt depspli cator CG- dep visldep2malt VISL tree 2cg | | | tigerdep2malt VISL- dep TIGER TIGER -dep tigerdep2malt , (NTN tools) (NTN tools) MALT (NTN tools) DTAG (NTN tools)

11. Accessibility Strong focus on making tools and corpora freely accessible on the internet Provide notational and complexity filters to bridge differences between different research and teaching traditions VISL's open source philosophy for reconciling academic and commercial use: Free compilers and corpora, but allowing for the protection (i.e. commercializability) of grammars, lexica and end-user applications

12. Related applicative CG-projects CG spell/grammar checking (No, Da) Lingsoft / Microsoft Named Entity Recognition (Da, No) Nomen Nescio (Nordic Network) 2001-2003 Treebanks (Da Arboretum, Norwegian plans) Nordic Treebank Network 2003-2004 Question Answering systems (Da) Aminova Dialogue Systems Teaching (e.g. VISL-GYM, VISL-HHX, GREI)

13. PaNoLa's other leg: CALL Integrating and strengthening Nordic languages in the VISL grammar teaching system A unified system of grammatical categories and structural analysis for 22 languages (Dienhart 2000 and Bick 2001) Color codes and symbolic notation Systematic focus on form & function Preexisting server and programming infrastructure School and university teaching contacts at all levels Internet based games and exercises Graded complexity filters

14. notational harmonization vs. linguistic differences: The greenlandic example QUE:par CJT:cl =S:pron Suumuna #'Hvilken/Hvad' =fA:icl ==Od:g ===D:n naasut #'planternes' ===H:n qorsuttaat #'deres det grnne' ==P:v-pcp1 kiilorpassuakkaarlugu #grende det i kilovis =A:g ==H:n nunamut #'jorden' ==D:n uumassuseqanngitsumut #'p den livlse' =P:v siaruartilertaraa #fr det til at brede sig CJT:cl- =fA:cl- ==S:n apullu #og sneen CO:conj _lu -CJT:cl =-fA:cl ==P:v aanniariaraangat #s ofte den begynder at smelte =P:v siaruaatipallatsittarlugu #fr det til at vlte frem ? KAL22a) Suumuna naasut qorsuttaat kiilorpassuakkaarlugu nunamut uumassuseqanngitsumut siaruartilertaraa apullu aanniariaraangat siaruaatipallatsittarlugu? (Hvad var det der gjorde, at kilo efter kilo af det grnne plantestof kunne vlte frem fra den livlse jord, lige s snart det blev varmt nok i vejret og de sidste rester af sne var vk?) ==H:n nunamut #p jorden ===R:n('nuna') nuna- ===D:in('mut',fleksiver) -mut ==D:n uumassuseqanngitsumut ===R:v('uuma') uuma- ===D:in('ssusiq') -ssuse- ===D:iv('qar') -qa- ===D:iv('ngngit') -nngit- ===D:in('Tuq') -su- ===D:in('mut',fleksiver) -mut ==P:v aanniariaraangat ===R:v('aak') aan- ===D:iv('niar') -nia- ===D:iv('riar') -riar- ===D:iv('gaangat',fleksiver) -aangat =P:v siaruaatipallatsittarlugu ==R:v('siaruar') siarua- ==D:iv('ute') -ati- ==D:iv('pallak') -pallat- ==D:iv('tit') -sit- ==D:iv('Tar') -tar- ==D:iv('lugu',fleksiver) -lugu

15. Greenlandic word-internal tree structures

16. Teaching corpora Pedagogically structured XML-markup for teaching topic and didactical progression Finnish and Swedish modelled on Danish and Norwegian examples files (comparative possibilities) compatibility with and importability for research treebanks (e.g. Sofie)

17. Interactive teaching trees

18. Grammar games: Labyrinth

19. Grammar Games: Word Fall

20. Integrating the CG and CALL legs Nordic CG expertise is used to provide live analyses as input for the teaching modules, if necessary by CGI- communication between university servers, e.g. Oslo-SDU Descriptional harmonization issues (e.g. Word class) Determine matching complexity (e.g. subclause analysis?)

21. CG leg evaluation CG-grammars improve incrementally, so evaluation is less definite than for probabilistic systems, and can change over time. Results depend on tag granularity and test genre Some numbers: -- DanGram : F-Score 98.65 for PoS, 94.9 for function (Bick 2003) -- DanGram NER : 5% typing errors, 2% chunking errors -- Bokml CG : 97.2% lexical F-score (Hagen & Johannessen 2003) -- Nynorsk CG : 96.2% lexical F-score -- SWECG 1.0 : recall 99.7% at a precision of 95% (pre-PaNoLa) -- -TBL CG for Swedish: 98.1% lexical accuracy when allowing for 1.04 tags pr. Word (Lager 1999)

22. Teaching leg evaluation GREI evaluation: improvement of grammatical skills after using VISL tools (104 children 7 th and 8 th grade) Same level tests before & after using VISL/GREI, test & control groups Subjective results: All users thought VISL was more fun (games more than trees), and that their grammatical skills had improved Objective results: Test group performed 14.5% better than control group (7 th grade), resp. 7% (8 th grade) and 12% at the secondary level. Differences were positive for both PoS and sentence analysis, but more marked for the latter

23. Teaching corpora differences across PaNoLa languages Preposition frequency: 11% (Bokml), 11.4% (Danish), 13.4% (Nynorsk), 0.5% (Finnish) PoS: klappe i, tage p, skrive noget om are tagged as ADV in Danish, as PRP in Norwegian samples Danish infinitive markers ('at') tagged as CONJ in Norwegian Subclass solutions: e.g. Da/Fi distinction between adjunct and argument adverbials, not made by No/Se (fA/As/Ao vs. A) Tradition interference: Swedish analysis had zero constituents, because it was annotated according to the English VISL model

24. Outlook Continued development of Nordic Constraint Grammars and CG applications Ongoing CALL service for schools Presence of the CG paradigm in other Nordic networks Post-PaNoLa: VISL adaptations for other minor Nordic languages (Faeroese, Icelandic, Samic, Estonian ...)