About The Oslo-Bergen-tagger
The Oslo-Bergen Tagger consists of the following modules:
- Preprocessor with compound analyser and multitagger for Bokmål and Nynorsk: The module finds sentence boundaries and identifies and analyses words not found in the lexicon such as compounds. Each word is then equipped with all the grammatical tags found possible for the word according to the Norwegian lexicon Norsk ordbank (see below). The module is programmed in Python at The Text Laboratory.
- Grammar modules for morphological disambiguation of Bokmål and Nynorsk:
Morphological disambiguation: The module removes redundant morphological tags using constraint based rules (Constraint Grammar). The Constraint Grammar rules are written in CG3 by the Text Laboratory, University of Oslo. The CG3 compiler is developed at the University of Southern Denmark in Odense.
- Additional statistical module for Bokmål: The module uses a HunPos-tagger to remove the still remaining ambiguous output left by the Constraint Grammar module. There are two types of remaining ambiguity: one is unintended ambiguity that is the result of inadequate coverage by the CG rules, the other is ambiguity that has deliberately been left behind, such as in the case of indefinite singular nouns where it is impossible to determine whether the noun is masculine or feminine, or semantic ambiguity. The HunPos-tagger was trained on a totally disambiguated version of the training corpus made for the Oslo-Bergen tagger. This work was done by the Text Laboratory, University of Oslo.
Lexicon
The Oslo-Bergen Tagger uses the Norwegian lexicon Norsk ordbank for multitagging. Norsk ordbank is an electronic database of inflected forms and other morphological information. Norsk ordbank is made on the basis of:
- words and information about inflection from the dictionaries Bokmålsordboka and Nynorskordboka, (versions written at ILN, UiO).
- word lists and inflectional codes for Bokmål and Nynorsk from IBM Norway
- argument structure codes made by the NorKompLeks Project at NTNU