The NDC Treebank and Parser
The NDC Treebank contains 4637 speech segments and 66 042 words/tokens from the Bokmål transcriptions in the Norwegian part of Nordic Dialect Corpus. The treebank is manually corrected and is annotated with morphological and dependency-style syntactic analysis. The NDC Treebank is available in two versions:
- A searchable version in the Glossa search interface:
Search the NDC Treebank
- A downloadable version in conllx format:
Download the treebank from Github
License
The segments in the treebank are taken from 30 transcribed NDC recordings from 17 different places in Norway:
Bergen, Bømlo, Flå, Herøy in Møre and Romsdal, Hjartdal, Hyllestad, Jevnaker, Jølster, Kirkesdalen, Lierne, Lyngdal, Rømskog. Sokndal, Stamsund, Trondheim, Vardø and Ål in Hallingdal,
As far as possible, the annotations follow the guidelines for Norsk Dependenstrebank and Retningslinjer for morfologisk og syntaktisk annotasjon i Norsk dependenstrebank (Kari Kinn, Per Erik Solberg and Pål Kristian Eriksen, 2013). For certain speech specific features that are not mentioned in the NDT guidelines, the annotations follow the guidelines written for the LIA Treebank.
The annotation of the treebank is done automatically, but is manually corrected by at least one person using the annotation tool ConlluEditor.
Read about the treebank in:
Andre Kåsen, Kristin Hagen, Anders Nøklestad, Joel Priestley, Per Erik Solberg and Dag Trygve Truslew Haug. 2022. The Norwegian Dialect Corpus Treebank. In Nicoletta Calzolari et al.: Proceedings of the Thirteenth Language Resources and Evaluation Conference.
The NDC Parser
The NDC Parser is a transition-based dependency parser, UUparser, developed at Uppsala University The parser is trained on the annotated Bokmål transcriptions from the NDC Treebank. A corresponding parser for Nynorsk is also developed: The LIA parser.
The parsers were evaluated using 5-fold cross validation:
Treebank | UAS | LAS |
---|---|---|
LIA | 85.23 | 80.01 |
NDC | 84.11 | 78.43 |
(UAS: unlabelled attachment score, LAS: labelled attachment score)
The LIA Treebank
The LIA Treebank includes 7536 speech segments and 77 701 tokens from the speech corpus LIA Norwegian, which is transcribed in Nynorsk. The treebank was the first treebank for Norwegian speech, and the NDC Treebank is built up the same way with transcriptions from the same dialect areas.
The LIA Treebank is available in three formats: searchable in Glossa and downloadable from Github in both conllx and conllu format.
Read more about the LIA Treebank
Contact
Andre Kåsen has worked with morphological tagging and parsing for both treebanks and has also written his master's thesis on this. Feel free to contact: him at andre.kasen at the National Library of Norway, nb.no.
Financing
The developement of the NDC Treebank and parser is financed by the infrastructure project Clarino+