LIA Norwegian comprises 3.5 million words elicited from 1374 informants from 226 local areas in Norway. The material is transcribed both (quasi) phonetically and orthographically (Nynorsk), as well as being morphologically tagged with the newly developed spoken language tagger for Nynorsk, the LIA tagger.
LIA Norwegian is accessible via the corpus search interface Glossa.
The recordings and transcriptions were provided by four universities: NTNU, UiB, UiO and UiT. There is also material from Målførearkivet (the dialect archive at the University of Oslo) that was previously available in the Nordic Dialect Corpus.
Audio files, transcriptions and metadata from the corpus are available at a file depot, along with audio that has not been transcribed in the project. The transcriptions can be downloaded in ELAN format from the depot, while the audio can be streamed.
Downloadable transcriptions and audio files
All transcriptions are downloadable in plain text format. A folder containing 553 transcriptions from LIA Norwegian, in ELAN format, along with their corresponding audio, can moreover be downloaded. The recordings contain no sensitive information and can be used freely by linguists or for other technological purposes. (Many of the LIA recordings have content that has been deemed sensitive. Such content has not been transcribed, such that the recordings can still be used in the corpus. These recordings are not available for download.)
The LIA Treebank
The LIA Treebank includes 5250 speech segments and 55 410 tokens from LIA Norwegian. The treebank is annotated with morphological and dependency-style syntactic analysis and manually corrected. The treebank is available in two formats:
tekstlab-post at iln.uio.no