LIA Norwegian

LIA Norwegian - Corpus of historical dialect recordings

LIA Norwegian comprises 3.5 million words from historical dialect recordings, elicited from 1382 informants from 227 local areas in Norway. The material is transcribed both (quasi) phonetically and orthographically (Nynorsk), as well as being morphologically tagged with the newly developed spoken language tagger for Nynorsk, the LIA tagger.

LIA Norwegian is accessible via the corpus search interface Glossa.

The recordings and transcriptions were provided by four universities: NTNU, UiB, UiO and UiT. There is also material from Målførearkivet (the dialect archive at the University of Oslo) that was previously available in the Nordic Dialect Corpus.

Search the corpus
Read the user manual for LIA Norwegian
Read about the LIA project

How to refer to the corpus:
Hagen, Kristin & Vangsnes, Øystein A. (2023). LIA-korpusa – eldre talemålsopptak for norsk og samisk gjort tilgjengelege.
Nordlyd, 47(2), 119-130. https://doi.org/10.7557/12.7157

Please also add the corpus handle:
LIA Norwegian - Corpus of historical dialect recordings:
https://hdl.handle.net/11538/0000-000C-368B-B

File depot

Audio files, transcriptions and metadata from the corpus are available in a file depot, along with audio that has not been transcribed in the project. The transcriptions can be downloaded in ELAN format from the depot, while the audio can be streamed.

Search the LIA file depot
Read about the depot

Downloadable transcriptions and audio files

All transcriptions are downloadable in plain text format. A folder containing 553 transcriptions from LIA Norwegian, in ELAN format, along with their corresponding audio, can moreover be downloaded. The recordings contain no sensitive information and can be used freely by linguists or for other technological purposes. (Many of the LIA recordings have content that has been deemed sensitive. Such content has not been transcribed, such that the recordings can still be used in the corpus. These recordings are not available for download.)

User license

Download selected audio files and transcriptions from LIA Norwegian

Download all transcriptions from LIA in plain text format from Github

The LIA Treebank

The LIA Treebank includes 7536 speech segments and 77 701 tokens from LIA Norwegian. The treebank is annotated with morphological and dependency-style syntactic analysis and manually corrected. The treebank is available in three versions:

- A downloadable version in conllx format.

Download the treebank from Github
License

- A searchable version in the search interface Glossa.

Search the treebank

- A downloadable version in conllu format. This version is automatically converted to Universal Dependencies by Lilja Øvrelid, University of Oslo. The conllu version contains 5250 speech segments and 55 410 tokens.

Download the conllu version
License
Read more in Norwegian

How to refer to the treebank:
The LIA Treebank: https://hdl.handle.net/11538/bab5d1e1-2

The LIA Parser

The LIA parser is a dependency parser trained on the LIA Treebank. The parser is a so-called transition-based dependency parser, UUParser, developed at Uppsala University.

Read more and download the parser.
Read more in Norwegian

How to refer to the parser:
The LIA parser: https://hdl.handle.net/11538/25934B3F-4

Contact:

tekstlab-post at iln.uio.no