CLARINO is a Norwegian infrastructure project jointly funded by the Research Council of Norway and a consortium of Norwegian universities and research institutions. Its goal is to implement the Norwegian part of CLARIN. The ultimate aim is to make existing and future language resources easily accessible for researchers and to bring eScience to humanities disciplines. The CLARINO project is coordinated by University of Bergen.
CLARINO Text Laboratory Centre is a C centre in the CLARIN infrastructure.
The table below shows Text Laboratory resources with a signed CLARIN agreement. More resources will come. Go to the Text Laboratory homepage to view all resources from the Text Laboratory.
Corpora:
The Big Brother Corpus | (2007) 440 300 tokens. Speech. Norwegian TV show from 2001. Accessible through Glossa. Licence: —Licence conditions —Download metadata —Search the corpus |
Corpus of American Nordic Speech v.3 | (2019) (746 000 tokens). Speech. American Norwegian/Swedish. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus |
Corpus of Doctor-Patient Consultations from Ahus | (2015) 950 000 tokens. Speech. Transcriptions without audio files. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus |
The Lexicographic Corpus for Norwegian Bokmål | (2013) 100 mill tokens. Written text. Norwegian Bokmål. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus |
(2018) 3,5 mill tokens. Speech. Norwegian dialects from 1937 - 1996. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus | |
The LIA Treebank | (2022) 7536 speech segments and 77 701 tokens from LIA Norwegian annotated with morphological and dependency-style syntactic analysis. Accessible through interface or download. Licence: —Licence conditions —Licence for download: —Licence conditions —Download metadata —Search the treebank —Download the treebank |
LIA Sápmi - Sámegiela hállangiellakorpus | (2018) 190 000 tokens. Speech. Sami dialects. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus |
Nordic Dialect Corpus v. 4.0 | (2013) 2.75 mill tokens. Speech. Nordic dialects. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus |
The NDC Treebank | (2022) 4637 speech segments and 66 042 tokens from the Norwegian part of Nordic Dialect Corpus annotated with morphological and dependency-style syntactic analysis. Accessible through interface or download. Licence: —Licence conditions —Licence for download: —Licence conditions —Download metadata —Search the corpus —Download the treebank |
Nordic Syntax Database | (2013) 924 sentence judgments by Nordic dialect speakers. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the database |
The NORINT Corpus | (2017) Speech (110 000 tokens) and written text (53 000 tokens). Norwegian as second language. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus |
The NORM Corpus | (2017) 1.17 mill tokens. Written pupil texts. Norwegian Bokmål and Nynorsk. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus |
Norwegian Words | (2013) Lexical database with 1650 Norwegian Bokmål nouns, adjectives and verbs. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the database |
NoTa-Oslo Norsk talespråkskorpus - Oslodelen | (2006) 957 000 tokens. Speech. Oslo sociolects. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus |
NoWaC - Norwegian Web as Corpus v1.0 | (2010) 700 million tokens. Written text. Bokmål. Accessible through interface or download. Licence: —Licence conditions —Licence for download: —Licence conditions —Download metadata —Download the corpus —Search the corpus |
Frequency lists from NoWaC | (2010) Frequency lists. Bokmål. Licence: —Licence conditions —Download metadata —Download Frequency lists |
The SKRIV Corpus | (2016) 112 000 tokens. Written texts by students in upper secondary vocational education programs. Norwegian Bokmål. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus |
TAUS - Talemålsundersøkelsen i Oslo v.3 | (2007, 2020) 388 000 tokens. Speech. Oslo sosiolect from 1971-1973. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus |
Downloadable transcriptions (and audio files) from corpora:
The Big Brother Corpus - downloadable transcriptions | (2007) 440 300 tokens. Transcriptions of dialogs from the Norwegian TV show from 2001. Licence: —Licence conditions —Download metadata —Download transcriptions |
Corpus of American Nordic Speech v.3 - downloadable transcriptions | (2019) (746 000 tokens). Speech. Transcriptions of American Norwegian/Swedish interviews and dialogs. Licence: —Licence conditions —Download metadata —Download transcriptions |
LIA: Transcriptions and selected audio files from LIA Norwegian for download | (2021) 553 transcriptions with corresponding audio files from LIA Norwegian. Speech. Licence: —Licence conditions —Download metadata —Download audio files and transcriptions |
The LIA Treebank | (2021) 7536 speech segments and 77 701 tokens from LIA Norwegian annotated with morphological and dependency-style syntactic analysis. Licence: —Licence conditions —Download metadata —Download conllx-format —Download conllu-format (version with 5250 segments and 55 410 token) |
Nordic Dialect Corpus v. 4.0 - downloadable transcriptions | (2013) 2.75 mill tokens. Speech. Transcriptions of interviews and dialogs with Nordic dialects. Licence: —Licence conditions —Download metadata —Download transcriptions |
The NDC Treebank | (2022) 4637 speech segments and 66 042 tokens from the Norwegian part of Nordic Dialect Corpus annotated with morphological and dependency-style syntactic analysis. Licence: —Licence conditions —Download metadata —Download conllx-format |
NoTa-Oslo Norsk talespråkskorpus - Oslodelen - downloadable transcriptions | (2006) 957 000 tokens. Speech. Transcriptions of interviews and dialogs with Oslo sociolects. Licence: —Licence conditions —Download metadata —Download transcriptions |
TAUS - Talemålsundersøkelsen i Oslo - downloadable transcriptions | (2007, 2020) 388 000 tokens. Speech. Transcriptions of interviews with Oslo sosiolect from 1971-1973. Licence: —Licence conditions —Download metadata —Download transcriptions |
Tools:
Glossa | (2023) Search and post-processing tool for text and speech corpora. Licence: —MIT Licence —Download metadata —Download Glossa |
The Oslo-Bergen Tagger | (2023) Morphological tagger for Norwegian Bokmål and Nynorsk. Licence: —GPL —Download metadata —Download OBT |
The LIA parser | (2023) Dependency parser for spoken Norwegian dialects trancribed to Nynorsk. —Licence conditions —Download metadata —Download the parser |
The NDC parser | (2023) Dependency parser for spoken Norwegian dialects trancribed to Bokmål. —Licence conditions —Download metadata —Download the parser |
More language resources from the Text Laboratory/Humit.
Contact: tekstlab-post at iln.uio.no
Clarino Consortium partners: