CLARINO is a Norwegian infrastructure project jointly funded by the Research Council of Norway and a consortium of Norwegian universities and research institutions. Its goal is to implement the Norwegian part of CLARIN. The ultimate aim is to make existing and future language resources easily accessible for researchers and to bring eScience to humanities disciplines. The CLARINO project is coordinated by University of Bergen.
CLARINO Humit Text Laboratory Centre is a C centre in the CLARIN infrastructure.
The table below shows Humit Text Laboratory resources with a signed CLARIN agreement. More resources will come. Go to the Humit homepage to view all resources from Humit/Text Laboratory.
Corpora:
The Big Brother Corpus | (2007) 440 300 tokens. Speech. Norwegian TV show from 2001. Accessible through Glossa. Licence: ![]() |
Corpus of American Nordic Speech v.3 | (2019) (746 000 tokens). Speech. American Norwegian/Swedish. Accessible through interface. Licence: ![]() |
Corpus of Doctor-Patient Consultations from Ahus | (2015) 950 000 tokens. Speech. Transcriptions without audio files. Accessible through interface. Licence: ![]() |
The Lexicographic Corpus for Norwegian Bokmål | (2013) 100 mill tokens. Written text. Norwegian Bokmål. Accessible through interface. Licence: ![]() |
(2018) 3,5 mill tokens. Speech. Norwegian dialects from 1937 - 1996. Accessible through interface. Licence: ![]() |
|
The LIA Treebank | (2022) 7536 speech segments and 77 701 tokens from LIA Norwegian annotated with morphological and dependency-style syntactic analysis. Accessible through interface or download. Licence: ![]() ![]() ![]() |
LIA Sápmi - Sámegiela hállangiellakorpus | (2018) 190 000 tokens. Speech. Sami dialects. Accessible through interface. Licence: ![]() —Download metadata —Search the corpus |
Nordic Dialect Corpus v. 4.0 | (2013) 2.75 mill tokens. Speech. Nordic dialects. Accessible through interface. Licence: ![]() |
The NDC Treebank | (2022) 4637 speech segments and 66 042 tokens from the Norwegian part of Nordic Dialect Corpus annotated with morphological and dependency-style syntactic analysis. Accessible through interface or download. Licence: ![]() ![]() ![]() |
Nordic Syntax Database | (2013) 924 sentence judgments by Nordic dialect speakers. Accessible through interface. Licence: ![]() ![]() |
The NORINT Corpus | (2017) Speech (110 000 tokens) and written text (53 000 tokens). Norwegian as second language. Accessible through interface. Licence: ![]() |
The NORM Corpus | (2017) 1.17 mill tokens. Written pupil texts. Norwegian Bokmål and Nynorsk. Accessible through interface. Licence: ![]() |
Norwegian Words | (2013) Lexical database with 1650 Norwegian Bokmål nouns, adjectives and verbs. Accessible through interface. Licence: ![]() |
NoTa-Oslo Norsk talespråkskorpus - Oslodelen | (2006) 957 000 tokens. Speech. Oslo sociolects. Accessible through interface. Licence: ![]() |
NoWaC - Norwegian Web as Corpus v1.0 | (2010) 700 million tokens. Written text. Bokmål. Accessible through interface or download. Licence: ![]() ![]() ![]() |
Frequency lists from NoWaC | (2010) Frequency lists. Bokmål. Licence: ![]() ![]() |
The SKRIV Corpus | (2016) 112 000 tokens. Written texts by students in upper secondary vocational education programs. Norwegian Bokmål. Accessible through interface. Licence: ![]() |
TAUS - Talemålsundersøkelsen i Oslo v.3 | (2007, 2020) 388 000 tokens. Speech. Oslo sosiolect from 1971-1973. Accessible through interface. Licence: ![]() |
Downloadable transcriptions (and audio files) from corpora:
The Big Brother Corpus - downloadable transcriptions | (2007) 440 300 tokens. Transcriptions of dialogs from the Norwegian TV show from 2001. Licence: ![]() |
Corpus of American Nordic Speech v.3 - downloadable transcriptions | (2019) (746 000 tokens). Speech. Transcriptions of American Norwegian/Swedish interviews and dialogs. Licence: ![]() |
LIA: Transcriptions and selected audio files from LIA Norwegian for download | (2021) 553 transcriptions with corresponding audio files from LIA Norwegian. Speech. Licence: ![]() |
The LIA Treebank | (2021) 7536 speech segments and 77 701 tokens from LIA Norwegian annotated with morphological and dependency-style syntactic analysis. Licence: ![]() |
Nordic Dialect Corpus v. 4.0 - downloadable transcriptions | (2013) 2.75 mill tokens. Speech. Transcriptions of interviews and dialogs with Nordic dialects. Licence: ![]() |
The NDC Treebank | (2022) 4637 speech segments and 66 042 tokens from the Norwegian part of Nordic Dialect Corpus annotated with morphological and dependency-style syntactic analysis. Licence: ![]() |
NoTa-Oslo Norsk talespråkskorpus - Oslodelen - downloadable transcriptions | (2006) 957 000 tokens. Speech. Transcriptions of interviews and dialogs with Oslo sociolects. Licence: ![]() |
TAUS - Talemålsundersøkelsen i Oslo - downloadable transcriptions | (2007, 2020) 388 000 tokens. Speech. Transcriptions of interviews with Oslo sosiolect from 1971-1973. Licence: ![]() |
Tools:
Glossa | (2023) Search and post-processing tool for text and speech corpora. Licence: ![]() |
The Humit Tagger | (2024) Morphological AI tagger for Norwegian Bokmål and Nynorsk. Licence: ![]() |
The Oslo-Bergen Tagger | (2023) Morphological tagger for Norwegian Bokmål and Nynorsk. Licence: ![]() |
The LIA Parser | (2023) Dependency parser for spoken Norwegian dialects trancribed to Nynorsk. ![]() |
The NDC Parser | (2023) Dependency parser for spoken Norwegian dialects trancribed to Bokmål. ![]() |
More language resources from the Humit/Text Laboratory.
Contact: humit@hf.uio.no
Clarino Consortium partners: