CLARINO Humit Text Laboratory Centre

Home

CLARINO

CLARIN

Humit

Welcome to CLARINO Humit Text Laboratory Centre

CLARINO is a Norwegian infrastructure project jointly funded by the Research Council of Norway and a consortium of Norwegian universities and research institutions. Its goal is to implement the Norwegian part of CLARIN. The ultimate aim is to make existing and future language resources easily accessible for researchers and to bring eScience to humanities disciplines. The CLARINO project is coordinated by University of Bergen.

CLARINO Humit Text Laboratory Centre is a C centre in the CLARIN infrastructure.
The table below shows Humit Text Laboratory resources with a signed CLARIN agreement. More resources will come. Go to the Humit homepage to view all resources from Humit/Text Laboratory.

Corpora:

The Big Brother Corpus	(2007) 440 300 tokens. Speech. Norwegian TV show from 2001. Accessible through Glossa. Licence: —Licence conditions —Download metadata —Search the corpus
Corpus of American Nordic Speech v.3	(2019) (746 000 tokens). Speech. American Norwegian/Swedish. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus
Corpus of Doctor-Patient Consultations from Ahus	(2015) 950 000 tokens. Speech. Transcriptions without audio files. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus
The Lexicographic Corpus for Norwegian Bokmål	(2013) 100 mill tokens. Written text. Norwegian Bokmål. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus
LIA Norwegian - Corpus of historical dialect recordings	(2018) 3,5 mill tokens. Speech. Norwegian dialects from 1937 - 1996. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus
The LIA Treebank	(2022) 7536 speech segments and 77 701 tokens from LIA Norwegian annotated with morphological and dependency-style syntactic analysis. Accessible through interface or download. Licence: —Licence conditions —Licence for download: —Licence conditions —Download metadata —Search the treebank —Download the treebank
LIA Sápmi - Sámegiela hállangiellakorpus	(2018) 190 000 tokens. Speech. Sami dialects. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus
Nordic Dialect Corpus v. 4.0	(2013) 2.75 mill tokens. Speech. Nordic dialects. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus
The NDC Treebank	(2022) 4637 speech segments and 66 042 tokens from the Norwegian part of Nordic Dialect Corpus annotated with morphological and dependency-style syntactic analysis. Accessible through interface or download. Licence: —Licence conditions —Licence for download: —Licence conditions —Download metadata —Search the corpus —Download the treebank
Nordic Syntax Database	(2013) 924 sentence judgments by Nordic dialect speakers. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the database
The NORINT Corpus	(2017) Speech (110 000 tokens) and written text (53 000 tokens). Norwegian as second language. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus
The NORM Corpus	(2017) 1.17 mill tokens. Written pupil texts. Norwegian Bokmål and Nynorsk. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus
Norwegian Words	(2013) Lexical database with 1650 Norwegian Bokmål nouns, adjectives and verbs. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the database
NoTa-Oslo Norsk talespråkskorpus - Oslodelen	(2006) 957 000 tokens. Speech. Oslo sociolects. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus
NoWaC - Norwegian Web as Corpus v1.0	(2010) 700 million tokens. Written text. Bokmål. Accessible through interface or download. Licence: —Licence conditions —Licence for download: —Licence conditions —Download metadata —Download the corpus —Search the corpus
Frequency lists from NoWaC	(2010) Frequency lists. Bokmål. Licence: —Licence conditions —Download metadata —Download Frequency lists
The SKRIV Corpus	(2016) 112 000 tokens. Written texts by students in upper secondary vocational education programs. Norwegian Bokmål. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus
TAUS - Talemålsundersøkelsen i Oslo v.3	(2007, 2020) 388 000 tokens. Speech. Oslo sosiolect from 1971-1973. Accessible through interface. Licence: —Licence conditions —Download metadata —Search the corpus

Downloadable transcriptions (and audio files) from corpora:

The Big Brother Corpus - downloadable transcriptions	(2007) 440 300 tokens. Transcriptions of dialogs from the Norwegian TV show from 2001. Licence: —Licence conditions —Download metadata —Download transcriptions
Corpus of American Nordic Speech v.3 - downloadable transcriptions	(2019) (746 000 tokens). Speech. Transcriptions of American Norwegian/Swedish interviews and dialogs. Licence: —Licence conditions —Download metadata —Download transcriptions
LIA: Transcriptions and selected audio files from LIA Norwegian for download	(2021) 553 transcriptions with corresponding audio files from LIA Norwegian. Speech. Licence: —Licence conditions —Download metadata —Download audio files with transcriptions —Download all LIA transcriptions
The LIA Treebank	(2021) 7536 speech segments and 77 701 tokens from LIA Norwegian annotated with morphological and dependency-style syntactic analysis. Licence: —Licence conditions —Download metadata —Download conllx-format —Download conllu-format (version with 5250 segments and 55 410 token)
Nordic Dialect Corpus v. 4.0 - downloadable transcriptions	(2013) 2.75 mill tokens. Speech. Transcriptions of interviews and dialogs with Nordic dialects. Licence: —Licence conditions —Download metadata —Download transcriptions
The NDC Treebank	(2022) 4637 speech segments and 66 042 tokens from the Norwegian part of Nordic Dialect Corpus annotated with morphological and dependency-style syntactic analysis. Licence: —Licence conditions —Download metadata —Download conllx-format
NoTa-Oslo Norsk talespråkskorpus - Oslodelen - downloadable transcriptions	(2006) 957 000 tokens. Speech. Transcriptions of interviews and dialogs with Oslo sociolects. Licence: —Licence conditions —Download metadata —Download transcriptions
TAUS - Talemålsundersøkelsen i Oslo - downloadable transcriptions	(2007, 2020) 388 000 tokens. Speech. Transcriptions of interviews with Oslo sosiolect from 1971-1973. Licence: —Licence conditions —Download metadata —Download transcriptions

Tools:

Glossa	(2023) Search and post-processing tool for text and speech corpora. Licence: —MIT Licence —Download metadata —Download Glossa
The Humit Tagger	(2024) Morphological AI tagger for Norwegian Bokmål and Nynorsk. Licence: —MIT Licence —Download metadata —Use the Humit Tagger online —Download the Humit Tagger
The Oslo-Bergen Tagger	(2023) Morphological tagger for Norwegian Bokmål and Nynorsk. Licence: —GPL —Download metadata —Download OBT
The LIA Parser	(2023) Dependency parser for spoken Norwegian dialects trancribed to Nynorsk. —Licence conditions —Download metadata —Download the parser
The NDC Parser	(2023) Dependency parser for spoken Norwegian dialects trancribed to Bokmål. —Licence conditions —Download metadata —Download the parser

More language resources from the Humit/Text Laboratory.

Contact: humit@hf.uio.no

Clarino Consortium partners: