CANS - Corpus of American Nordic Speech v.3.1

Search the corpus
Tools
More about the transcriptions

CANS - Corpus of American Nordic Speech v.3.1 (latest version published 27. January 2021) consists of interviews and conversations with 246 American Norwegian informants from 57 locations in USA and Canada, all in all more almost 773 000 words. CANS v.3.1 includes recordings and transcriptions from Janne Bondi Johannessen et al. (2010 - 2016) together with older recordings and transcriptions from Didrik Arup Seip and Ernst W. Selmer (1931), Einar Haugen (1942) and Arnstein Hjelde (1987, 1990, 1992).

September 2017 the corpus was enhanced with American Swedish: nearly 46 000 words spoken by 22 informants from seven locations in USA. The Swedish recordings are collected by Ida Larsson et al. (2011 - 2014).

The corpus is freely available for research using login with Feide or Clarin. (Contact us if you need another login alternative.)

The interviews and conversations in the corpus are transcribed in two ways: A phonetic transcription and an orthographic transcription. The transcriptions are connected to each other and to the original audio and video files.

Download the transcriptions
The transcriptions are downloadable, some of them in html format, some in text format.

Read or download the transcriptions from Github:

Please refer to the corpus with this reference:
Johannessen, Janne Bondi. 2015. The Corpus of American Norwegian Speech (CANS). In Béata Megyesi (ed.): Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. NEALT Proceedings Series 23.
Download.

Please also add the corpus URL:
CANS - Corpus of American Nordic Speech v.3.1: https://tekstlab.uio.no/norskiamerika/english/corpus.html

Tools

Phonetic transcription: The first recordings were transcribed with Transcriber. At present we use the transcription tool ELAN.

Orthographic transcription: The Oslo Transliterator - a semi-automatic dialect transliterator developed at the Text Laboratory - is used for making orthographic transcriptions out of the phonetic transcriptions; both for Norwegian and Swedish. The orthographic transcriptions are proof-read against the audio files.

Morphosyntactic tagging Norwegian: The transcriptions are tagged with morphosyntactic categories by a statistical tagger (TreeTagger) first developed for the NoTa-Oslo corpus. The tagger has achieved a performance of 96.9 % by 10-fold cross validation.

Nøklestad, Anders and Åshild Søfteland (2007). Tagging a Norwegian Speech Corpus. NODALIDA 2007 Conference Proceedings. Download
TreeTagger
Oslo-Bergen taggeren

Morphosyntactic tagging Swedish: The Swedish tagger is a TnT tagger, see Kokkinakis (2003). The tagger is trained on the Swedish PAROLE corpus and manually tagged orthographic Övdalian transcriptions from Nordic Dialect Corpus.

Kokkinakis, Sofie Johansson. (2003). En studie över påverkande faktorer i ordklasstaggning. Baserad på taggning av svensk text med EPOS. Göteborg University.

The technical solutions were originally developed for The Nordic Dialect Corpus and financed by NorDiaSyn and NordForsk.

Search tool: CANS is searchable through Glossa, a search tool developed at the Text Laboratory. Glossa can offer a modern user friendly and functional user interface. The work is financed by the CLARINO project.

More about the transcriptions

Phonetic transcription: In a phonetic transcription the dialect features will be clearly presented in the written representation, whether they are phonological, morphological, syntactic or lexical. A written representation of speech is a great help for the linguist to get a quick overview of the material.

The phonetic transcription standard is based on Papazian and Helleland's Norsk talemål. Lokal og sosial variasjon (2005), but with no special characters, only the Norwegian/Swedish alphabet. Also, the transcription is quite broad. The transcription standard in CANS is basically the same as that used for Norwegian in the Nordic Dialect Corpus. The standard is also used for the American Swedish transcription.

Orthographic transcription: An orthographic transcription is important because it is a generalization over all the variation. It enables the possibility of doing general searches, and automated methods, such as grammatical tagging. The orthographic transcription is much faster than the phonetic one, thanks to the semi-automatic dialect transliterator which translates from the phonetic transcription to Norwegian Bokmål or Swedish orthography.

Transcription and translation Guidelines for Norwegian in America (in Norwegian)
Transcription Guidelines for Swedish in America
Transcription Guidelines (for the Nordic Dialect Corpus, in Norwegian)
Translation to orthographic transcription - Guidelines (for Nordic Dialect Corpus, in Norwegian)

The two transcriptions exemplified

Phon.:	d	e	haRd	tu	finn
Orthogr.:	det	er	hard	to	finne
Transl.:	it	is	hard	to	find

Phon.:	vi	sellt	nå	tå	ri	å	rennta	ut	resst'n
Orthogr.:	vi	solgte	noe	av	det	og	renta	ut	resten
Transl.:	we	sold	some	of	it	and	let	out	the rest

Sweet welcome for the Norwegian researchers in Blair. Photo: K. M. Eide.

Search the corpus

Janne and Signe with informants in Sunburg.

Contact:
tekstlab-post@iln.uio.no