CANS - Corpus of American Nordic Speech presently consists of interviews and conversations with 50 American Norwegian informants from 22 locations in USA and Canada, all in all nearly 200 000 words.
September 2017 the corpus was enhanced with American Swedish: nearly 46 000 words spoken by 19 informants from 3 locations in USA. The corpus will be further expanded as more transcriptions are finished.
The corpus is freely available for research using login with Feide, eduGAIN or Clarin. (Contact us if you need another login alternative.)
The interviews and conversations in the corpus are transcribed in two ways: A phonetic transcription and an orthographic transcription. The transcriptions are connected to each other and to the original audio and video files.
Please refer to the corpus with this reference:
Johannessen, Janne Bondi. 2015. The Corpus of American Norwegian Speech (CANS). In Béata Megyesi (ed.): Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. NEALT Proceedings Series 23.
Please also add the corpus URL:
CANS - Corpus of American Nordic Speech: http://tekstlab.uio.no/norskiamerika/english/corpus.html
Orthographic transcription: The Oslo Transliterator - a semi-automatic dialect transliterator developed at the Text Laboratory - is used for making orthographic transcriptions out of the phonetic transcriptions; both for Norwegian and Swedish. The orthographic transcriptions are proof-read against the audio files.
Morphosyntactic tagging Norwegian: The transcriptions are tagged with morphosyntactic categories by a statistical tagger (TreeTagger) first developed for the NoTa-Oslo corpus. The tagger has achieved a performance of 96.9 % by 10-fold cross validation.
Morphosyntactic tagging Swedish: The Swedish tagger is a TnT tagger, see Kokkinakis (2003). The tagger is trained on the Swedish PAROLE corpus and manually tagged orthographic Övdalian transcriptions from Nordic Dialect Corpus.
Search tool: CANS is searchable through Glossa, a new search tool developed at the Text Laboratory. Glossa can offer a modern user friendly and functional user interface. The work is financed by the CLARINO project.
The first version of CANS with only American Norwegian speech is still available in the old Glossa search interface:
Phonetic transcription: In a phonetic transcription the dialect features will be clearly presented in the written representation, whether they are phonological, morphological, syntactic or lexical. A written representation of speech is a great help for the linguist to get a quick overview of the material.
The phonetic transcription standard is based on Papazian and Helleland's Norsk talemål. Lokal og sosial variasjon (2005), but with no special characters, only the Norwegian/Swedish alphabet. Also, the transcription is quite broad. The transcription standard in CANS is basically the same as that used for Norwegian in the Nordic Dialect Corpus. The standard is also used for the American Swedish transcription.
Orthographic transcription: An orthographic transcription is important because it is a generalization over all the variation. It enables the possibility of doing general searches, and automated methods, such as grammatical tagging. The orthographic transcription is much faster than the phonetic one, thanks to the semi-automatic dialect transliterator which translates from the phonetic transcription to Norwegian Bokmål or Swedish orthography.
Sweet welcome for the Norwegian researchers in Blair. Photo: K. M. Eide.
Janne and Signe with informants in Sunburg.