User Manual for Nordic Dialect Corpus (NDC) v. 3.0

Go to the NDC homepage

This user manual is written by Kristin Hagen and Ingvild Røsok. The manual is based on The Nordic Dialect Corpus - Search Interface Documentation written by Eirik Olsen.

1. Nordic Dialect Corpus

Nordic Dialect Corpus v. 3.0 is a corpus of Norwegian, Swedish, Danish, Faroese, Icelandic and Övdalian spoken language. It consists of spontaneous speech data and contains in excess 3 million words from conversations and interviews by dialect speakers. The recordings are transcribed and linked to audio and video. The corpus has a map function and can be searched in a variety of ways.

NOTE: In this version (v. 3.0) old recordings and transcriptions from Målførearkivet (Oslo Old Dialect Archive) are included. The same transcriptions translated to Nynorsk are now also searchable in LIA Norwegian - Corpus of Old Dialect Recordings.

In Nordic Dialect Corpus version 4.0, published 15. September 2019, Nordic Dialect Corpus contains dialect recordings and transcriptions from 1998 - 2015 only. Read the user manual for version 4.

On this page:

1.1 Transcriptions in Nordic Dialect corpus

1.2 Main search page

1.2.1 Simple search and examples of results

1.2.2 Extended Search

1.2.2.1 Search multiple words
1.2.2.2 Search for Lemma, Start, End, Phonetic, Segment initial or Segment final
1.2.2.3 Search for word class or morphological features
1.2.2.4 Search other tags (words not in the dictionary, laughter, yawning etc.)
1.2.2.5 Specify or exclude lemma or wordform

1.2.3 CQP Search Expression (CQP Query)

1.2.4 Or search

1.3 Metadata
1.3.1 Metadata search
1.3.2 Show speakers
1.3.3 Metadata categories

1.4 Random selection of search results

1.5 Statistics

1.6 Geographical map

1.7 Download data

1.8 Sort the search results


1.1 Transcriptions in Nordic Dialect corpus
The recordings are transcribed using the transcription program Transcriber. Using this software the transcriptions are connected to the audio and video files.

All recordings are orthographically transcribed following the standard of each language. In addition, all Norwegian recordings are transcribed semi-phonetically, and the Övdalian recordings are transcribed according to Övdalian orthography. For more details about the transcriptions themselves, cf. the transcription page on the ScanDiaSyn site.

1.2 Main search page
Figure 1 shows the main search page for Nordic Dialect Corpus:





Figure 1: Main search page of Nordic Dialect Corpus

On the left hand side are all the searchable metadata categories. In Nordic Dialect Corpus, these categories are a selection of different features relating to the informants including, Informant code, recording year, birth year, gender, age, age group, place, area, region, country, genre. The number of selected informants (including the number of tokens) are indicated above the metadata categories.

The Show speakers button gives you an overview of all the informants or of the selection of informants you have chosen. Read more in section 1.3.

At the top of the page, you will find two buttons. The Hide filters button hides the metadata menu to the left, while the Reset form button gives you a blank search page. The rest of the search page is about the searched keyword(s) or its properties. You can read more below.

 

1.2.1 Simple search and examples of results
In Simple search you can search for individual words and phrases in the search field.

The results are shown as a concordance (see figure 2). The orthographic transcription is on the first result line with the semi-phonetic transcription below.

You can see the number of hits over the search results on the right. There are 50 search results displayed per page. If there are more than 50, the results will be displayed across multiple pages, which you can access by clicking on the arrows.

Above the search results you will find buttons for download and sorting. For more information, see sections 1.7. and 1.8.The Concordance search result view is the pre-selected view, but you can also get different statistical views of the search result (see 1.5) or view the results on a geographical map (see section 1.6.)

If you hover your mouse over a word from the search result, a small window will pop up with information about lemma, word class, other morphological information and tags, see figure2a. Read more about word classes and tags in sections 1.2.2.3 and 1.2.2.4.

In the left-hand column of the search results there are three icons. Click on the video icon to watch a video of the search result (see figure 3). Be aware that not all recordings have the video possibility. Click the audio icon to get audio only (see figure 4). In the audio and video media player, you can get more context by moving the square buttons on the slider bar, located below the box, to the left and / or right.

Click on the last icon to bring up the sound wave and spectrogram of the search result (see figure 5). Click on the informant code to view the metadata about the informant (see figure 6).


Figure 2: Search results for individual words in the corpus.

 

 

Figure 2a: If you hover the mouse over a word from the search result, a small window of information about word class, other morphological information and tags will be displayed.




Figure 3: Video view of the search result. Drag the squares below the video player to get more context



Figure 4: Audio playback of the search result



Figure 5: Sound wave and spectrogram

 



Figure 6: Metadata about the informant

 

1.2.2 Extended Search
An extended search (see figure 7) provides you with more search options. You can search both on individual words and phrases, filtering your search by lemma, start or end of words, or at the beginning or end of a segment (segment initial, segment final). The phonetic option gives you the opportunity to search in the semi-phonetic part of the transcription only. Furthermore, you can do a search for word classes, morphological features or other tags.

 



Figure 7: Extended search

 

1.2.2.1 Search multiple words
If you click on the blue plus sign to the right of the search box (see figure 8). You can create as many search boxes as you like. Between the search boxes, you will find two boxes labeled min and max that you can use to define the minimum or maximum word limit you want between keywords. To remove a search box, click the grey minus sign on the right-hand side of the box.

In figure 8, there has been a search for the words for meg (for me). Note that there are 304 results presented over 7 pages. Click the arrows to navigate in the search results.


Figure 8: Search multiple words.

 

1.2.2.2 Search for Lemma, Start, End, Phonetic, Segment initial or Segment final
Below the search window there are six boxes that you can select by ticking: Lemma, Start, End, Phonetic, Segment initial or Segment final. If you tick the box Lemma, you get all the inflectional forms of a word as a result. If you search for the word bok (book), you get all the different forms (bok, boka, boken, bøker and bøkene) as a result if the words exist in the corpus. If you tick Start or End, you will get all the words that begin with the word or letters that are typed in the search box. A search for a bok where the Start box is ticked can give results like boksamlinga (book collection) or bokstav (letter). If the End box is ticked, the results may be words like a skrivebok (copy book) or notisbok (notebook).

By default, any text entered in the search box will be searched for in the orthographic transcription part of the corpus. The Norwegian and Övdalian transcriptions also have alternative transcriptions in addition to the orthographic ones and these transcriptions may be queried by ticking the Phonetic box. This may be helpful, for instance, if you are looking for specific phonetic variants in Norwegian like itte or ikkje, variants of the orthographic ikke (not).

The transcriptions in Nordic Dialect Corpus consist of segments, not sentences in a written language sense. The segments are separated from each other, not by punctuation, but with time codes that indicate where in the video or audio file the segment starts or stops. The segments will often match written language sentences, but since this is speech, there may also be incomplete sentences without subject and verbal.

If you select Segment initial, you specify that the search term must come first in a segment. Ticking the Segment final box means doing a search for the last word. Figure 9 shows a search for the word etterpå (afterwards) in Segment initial position.


Figure 9: Etterpå (afterwards) in Segment initial position

 

1.2.2.3 Search for word class or morphological features
In an extented search, you can search the word class by using the drop-down menu to the left in the search box, see figure 10. Clicking the button left of the arrow will open the box, as seen in figure 11. If you select a word class under Parts-of-speech, you also get access to the selections under Morphosyntactic features for the word class you have chosen. Your selections will appear as small blue boxes. Figure 10 shows a search for noun plural.
The other options in figure 11 are explained in the sections below.

Figure 10: Buttons for searching word class and other morphological features

 

Figure 11: Searching for word class and other morphological information

If you click on multiple words simultaneously, such as noun and pronoun, you will get all the words that are either nouns or pronouns. In the same way, you can click multiple values within a category, such as both feminine and masculine in the gender category for nouns to find words that are either feminine, masculine or both.


 1.2.2.4 Search other tags (words not in the dictionary, laughter, yawning etc.)

In the Description category of the search box described in section 1.2.2.3 above (figure 11) you can search for the following two tags:

X: The x-tag has been used in instances where lexical words are not found in the orthographic written standard.

O: The o-tag has been used in instances where, in the Norwegian and Övdalian transcriptions, grammatical words are not found in the orthographic standard, and are "translated" to their standard equivalent.

In the Non-lexical category of the search box you can apply the following tags:

back-click, breathing, coughing, draws breath, front-click, groaning, hawking, interruption, labial fricative, labial vibrant, laughing, laughter, onomatopoetic, sibilant, sighing, sniffing, spelled, sucking sound, unclear, whistling and yawning.

Most of the tags are sounds or non-lexical utterances like laughter and coughing, mostly annotated in the Norwegian transcriptions.

Interruption is applied whenever an informant interrupts himself / herself or is interrupted by the interviewer or another informant.

Spelled refers to words that are spelled out by the speaker.

Unclear is applied for words that are unclear or otherwise uncertain for the transcribers that worked with the transcriptions.

The X and O tags and the non-lexical tags are associated with one or more words. If you tick one or more of the tags, you will get the words that are linked to them as a result.

In figure 12 it has been searched for X, words that are not found in dictionaries with written standard. The words hostility and hospitality are among the results, as they are English words. Note that the dialect word sjø, a modal particle from trøndelag in Norway, is tagged as X.


Figure 12: Search for words not in the dictionary

1.2.2.5 Specify or exclude lemma or wordform
At the bottom of the morphological search box in figure 11 there is a field where you can further specify your search (box labeled Specify word form). So if you want to search for verb in the morphological search box, but only want auxiliaries, you can simply select Specify lemma and add the auxiliaries one by one in the Specify word form box, pressing OK after each word.

If you have chosen verb, but do not wish to include the auxiliaries, you can do what is described above, except you use the arrow in the Specify word form box to select a different option from the drop down menu. Choose Exclude wordform or Exclude lemma.

If you have searched for an ortographic form in the main search box, you can specify the phonetic form(s) that you like to see here. Choose Specify phonetic form and add the phonetic forms in the box one by one, pressing OK for each form. You can also choose the other way around, by clicking the phonetic box beneath the main search window and searching for the phonetic form there. In the specify menu you choose Specify word form and write the orthographic form in the field to the right.

NB! Remember to click OK once you have entered a word into the box! Words that are excluded will then appear on the right side in red with an exclamation point in front, see figure 13 and figure 15. Words that are specified, will appear in blue.

Figure 13: Specify or exclude lemma and word form.


1.2.3 CQP Search Expression (CQP Query)
CQP queries can be used for advanced searches that are not possible in a simple or extended search. To use this option, you will need to be familiar with the CQP query language. If you need help with an advanced search, you can contact the Text Laboratory. Figure 14 shows an example of how a search for the words siden av (side of) followed by noun or pronoun displays in either Extended search or CQP Query. If you have used the options in Extended search and wonder what this search looks like in the CQP query language, then click CQP query and you will get the search expression, as seen in figure 14.

 

Figure 14: An example of the same search in Extended search and CQP query.

 

1.2.4 Or search
Clicking on the Or box will open a new search window below the other. Searches this box provides an or search. That is, you search the word in the main box or in the Or box. You can create as many Or boxes as you want. You delete them by clicking the red cross to the left of the box.

Figure 15 shows an advanced search for words in past tense ending with either -a or-et. The verbs sa (said), ga (gave) and het (was called) are excluded.


Figure 15: Or search.


1.3 Metadata Search and Show speakers
As shown earlier in figure 1, all the meta-categories in the search form are listed in a column to the left in the search form. In Nordic Dialect Corpus, these categories are as follows: Informant code, Recording year, Birth year, Gender, Age, Age group, Place, Area, Region, Country, Genre.

1.3.1 Metadata Search
Clicking on one of the links in the left menu will bring up the different values in each category. You can click and select one or more, and the results will be displayed in a box below the category. If you click on the red cross, the choice will be reset. Figure 16 shows how the metadata menu looks when you have clicked the Country category.



Figure 16: The metadata menu where the chosen category is Country

The choice you make restricts the possibilities you have for further searches. For example, if you choose Nord-Norge in the region category, you will only be able to select values associated with informants from Nord-Norge. Consequently, you will not get Alvdal as an option in the Area category because Alvdal is in region Østlandet. In figure 17, Nord-Norge and Bodø are ticked under the categories Region and Place.


Figure 17: Nord-Norge and Bodø are chosen under the categories Region and Place.

Above the metadata category menu, there is a counter that shows you how many informants (speakers) you have chosen and how many tokens your selection consists of. In this version, Nordic Dialect Corpus contains 874 informants and 3 113 388 tokens (words and punctuation marks) as shown in the figures above. When only informants who come from Bodø is selected, the selection will be confined to 4 informants and 15 042 tokens as figure 17 shows.

1.3.2 Show speakers
If you want to see a summary of the informants you have chosen, click the Show speakers button below the search box next to the Or button, see figure 15. The result will be presented as in figure 18 based on the selection from figure 17.


Figure 18: The Show speakers box

 

1.3.3 Metadata categories

The various metadata categories are briefly described below.

Informant code:
Each informant are given a unique code. The format differ from country to country.

Geographic Location: place, area, region and country.
The category geographic location is divided into four subcategories: country, region, area and place. These subcategories are used differently for the different languages. In table 1 below, the use of the region, area and place subcategories in the different languages are summarized.

Subcategory Norwegian Swedish Danish Faroese Icelandic
region region (landsdel) region (landsdel) not used region (sýsla) or island region
area county (fylke) province (landskap) not used island or part of island region
place municipality (kommune) or urban area municipality (kommun) or urban area island - but also some Jutish urban areas and regions + Copenhagen municipality (kommuna) or urban area municipality or urban area

Table 1: Geographic Variables

A map of recording locations is shown here.

Age and Sex: birth year, age, age group, gender:
The age category is divided into two variables, age group <40 and age group >40. Age group <40 consists of young people, mainly between 15 and 30 years of age, while age group >40 consists of adults and elderly people, most of them over 50 years of age. You can read more about the distinction on the NorDiaSyn homepage. The gender category naturally has two variables: female (F) and male (M).

Recording: recording year and genre:
The recording years span from 1951 to 2012. The recordings from 1951 to 1984 are old Norwegian recordings from Målførearkivet (Oslo Old Dialect Archive), which are meant to complement the modern recordings and give the possibility for longitudinal studies. The modern ScanDiaSyn recordings are from 1998 to 2012. Exact year spans for each language are given in the list below:

The genre category has two main variables: interview and conversation. The latter has been divided into several subgenres, depending on the relationship between the informants. You can read more about the distinction between interview and conversation on the NorDiaSyn homepage. The following is a complete list of genres used in the corpus:

 

1.4 Random selection of search results
If you have a search that will give numerous hits, you can choose to see only a certain number of random selections. Specify the number of hits in the box next to Show texts (see figure 19).

If you want to reproduce this specific result later, select a number and insert it into the with seed box. In figure 19, there has been done a search for all nouns in Danish in the corpus with 200 randomly selected hits showing at a time. Here the number 5 has been put in the with seed box. Each time you do the same search and type the same number into the box, you get the same random selection of search results. If you type another number, you will get another random selection.

It is possible to select a random selection of search results for searches that are Extended searches or CQP queries.

Figure 19: Checkbox to get a randomized selection of search results. Here shown with 200 random results.


1.5 Statistics
The Concordance search result view is the default setting from which all the previous examples are taken. If you select Statistics, as shown in figure 20 below, you can see different frequency counts and statistics. Currently, the boxes above theUpdate stats button are the ones that can be selected. Click on what you want to see and press Update stats. Figure 20 displays frequencies from the search results shown in figure 15, which showed a search for verbs in past tense that end with either -a or -et in Norwegian. The verbs sa (said), ga (gave) and het (was called) are excluded. The frequencies are listed in the left-hand column, and the word forms are listed to right together with the phonetic form.


Figure 20: Statistical display for the search results shown in figure 15.

 

1.6 Geographical map
The Concordance view is the pre-selected search result and all the above examples are taken from here. If you select Map, as shown in figure 21 below, you can get an overview of the distribution of a word and its variants in the corpus. For the Norwegian transcriptions that are transcribed both orthographically and semi-phonetically, this is a good way to look for phonetic variation. You can zoom in and out of the map to get a better overview.

In figure 21 you can see the distribution of variants for the Norwegian word ikke (not), which are shown on the map as red dots.

By clicking on a colour and then on a variant, you can also see the variation of the chosen variant on the map. In figure 21 below, five of the variants are marked with the color yellow. To remove the color, click the box once or twice until the box color returns to grey.


Figure 21: Search result shown as a map for the Norwegian word ikke (not). The variants ennte, ente, nnte, nte and rrt are marked with yellow.

By hovering the mouse over a variant, as shown in figure 22, you get a window with details about the locations where the variant is used, the number of times it is used in each location, and the number of hits for the variant total. The number of hits for all variants combined can be found in the upper right-hand corner under Found, see figure 21.

Figure 22: The distribution of a word's variants and the number recorded at each location.

 

1.7 Download data
If you click on the Download button above the search results (see figure 2), a dialoge box will pop up where you can choose multiple download formats such as Excel file, tab-separated text file or comma-separated text file. You can also choose which informartion to download (see figure 23).

Figure 23: Download options box

 

1.8 Sort the search results
The search results can be sorted in various ways as shown in figure 24. If you want to sort by the search word, select Sort by match. You can also sort by the word directly to the left (sort by immediate left context) or the word directly to the right (sort by immediate right context). Note that punctuation marks are alphabetized before a and b, etc.


Figure 24: Search results can be sorted in different ways.


Figure 25: Punctuation marks or symbols are alphabetized before a and b, etc.

 

Go to the NDC homepage