User Manual for LIA Sápmi - Sámegiela hállangiellakorpus

Go to homepage

This user manual is written by Kristin Hagen and Ingvild Røsok. The manual is based on The Nordic Dialect Corpus - Search Interface Documentation written by Eirik Olsen.

 

1. LIA Sápmi - Sámegiela hállangiellakorpus

This first version of LIA Sápmi - the corpus of Sami dialects - consists of interviews and conversations with 56 informants. The audio recordings range from the 1960s up to around 1990.

On this page:

1.1 Transcriptions in LIA Sápmi

1.2 Main search page of LIA Sápmi

1.2.1 Simple search and examples of results

1.2.2 Extended search

1.2.2.1 Search multiple words
1.2.2.2 Search Lemma, Start, End, Segment initial or Segment final
1.2.2.3 Search for word class or morphological features
1.2.2.4 Search different tags (laughter, words not in the dictionary, etc.)
1.2.2.5 Specify or exlude Lemma or Word form

1.2.3 CQP Search Expression (CQP query)

1.2.4 'Or' search

1.3 Metadata search and Show speakers

1.4 Random selection of search results

1.5 Geographical Map

1.6 Statistics

1.7 Download data

1.8 Sort the search results


1.1 Transcriptions in LIA Sápmi
The recordings of LIA Sápmi have been transcribed orthographicly using the transciption program Elan. The recordings are trancribed word for word without altering the word order. There are two transcription manuals for LIA Sápmi, one in Norwegian with general rules for transcription and one in Sami with details. See here:

Transkripsjonsrettleiing for LIA - samisk
Davvisámegiela transkripšuvdna ortografiija mielde - LIA

Giellatekno has tagged the trancriptions by word class.

1.2 Main search page of LIA Sápmi
Figure 1 shows the main search page of LIA Sápmi:





Figure 1: Main search page of LIA Sápmi

 

To the left in the green frame are all the searchable metadata categories. In LIA Sápmi, these categories are a selection of different features relating to the informants including, Informant code, Alias, Agegroup, Age, Born (year), Gender, Place, Báiki, District 1960, District 1978, Gielda-Suohkan 1960, Gielda-Suohkan 1978, County, Fylka, Occupation, Recorded (year) and Interviewer.

The number of selected informants (including the number of tokens) are indicated above the metadata categories. The Show speakers button gives you an overview of all the informants or the selection of informants you have chosen. Read more in section 1.3.

At the top of the page, there are two buttons. The Hide filters button hides the metadata tabs to the left, while the Reset form button gives you a blank search page.

The rest of the search page is about the searched keyword(s) or its properties. You can read more below.

1.2.1Simple search and examples of results
In Simple search you can search for individual words and phrases in the search field. The results are shown as a concordance (see figure 2). The number of matches can be seen above the search results on the right. There are 50 search results presented per page. If there are more than 50, they will be presented across multiple pages, which you can access by clicking on the arrows (see figure 12).

Above the search results you will find buttons for sorting and downloading. For more information, see sections 1.7. and 1.8. The Concordance search result views the pre-selected view, but you can also get different statistical views of the search result (see section 1.6), or view the results in a geographical map, see section 1.5.

If you hover your mouse over a word from the search result, a small window will pop up with information about lemma, word class, other morphological information and tags (see figure 3). Read more about word class and tags in sections 1.2.2.3and 1.2.2.4.

In the left-hand column of the search results there are two icons (see figure 2). Click the audio icon (first icon) to access audio of a segment (see figure 4). Within the audio media player, more context can be accessed by moving the square buttons below the box to the left and / or the right. Click on the second icon to view a sound wave and spectrogram for the search result (see figure 5). Click on the informant number to view the metadata about the informant (see figure 6).

 


 

Figure 2: Search results for individual words in the corpus

 

 

 

Figure 3: If you hover the mouse over a word from the search result, a small window will be displayed with information about word class, other morphological information and tags




Figure 4: Audio playback of the search result

 


Skjermbilde 2016-12-14 kl. 13.31.29.png

Figure 5: Sound wave and spectrogram

 



 

Figure 6: Metadata about the informant

 

1.2.2 Extended search
An extended search (see figure 7) provides more search options. You can search individual words and phrases, filtering your search by lemma, start or end of words, or the beginning or end of a segment (Segment initial / Segment final). Furthermore, you can do a search on word classes, morphological features or other tags.

 



Figure 7: Extended search

 

1.2.2.1 Search multiple words
If you click on the blue plus sign to the right of the search box (see figure 8), a second search box will pop up. You can create as many search boxes as you like. You can define the limit of how many words can be between the search keywords using the min and max boxes situated between the search boxes. To remove a search box, click the grey minus sign on the right side of the box.
Figure 8 shows a search for the phrase birra jagi ('year around'). There are 7 matches presented over 1 page. When a search results in more than one page, you can click the arrows to navigate the search results (see figure 12).


Figure 8: Search multiple words

 

1.2.2.2 Search Lemma, Start, End, Segment initial or Segment final
Below the search window there are five boxes that you can select by ticking off for Start, End, Segment initial or Segment final. If you tick the box Lemma, you get all the inflectional forms of a word as a result. If you search for the word jahki (year), you get all the different forms (grammatical number, case, etc.) as a result, if the words exist in the corpus. If you tick Start or End, you will get all the words that begin with the word or letters that are typed in the search box. So a search for jahki where End is ticked off, can result in words like válgajahki (election year) or čuohtejagi (century).

The transcriptions in LIA Sápmi consist of segments, not sentences in a written language sense. The segments are separated from each other, not by punctuation, but with time codes that indicate where in the audio file the segment starts or stops. The segments will often match written (language) sentences, but since this is speech, there may also be incomplete sentences without subject and verbal.

If you select Segment initial, you specify that the search term must come first in a segment. Ticking the Segment final box specifies that you want the search term to come last. Figure 9 shows a search for the word dalle (then) in Segment initial position.


Figure 9: Dalle in Segment initial postion

 

1.2.2.3 Search for word class or morphological features
In an extended search, you can search for word class by using the drop-down menu to the left of the search box (see figure 10). Clicking the button to the left of the arrow opens the box, as seen in figure 11. If you select a word class under Parts-of-speech, you also get access to the options under Morphosyntactic features for the word class you have chosen. Your selections will appear as small blue boxes. Figure 10 shows a search for noun plural.
The other options in figure 11 are explained in the sections below.


Figure 10: Buttons for word class searches and other morphological features


Figure 11: Search for word class and other morphological information

 

If you click on multiple word classes simultaneously, such as both noun and pronoun, you will get all the words that are either nouns or pronouns. Correspondingly, you can click multiple values within a category, such as both feminine and masculine in the gender category for nouns to find words that are either feminine, masculine or both.

 1.2.2.4 Search different tags (laughter, words not in the dictionary, etc.)
Under the categories description and non-lexical, described in 1.2.2.3 above (figure 11), you can search for tags that either describe a word or that are independent events in speech, suck as coughing or laughing for instance.

In the Description category of the search box you can search for the following tags:

Dialect: Used in segments where the primary language is Sami, but has instances of dialectal words or words that are not found in Sami dictionaries.
Compound with loan word: Used in instances where there is a compound word consisting of Sami / other language.
Loan word: Word (or segment) which is a loan from a Nordic language.
Derivation: Word tagged as a derivated word form
Unclear:
Word that is unclear or otherwise ambiguous to the transcribers who worked with the transcriptions.
Whispering/laughter/yawning/groaning: Words that are whispered, or uttered while the informant is laughing, yawning or groaning.

If you tick one or more of the tags above, you will get the words that are linked to them as a result.


In the Non-lexical category of the search box you can apply the following tags:

Hawking (coughing), laughter, onomatopoetic, groaning and unclear.

These are independent events within the speech that are sounds or non-lexical utterances, such as laughter or groaning.
The unclear tag marks, in this instance, a word or segment that is incomprehensible to the transcriber.

In figure 12 you see an example of a search using the Loan word tag. The words nei (no) and kilometer are among the results.



Figure 12: Search for nordic loan words

 

1.2.2.5 Specify or exclude Lemma or Word form
At the bottom of the morphological search box in figure 11 there is a field where you can further specify your search (box labeled Specify word form). So if you want to search for verb, but only want auxiliaries, you can simply select Specify lemma and add the auxiliaries one by one into the Specify word form box on the right, clicking OK after each word.

If you have chosen verb, but do not want the auxiliaries, you can follow the same procedure as above, but instead select Exclude word form or Exclude lemma.

NB! Remember to click the OK button when you have added a word. Words that are excluded will appear in red with an exclamation point to the right of the box, see figure 13. Words that are specified, appears in blue, see figure 14.


Figure 13: Specify or exlude Lemma or Word form.


1.2.3 CQP Search Expression (CQP query)
CQP queries can be used for advanced searches that are not possible in single or extended searches. To use this option, you will need to be familiar with the CQP query language. If you need help with an advanced search, you can contact the Text Laboratory. Figure 14 shows an example of how searches for the words birra jagi ('year around') followed by adverb or verb appear in either Extended search and CQP query. If you have used the options in Extended search and wonder how this search looks in the CQP search language, click CQP query to get the search expression as seen in figure 14.




Figure 14: Example of same search in Extended search and CQP query

 

1.2.4 Or search
Clicking on the Or box will open a new search window below the original one. Searches in this box provides an or search. That is, you search the word in the main box or in the Or box. You can create as many Or boxes as you want. You delete them by clicking the red cross sign to the left of the box.

Figure 15 shows a complex search for verbs in past tense that end with either -it or -at. The verbs boahtit (come), mannat (go) and hállat (talk) are excluded.


Figure 15: 'Or' search


1.3 Metadata search and Show speakers
To the left in the search form are all the metacategories. In LIA Sápmi, these categories are as follows: Informant code, Alias, Agegroup, Age, Born (year), Gender, Place, Báiki, District 1960, District 1978, Gielda-Suohkan 1960, Gielda-Suohkan 1978, County, Fylka, Occupation, Recorded (year) and Interviewer.

Clicking on one of the links will bring up different values in each category. You can click and select one or more, and the results of your choice will be displayed in a box below the category. If you click on the red cross sign, the choice will be reset. Figure 16 shows what the metadata menu looks like when you click on the Agegroup category.



Figure 16: Metadata menu, where the chosen category is Agegroup.


The choice you make restricts the options for further searches. For example, if you chose M in the Gender category, you will only be able to select values that are associated with the male informants. Therefore, you will not get Bekkarfjord under Place because there are only female informants from this area in the corpus. In figure 17, Bekkarfjord has been selected under Place.


Figure 17: Bekkarfjord is selected under Place.

 

Above the metadata category menu, there is a counter that shows you how many informants you have chosen, and how many words the selection consists of at any time. In this version, LIA Sápmi contains 55 informants and 62 656 tokens (words and punctuations), as shown in the figure above (figure 16). When only informants from Bekkarfjord are chosen, the selection is limited to 1 speakers and 1284 tokens as shown in figure 17.

If you want to see an overview of the informants you have selected, click the Show speakers button next to the Or button (see figure 15). The result will be displayed as seen in figure 18 based on the selection from figure 17.


Figure 18: The Show speakers window.

 

The different metadata categories are briefly described below:

Informant code: Each informant has been given a code instead of their real name. The informant code consists of a place name, acronym for the university who own the recordings, number of the audiofile and two digits that represent the informant. If these two digits are more than 01, this indicates that there is more than one informant in that recording. You can search one or multiple informant codes.

Alias: If the informant is on two or more different recordings, you can see this by choosing an informant code and see what the alias(es) might be.

Gender and age: Indicates the gender and age of the informant

Agegroup: Indicates the age group of the informant, sorted by decade

Born (year): Provides the birth year

Place: Provides the place the informant is from. This does not make up the place name in the informant code.

District 1960: Indicates the Norwegian name of informant's home district in 1960 (In Norwegian: kommune).

District 1978: Indicates the Norwegian name of informant's hometown in 1978 (In Norwegian: kommune).

Gielda-Suohkan 1960: Indicates the Sami name of the informant's home district in 1960.

Gielda-Suohkan 1978: Indicates the Sami name of the informant's home district in 1978.

County: Indicates the Norwegian name of the county the informant is from

Fylka: Indicates the Sami name of county the informant is from

Occupation: Provides the informant's occupation

Recorded (year): Provides the year the audio was recorded.

Interviewer: Provides the name of the interviewer

 

1.4 Random selection of search results
If you have a search that will result in many hits, you can choose to see only a certain number of randomly selected hits. Specify the number of hits in the box next to Show speakers (see figure 19).

If you want to reproduce this specific result later, select a number and insert it into the box called with seed. In figure 19 the search is extended to all nouns in the corpus, with a display of 200 randomly selected hits, and the number 5 in the with seed box. Each time you do the same search and type the same number in the box, you get the same random selection of search results. If you type another number, you will get another random selection.

It is possible to select a random selection of search results for searches that are Extended searches or CQP queries.


Figure 19: Checkbox that gives a random selection of search results. Here shown with 200 possible results

 

1.5 Geographical map
The map function lets you see your hits on a map. In corpora where there are both orthographic and phonetic transcriptions, this map can be used to get an overview of the distribution of a word and its variants in the corpus. In LIA Sápmi, the map can be used as a visual aid to get a geographical overview of the informants and the words they use. You can zoom in and out to get your preferred overview.

In figure 20, there has been a search for the word mun (I). The red dots on the map mark the locations of informants who have used this word.


Figure 20: The red dots mark the locations of informants using the search word.

By hovering the mouse over a variant, as shown in figure 21, you get a window with details about the locations where the variant is used, the number of times it is used in each location, and the number of hits for all variants.


Figure 21: Shows the distribution of a word and the number recorded at each location

 

By using the metadata categories to the left, you can limit the result to informants with certain attributes. In figure 22 below, the result has been limited to male informants in recordings made in 1976.



Figure 22: Use the metadata categories to limit the overview of the result.

You can also use the map function to view the results of complex searches. By clicking on a color and then on a variant, you can see the different locations of informants using the word on the map. In figure 23 below, there has been a search for verbs ending with -a. Three of the results are marked with the colors red, yellow and green. To remove the color, click the box once or twice until the box color returns to grey.


Figure 23: An overview of the locations where the marked words have been used.

1.6 Statistics
The Concordance search result view is the default setting from which all the previous examples are taken. If you select Statistics as shown in figure 24 below, you can see different frequency counts and statistics. Currently, the boxes above the Update stats button are the ones that can be selected. Click on what you want to see and press Update stats. Figure 24 displays frequencies from the search results shown in figure 15, which is searches for verbs in past tense that end with either -it or -at. The verbs boahtit (come), mannat (go) and hállat (talk) are excluded. The frequencies are listed in the left-hand column, and the word forms are listed to the right.


Figure 24: Statistical display for the results shown in figure 15


1.7 Download data
If you click on the Download button above the search results (see figure 2), a dialogue box will pop up where you can choose multiple download formats, such as: Excel, tab-separated or comma-separated text file. You can also choose which information to download (see figure 25).



Figure 25: Download options window

1.8 Sort the search results
The search results can be sorted in various ways as shown in figur 26. If you want to sort by search word, select Sort by match. You can also sort by the word directly to the left (Sort by immediate left context) or the word directly to the right (Sort by immediate right context). Note that punctuation marks or symbols are alphabetized before a and b, etc (see figur 27).


Figure 26: Search results can be sorted in different ways


Figure 27: Punctuation marks or symbols are alphabetized before a and b. Note that _DC_ indicates a short break.