User Manual for Ethiopian Speech Corpora

User Manual for Ethiopian Speech Corpora

1. Ethiopian Speech Corpora

The NORHED project Linguistic Capacity Buliding – Tools for the inclusive development of Ethiopia has so far made six small speech corpora.

The corpora are all accesible through the search corpus tool Glossa, developed at the Textlaboratory, University of Oslo.

In this user manual we will describe how to use the speech corpora with Muher Speech Corpus as an example. All the speech corpora have the same design and search possibilities except for small differences in the metadata menu on the left.

On this page:

1.1 Transcriptions in the Ethiopian corpora

1.2 Main search page

1.2.1 Simple search and examples of results

1.2.2 Extended search

1.2.2.1 Search multiple words
1.2.2.2 Search Start, End, Segment initial or Segment final
1.2.2.3 Specify or exlude Word form

1.2.3 CQP Search Expression (CQP query)

1.2.4 'Or' search

1.3 Metadata search and Show speakers

1.4 Random selection of search results

1.5 Statistics

1.6 Download data

1.7 Sort the search results

1.1 Transcriptions in the Ethiopian corpora
The recordings have been transcribed using the transciption program Elan. The recordings are trancribed word for word without altering the word order. The Amharic Speech Corpus, Hadiyya and Oromo are ortographically transcribed, Amharic with the Fidäl script.

For Gumer Speech Corpus, Hamar and Muher IPA is used to transcribe the recordings.

1.2 Main search page
Figure 1 shows the main search page of Muher:

Figure 1: Main search page of the Muher Speech Corpus

To the left in the green frame are all the searchable metadata categories. In Muher Speech Corpus, these categories are a selection of different features relating to the speakers including, Informant code, Gender, Age, Place, Informant languages and also Genre of the recording. Please note that the selection of metadata can differ from corpus to corpus.

The number of selected speakers (including the number of tokens) are indicated above the metadata categories. The Show speakers button gives you an overview of all the speakers or the selection of speakers you have chosen. Read more in section 1.3.

At the top of the page, there are two buttons. The Hide filters button hides the metadata tabs to the left, while the Reset form button gives you a blank search page.

The rest of the search page is about the searched keyword(s) or its properties. You can read more below.

1.2.1Simple search and examples of results
In Simple search you can search for individual words and phrases in the search field. The results are shown as a concordance (see figure 2). The number of matches can be seen above the search results on the right. There are 50 search results presented per page. If there are more than 50, they will be presented across multiple pages, which you can access by clicking on the arrows.

Above the search results you will find buttons for sorting and downloading. For more information, see sections 1.6. and 1.7. The Concordance search result views the pre-selected view, but you can also get different statistical views of the search result (see section 1.5).

In the left-hand column of the search results there are two or three icons (see figure 2). Click the the video icon (first icon) to access the video of a segment (see figure 3) if video is available. All results have an audio icon. Click this icon to access the audio of a segment (see figure 4). Within the video and audio media player, more context can be accessed by moving the square buttons below the box to the left and / or the right.

Click on the last icon to view a sound wave and spectrogram for the search result (see figure 5). Click on the informant number to view the metadata about the speaker (see figure 6).

Figure 2: Search results for individual words in the corpus

Figure 3: Video view of the search result. Drag the squares below the video player to get more context

Figure 4: Audio playback of the search result

Figure 5: Sound wave and spectrogram

Figure 6: Metadata about the speaker

1.2.2 Extended search
An extended search (see figure 7) provides more search options. You can search individual words and phrases, filtering your search by start or end of words, or the beginning or end of a segment (Segment initial / Segment final).

Figure 7: Extended search

1.2.2.1 Search multiple words
If you click on the blue plus sign to the right of the search box (see figure 8), a second search box will pop up. You can create as many search boxes as you like. You can define the limit of how many words can be between the search keywords using the min and max boxes situated between the search boxes. To remove a search box, click the grey minus sign on the right side of the box.

Figure 8 shows a search for the words bəgərəd bet ('in-the-young-girls house/family'). There are 5 matches presented over 1 page. When a search results in more than one page, you can click the arrows to navigate the search results (see figure 2).

Figure 8: Search multiple words

1.2.2.2 Search Start, End, Segment initial or Segment final
Below the search window there are four boxes that you can select by ticking off for Start, End, Segment initial or Segment final. If you tick Start or End, you will get all the words that begin with the word or letters that are typed in the search box. So a search for bet where Start is ticked off, can result in words like betwe, betən, betəsəb, betɨm, betə, betwɨta and betɨn.

The transcriptions in the Ethiopian speech corpora consist of segments, not sentences in a written language sense. The segments are separated from each other, not by punctuation, but with time codes that indicate where in the audio file the segment starts or stops. The segments will often match written (language) sentences, but since this is speech, there may also be incomplete sentences without subject and verbal.

If you select Segment initial, you specify that the search term must come first in a segment. Ticking the Segment final box specifies that you want the search term to come last. Figure 9 shows a search for the word bet ('house') in Segment initial position.

Figure 9: Bet ('house') in Segment initial postion

1.2.2.3 Specify or exclude Word form
If you click at the menu icon to the left in the search box (see figure 9) you will find a search window where you can further specify your search (box labeled Specify word form). The Specify word form is perhaps not useful in a corpus without morphological tags, but Exclude word form can be used like this: If you have chosen for example Start or End and a search for a specific combination of characters, you can choose Exclude word form and exclude the frequent words in your result that is not interesting for you.

NB! Remember to click the OK button when you have added a word. Words that are excluded will appear in red with an exclamation point to the right of the box, see figure 10.

Figure 10: Specify or exlude Lemma or Word form.

1.2.3 CQP Search Expression (CQP query)
CQP queries can be used for advanced searches that are not possible in single or extended searches. To use this option, you will need to be familiar with the CQP query language. If you need help with an advanced search, you can contact the Text Laboratory. Figure 11 shows an example of how searches for the words bəgərəd bet ('in-the-young-girls house/family') followed by a word beginning with j appear in either Extended search and CQP query. If you have used the options in Extended search and wonder how this search looks in the CQP search language, click CQP query to get the search expression as seen in figure 11.

Figure 11: Example of same search in Extended search and CQP query

Figure 12 shows a search for [word="ə[bdg]+ə"], that means an words starting with an ə followed by one or more of the following consonants: b, d or g and then again ə.

Figure 12: Example of a CQP query search for [word="ə[bdg]+ə"]

In figure 13 you can see a search for the consonant k followed by ' and one unspecified character, then the consonant n. The character combination may be followed by yet another unspecified character: [word="k'.n.?"].

Figure 13: Example of a CQP query search for [word="k'.n.?"]

1.2.4 Or search
Clicking on the Or box will open a new search window below the original one. Searches in this box provides an or search. That is, you search the word in the main box or in the Or box. You can create as many Or boxes as you want. You delete them by clicking the red cross sign to the left of the box.

Figure 14 shows a search for words staring with bet- og words ending with -bet.

Figure 14: 'Or' search

1.3 Metadata search and Show speakers
To the left in the search form are all the metacategories. In Muher Speech Corpus, these categories are as follows: Informant code, Gender, Age, Place, Informant languages and Genre.

Clicking on one of the links will bring up different values in each category. You can click and select one or more, and the results of your choice will be displayed in a box below the category. If you click on the red cross sign, the choice will be reset. Figure 15 shows what the metadata menu looks like when you click on the Age category.

Figure 15: Metadata menu, where the chosen category is Age.

The choice you make restricts the options for further searches. For example, if you chose Wolkite in the Place category, you will only be able to select values that are associated with the Wolikte speakers. Therefore, you will not get 70 as an option under Age as there are only speakers in their thirties in the corpus from Wolkite. In figure 16, Wolkite has been selected under Place.

Figure 16: Wolkite is selected under Place.

Above the metadata category menu, there is a counter that shows you how many speakers you have chosen, and how many words the selection consists of at any time. In this version, Muher Speech Corpus contains 8 speakers and 40 352 tokens (words and punctuations), as shown in the figure above (figure 15). When only speakers from Wolkite are chosen, the selection is limited to 5 speakers and 34024 tokens as shown in figure 16.

If you want to see an overview of the speakers you have selected, click the Show speakers button next to the Or button (see figure 14). The result will be displayed as seen in figure 17 based on the selection from figure 16.

Figure 17: The Show speakers window.

The different metadata categories are briefly described below:

Informant code: Each speaker has been given a code instead of their real name. In Muher Speech Corpus the informant code consists of the first letters of the language name (mu) followed by _ , the place name, _ and a number

Gender and age: Indicates the gender and age of the speaker

Place: Provides the place the speaker comes from

Informant languages: Which language(s) the speakers can speak

Name of eaf file: The original name of the transcription file

Genre: How the recording can be categorised: conversation, dialog/narration and tales

Description: More detailed information about the topics in the recordings

1.4 Random selection of search results
If you have a search that will result in many hits, you can choose to see only a certain number of randomly selected hits. Specify the number of hits in the box next to Show speakers (see figure 18).

If you want to reproduce this specific result later, select a number and insert it into the box called with seed. In figure 18 the search is extended to all words in the corpus starting with an m the corpus, with a display of 200 randomly selected hits, and the number 5 in the with seed box. Each time you do the same search and type the same number in the box, you get the same random selection of search results. If you type another number, you will get another random selection.

It is possible to select a random selection of search results for searches that are Extended searches or CQP queries.

Figure 18: Checkbox that gives a random selection of search results. Here shown with 200 possible results

1.5 Statistics
The Concordance search result view is the default setting from which all the previous examples are taken. If you select Statistics as shown in figure 19 below, you can see different frequency counts and statistics. Currently, the boxes above the Update stats button are the ones that can be selected. Click on what you want to see and press Update stats. Figure 19 displays frequencies from the search results shown in figure 14, which is searches for words staring with bet- og words ending with -bet. The frequencies are listed in the left-hand column, and the word forms are listed to the right.

You can download the result in three different formats: Excel, tab-separated and comma-separated.

Figure 19: Statistical display for the results shown in figure 14

1.6 Download data
If you click on the Download button above the search results (see figure 2), a dialogue box will pop up where you can choose multiple download formats, such as: Excel, tab-separated or comma-separated text file, see figure 20.

Figure 20: Download options window

1.7 Sort the search results
The search results can be sorted in various ways as shown in figur 21. If you want to sort by search word, select Sort by match. You can also sort by the word directly to the left (Sort by immediate left context) or the word directly to the right (Sort by immediate right context). Note that punctuation marks or symbols are alphabetized before a and b, etc.).

Figure 21: Search results can be sorted in different ways