Ethiopian Language Technology

Language Technology for Ethiopia

On this page we have gathered some central language technology resources, whether they have been developed by participants in the NORHED project Linguistic Capacity Buliding – Tools for the inclusive development of Ethiopia 2014-2020 or by others.

Sidaama Dictionary
Kjell Magne Yri from the NORHED project has in August 2019 published his Sidaama Dictionary in cooperation with Steve Pepper. The dictionary also contains a complete grammar and a short grammatical resumé.
Read more and use the dictionary.
Eight Ethiopian speech corpora
The NORHED project Linguistic Capacity Buliding – Tools for the inclusive development of Ethiopia has so far made eight small speech corpora.
- Amharic Speech Corpus 154 000 tokens, 82 speakers. Linguist: Professor Baye Yimam.
- Gumer Speech Corpus 37 250 tokens, 22 speakers. Linguist: Dr. Fekede Menuta.
- Hadiyya Speech Corpus 13 000 tokens, 39 speakers. Linguists: Dr. Shimelis Mazengia Dr. Zelealm Leyew
- Hamar Speech Corpus 16 900 tokens, 2 speakers. Linguists: Dr. Binyam Sisay and Dr. Moges Yigezu.
- Kambata Speech Corpus 139 600 tokens, 69 speakers. Linguists: Dr. Derib Ado.
- Muher Speech Corpus 40 500 tokens, 8 speakers. Linguist: Dr. Ronny Meyer.
- Oromo Speech Corpus 266 500 tokens, 88 speakers. Linguists: Dr. Derib Ado and Dr. Feda Neggese.
- Tigrinya Speech Corpus 138 600 tokens, 45 speakers. Linguists: Dr. Derib Ado and Dr. Feda Neggese.
Five Ethiopian web corpora
The HaBiT project has in cooperation with the project Linguistic Capacity Buliding – Tools for the inclusive development of Ethiopia developed web corpora for Amharic, Oromo, Somali and Tigrinya:

Corpus Amharic WaC [2013 + 2015 + 2016]
Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC corpus.
20,287,250 tokens / 17,320,000 words
Amaharic WIC
Amaharic WIC is the tagged corpus described in Argaw and Asker (2005), Gambäck and Asker (2010) and Gambäck (2012) made searchable in SketchEngine. The SERA script (System for Ethiopic Representation in ASCII' (Yacob, 1997)) version of the corpus is encoded in the sera attribute while the Ethiopian script (Fidel) version is encoded in the word attribute.
Corpus Oromo WaC [2016]
Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
5,091,696 tokens / 4,249,953 words
Corpus Somali WaC [2016]
Somali web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
79,741,231 tokens / 71,871,585 words
Corpus Tigrinya WaC [2016]
Tigrinya web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
2,531,443 tokens / 2,087,613 words

HornMorpho
Morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs made by Micheal Gasser.
- Download analysers

Resources collected by Daniel Yacob
- See the resources

Linguistic Capacity Building — Tools for the inclusive development of Ethiopia
Read about the project on the Text Laboratory homepage.
Addis Ababa University Research

References

Atelach Alemu Argaw and Lars Asker. 2005. Web mining for an Amharic-English bilingual corpus. In 1st Int. Conf. on Web Information Systems and Technologies, pp. 239– 246, Deauville Beach, Florida, May.

Björn Gambäck and Lars Asker. "Experiences with Developing Language Processing Tools and Corpora for Amharic'". In P. Cunningham and M. Cunningham, editors, Proceedings of IST-Africa 2010, the 5th Conference on Regional Impact of Information Society Technologies in Africa, Durban, South Africa, May. IIMC. Read the pdf.

Gambäck, Björn. Tagging and Verifying an Amharic News Corpus. Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012). Read the pdf.

Johannessen, Janne Bondi. The Corpus Search and Results Handling System Glossa. Chung-hua Buddhist Journal 2012; Volum 25. s. 87-104

Johannessen, Janne Bondi; Nygaard, Lars; Priestley, Joel; Nøklestad, Anders. Glossa: a Multilingual, Multimodal, Configurable User Interface. I: Proceedings of the 6th International Conference on Language Resources and Evaluation. European Language Resources Association 2008 ISBN 2-9517408-4-0. Read the article.

Language Technology for Ethiopia

Read more

References