/

Language Technology for Ethiopia

On this page we have gathered some central language technology resources, whether they have been developed by participants in the NORHED project Linguistic Capacity Buliding – Tools for the inclusive development of Ethiopia 2014-2018 or by others.

    • Corpus Amharic WaC [2013 + 2015 + 2016]
      Amharic web corpus. Crawled by SpiderLing in August 2013 and October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger trained on Amharic WIC corpus.
      20,287,250 tokens / 17,320,000 words
    • Corpus Oromo WaC [2016]
      Oromo web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
      5,091,696 Tokens / 4,249,953 words
    • Corpus Somali WaC [2016]
      Somali web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
      79,741,231 tokens / 71,871,585 words
    • Corpus Tigrinya WaC [2016]
      Tigrinya web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
      2,531,443 tokens / 2,087,613 words

  • A searchable corpus of Amharic
    Based on the tagged corpus described in Argaw and Asker (2005), Gambäck and Asker (2010) and Gambäck (2012), two versions have been installed into the corpus search system Glossa (Johannessen et al. 2008, Johannessen 2012), one in the SERA script, 'System for Ethiopic Representation in ASCII' (Yacob, 1997), and one in the Ethiopian script Fidel.


  • HornMorpho
    Morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs made by Micheal Gasser.

Read more


References

Atelach Alemu Argaw and Lars Asker. 2005. Web mining for an Amharic-English bilingual corpus. In 1st Int. Conf. on Web Information Systems and Technologies, pp. 239– 246, Deauville Beach, Florida, May.

Björn Gambäck and Lars Asker. "Experiences with Developing Language Processing Tools and Corpora for Amharic'". In P. Cunningham and M. Cunningham, editors, Proceedings of IST-Africa 2010, the 5th Conference on Regional Impact of Information Society Technologies in Africa, Durban, South Africa, May. IIMC. Read the pdf.

Gambäck, Björn. Tagging and Verifying an Amharic News Corpus. Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012). Read the pdf.

Johannessen, Janne Bondi. The Corpus Search and Results Handling System Glossa. Chung-hua Buddhist Journal 2012; Volum 25. s. 87-104

Johannessen, Janne Bondi; Nygaard, Lars; Priestley, Joel; Nøklestad, Anders. Glossa: a Multilingual, Multimodal, Configurable User Interface. I: Proceedings of the 6th International Conference on Language Resources and Evaluation. European Language Resources Association 2008 ISBN 2-9517408-4-0. Read the article.