HaBiT: Corpora collaboration with Czech university
HaBiT - Harvesting big text data for under-resourced languages - was a collaboration project from 2014 to 2017 with Masarykova univerzita in Brno and NTNU in Trondheim together with the Text Laboratory from University of Oslo, Addis Ababa University and Hawassa University.
The goals for the HaBiT project was:
- Build large annotated corpora for Norwegian (tentatively with a size of at least 1 billion tokens, and with the aim of 5 billion tokens). For Czech, a corpus larger than 5 billion tokens will be compiled. For Amharic, Tigrinya, Oromo, and Somali, corpora of at least a few million tokens will be built (aiming at 20 million, at least for Amharic).
- Develop a parallel Czech-Norwegian corpus (with size up to 10 million tokens),
- Develop software modules such as taggers, parsers, and Sketch Grammars for participating languages (Norwegian, and at least Amharic among the Ethiopian languages). Improve results for the already developed Czech modules as well,
- To give presentations at international conferences and workshops, with corresponding papers in the relevant journals,
- Organize a workshop related to the under-resourced languages (e.g., within the TSD – Text, Speech and Dialogue – conference framework).
HaBiT-meeting in Oslo September 5.-6. 2015.
Feda Negesse, Pavel Rychlý, Björn Gambäck, Anders Nøklestad, Aleš Horák, Derib Ado,Vít Suchomel, Kristin Hagen, Janne Bondi Johannessen, Lars Bungum, Joel Priestley.