logo tekstlab arkiv

HaBiT: Corpora collaboration with Czech university

HaBiT - Harvesting big text data for under-resourced languages - was a collaboration project from 2014 to 2017 with Masarykova univerzita in Brno and NTNU in Trondheim together with the Text Laboratory from University of Oslo, Addis Ababa University and Hawassa University.

The project was financed by the Czech-Norwegian Research Programme (EEA and Norway Grants).
Read more about the project here.

See outcomes from the project.
Read more about two, big web corpora for Norwegian Nynorsk and Bokmål

The goals for the HaBiT project was:

Build large annotated corpora for Norwegian (tentatively with a size of at least 1 billion tokens, and with the aim of 5 billion tokens). For Czech, a corpus larger than 5 billion tokens will be compiled. For Amharic, Tigrinya, Oromo, and Somali, corpora of at least a few million tokens will be built (aiming at 20 million, at least for Amharic).
Develop a parallel Czech-Norwegian corpus (with size up to 10 million tokens),
Develop software modules such as taggers, parsers, and Sketch Grammars for participating languages (Norwegian, and at least Amharic among the Ethiopian languages). Improve results for the already developed Czech modules as well,
To give presentations at international conferences and workshops, with corresponding papers in the relevant journals,
Organize a workshop related to the under-resourced languages (e.g., within the TSD – Text, Speech and Dialogue – conference framework).

HaBiT-meeting in Oslo September 5.-6. 2015.

HaBiT-meeting in Oslo September 5.-6. 2015.
Feda Negesse, Pavel Rychlý, Björn Gambäck, Anders Nøklestad, Aleš Horák, Derib Ado,Vít Suchomel, Kristin Hagen, Janne Bondi Johannessen, Lars Bungum, Joel Priestley.

nlp-logo ntnu-logo

logo tekstlab arkiv