Workshop on Advanced Corpus Solutions

Topics Back <<

Background

One of the main uses of language corpora is to assist linguists and language technologists on what is the correct or representative language data within a certain domain. Linguists, unlike the other two groups, however, cannot be expected to be computationally advanced, and yet their research needs as complex data as the technological group. This is not least true since one of the types of output of linguistic research often as a next step will be input to language technology systems.

While corpora used to be relatively simple and straightforward, perhaps varying along one dimension, such as written text genre (the Brown Corpus, the LOB corpus), and annotated according to one type, such as part of speech, the needs of linguists have risen in accordance with the possibilities that the technology offers.

Linguists have higher expectations nowadays, as they would like corpora to, for example, contain audio of spoken language, dialects, videos, or to be multilingual. This contents should also be searchable, with possibilities for searching in the spoken language files for ways of expressing some word or grammatical category, or in the video files for types of gestures, or in the multilingual files, to see how one word or category in one language corresponds to another in a different language.

With more advanced corpora, users also expect annotations to go with them. Part of speech is still an issue, but syntactically parsed corpora are also desired, and annotations relating to gestures, speech events, emotions etc. Spoken corpora should be transcribed, and there are expectations as to type of transcription (orthographic, phonetic). Sociolinguistic, geographical and historical variables are also on the agenda: sex, age and education are background variables that may distinguish linguistic types, and are therefore important to be searchable.

The human-machine interface is important. Few linguists accept search expressions to be produced in a language of regular expressions. The options should be clickable or be presented as choices from a menu. For larger areas, querying via maps is a desirable option.

Corpus search issues are not the only important ones. Results handling is also something that researchers want. The results should be exportable straight to a database, statistics should be calculated, further annotations should be possible, maps should illustrate geographic distribution of hits.

While the list of desiderata is long, it turns out that few of the points are fulfilled in actual corpora. For example, spoken language corpora are often represented by transcriptions (even orthographic ones), but very rarely come with audio or video possibilities. Maps are still uncommon in connection with corpora.

In addition to the need for advanced individual corpora, there is also a growing interest for the interoperability between corpora (as stated explicitly, for example, by the European CLARIN inititative).

Papers

We welcome papers on corpora that address one or more of the issues above, either because they provide principled solutions to some of the challenges, or because they have implemented exciting solutions to specific topics mentioned above within these three areas:

• Corpus tools: corpus search, results presentation, results handling, linguistic annotation, text annotation
• Corpus types: monolingual corpora, parallel corpora, spoken language corpora, multimedia corpora

We invite papers on any language including, but not limited to, Asian languages.

Back <<

    General Chair:
    Janne Bondi Johannessen,
    University of Oslo, Norway

    Co-chairs:
    Eckhard Bick, University of Southern     Denmark
    Lars Borin, Gothenburg University,     Sweden
    Jan Pieter Kunst, Meertens Institute,     Netherlands

    Program Committee:
    Wirote Aroonmanakun, Chulalongkorn     University, Thailand
    Emily M. Bender, University of Washington,     USA
    Francis Bond, Nanyang Technological     University, Singapore
    Ying Chen, China Agriculture University
    Stefan Evert, University of Osnabrück,     Germany
    Stefan Th. Gries, UCSB, St Barbara, USA
    Dag Haug, University of Oslo, Norway
    Shoushan Li, Suzhou University
    Kikuo Maekawa, The National Institute for     Japanese Language, Japan
    Adam Przepiórkowski, Polish Academy of     Sciences, Poland
    Bert Vaux, University of Cambridge, UK
    Franca Wesseling, Meertens Institute,     Netherlands

The Text Laboratory, University of Oslo