The Oslo Corpus of Tagged Norwegian Texts
(bokmål
and nynorsk parts)
The bokmål part of the Oslo Corpus contains about
18.5 million words, while the nynorsk part contains about
3.8 million words. The corpora have been coded according to the
IMS Corpus Workbench standard (Institut für Maschinelle
Sprachverarbeitung, University of Stuttgart). The search interface has been
developed at the Text
Laboratory.
[Search the
bokmål corpus]
[Search the nynorsk corpus]
[Text Laboratory home page]
The corpus consists of the texts that were available at the Text
Laboratory in January 1999. It is composed of texts from three genres: fiction
(bokmål: 1.7 mill. words; nynorsk: 2.1 mill.),
newpapers/magazines (bokmål: 9.6 mill.; nynorsk: 1 mill.),
and factual prose (bokmål: 7.1 mill.; nynorsk: 700.000).
All fiction comes from ECI
(European Corpus Initiative) and Norsk Tekstarkiv (Norwegian Text Archive),
Bergen (now: HIT-senteret). The texts from
newspapers and magazines have been collected by the Text Laboratory with kind
permission from the various editorial offices. The factual prose consists
mainly of NOU reports (Norwegian Official Reports) and Norwegian laws and
regulations. A detailed survey of the texts, with source annotation codes, is
given here.
The corpus is not meant to be representative in any sense, although
it contains texts from a variety of genres. The main purpose of the corpus is
to offer a large amount of text which researchers can use for searching.
However, since it is possible to restrict the search to specific sources, the
corpus can be used as a tailored corpus - you could choose to search in all
newpaper texts or all of the fiction or all of the factual prose, or single
texts, or any combination of these. (Cf. also ENPC.)
The corpus project, which includes gathering of texts, grammatical
tagging, source annotation, IMS coding, and development of the web interface,
has been led by Janne Bondi Johannessen. Diana Santos developed the original
web interface for regular expressions (for
The Oslo Corpus of
Bosnian Texts), while Sigurd Schiøth and Anders Nøklestad extended the interface so as to
support searching by clicking in check boxes. Tore Bjertnes Pedersen and Anders
Nøklestad have created the codes for source annotation, based on similar
work at Seksjon for leksikografi
og målføregransking (Section for lexicography and dialect
research). The grammatical tagging was mainly done by Kristin Hagen (the
morphological part) and Anders Nøklestad (the syntactic part) (but click
here to find a
complete list of persons involved). Certain parts of the tagger (viz. the
multi-tagger) have been developed in collaboration with
Dokumentasjonsprosjektet (the
Documentation Project) (led by Christian-Emil Ore), and the programming has
been performed by Lars-Jørgen Tvedt and partly by Helge Hauglin.
[Search the
bokmål corpus]
[Search the nynorsk corpus]
[Text Laboratory home page]
A lot of work has gone into the grammatical tagging of the corpus.
The development of the tagger itself has involved
six man-labour years, mainly financed by the Norwegian Research Council, the
Documentation Project and the Text Laboratory. We have used software developed
by Lingsoft, Finland, which runs with a
kind of dependency grammar called Constraint Grammar. It is
possible to search the corpus for specific tags.
Morphological tags
The morphological tags are, strictly speaking, morphosyntactic tags.
They indicate part of speech along with all common categories and their
features, such as gender (masculine, feminine, neuter), number (singular,
plural), definiteness (definite, indefinite), tense (present tense, past
tense), just to mention a few. A full survey is given
here. As
far as possible, we have followed Norsk Referansegrammatikk (Norwegian
Reference Grammar) in our choice of parts of speech and morphosyntactic
features. This has led to some untraditional classifications, for instance, all
words that were earlier called locative adverbs are now being classified as
prepositions.
Syntactic tags
The syntactic tags indicate common syntactic functions like Subject
and Object. All syntactic tags are preceded by a Commercial At (@). Since we
are using a kind of dependency grammar, where every word is labelled either as
a head or as a modifier, there are also quite a few less traditional tags,
e.g.: @<SBU (SUBORDINATING CONJUNCTION modifying something to the left),
@DET> (DETERMINER modifying something to the right), @KON (COORDINATING
CONJUNCTION). An arrow on the syntactic tag means that the word is a modifier
of a head which is found in the direction of the arrow. A full survey of the
syntactic tags is given
here.
Survey of source annotations
The codes for source annotation are based on similar work done at
Seksjon for leksikografi og målføregransking (Section for
lexicography and dialect research), University of Oslo. Here is an example:
Allbjart, Gunnar 'Flukten til livet' flukt.syn SK/AlGu/01
The source annotation is the code at the end of the line. SK means
fiction ("skjønnlitteratur"); the codes for the other genres are as
follows: AV=newspaper/magazine ("avis/ukeblad") and SA=factual prose
("sakprosa"). The four letters in the middle field indicate the name of the
author, or the name and year of the newspaper/magazine, while the last number
is a file index in case there are more than one work by the same author or more
than one file from the same newspaper/magazine. Note that for the
newspapers/magazines there is no relationship between the number of files and
the number of issues. For instance, AV/Af94/01 contains 26 issues of
Aftenposten from 1994. A full survey of source annotations is given
here.
[Search the
bokmål corpus]
[Search the nynorsk corpus]
[Text Laboratory home page]
It is possible to search for one, two or three words or parts of
words (beginnings or endings), and the words can either be adjacent or
separated by specified number of other words. One or more of the words may be
specified, to different degrees, with regard to grammatical category, and you
can also specify what kind of text you want to search. It is even possible to
search on grammatical category alone, without naming any part of the words.
Note! Remember to clear the search form before each new
search.
Examples of the major types of queries
- Single words. Find all instances of jente: Write
jente in the field Første ord. Click Søk i
korpuset.
- Prefixes. Find all words beginning with be-:
Write be in the field Første ord. Click the checkbox
marked Begynnelse av ord. Click Søk i korpuset (examples:
bena, bestemt).
- Suffixes. Find all words ending in -else: Write
else in the field Første ord. Click in the checkbox marked
Endelse av ord. Click Søk i korpuset (examples:
forbauselse, forskrekkelse).
- Word sequences. Find all sequences of adjacent words
where the first one ends in -r and the second one begins with
be-: Write r in the field Første ord, and click the
checkbox marked Endelse av ord, select maks 0 ord
mellom, write be in the field Andre ord, and click the
checkbox marked Begynnelse av ord. Click Søk i korpuset
(examples: eller begynne, har bestemt).
- Broken sequence - with intervening words. Find all
instances of the word jeg followed by the word og with no more
than seven words in between: Write jeg in Første ord,
select maks 7 ord mellom, write og in Andre
ord. Click Søk i korpuset (example: ...jeg var ute i samme
ærend og ble glad...)
- Restrict the search to certain kinds of text. Find all
words beginning with be- in the fiction material: Write be in
Første ord, click the checkbox marked Begynnelse av ord,
click on Velg tekster, and select Alle in the
"Skjønnlitteratur" menu and click on Ingen below the newspaper
and factual prose menus. Click Søk i korpuset (examples:
bena, bestemt).
- Restrict the search with regard to grammatical category.
Find all verbs in the present tense that are not compounds: Do not write
anything in the fields Første ord, Andre ord or Tredje
ord. Select Verb from the Grammatiske kategorier menu below
Første ord, click Morfosyntaktiske trekk and then on the
left radio button for Presens in the window that appears. Click
OK. Select Annet in the Utelukk kategori(er) menu below
Første ord and click on Sammensetning in the window that
appears. Click OK and Søk i korpuset (examples:
puster, bestemmer, but not pustet, bestemt,
massekopierer).
Examples of combinations of the search criteria above
- Find all words beginning with be- that are verbs in the
fiction material: Write be in Første ord, click the
checkbox marked Begynnelse av ord, select Verb in the
Grammatiske kategorier menu, click on Velg tekster, select
Alle in the "Skjønnlitteratur" menu and click on
Ingen below the newspaper and factual prose menus. Click OK and
Søk i korpuset (examples: bestemt, begynner, but
not bena, begynnelse).
- Find all words beginning with be- that are verbs in the
present tense, in fiction and factual prose: Write be in
Første ord, click the checkbox marked Begynnelse av ord,
select Verb in the Grammatiske kategorier menu, click on
Morfosyntaktiske trekk and then on the left radio button for
Presens in the window that appears, click Velg tekster, select
Alle in the "Skjønnlitteratur" menu and Alle in the
"Sakprosa" menu, and click on Ingen below the newspaper menu. Click
OK and Søk i korpuset (examples: bestemmer,
begynner, but not bena, begynnelse, bestemt).
- Find all words beginning with be- that are verbs in the
present tense in Aftenposten: Write be in Første ord,
click in the checkbox marked Begynnelse av ord, select Verb in
the Grammatiske kategorier menu, click on Morfosyntaktiske trekk
and then on the left radio button for Presens in the window that
appears, click Velg tekster, select Aftenposten in the "Aviser" menu and
click on Ingen below the factual prose and fiction menus. Click
OK and Søk i korpuset (examples: bestemmer,
begynner, but not bena, begynnelse, bestemt).
- Find all verbs in Aftenposten that are not in the past tense:
Do not write anything in the fields Første ord, Andre ord
or Tredje ord. Select Verb in the Grammatiske kategorier
menu, click on Morfosyntaktiske trekk and then on the right radio button
for Presens in the window that appears, click Velg tekster,
select Aftenposten in the "Aviser" menu and click on Ingen below the
factual prose and fiction menus. Click OK and Søk i
korpuset (examples: pustet, bestemmer).
- Find all verbs that are followed by a preposition in the
fiction material: Do not write anything in the fields Første ord,
Andre ord or Tredje ord. Select Verb in the Grammatiske
kategorier menu below Første ord and Preposisjon in
the corresponding menu below Andre ord, click on Velg tekster,
select Alle in the "Skjønnlitteratur" menu and click on
Ingen below the newspaper and factual prose menus. Click OK and
Søk i korpuset (examples: pustet ut, bestemmer
for).
[Search the
bokmål corpus]
[Search the nynorsk corpus]
[Text Laboratory home page]
The corpus is freely available for research using login with Feide or eduGAIN. (Contact the Text Laboratory if you need another login alternative.)
Technical information
This is a front-end to CQP, the Corpus Query Processor of the
IMS Corpus
Workbench developed by Oliver Christ and Bruno Maximilian Schulze at the
Institut für
Maschinelle Sprachverarbeitung at the University of Stuttgart. Here you can
get to its Frequently Asked Questions list at
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/OldDocus/FAQ.html.
We gratefully acknowledge permission to use CQP for research
purposes.
Those acquainted with the CQP
query syntax can use (almost) all of its power. Particular restrictions are
described below.
[Search the
bokmål corpus]
[Search the nynorsk corpus]
[Text Laboratory home page]
The corpus is encoded in the ISO-8859-1 character set. It is also
possible to have the search results shown in pure ASCII format.
The corpus consists of the electronic
Norwegian material that was available at the Text Laboratory by January
1999. We have received most of this material in electronic form, either
directly from newspapers, authors or publishers, or by way of text collectors
such as Humanistisk datasenter in Bergen (now: HIT-senteret) and
ECI (European Corpus
Initiative). We have also downloaded governmental information bulletins (NOU
reports) from the Internet. We are very grateful that we have got permission
from newpapers, publishers and authors to use their texts in this first Oslo
corpus. We have not made any changes to the texts, except for deleting certain
numerical tables in some of the texts. We have not removed headlines, captions
and other elements which might have been thought to create problems for a
tagger. Instead, we designed the tagger to be able to handle such elements -
albeit within limits.
The corpus has been tagged with a multitagger (developed
by the Text Laboratory and
the Documentation Project in
collaboration), and then with our
disambiguating tagger,
developed by the Text Laboratory (using software by
Lingsoft, Finland). The corpus has been
automatically converted to CQP format from pure text files with meta
information in the header and from an index containing the correct text
identifier.
The corpus is not proofread.
Finally, there are a few differences between our corpus and the
Corpus Workbench standard:
- The structure of the corpus does not permit formal units like
paragraphs and sentences to be included in the query.
- Each word in the corpus has its own source annotation. We have
arranged for the source to be shown for each line in the concordance.
- Capital and non-capital letters have different encodings.
- Punctuation marks have been encoded as separate characters, so
that it is possible to search for e.g. commas.
[Search the
bokmål corpus]
[Search the nynorsk corpus]
[Text Laboratory home page]
The current search interface makes it possible to
- search by clicking and writing
- to have the results shown in Latin 1 or in pure ASCII
- to select the amount of context that is to be shown in the
concordance
- to have only a specified number of randomly chosen hits
shown
- to select the kind of search result (concordance, distribution
of forms or sources, or a combination)
- to select concordance without tags, with tags only on the
search item, or with tags both on the search item and on the context
- to sort the concordance by source, search string, or the
preceding or following word or punctuation mark.
The search result is shown together with a regular expression form
of the query, the date, and the number of hits.
In some cases a warning or a helpful message is given. For example:
- Do not ask for a distribution of forms when the search
expression only corresponds to a single form
- Do not use * instead of .* (a* means a
number of a's, not a followed by something else; to get that you would
have to write a.*)
- Do not use spaces in the middle of a search expression. If you
want two words, you have to enclose them in quotation marks.
Important restrictions
In order to prevent users from downloading entire texts to their own
machine, we have implemented the following restrictions:
- You are not allowed to request a context larger than 500
characters. No matter how large the number entered, the maximum context you'll
see will be 500 characters long.
- You are not allowed to get sequences longer than 200 words
(from the beginning of the search expression til the end). Longer expressions
will be reduced to 200 words.
[Search the
bokmål corpus]
[Search the nynorsk corpus]
[Text Laboratory home page]
[Top of page]
Publications
Publications where the corpus has been used
- Helle Asmussen. 2000. Korpus 2000 - En undersøgelse af brugergrupper og
korpusværktøjer. Prosjektoppgave, Institut for Datalingvistik, Handelshøjskolen i København.
(HTML, Postscript)
- Philipp Conzett. 2002. Frå einskap til ulikskap? Ei gransking av genustilhøvet
ved avleiingar på -skap i skandinavisk. Term paper, University of Tromsø.
- Hanne Ragnhild Eliassen. 2002. Frekvens og norske verb. Hvordan kan verb klassifiseres, og hvordan påvirker frekvens verbene? Cand.philol. thesis, University of Oslo.
- Elisabet Engdahl. 1999. Valet av passivform i modern svenska.
Lecture given at Svenskans beskrivning 24 in Linköping.
- Elisabet Engdahl. 1999. The choice between bli-passive
and s-passive in Danish, Norwegian and Swedish.
NORDSEM-report
no. 3. (Postscript)
- Martin Hilpert. 2002. Semantik und Syntax von Verben der Meinungsäusserung im Dänischen, Norwegischen und Schwedischen. Eine kompararative, korpusbasierte Fallstudie. Universität Hamburg.
- Janne Bondi Johannessen. 1998. Negasjonen ikke: Kategori og
syntaktisk posisjon. MONS 7. Utvalde artiklar frå det 7. Møtet om
Norsk Språk i Trondheim 1997. ISBN 82-7099-307-7
- Fredrik Andersen Kavli. 2001. Korpusargumenter. Cand.philol. thesis, University of Bergen. (HTML)
- Arild Lian, Paul J. Karlsen, and Bendik Winswold. 2001. A re-evaluation of the phonological similarity effect in adults' short-term-memory of words and nonwords. Memory, 9 (4,5,6), 281-299.
- Arne Martinus Lindstad. 1999. Issues in the Syntax of Negation
and Polarity in Norwegian. A Minimalist Analysis. Cand.philol. thesis,
University of Oslo.
- Victoria Rosén, 2000. Er norsk et naturlig språk? In: Øivin Andersen, Kjersti Fløttum
and Torodd Kinn (eds.), Menneske, språk og fellesskap. Festskrift til Kirsti Koch
Christensen på 60-årsdagen, 1. desember 2000, Oslo, Novus forlag.
- Grete Seland, 2001. The Norwegian Reflexive Caused Motion Construction. A Construction Grammar Approach. Cand.philol. thesis, University of Oslo.
- Henrik Stiansen, 2001. Indirekte objekt i norsk. Cand.philol. thesis, University of Oslo.
- Ingebjørg Tonne, 2001. Progressives in Norwegian and the Theory of Aspectuality. Dr.art thesis, University of Oslo, Acta Humaniora, Unipub/Gnist-Akademika.
(Postscript)
- Øystein Alexander Vangsnes. 2001. Distributiv possessiv - en binominal konstruksjon. In Inger Moen (et al.), Mons 9: Utvalgte artikler fra Det niende møtet om norsk språk i Oslo 2001, 230-243. Oslo: Novus.
If you use the corpus for lectures or written work, please tell us
about it. We would like to extend the list of such work, since it is valuable
for all of us.
About tagging
Scientific journals and anthologies:
- Kristin Hagen, Janne Bondi Johannessen and Anders
Nøklestad. 2000. A Web-Based Advanced and User Friendly System: The Oslo Corpus of Tagged Norwegian Texts. In Gavrilidou, M., G. Carayannis, S. Markantonatou, S. Piperidis and G. Stainhaouer (eds.): Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, Greece 31 May - 2 June 2000.
- Kristin Hagen, Janne Bondi Johannessen and Anders
Nøklestad. 2000. A Constraint-Based Tagger for Norwegian. In Lindberg, C.-E. and S. Nordahl Lund (eds.): 17th Scandinavian Conference of Linguistics, vol. I. Odense: Odense Working Papers in Language and Communication, No. 19, vol I.
- Kristin Hagen, Janne Bondi Johannessen and Anders
Nøklestad. 2000. The shortcomings of a tagger. In Proceedings from the 12th "Nordiske datalingvistikkdager", Trondheim 9-10 December, 1999. Trondheim: Lingvistisk institutt, NTNU ).
- Janne Bondi Johannessen. 1998. Tagging and the case of
pronouns. Computers and the Humanities. ISSN 0010-4817
- Janne Bondi Johannessen. 1998. Elektroniske hjelpemidler -
leksikografisk fornying. Norskrift. ISSN 0800.7764
- Kristin Hagen and Janne Bondi Johannessen. 1998. Disambiguering
uten syntaks. MONS 7. Utvalde artiklar frå det 7. Møtet om Norsk
Språk i Trondheim 1997. ISBN 82-7099-307-7
- Anders Nøklestad. 1998. Statistisk disambiguerende
tagging av norsk. MONS 7. Utvalde artiklar frå det 7. Møtet om
Norsk Språk i Trondheim 1997. ISBN 82- 7099-307-7
- Janne Bondi Johannessen and Helge Hauglin.1998. An Automatic
Analysis of Norwegian Compounds. Papers from the 16th Scandinavian Conference
of Linguistics, Turku/Åbo, Finland. ISBN 951-29-1327-5
Unpublished:
- Kristin Hagen, Janne Bondi Johannessen og Kristian Emil
Kristoffersen. 1997. Problemer ved bruk av andres lister til
taggerformål. Foredrag presentert på Møte om norsk
språk 7, Universitetet i Trondheim.
[Search the
bokmål corpus]
[Search the nynorsk corpus]
[Text Laboratory home page]
This is version 2 of the corpus, tagged using version 2 of the
multitagger and version 2 of the disabiguating tagger.
[Search the
bokmål corpus]
[Search the nynorsk corpus]
[Text Laboratory home page]
We are planning to make some improvements, hopefully in the near
future.
- Collocations. We will offer collocations for the search
word.
- Frequency lists. We will create frequency lists for all
of the text types.
- Random selection with even distribution among text
types. We will offer the opportunity to search for a certain number of
randomly selected instances where the instances are evenly distributed among
the various text types.
- Remove articles etc. in the wrong language variety. We
will continue to remove extensive nynorsk texts from the
bokmål material and vice versa.
- The layout on the click-and-write pages will be continuously
evaluated and improved.
We want to continue to improve the Oslo Corpus. Therefore, we will
appreciate all suggestions for improvements, either to
tekstlab-post@iln.uio.no or to the corpus
discussion list, oktnt-list@iln.uio.no. We would like to
thank Stig Johansson, Elisabet Engdahl, Johan Laurits Tønnesson, and
Carl Vikner for their valuable suggestions.
[Search the
bokmål corpus]
[Search the nynorsk corpus]
[Text Laboratory home page]
Norwegian document created by Janne Bondi Johannessen,
translated into English by Anders Nøklestad.
Last updated 7 May
2007 by AN.