The Oslo Corpus of Bosnian Texts

[Korpus bosanskih tekstova na Univerzitetu u Oslu]

The Oslo Corpus of Bosnian Texts consists of a corpus of approximately 1.5 million words, encoded with the IMS corpus workbench developed at the Institut fur Maschinelle Sprachverarbeitung at the University of Stuttgart, to which a suitable interface was added at the Text Laboratory.

Contents of the corpus
Types of queries available
How to get permission to use the corpus
How to get and produce the right fonts
Technical information
The most frequent 1,000 wordforms
Available publications on the corpus
Version
Acknowledgement
Users of the corpus
How to contact us

[Query the Corpus] [Go to the Text Laboratory's Home Page]

Contents of the corpus

A
drawing of a tombstone This corpus has been compiled at the University of Oslo as a joint project between the Department for East European and Oriental Studies and the Text Laboratory. The corpus contains approximately 1.5 million words, and comprises several different genres: fiction (novels and short stories), essays, children's stories, folklore, islamic texts, legal texts, and newspapers and journals. The texts, written by authors from Bosnia and Herzegovina, have for the most part been published in the 1990s. The corpus provides a new and different basis for research into the language of Bosnia and Herzegovina.

The project has been supervised by assistant professor Janne Bondi Johannessen, while professor Svein Mønnesland was responsible for the selection and compilation of the texts. Gordana Vranic and Kemila Basic have made the texts electronically available (by scanning and adaptation) in simple text files. Diana Santos has built the corpus based on those files in the format requested by the corpus tools used (see below for more information), and has also written the Web interface.

The holders of the copyrights for all the texts have kindly granted permission for the use of the texts in this corpus. In the event that a text is taken from a book, it never covers more than three quarters of that book.

For a detailed overview of the contents in terms of size and source, see "Sadrzzaj" page.

[Query the Corpus] [Go to the Text Laboratory's Home Page] [Go to Top of Page]

Types of queries available

When querying the corpus, one can ask for a concordance (KWIC, KeyWord In Context, style - the default), or one can ask for the distribution of the results, in terms of forms, or in terms of text source. In addition, one can, in the very same query, ask for both the concordance and (one of) the distributions.

Even though we plan to provide - eventually - a simpler and fully menu-based query form, for the moment we rely almost completely on the CQP query syntax. It allows one to express in a compact way quite complex choices, using regular expressions.

Examples of Bosnian queries are:

"sebi" All occurrences of the word sebi. Follow this link to see a possible display.
"kak.*" All words that begin with the letters kak. Follow this link to see a possible display.
".*ovati" All words that end with the sequence ovati (=infinitives, e.g. kritikovati). Follow this link to see a possible display.
".*t" "ch.*" All sequences of two adjacent words where the first ends in t and the second start with ch (=non-contracted form of the future tense, e.g. vidjet chess). Follow this link to see a possible display.
"da" []{0,7} "se" The word da followed by se having at most seven words in between. Follow this link to see a possible display.
"u" []* "u" []* "u" within p Paragraphs having at least three occurrences of the word u.
Not applicable yet, since the corpus is not structured in paragraphs, i.e. there is no structural attribute named p.
Follow this link to see a possible display without the within restriction.
See also a set of further examples specially prepared for Bosnian linguists.

It is important to be aware of the fact that, in addition to formal properties of the text, one can also make queries with such parameters such as text type, author, date, or even a particular work. For an overview of the possibilities offered by our classification of the texts, see "Sadrzzaj" page. Some examples:

[word="kak.*" & ori="PU.*"] Words that begin with the letters kak in newspapers or periodicals (Codes starting with PU). Follow this link to see a random selection of 20 of these words
[word=".*t" & ori=".*94"] []* "ch.*" Sequences of (not necessarily adjacent) words where the first ends in t and the second start with ch in works published in 1994. Follow this link to see a random selection of 10 of these sequences.

[Query the Corpus] [Go to the Text Laboratory's Home Page] [Go to Top of Page]

How to see and produce the right fonts

In order to show the results with Bosnian characters, you have to have support for ISO-8859-2 in the computer you are running your browser. If the results of your search look ugly, you can

follow the instructions to make them look better,
or choose all-ASCII display in the query form.

If you cannot type Bosnian characters directly, you can use their octal codes, their standard "alongations", or the corresponding ISO-8859-1 character (Latin 1) instead. Here are the possibilities:

Octal codes	Alongation	Latin 1
\306	Ch	Æ
\346	ch	æ
\310	CC	È
\350	cc	è
\320	Djj	Ð
\360	djj	ð
D\256	Dz	D®
d\276	dz	d¾
\251	SS	©
\271	ss	¹
\256	ZZ, Zh	®
\276	zz, zh	¾

Along with some examples:

"stra\271no", "\276ivim", "\271aljivd\276ija"
stra¹no, ¾ivim ¹aljivd¾ija
strassno, zzivim, ssaljivdzija

Please note that

in order to input octal codes, you have to enclose the words in quotes.
in order to make the character encodings unambiguous, we have changed the standard representation of to Djj and djj instead. This does not apply to the display of the results, which follows the standard. In other words, you look for Djje in your query, but you'll see Dje if you selected all-ASCII mode.
even if you input them as sequences of characters, the Bosnian characters are considered to be one character long, except for , which is regarded as D d fo llowed by . Given that "." stands for any character in the CQP syntax, this means that e.g. stra.no will match strassno, but .amijskih will not match dzamijskih.

[Query the Corpus] [Go to the Text Laboratory's Home Page] [Go to Top of Page]

Technical information

The IMS Corpus Workbench

This is a front-end to CQP, the Corpus Query Processor of the IMS Corpus Workbench developed by Oliver Christ and Bruno Maximilian Schulze at the Institut fur Maschinelle Sprachverarbeitung at the University of Stuttgart. Here you can get to its Frequently Asked Questions list at http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/OldDocus/FAQ.html.

We gratefully acknowledge permission to use CQP for research purposes.

Those acquainted with the CQP query syntax can use (almost) all of its power. Particular restrictions are described below.

[Query the Corpus] [Go to the Text Laboratory's Home Page] [Go to Top of Page]

Corpus structure and encoding

The corpus is encoded in the ISO-8859-2 character set. Instructions on how to configure your browser for some of the most common platforms can be read here.

Since it cannot be expected that every user will have access to a browser which allows the correct display of ISO-8859-2 encoded documents, the all-ASCII display option is available in the query form, which caters for the standard two character display of Bosnian-specific characters, as described above.

The corpus was created by scanning books and other printed material with an optical character recognizer (OCR); in some rare cases, the material obtained was already in electronic format. A few editorial alterations were made:

Some obvious misspellings were corrected.
Non-textual material, such as figures, tables, tables of contents, and layout indications were discarded.
In some cases (references, citations in English or Russian, results of football matches, etc.), text fragments were elided. Whenever such fragments have been removed from inside running text, elision is indicated by "/.../" in the corpus, in order to prevent incoherent text sequences.
In the cases where a significant portion of the text was capitalized, for stylistic reasons, we have converted to lower case, adopting the capitalization conventions that are used in Bosnian.
The same was done in the cases where capitalization of words was used to introduce concepts or people in children's literature.
Editorial information from newspapers, such as author of the particular article, place, news agency, name of the newspaper section, editorial comments like "To be continued", and the like, were removed.
No typesetting information, such as bold, italics, etc, was included, with the exception of the cases where the authors had used spaces within words, as in h e r e, to emphasize them. Since changing this would need a major editing of all files, we decided simply to warn the users, at least in the present version.

The corpus was automatically derived in CQP format from Word text-only files with meta-information as header, and from a table of contents including the correct text identifier, which was created as a Word file by Gordana Vranic.

The corpus was not manually revised after the conversion, so it is possible that some problems will appear. Please report any such problems, as well as general problems, suggestions for improvement, etc., to us.

Finally, there are a number of points users should take into consideration when querying the system. These concern the way the corpus is stored inside the Corpus Workbench itself:

The corpus is only divided into parts, even though paragraphs, and sentences, are expected in a later version.
The corpus is annotated, for each word, with the unique identifier associated with its source. The tag is called ori CQP-internally.
Later on, we hope to be able to display, for each concordance line, the identification of its source. For the moment we only have the possibility to restrict the query in relation to the identification of its source.
Capitals and non-capitals are distinctly encoded.
Punctuation marks are encoded as distinct tokens, so that one can look, for example, for words followed by comma.
In order to distinguish between opening and closing quotes, which have no distinct encoding in ISO-8859-2, these punctuation marks are internally encoded as bq and eq respectively. This allows a user to search for bq. The output, however, will display them as ordinary double quotes. The same is true for single quotes, which are encoded as bsq and esq.

[Query the Corpus] [Go to the Text Laboratory's Home Page] [Go to Top of Page]

Information on the search interface

The current search interface allows you

to specify the query in CQP-style,
to choose whether the output is produced in Latin 2 or lower ASCII
to choose the amount of context displayed in the concordance
to select a random sample of the matches (0 means that no sampling will be done)
to select the kind of output required (concordance, distribution of forms, distribution of sources, concordance plus either distribution).

The output is returned with an indication of the query issued by the user, the date, and the number of matches.

If the number of matches was not null and a concordance was requested, the number of instances found, and the number of instances that will actually be displayed are shown, followed by the instances found, with the actual match emphasized. If a distribution was requested, it is output in a simple table format, in decreasing order of frequency.

In some cases, warning or help messages are issued. The latter are meant to give some help to a first-time user. For example,

Don't ask for a distribution of forms when the query corresponds to only ONE form
Don't use * instead of .* (a* means a number of a's, not a followed by something else: for that you have to write a.*)
Don't ask for within X when X is not a valid structural attribute
Don't use spaces inside tokens. If you want to look for two words you have to enclose them inside quotes

Limitations

In order to prevent users from downloading the whole texts onto their own machines, the following restrictions were implemented:

You are not allowed to request a context larger than 500 characters. No matter how large the number entered, the maximum context you'll see will be 500 characters long.
You are not allowed to search in sequences for more than 2 paragraphs. So, even if you send a query having e.g. the request within 3 p, it will be changed to within 2 p by the program.
Not applicable since the corpus is not structured in paragraphs yet.
You are not allowed to get sequences longer than 200 words (from the beginning of the search expression til the end). This means that even if you send a query having the request within 2500, it will be changed to within 200 by the program.

Comparison with using CQP directly

Compared with using CQP in your own machine, in addition to performance downgrading there are some features that are missing, most significantly:

the use of subcorpora
the showing of a larger context if necessary

The restrictions described above do not hold if you have direct access to CQP and the corpus in your machine.

However, the display of the source identification, together with each example, is an improvement relative to the CQP and Xkwic programs.

Planned improvements

In the future, we plan to add the following capabilities to the Web interface:

the possibility of sorting the concordance according to several different criteria (now it is displayed in corpus order, or in a random fashion)
the possibility to issue case-insensitive queries
the possibility to have cross-distribution, i.e., have distribution of forms distributed by source
the possibility to have a relative, instead of absolute, distribution, possibly also weighted by the amount of text in different text types.

Suggestions for other capabilities, as well as constructive complaints, are always welcome.

[Query the Corpus] [Go to the Text Laboratory's Home Page] [Go to Top of Page]

Available publications on the corpus

Browne 98: Browne, Wayles. Agreement with infinitive subjects in Slavic; with a note on Corbett's notion of `real distance'. (Paper given at workshop on Comparative Slavic Morphosyntax, Bloomington, Indiana, 5-7 June 1998)
Jakopin 99: Jakopin, Primoz. Upper bound of entropy in Slovenian literary texts (paper written in Slovenian; English abstract here). Ph.D thesis, Ljubljana University.
Leko 98a: Leko, Nedzad. Compiling word frequency lists: problems of homonymy. Ms. University of Sarajevo and University of Oslo.
Leko 98b: Leko, Nedzad. Some lexical doublets in the Oslo Corpus of Bosnian Tex ts: A comparison with a previous study of doublets. Ms. University of Sarajevo and University of Oslo.
Leko 98c: Leko, Nedzad. Some problems in compiling a frequency dictionary from the Oslo Corpus of Bosnian Texts.Ms. University of Sarajevo and University of Oslo.
Leko 98d: Leko, Nedzad. Polarity Items in Bosnian. Ms. University of Sarajevo and University of Oslo.
Leko 98e: Leko, Nedzad. Recent changes in the Bosnian language as reflected by and documente d from the Oslo Corpus of Bosnian Texts. Ms. University of Sarajevo and University of Oslo.
Santos 98: Santos, Diana. Providing access to language resources through the World Wide Web: the Oslo Corpus of Bosnian Texts. Proceedings of The First International Conference on Language Resources and Evaluation (Granada, 28-30 May 1998), rtf
Szucsich 2002: Szucsich, Luka. Nominale Adverbiale im Russischen. Syntax, Semantik und Informationsstruktur. Otto Sagner Verlag: München (Munich).
Hellman 2005: Hellman, Matias. Znati and um(j)eti in Serbian, Croatian and Bosnian.Grammaticalisation of Habitual Auxiliaries. Slavica Helsingiensia 25. PDF

We would like to know about any further publications using material from our corpus, and eventually make them available from this page.

[Query the Corpus] [Go to the Text Laboratory's Home Page] [Go to Top of Page]

Version

This is Version 1.1 of the corpus, Version 2.1 of the interface, released on the 20th April 1998.

[Query the Corpus] [Go to the Text Laboratory's Home Page] [Go to Top of Page]

Acknowledgment

We gratefully acknowlegde Helge Hauglin's help in debugging CGI programs, Kjetil Rå Hauge's information on fonts and general feedback from an informed user's perspective, and the people at the University of Stuttgart for general technical support concerning CQP.

Our largest debt goes to Nedzad Leko, who was an enthusiastic first user, and provided us with documentation, feedback, and the frequency lists, as well as with the first papers using our corpus.

[Query the Corpus] [Go to the Text Laboratory's Home Page] [Go to Top of Page]

How to contact us

In Bosnian, please contact Professor Svein Mønnesland, svein.monnesland@ilos.uio.no,

Svein Mønnesland
ILOS
Postboks 1003 Blindern
0315 Oslo
Norway

+47-90918960

In English, you can contact the Text Laboratory by sending mail to tekstlab-post@iln.uio.no. You can also look at the Text Laboratory's home page for more detailed information.

[Query the Corpus] [Go to the Text Laboratory's Home Page] [Go to Top of Page]

Last modified in 2023 by KH.