[Korpus bosanskih tekstova na Univerzitetu u Oslu]
The Oslo Corpus of Bosnian Texts consists of a corpus of approximately
1.5 million words, encoded with the IMS corpus
workbench developed at the Institut fur Maschinelle
Sprachverarbeitung at the University of Stuttgart, to which a suitable
interface was added at the Text Laboratory.
- Contents of the corpus
- Types of queries available
- How to get permission to use the corpus
- How to get and produce the right fonts
- Technical information
- The most frequent 1,000 wordforms
- Available publications on the corpus
- Version
- Acknowledgement
- Users of the corpus
- How to contact us
This corpus has been compiled at the University of Oslo as a joint project between the Department for East European and Oriental Studies and the Text Laboratory. The corpus contains approximately 1.5 million words, and comprises several different genres: fiction (novels and short stories), essays, children's stories, folklore, islamic texts, legal texts, and newspapers
and journals. The texts, written by authors from Bosnia and
Herzegovina, have for the most part been published in the 1990s. The corpus
provides a new and different basis for research into the language of
Bosnia and Herzegovina.
The project has been supervised by assistant professor Janne Bondi
Johannessen, while professor Svein Mønnesland was
responsible for the selection and compilation of the texts. Gordana Vranic and Kemila Basic have made the texts electronically available (by scanning and adaptation) in simple text files. Diana Santos has built the corpus based on those files in the format requested by the corpus tools used (see below for more information), and has also written the Web interface.
The holders of the copyrights for all the texts have kindly granted permission for the use of the texts in this corpus. In the event that a text is taken from a book, it never covers more than three quarters of that book.
For a detailed overview of the contents in terms of size and
source, see "Sadrzzaj"
page.
When querying the corpus, one can ask for a concordance (KWIC, KeyWord In Context, style - the default), or one can ask for the distribution of the results, in terms of forms, or in terms of text source. In addition, one can, in the very same query, ask for both the concordance and (one of) the distributions.
Even though we plan to provide - eventually - a simpler and fully menu-based query form, for the moment we rely almost completely on the CQP query syntax.
It allows one to express in a compact way quite complex choices, using regular expressions.
Examples of Bosnian queries are:
It is important to be aware of the fact that, in addition to formal properties of the
text, one can also make queries with such parameters such as text type,
author, date, or even a particular work. For an overview of the possibilities offered by our classification of the texts, see "Sadrzzaj"
page.
Some examples:
In order to show the results with Bosnian characters, you have to have
support for ISO-8859-2 in the computer you are running your
browser.
If the results of your search look ugly, you can
- follow the instructions to make them
look better,
- or choose
all-ASCII display in the query form.
If you cannot type Bosnian characters directly, you can use their octal
codes, their standard "alongations", or the corresponding
ISO-8859-1 character (Latin 1) instead. Here are the possibilities:
Bosnian | Octal codes | Alongation | Latin 1
|
---|
| \306 | Ch | Æ
|
\346 | ch | æ
|
| \310 | CC | È
|
\350 | cc | è
|
| \320 | Djj | Ð
|
\360 | djj | ð
|
| D\256 | Dz | D®
|
d\276 | dz | d¾
|
| \251 | SS | ©
|
\271 | ss | ¹
|
| \256 | ZZ, Zh | ®
|
\276 | zz, zh | ¾
|
Along with some examples:
- "stra\271no", "\276ivim", "\271aljivd\276ija"
- stra¹no, ¾ivim ¹aljivd¾ija
- strassno, zzivim, ssaljivdzija
Please note that
- in order to input octal codes, you have to enclose the
words in quotes.
- in order to make the character encodings unambiguous,
we have changed the
standard representation of to Djj and
djj instead. This does not apply to the display of
the results, which follows the standard. In other words, you look for
Djje in your query, but you'll see Dje if you selected
all-ASCII mode.
- even if you input them as sequences of characters, the Bosnian
characters are considered to be one character long, except for , which is regarded as D d fo
llowed by
. Given that "." stands for any
character in the CQP syntax, this means that e.g. stra.no will match
strassno, but .amijskih will not match dzamijskih.
Technical information
This is a front-end to CQP, the Corpus Query Processor of the IMS
Corpus Workbench developed by Oliver Christ and Bruno Maximilian
Schulze at the Institut fur Maschinelle Sprachverarbeitung at
the University of Stuttgart.
Here you can get to its Frequently
Asked Questions list at http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/OldDocus/FAQ.html.
We gratefully acknowledge permission to use
CQP for research purposes.
Those acquainted with the CQP query syntax can
use (almost) all of its power. Particular restrictions
are described below.
The corpus is encoded in the ISO-8859-2 character set. Instructions
on how to configure your browser for some of the most common platforms
can be read here.
Since it cannot be expected that every user will have access to a
browser which allows the correct display of ISO-8859-2 encoded
documents, the all-ASCII display option is available in the query
form, which caters for the standard two character display of
Bosnian-specific characters, as described above.
The corpus was created by scanning books and other printed
material with an optical character recognizer (OCR); in some rare
cases, the material obtained was already in
electronic format. A few editorial alterations were made:
- Some obvious misspellings were corrected.
- Non-textual material, such as figures, tables, tables of contents, and
layout indications were discarded.
- In some cases (references,
citations in English or Russian, results of football matches, etc.), text fragments were elided. Whenever
such fragments have been removed from inside running text, elision is
indicated by "/.../" in the corpus, in order to prevent
incoherent text sequences.
- In the cases where a significant portion of the text was capitalized,
for stylistic reasons, we have converted to lower case, adopting the
capitalization conventions that are used in Bosnian.
- The same was done in the cases where capitalization of words was
used to introduce concepts or people in children's literature.
- Editorial information from newspapers, such as author of the particular
article, place, news agency, name of the newspaper section, editorial
comments like "To be continued", and the
like, were removed.
- No typesetting information, such as bold, italics, etc, was included,
with the exception of the cases where the authors had used spaces
within words, as in h e r e, to emphasize them. Since changing this
would need a major editing of all files, we decided simply to warn the
users, at least in the present version.
The corpus was automatically derived in CQP format from Word text-only
files with meta-information as header, and from a table of contents
including the correct text identifier, which was created as a Word file by Gordana Vranic.
The corpus was not manually revised after the conversion, so it is
possible that some problems will appear.
Please report any such problems, as well as general problems, suggestions
for improvement, etc., to us.
Finally, there are a number of points users should take into consideration
when querying the system. These concern the way the corpus is stored inside the Corpus Workbench itself:
- The corpus is only divided into parts, even though paragraphs, and
sentences, are expected in a later version.
- The corpus is annotated, for each word, with the unique identifier
associated with its source. The tag is called ori
CQP-internally.
Later on, we hope to be able to display, for each
concordance line, the identification of its source. For the moment we
only have the possibility to restrict the query in relation to the
identification of its source.
- Capitals and non-capitals are distinctly encoded.
- Punctuation marks are encoded as distinct tokens, so that one can
look, for example, for words followed by comma.
- In order to distinguish between opening and closing quotes, which
have no distinct encoding in ISO-8859-2, these punctuation marks are internally encoded
as bq and eq respectively. This allows a user to search for bq. The
output, however, will display them as ordinary double quotes. The same is true for single quotes, which are encoded as bsq and esq.
The current search interface allows you
- to specify the query in CQP-style,
- to choose whether the output is produced in Latin 2 or lower ASCII
- to choose the amount of context displayed in the concordance
- to select a random sample of the matches (0 means that no sampling will
be done)
- to select the kind of output required (concordance, distribution of forms, distribution of sources, concordance plus either distribution).
The output is returned with an indication of the query issued by the
user, the date, and the number of matches.
If the number of matches was not null and a concordance was requested, the number of instances found, and the number of instances that will actually be displayed are shown, followed by the instances found, with
the actual match emphasized. If a distribution was requested, it is output in a simple table format, in decreasing order of frequency.
In some cases, warning or help messages are issued. The latter are meant to give some help to a first-time user. For example,
- Don't ask for a distribution of forms when the query corresponds to only ONE form
- Don't use * instead of .* (a* means a number of a's, not a followed by something else: for that you have to write a.*)
- Don't ask for within X when X is not a valid structural attribute
- Don't use spaces inside tokens. If you want to look for two words you have to enclose them inside quotes
Limitations
In order to prevent users from downloading the whole texts onto their own
machines, the following restrictions were implemented:
Comparison with using CQP directly
Compared with using CQP in your own machine, in addition to
performance downgrading there are some features that are missing,
most significantly:
- the use of subcorpora
- the showing of a larger context if necessary
The restrictions described above do not hold if you have
direct access to CQP and the corpus in your machine.
However, the display of the source identification, together with each example, is an improvement relative to the CQP and Xkwic programs.
Planned improvements
In the future, we plan to add the
following capabilities to the Web interface:
- the possibility of sorting the concordance according to several different
criteria (now it is displayed in corpus order, or in a random fashion)
- the possibility to issue case-insensitive queries
- the possibility to have cross-distribution, i.e., have distribution of forms distributed by source
- the possibility to have a relative, instead of absolute, distribution, possibly also weighted by the amount of text in different text types.
Suggestions for other capabilities, as well as constructive complaints,
are always welcome.
- Browne 98
- Browne, Wayles. Agreement with infinitive subjects in Slavic; with a note on Corbett's notion of `real distance'.
(Paper given at workshop on Comparative Slavic Morphosyntax, Bloomington, Indiana, 5-7 June 1998)
- Jakopin 99
- Jakopin, Primoz. Upper
bound of entropy in Slovenian literary texts (paper written in Slovenian; English abstract
here). Ph.D thesis, Ljubljana
University.
- Leko 98a
- Leko, Nedzad. Compiling word frequency lists: problems of homonymy. Ms. University of Sarajevo and University of Oslo.
- Leko 98b
- Leko, Nedzad. Some lexical doublets in the Oslo Corpus of Bosnian Tex
ts: A comparison with a previous study of doublets. Ms. University of Sarajevo and University of Oslo.
- Leko 98c
- Leko, Nedzad. Some problems in compiling a frequency dictionary from
the Oslo Corpus of Bosnian Texts.Ms. University of Sarajevo and University of Oslo.
- Leko 98d
- Leko, Nedzad. Polarity Items in Bosnian. Ms. University of Sarajevo and University of Oslo.
- Leko 98e
- Leko, Nedzad. Recent changes in the Bosnian language as reflected by and documente
d from the Oslo Corpus of Bosnian Texts. Ms. University of Sarajevo and University of Oslo.
- Santos 98
- Santos, Diana. Providing access to language
resources through the World Wide Web: the Oslo Corpus of Bosnian
Texts. Proceedings of The First International Conference on
Language Resources and Evaluation (Granada, 28-30 May 1998), rtf
- Szucsich 2002
- Szucsich, Luka. Nominale Adverbiale im Russischen. Syntax,
Semantik und Informationsstruktur. Otto Sagner Verlag: München
(Munich).
- Hellman 2005
- Hellman, Matias. Znati and um(j)eti in Serbian, Croatian and Bosnian.Grammaticalisation of Habitual Auxiliaries. Slavica Helsingiensia 25. PDF
We would like to know about any further publications using material
from our corpus, and eventually make them available from this page.
This is Version 1.1 of the corpus, Version 2.1 of the interface,
released on the 20th April 1998.
We gratefully acknowlegde Helge Hauglin's help in debugging CGI
programs, Kjetil Rå Hauge's information on fonts and general feedback
from an informed user's perspective, and the people at the University
of Stuttgart for general technical support concerning CQP.
Our largest debt goes to Nedzad Leko, who was an enthusiastic first
user, and provided us with documentation, feedback, and the frequency
lists, as well as with the first papers using our corpus.
In Bosnian, please contact Professor Svein Mønnesland, svein.monnesland@ilos.uio.no,
Svein Mønnesland
ILOS
Postboks 1003 Blindern
0315 Oslo
Norway
+47-90918960
In English, you can contact the Text Laboratory by sending mail to tekstlab-post@iln.uio.no. You can
also look at the Text Laboratory's
home page for more detailed information.
Last modified in 2023 by KH.