Universitetet i Oslo
UniDigital
Syddansk universitet

   Norsk

 • Read about OBT

 • History

 • Evaluation

 • Tagset

 • OBT in use

 • Publications

 • Download

 • Contact

 

 Tekstlaboratoriet

Punctuation marks

                                                                                                                                      <— Back to tagset

The tagger marks sentence boundaries with clb (for example with commas and conjunctions). Complete sentence boundaries are marked with <<< .

Quotation marks and dots are put together with the word to the left or right if the character is assumed to be a part of the word, eg. when the dot is part of an abbreviation: A. subst prop fork, or when a single word only is in quotes: <"exit">. (When multiple words are framed by quotation marks, quotation marks are separated as a separate character <">.)

The tagger also tries to identify headlines, and gives them a complete sentence boundary tag plus a character | meaning headline:

 

<word>Været</word> 'the weather'
"<været>"
      "vær" subst appell nøyt be ent
<word>blir</word> 'gets'
"<blir>"
     "bli" verb pres i2 tr5 a5 pa4/til pr1 pr2 <aux1/perf_part>
<word>bedre</word> 'better'
"<bedre>"
     "god" adj komp
"<|>" 'headline'
"$|" clb <overskrift> <<< ' sentence boundary'

Note that all uppercase letters are converted to lowercase, but that the original text is retained between the tags <word> </ word>.


Ellipsis <...> , $... clb <ellipse>
Quotation marks <"> , $ <anf>
Colon <:> , $: clb <kolon>
Comma <,> , $, clb <komma>
Headline <|> , $| clb <overskrift> <<<
Parenthesis starts <(> , $( <parentes-beg>
Parenthesis ends <)> , $) <parentes-slutt>
Dot <.> , $. clb <<< <punkt>
Semicolon <;> , $; clb <semi>
Question mark <?> , $? clb <spm>
Line "<->" , "$-" <strek>
Exclamation "<!>" , "$!" clb <<< <utrop>