Punctuation marks
The tagger marks sentence boundaries with clb (for example with commas and conjunctions). Complete sentence boundaries are marked with <<< .
Quotation marks and dots are put together with the word to the left or right if the character is assumed to be a part of the word, eg. when the dot is part of an abbreviation: A. subst prop fork, or when a single word only is in quotes: <"exit">. (When multiple words are framed by quotation marks, quotation marks are separated as a separate character <">.)
The tagger also tries to identify headlines, and gives them a complete sentence boundary tag plus a character | meaning headline:
<word>Været</word> 'the weather'
"<været>"
"vær" subst appell nøyt be ent
<word>blir</word> 'gets'
"<blir>"
"bli" verb pres i2 tr5 a5 pa4/til pr1 pr2
<aux1/perf_part>
<word>bedre</word> 'better'
"<bedre>"
"god" adj komp
"<|>"
'headline'
"$|" clb <overskrift> <<< ' sentence boundary'
Note that all uppercase letters are converted to lowercase, but that the original text is retained between the tags <word> </ word>.
Ellipsis | <...> , $... clb <ellipse> |
Quotation marks | <"> , $ <anf> |
Colon | <:> , $: clb <kolon> |
Comma | <,> , $, clb <komma> |
Headline | <|> , $| clb <overskrift> <<< |
Parenthesis starts | <(> , $( <parentes-beg> |
Parenthesis ends | <)> , $) <parentes-slutt> |
Dot | <.> , $. clb <<< <punkt> |
Semicolon | <;> , $; clb <semi> |
Question mark | <?> , $? clb <spm> |
Line | "<->" , "$-" <strek> |
Exclamation | "<!>" , "$!" clb <<< <utrop> |