2009-fsmnlp/Pirinen-2009-fsmnlp.html

<!DOCTYPE html><html>
<head>
<title>Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009.</title>
<!--Generated on Fri Oct 13 18:33:09 2017 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on Last modification: October 13, 2017.-->

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">Weighting Finite-State Morphological Analyzers
using <span class="ltx_text ltx_font_smallcaps">HFST</span> Tools
<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup>The official publication was in the Proceedings of
FSMNLP 2009.</span></span></span>
</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Krister Lindén 
</span></span>
<span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Tommi Pirinen 
<br class="ltx_break">University of Helsinki
<br class="ltx_break">Helsinki, Finland
<br class="ltx_break">{krister.linden,tommi.pirinen}@helsinki.fi
<br class="ltx_break">
</span></span>
</div>
<div class="ltx_date ltx_role_creation">Last modification: October 13, 2017</div>

<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
    
<p class="ltx_p">In a language with very productive compounding and a rich
inflectional system, e.g. Finnish, new words are to a large extent
formed by compounding. In order to disambiguate between the possible
compound segmentations, a probabilistic strategy has been found
effective by Lindén and Pirinen <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib7" title="" class="ltx_ref">7</a>]</cite>. In this
article, we present a method for implementing the probabilistic
framework as a separate process which can be combined through
composition with a lexical transducer to create a weighted
morphological analyzer. To implement the analyzer, we use the
<span class="ltx_text ltx_font_smallcaps">HFST-LexC</span> and related command line tools which are part of
the open source <em class="ltx_emph">Helsinki Finite-State Technology</em> package.
Using Finnish as a test language, we show how to use the weighted
finite-state lexicon for building a simple unigram tagger with 97 %
precision for Finnish words and word segments belonging to the
vocabulary of the lexicon.
</p>
  
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>

<div id="S1.p1" class="ltx_para">
<p class="ltx_p">In English the received wisdom is that traditional morphological
analysis is too complex for statistical taggers to deal with; a
simplified tagging scheme is needed. The disambiguation accuracy will
otherwise be too low even with an n-gram tagger because there is not
enough training material. However, currently training material for
morphological disambiguators is abundantly available. At the same
time, one could argue that the interest in tagging has disappeared,
because we can do more complex things such as syntactic dependency
analysis and get the morphological disambiguation as a side effect. As
a matter of curiosity, we will still pursue statistical tagging,
because there is also the initial result often attributed to Ken
Church that approximately 90 % of the readings in English will be
correct if one simply gives each word its most frequent
morphosyntactic tag. We wish to derive a similar baseline for Finnish.</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">In addition, a morphologically complex language like Finnish is
different than English. In English there are hardly any inflectional
endings and applying traditional morphological analysis to English
necessarily creates massive ambiguity that can only be resolved by
context, whereas morphologically complex languages like Finnish in
each word most often carry the morphemes referred to by the
morphological tags. As the morphological tags have a physical
correspondence in the strings, it should be possible to use much less
context, or perhaps none at all, to disambiguate the traditional
morphological analysis of languages like Finnish. After all, the
reduced tag sets of English statistical taggers can be viewed as an
attempt to simplify the tag set to refer only to the visible surface
morphemes in a locally constrained context.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">There are some initial encouraging results by Lindén and Pirinen
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib7" title="" class="ltx_ref">7</a>]</cite> for disambiguating Finnish compounds using
unigram statistics for the parts in a productive compound process.
Unigram statistics for compounds is essentially the same as taking the
most likely morpheme segmentation and the most frequent reading of
each compound word. Similar results for disambiguating compounds using
a slightly different basis for estimating the probabilities have been
demonstrated for German by Schiller <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib11" title="" class="ltx_ref">11</a>]</cite> and by Marek
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib9" title="" class="ltx_ref">9</a>]</cite>. These results further encourage us to pursue the
topic of full morphological tagging for a complex language like
Finnish using only a lexicon and unigram statistics for the words and
their compound parts.</p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">In <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib7" title="" class="ltx_ref">7</a>]</cite>, Lindén and Pirinen suggest a method which
essentially requires the building of a full form lexicon and an
estimate for each separate word form. This is not particularly
convenient, instead we introduce a simplified way to weight the
different parts of the lexicon with frequency data from a corpus by
using weighted finite-state transducer calculus. We use the open
source software tools of
<span class="ltx_text ltx_font_smallcaps">HFST<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup><a href="hfst.sourceforge.net" title="" class="ltx_ref ltx_url ltx_font_typewriter ltx_font_upright">hfst.sourceforge.net</a></span></span></span></span>, which contains
<span class="ltx_text ltx_font_smallcaps">HFST-LexC</span> similar to the Xerox LexC tool
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib2" title="" class="ltx_ref">2</a>]</cite>. In addition to compiling LexC-style lexicons,
<span class="ltx_text ltx_font_smallcaps">HFST-LexC</span> has a mechanism for adding weights to compound
parts and morphological analyses. The <span class="ltx_text ltx_font_smallcaps">HFST</span> tools also contain
a set of command line tools that are convenient for creating the final
weighted morphological analyzer using transducer calculus.</p>
</div>
<div id="S1.p5" class="ltx_para">
<p class="ltx_p">We apply the weighted morphological analyzer to the task of
morphologically tagging Finnish text. As expected, it turns out that a
highly inflecting and compounding language with a free word order like
Finnish solves many of its linguistic ambiguities during word
formation. This pays back in the form of 97 % tagger precision using
only a very simple unigram tagger in the form of a weighted
morphological lexicon for the words and word parts that are in the
lexicon. For words that contain unknown parts, the lexicalized
strategy is, however, rather toothless. For such words it seems, we
may, after all, need a traditional guesser and n-gram statistics for
morphological disambiguation.</p>
</div>
<div id="S1.p6" class="ltx_para">
<p class="ltx_p">The remainder of the article is structured as follows. In
Sections <a href="#S2" title="2 Finnish Morphology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, we briefly present some aspects of Finnish
morphology that may be problematic for statistical tagging. In
Section <a href="#S3" title="3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>, we introduce the probabilistic formulation of how
to weight lexical entries. In Section <a href="#S4" title="4 Data Sets ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>, we introduce the
test and training corpora. In Section <a href="#S5" title="5 Tests and Results ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>, we evaluate the
weighted lexicon on tagging Finnish text. Finally, in
Sections <a href="#S6" title="6 Discussion and Further Research ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">6</span></a> and <a href="#S7" title="7 Conclusions ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">7</span></a>, we discuss the results and draw
the conclusions.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Finnish Morphology</h2>

<div id="S2.p1" class="ltx_para">
<p class="ltx_p">We present some aspects of Finnish inflectional and compounding
morphology that may be problematic for statistical tagging in
Sections <a href="#S2.SS1" title="2.1 Inflection in Finnish ‣ 2 Finnish Morphology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.1</span></a> and <a href="#S2.SS2" title="2.2 Compounding in Finnish ‣ 2 Finnish Morphology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.2</span></a>. For a more thorough
introduction to Finnish morphology, see Karlsson <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib5" title="" class="ltx_ref">5</a>]</cite>,
and for an implementation of computational morphology, see Koskenniemi
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib6" title="" class="ltx_ref">6</a>]</cite>. In Section <a href="#S2.SS2" title="2.2 Compounding in Finnish ‣ 2 Finnish Morphology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.2</span></a>, we present an outline
of how to implement the morphology in sublexicons which are useful for
weighting.</p>
</div>
<section id="S2.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.1 </span>Inflection in Finnish</h3>

<div id="S2.SS1.p1" class="ltx_para">
<p class="ltx_p">In Finnish morphology, the inflection of typical nouns produces
several thousands of forms for the productive inflection. E.g. a noun
has more than 12 cases in singular and plural as well as possessive
suffixes and clitic particles resulting in more than 2000 forms for
every noun.</p>
</div>
<div id="S2.SS1.p2" class="ltx_para">
<p class="ltx_p">Mostly the traditional linguistically motivated morphological analysis
of Finnish is based on visible morphemes. However, for illustrational
purposes we will discuss two prototypical cases where the analysis
needs context. One such case is where a possessive suffix overrides
the case ending to create ambiguity: <span class="ltx_text ltx_font_italic">taloni</span> ’my house/of my
house/my houses’, i.e. either <span class="ltx_text ltx_font_italic">talo</span> ’house’ nominative singular,
<span class="ltx_text ltx_font_italic">talon</span> ’of the house’ genitive singular or <span class="ltx_text ltx_font_italic">talot</span> ’houses’
nominative plural followed by a possessive suffix. This ambiguity is
systematic, so either the distinctions can be left out or one can
create a complex underspecified tag <span class="ltx_text ltx_font_italic">+Sg+Nom/+Sg+Gen/+Pl+Nom</span> for
this case.</p>
</div>
<div id="S2.SS1.p3" class="ltx_para">
<p class="ltx_p">Another case, which is common in most languages, is the distinction
between nouns or adjectives and participles of verbs. This often
affects the choice of baseform for the word, i.e. the baseform of
’writing’ is either a verb such as ’write’ or a noun such as
’writing’. In Finnish, we have words like <span class="ltx_text ltx_font_italic">taitava</span> ’skillful
Adjective’ or ’know Verb Present Participle’ and <span class="ltx_text ltx_font_italic">kokenut</span>
’experienced Adjective’ or ’experience Verb Past Participle’. Since
the two readings have different baseforms, it is not be possible to
defer the ambiguity to be resolved later by using underspecification.
In some cases, one of the forms is rare and can perhaps be ignored
with a minimal loss of information, but sometimes both occur regularly
and in overlapping contexts, in which case both forms should be
postulated and eventually disambiguated. However, sufficient
information for doing this reliably may not be available before some
degree of syntactic or semantic analysis.</p>
</div>
<div id="S2.SS1.p4" class="ltx_para">
<p class="ltx_p">In Sections <a href="#S5" title="5 Tests and Results ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a> and <a href="#S6" title="6 Discussion and Further Research ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">6</span></a>, we will return to the
significance of these problems in Finnish and their impact on the
morphological disambiguation.</p>
</div>
</section>
<section id="S2.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2 </span>Compounding in Finnish</h3>

<div id="S2.SS2.p1" class="ltx_para">
<p class="ltx_p">Finnish compounding theoretically allows nominal compounds of
arbitrary length to be created from initial parts of certain noun
forms. The final part may be inflected in all possible forms.</p>
</div>
<div id="S2.SS2.p2" class="ltx_para">
<p class="ltx_p">Normal inflected Finnish noun compounds correspond to prepositional
phrases in English, e.g. <span class="ltx_text ltx_font_italic">ostoskeskuksessa</span> ’in the shopping
center’. The morphological analysis in Finnish of the previous phrase
into <span class="ltx_text ltx_font_italic">ostos#keskus+N+Sg+Ine</span> corresponds in English to noun
chunking and case analysis into ’shopping center +N+Sg+Loc:In’.</p>
</div>
<div id="S2.SS2.p3" class="ltx_para">
<p class="ltx_p">In extreme cases, such as the compounds describing ancestors, nouns
are compounded from zero or more of <em class="ltx_emph">isän</em> ‘father
<span class="ltx_text ltx_font_smallcaps">singular genitive</span>’ and <em class="ltx_emph">äidin</em> ‘mother <span class="ltx_text ltx_font_smallcaps">singular
genitive</span>’ and then one of the inflected forms of <em class="ltx_emph">isä</em> or
<em class="ltx_emph">äiti</em> creating forms such as <em class="ltx_emph">äidinisälle</em> ‘to (maternal)
grandfather’ or <em class="ltx_emph">isänisänisänisä</em> ‘great great grandfather’. As
for the potential ambiguity, Finnish also has the noun <em class="ltx_emph">nisä</em>
‘udder’, which creates ambiguity for any paternal grandfather,
e.g. <em class="ltx_emph">isän#isän#isän#isä</em>, <em class="ltx_emph">isän#isä#nisän#isä</em>,
<em class="ltx_emph">isä#nisä#nisä#nisä</em>, …</p>
</div>
<div id="S2.SS2.p4" class="ltx_para">
<p class="ltx_p">Finnish compounding also includes forms of compounding where all parts
of the word are inflected in the same form, but this is limited to a
small fraction of adjective initial compounds and to the numbers if
they are spelled out with letters. In addition, some inflected verb
forms may appear as parts of compounds. These are much more rare than
nominal compounds <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib4" title="" class="ltx_ref">4</a>]</cite> so they do not interfere with the
regular compounding.</p>
</div>
</section>
<section id="S2.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3 </span>Finnish Computational Morphology</h3>

<div id="S2.SS3.p1" class="ltx_para">
<p class="ltx_p">Pirinen <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib10" title="" class="ltx_ref">10</a>]</cite> presented an open source implementation of
a finite state morphological analyzer for Finnish, which has been
reimplemented with the <span class="ltx_text ltx_font_smallcaps">HFST</span> tools and extended with data
collected and classified by Listenmaa <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib8" title="" class="ltx_ref">8</a>]</cite>. We use the
reimplemented and extended version as our unweighted lexicon.
Pirinen’s analyzer has a fully productive noun compounding
mechanism. Fully productive noun compounding means that it allows
compounds of arbitrary length with any combination of nominative
singulars, genitive singulars, or genitive plurals in the initial part
and any inflected form of a noun as the final part.</p>
</div>
<div id="S2.SS3.p2" class="ltx_para">
<p class="ltx_p">The morphotactic combination of morphemes is achieved by combining
sublexicons as defined in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib2" title="" class="ltx_ref">2</a>]</cite>. We use the open source
software called <span class="ltx_text ltx_font_smallcaps">HFST-LexC</span> with a similar interface as the
Xerox LexC tool. The interested reader is referred to
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib2" title="" class="ltx_ref">2</a>]</cite> for an exposition of the LexC syntax. The
<span class="ltx_text ltx_font_smallcaps">HFST-LexC</span> tool extends the syntax with support for adding
weights on the lexical entries.</p>
</div>
<div id="S2.SS3.p3" class="ltx_para">
<p class="ltx_p">We note that the noun compounding can be decomposed into two
concatenatable lexicons separated by a word boundary marker, i.e. any
number of noun prefixes <em class="ltx_emph">CompoundNonFinalNoun</em><math id="S2.SS3.p3.m1" class="ltx_Math" alttext="{}^{*}" display="inline"><msup><mi></mi><mo>*</mo></msup></math> in
Figure <a href="#S2.F1" title="Figure 1 ‣ 2.3 Finnish Computational Morphology ‣ 2 Finnish Morphology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a> separated by ’#’ and from the inflected
noun forms <em class="ltx_emph">CompoundFinalNoun</em> in
Figure <a href="#S2.F2" title="Figure 2 ‣ 2.3 Finnish Computational Morphology ‣ 2 Finnish Morphology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. Similar decompositions can be achieved
for other parts of speech as needed. For a further discussion of the
structure of the lexicon, see <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib7" title="" class="ltx_ref">7</a>]</cite>.</p>
</div>
<figure id="S2.F1" class="ltx_figure"><pre class="ltx_verbatim ltx_centering ltx_font_typewriter" style="font-size:70%;">
LEXICON Root
## CompoundNonFinalNoun ;
## #;

LEXICON Compound
#:0 CompoundNonFinalNoun;
#:0 #;

LEXICON CompoundNonFinalNoun
isä   Compound  "weight: 0, gloss: father" ;
isän  Compound  "weight: 0, gloss: father's" ;
äiti  Compound  "weight: 0, gloss: mother" ;
äidin Compound  "weight: 0, gloss: mother's" ;
</pre>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>Unweighted fragment for
{<em class="ltx_emph">CompoundNonFinalNoun</em>}<math id="S2.F1.m2" class="ltx_Math" alttext="{}^{*}" display="inline"><msup><mi></mi><mo>*</mo></msup></math> i.e. <em class="ltx_emph">noun
prefixes</em>.</figcaption>
</figure>
<figure id="S2.F2" class="ltx_figure"><pre class="ltx_verbatim ltx_centering ltx_font_typewriter" style="font-size:70%;">
LEXICON Root
CompoundFinalNoun ;

LEXICON CompoundFinalNoun
isä:isä+sg+nom     ## "weight: 0, gloss: father" ;
isän:isä+sg+gen    ## "weight: 0, gloss: father's" ;
isälle:isä+sg+all  ## "weight: 0, gloss: to the father" ;

LEXICON ##
## # ;
</pre>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>Unweighted fragment for <em class="ltx_emph">CompoundFinalNoun</em>, i.e.
<em class="ltx_emph">noun forms</em>.</figcaption>
</figure>
</section>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Methodology</h2>

<div id="S3.p1" class="ltx_para">
<p class="ltx_p">Assume that we want to know the probability of a morphological
analysis with a morpheme segmentation <em class="ltx_emph">A</em> given the token
<em class="ltx_emph">a</em>, i.e. <math id="S3.p1.m1" class="ltx_Math" alttext="\mathrm{P}(A|a)" display="inline"><mrow><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>A</mi><mo stretchy="false">|</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></mrow></math>. According to Bayes rule, we get
Equation <a href="#S3.E1" title="(1) ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.</p>
</div>
<div id="S3.p2" class="ltx_para">
<table id="S3.E1" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S3.E1.m1" class="ltx_Math" alttext="\mathrm{P}(A|a)=\mathrm{P}(A,a)/\mathrm{P}(a)=\mathrm{P}(a|A)\mathrm{P}(A)/%
\mathrm{P}(a)" display="block"><mrow><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>A</mi><mo stretchy="false">|</mo><mi>a</mi><mo stretchy="false">)</mo></mrow><mo>=</mo><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>A</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></mrow><mo>/</mo><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow><mo>=</mo><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">|</mo><mi>A</mi><mo stretchy="false">)</mo></mrow><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>A</mi><mo stretchy="false">)</mo></mrow><mo>/</mo><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td>
</tr>
</table>
</div>
<div id="S3.p3" class="ltx_para">
<p class="ltx_p">We wish to retain only the most likely analysis and its segmentation
<em class="ltx_emph">A</em>. As we know that <math id="S3.p3.m1" class="ltx_Math" alttext="\mathrm{P}(a|A)" display="inline"><mrow><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">|</mo><mi>A</mi><mo stretchy="false">)</mo></mrow></mrow></math> is almost always 1, i.e. a
word form is known when its analysis is given. Additionally,
<em class="ltx_emph">P(a)</em> is constant during the maximization, so the expression
simplifies to finding the most likely global analysis <em class="ltx_emph">A</em> as
shown by Equation <a href="#S3.E2" title="(2) ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, i.e. we only need to estimate the
output language model.</p>
</div>
<div id="S3.p4" class="ltx_para">
<table id="S3.E2" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S3.E2.m1" class="ltx_Math" alttext="\arg\max_{A}\mathrm{P}(A|a)=\arg\max_{A}\mathrm{P}(a|A)\mathrm{P}(A)/\mathrm{P%
}(a)=\arg\max_{A}\mathrm{P}(A)" display="block"><mrow><mi>arg</mi><munder><mi>max</mi><mi>A</mi></munder><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>A</mi><mo stretchy="false">|</mo><mi>a</mi><mo stretchy="false">)</mo></mrow><mo>=</mo><mi>arg</mi><munder><mi>max</mi><mi>A</mi></munder><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">|</mo><mi>A</mi><mo stretchy="false">)</mo></mrow><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>A</mi><mo stretchy="false">)</mo></mrow><mo>/</mo><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow><mo>=</mo><mi>arg</mi><munder><mi>max</mi><mi>A</mi></munder><mi mathvariant="normal">P</mi><mrow><mo stretchy="false">(</mo><mi>A</mi><mo stretchy="false">)</mo></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td>
</tr>
</table>
</div>
<div id="S3.p5" class="ltx_para">
<p class="ltx_p">In order to find the most likely segmentation of <em class="ltx_emph">A</em>, we can make
the additional assumption that the probability <em class="ltx_emph">P(A)</em> is
proportional to the product of the probabilities <math id="S3.p5.m1" class="ltx_Math" alttext="\mathrm{P}(s_{i})" display="inline"><mrow><mi mathvariant="normal">P</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><msub><mi>s</mi><mi>i</mi></msub><mo stretchy="false">)</mo></mrow></mrow></math> of
the segments of <em class="ltx_emph">A</em>, where <math id="S3.p5.m2" class="ltx_Math" alttext="A=s_{1}s_{2}...s_{n}" display="inline"><mrow><mi>A</mi><mo>=</mo><mrow><msub><mi>s</mi><mn>1</mn></msub><mo>⁢</mo><msub><mi>s</mi><mn>2</mn></msub><mo>⁢</mo><mi mathvariant="normal">…</mi><mo>⁢</mo><msub><mi>s</mi><mi>n</mi></msub></mrow></mrow></math>, defined by
Equation <a href="#S3.E3" title="(3) ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>. This assumption based on a unigram
language model of compounding has been demonstrated by Lindén and
Pirinen <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib7" title="" class="ltx_ref">7</a>]</cite> to work well in practice.</p>
</div>
<div id="S3.p6" class="ltx_para">
<table id="S3.E3" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S3.E3.m1" class="ltx_Math" alttext="\mathrm{P}(A)\propto\prod_{s_{i}}\mathrm{P}(s_{i})" display="block"><mrow><mrow><mi mathvariant="normal">P</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>A</mi><mo stretchy="false">)</mo></mrow></mrow><mo>∝</mo><mrow><munder><mo largeop="true" movablelimits="false" symmetric="true">∏</mo><msub><mi>s</mi><mi>i</mi></msub></munder><mrow><mi mathvariant="normal">P</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><msub><mi>s</mi><mi>i</mi></msub><mo stretchy="false">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(3)</span></td>
</tr>
</table>
</div>
<section id="S3.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>Estimating probabilities</h3>

<div id="S3.SS1.p1" class="ltx_para">
<p class="ltx_p">The estimated probability of a token, <em class="ltx_emph">a</em>, to occur in the corpus
is proportional to the count, <em class="ltx_emph">c(a)</em>, divided by the corpus size,
<em class="ltx_emph">cs</em>. The probability <em class="ltx_emph">p(a)</em> of a token in the corpus is
defined by Equation <a href="#S3.E4" title="(4) ‣ 3.1 Estimating probabilities ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>. We also note that the corpus
estimate for <em class="ltx_emph">p(a)</em> is in fact an estimate of the sum of the
probabilities of all the possible analyses and segmentations of
<em class="ltx_emph">a</em> in the corpus.</p>
</div>
<div id="S3.SS1.p2" class="ltx_para">
<table id="S3.E4" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S3.E4.m1" class="ltx_Math" alttext="\mathrm{p}(a)=\mathrm{c}(a)/\mathrm{cs}" display="block"><mrow><mrow><mi mathvariant="normal">p</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><mrow><mrow><mi mathvariant="normal">c</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></mrow><mo>/</mo><mi>cs</mi></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(4)</span></td>
</tr>
</table>
</div>
<div id="S3.SS1.p3" class="ltx_para">
<p class="ltx_p">Tokens <em class="ltx_emph">x</em> known to the original lexicon but unseen in the corpus
need to be assigned a small probability mass different from 0, so they
get <em class="ltx_emph">c(x) = 1</em>, i.e. we define the count of a token as its corpus
frequency plus 1 as in Equation <a href="#S3.E5" title="(5) ‣ 3.1 Estimating probabilities ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>, also known as Laplace
smoothing.</p>
</div>
<div id="S3.SS1.p4" class="ltx_para">
<table id="S3.E5" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S3.E5.m1" class="ltx_Math" alttext="\mathrm{c}(a)=1+\mathrm{frequency}(a)" display="block"><mrow><mrow><mi mathvariant="normal">c</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><mrow><mn>1</mn><mo>+</mo><mrow><mi>frequency</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(5)</span></td>
</tr>
</table>
</div>
</section>
<section id="S3.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.2 </span>Weighting the Lexicon</h3>

<div id="S3.SS2.p1" class="ltx_para">
<p class="ltx_p">In order to use the probabilities as weights in the lexicon, we
implement them in the tropical semiring, which means that we use the
negative log-probabilities as defined by Equation <a href="#S3.E6" title="(6) ‣ 3.2 Weighting the Lexicon ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">6</span></a>.</p>
</div>
<div id="S3.SS2.p2" class="ltx_para">
<table id="S3.E6" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S3.E6.m1" class="ltx_Math" alttext="\mathrm{w}(a)=-\mathrm{log}(p(a))" display="block"><mrow><mrow><mi mathvariant="normal">w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><mrow><mo>-</mo><mrow><mi>log</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mi>p</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>a</mi><mo stretchy="false">)</mo></mrow></mrow><mo stretchy="false">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(6)</span></td>
</tr>
</table>
</div>
<div id="S3.SS2.p3" class="ltx_para">
<p class="ltx_p">In the tropical semiring, probability multiplication corresponds to
weight addition and probability addition corresponds to weight
maximization. In <span class="ltx_text ltx_font_smallcaps">HFST-LexC</span>, we use OpenFST <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib1" title="" class="ltx_ref">1</a>]</cite> as
the software library for weighted finite-state transducers.</p>
</div>
<figure id="S3.F3" class="ltx_figure"><pre class="ltx_verbatim ltx_centering ltx_font_typewriter" style="font-size:70%;">
LEXICON Root
## CompoundNonFinalNoun ;
## CompoundFinalNoun ;

LEXICON Compound
0:# CompoudNonFinalNoun;
0:# CompoudFinalNoun;

LEXICON CompoundNonFinalNoun
isä   Compound  "weight: -log(c(isä)/cs)" ;
isän  Compound  "weight: -log(c(isän)/cs)" ;
äiti  Compound  "weight: -log(c(äiti)/cs)" ;
äidin Compound  "weight: -log(c(äidin)/cs)" ;

LEXICON CompoundFinalNoun
isä+sg+nom  ##  "weight:-log(c(isä+sg+nom)/cs)" ;
isä+sg+gen  ##  "weight:-log(c(isä+sg+gen)/cs)" ;
isä+sg+all  ##  "weight:-log(c(isä+sg+all)/cs)" ;
isä+pl+ins  ##  "weight:-log(c(isä+sg+all)/cs)" ;

LEXICON ##
## # ;
</pre>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>Structure weighting scheme using token penalties on the
output language. Note that the functions in the comment field are
placeholders for the actual weights.</figcaption>
</figure>
<div id="S3.SS2.p4" class="ltx_para">
<p class="ltx_p">For short, we call our unweighted compounding lexicon, <em class="ltx_emph">Lex</em>, and
the decomposed noun compounding lexicon parts, i.e. the noun prefixes
<em class="ltx_emph">CompoundNonFinalNoun</em><math id="S3.SS2.p4.m1" class="ltx_Math" alttext="{}^{*}" display="inline"><msup><mi></mi><mo>*</mo></msup></math> in Figure <a href="#S2.F1" title="Figure 1 ‣ 2.3 Finnish Computational Morphology ‣ 2 Finnish Morphology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a> and
the inflected noun forms <em class="ltx_emph">CompoundFinalNoun</em> in
Figure <a href="#S2.F2" title="Figure 2 ‣ 2.3 Finnish Computational Morphology ‣ 2 Finnish Morphology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, <em class="ltx_emph">Pref</em> and <em class="ltx_emph">Final</em>,
respectively.</p>
</div>
<div id="S3.SS2.p5" class="ltx_para">
<p class="ltx_p">For an illustration of how the weighting scheme can be implemented in
the weighted output language model, <math id="S3.SS2.p5.m1" class="ltx_Math" alttext="WLex" display="inline"><mrow><mi>W</mi><mo>⁢</mo><mi>L</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>x</mi></mrow></math>, of the noun compounding
lexicon, see Figure <a href="#S3.F3" title="Figure 3 ‣ 3.2 Weighting the Lexicon ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>. There is an obvious extension
of the weighting scheme to the output models of the decomposed
unweighted lexicons, <em class="ltx_emph">Pref</em> and <em class="ltx_emph">Final</em>. We call these
weighted output language models <em class="ltx_emph">WPref</em> and <em class="ltx_emph">WFinal</em>,
respectively.</p>
</div>
</section>
<section id="S3.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.3 </span>Back Off Model</h3>

<div id="S3.SS3.p1" class="ltx_para">
<p class="ltx_p">The original lexicon, <math id="S3.SS3.p1.m1" class="ltx_Math" alttext="Lex" display="inline"><mrow><mi>L</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>x</mi></mrow></math>, can be weighted by composing it with the
weighted output language, <math id="S3.SS3.p1.m2" class="ltx_Math" alttext="WLex" display="inline"><mrow><mi>W</mi><mo>⁢</mo><mi>L</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>x</mi></mrow></math>, as in
Equation <a href="#S3.E7" title="(7) ‣ 3.3 Back Off Model ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">7</span></a>. However, there are a number of word
forms and compound segments in the lexicon, for which no estimate is
available in the corpus. We wish to assign a large weight to these
forms and segments, i.e. a weight <em class="ltx_emph">M</em> which is greater than any
of the weights estimated from the corpus, e.g. <math id="S3.SS3.p1.m3" class="ltx_Math" alttext="M=log(1+\mathrm{cs})" display="inline"><mrow><mi>M</mi><mo>=</mo><mrow><mi>l</mi><mo>⁢</mo><mi>o</mi><mo>⁢</mo><mi>g</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mn>1</mn><mo>+</mo><mi>cs</mi></mrow><mo stretchy="false">)</mo></mrow></mrow></mrow></math>. To calculate the missing words, we first use the
homomorphism <math id="S3.SS3.p1.m4" class="ltx_Math" alttext="uw" display="inline"><mrow><mi>u</mi><mo>⁢</mo><mi>w</mi></mrow></math> to map the <math id="S3.SS3.p1.m5" class="ltx_Math" alttext="WPref" display="inline"><mrow><mi>W</mi><mo>⁢</mo><mi>P</mi><mo>⁢</mo><mi>r</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>f</mi></mrow></math> to an unweighted automata, which we
subtract from <math id="S3.SS3.p1.m6" class="ltx_Math" alttext="\Sigma^{*}" display="inline"><msup><mi mathvariant="normal">Σ</mi><mo>*</mo></msup></math> and give the output model the final weight
<math id="S3.SS3.p1.m7" class="ltx_Math" alttext="M" display="inline"><mi>M</mi></math> using the homomorphism <math id="S3.SS3.p1.m8" class="ltx_Math" alttext="mw" display="inline"><mrow><mi>m</mi><mo>⁢</mo><mi>w</mi></mrow></math>.</p>
</div>
<div id="S3.SS3.p2" class="ltx_para">
<p class="ltx_p">We create the following new sublexicons using automata difference and
composition with the original decomposed transducers in
Equations <a href="#S3.E8" title="(8) ‣ 3.3 Back Off Model ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">8</span></a> and <a href="#S3.E9" title="(9) ‣ 3.3 Back Off Model ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">9</span></a>.</p>
</div>
<div id="S3.SS3.p3" class="ltx_para">
<table id="A0.EGx1" class="ltx_equationgroup ltx_eqn_eqnarray ltx_eqn_table">

<tbody id="S3.E7"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_td ltx_align_right ltx_eqn_cell"><math id="S3.E7.m1" class="ltx_Math" alttext="\displaystyle KnownAndSeenWords" display="inline"><mrow><mi>K</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>o</mi><mo>⁢</mo><mi>w</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>A</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>d</mi><mo>⁢</mo><mi>S</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>W</mi><mo>⁢</mo><mi>o</mi><mo>⁢</mo><mi>r</mi><mo>⁢</mo><mi>d</mi><mo>⁢</mo><mi>s</mi></mrow></math></td>
<td class="ltx_td ltx_align_center ltx_eqn_cell"><math id="S3.E7.m2" class="ltx_Math" alttext="\displaystyle=" display="inline"><mo>=</mo></math></td>
<td class="ltx_td ltx_align_left ltx_eqn_cell"><math id="S3.E7.m3" class="ltx_Math" alttext="\displaystyle Lex~{}o~{}WLex" display="inline"><mrow><mi>L</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mpadded width="+3.3pt"><mi>x</mi></mpadded><mo>⁢</mo><mpadded width="+3.3pt"><mi>o</mi></mpadded><mo>⁢</mo><mi>W</mi><mo>⁢</mo><mi>L</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>x</mi></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(7)</span></td>
</tr></tbody>
<tbody id="S3.E8"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_td ltx_align_right ltx_eqn_cell"><math id="S3.E8.m1" class="ltx_Math" alttext="\displaystyle MaxUnseenPref" display="inline"><mrow><mi>M</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>x</mi><mo>⁢</mo><mi>U</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>s</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>P</mi><mo>⁢</mo><mi>r</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>f</mi></mrow></math></td>
<td class="ltx_td ltx_align_center ltx_eqn_cell"><math id="S3.E8.m2" class="ltx_Math" alttext="\displaystyle=" display="inline"><mo>=</mo></math></td>
<td class="ltx_td ltx_align_left ltx_eqn_cell"><math id="S3.E8.m3" class="ltx_Math" alttext="\displaystyle Pref~{}o~{}(mw(\Sigma^{*}-uw(WPref)))" display="inline"><mrow><mi>P</mi><mo>⁢</mo><mi>r</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mpadded width="+3.3pt"><mi>f</mi></mpadded><mo>⁢</mo><mpadded width="+3.3pt"><mi>o</mi></mpadded><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mi>m</mi><mo>⁢</mo><mi>w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><msup><mi mathvariant="normal">Σ</mi><mo>*</mo></msup><mo>-</mo><mrow><mi>u</mi><mo>⁢</mo><mi>w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mi>W</mi><mo>⁢</mo><mi>P</mi><mo>⁢</mo><mi>r</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>f</mi></mrow><mo stretchy="false">)</mo></mrow></mrow></mrow><mo stretchy="false">)</mo></mrow></mrow><mo stretchy="false">)</mo></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(8)</span></td>
</tr></tbody>
<tbody id="S3.E9"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_td ltx_align_right ltx_eqn_cell"><math id="S3.E9.m1" class="ltx_Math" alttext="\displaystyle MaxUnseenFinal" display="inline"><mrow><mi>M</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>x</mi><mo>⁢</mo><mi>U</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>s</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>F</mi><mo>⁢</mo><mi>i</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>l</mi></mrow></math></td>
<td class="ltx_td ltx_align_center ltx_eqn_cell"><math id="S3.E9.m2" class="ltx_Math" alttext="\displaystyle=" display="inline"><mo>=</mo></math></td>
<td class="ltx_td ltx_align_left ltx_eqn_cell"><math id="S3.E9.m3" class="ltx_Math" alttext="\displaystyle Final~{}o~{}(mw(\Sigma^{*}-uw(WFinal)))" display="inline"><mrow><mi>F</mi><mo>⁢</mo><mi>i</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mpadded width="+3.3pt"><mi>l</mi></mpadded><mo>⁢</mo><mpadded width="+3.3pt"><mi>o</mi></mpadded><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mi>m</mi><mo>⁢</mo><mi>w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><msup><mi mathvariant="normal">Σ</mi><mo>*</mo></msup><mo>-</mo><mrow><mi>u</mi><mo>⁢</mo><mi>w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mi>W</mi><mo>⁢</mo><mi>F</mi><mo>⁢</mo><mi>i</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>l</mi></mrow><mo stretchy="false">)</mo></mrow></mrow></mrow><mo stretchy="false">)</mo></mrow></mrow><mo stretchy="false">)</mo></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(9)</span></td>
</tr></tbody>
</table>
</div>
<div id="S3.SS3.p4" class="ltx_para">
<p class="ltx_p">These sublexicons can be combined as specified in
Equation <a href="#S3.Ex1" title="3.3 Back Off Model ‣ 3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3.3</span></a> to cover the whole of the original
lexicon.</p>
</div>
<div id="S3.SS3.p5" class="ltx_para">
<table id="A0.EGx2" class="ltx_equationgroup ltx_eqn_eqnarray ltx_eqn_table">

<tr id="S3.Ex1" class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_td ltx_align_right ltx_eqn_cell"><math id="S3.Ex1.m1" class="ltx_Math" alttext="\displaystyle WeightedLexicon~{}=~{}KnownAndSeenWords~{}|~{}Pref~{}MaxUnseenFinal" display="inline"><mrow><mi>W</mi><mi>e</mi><mi>i</mi><mi>g</mi><mi>h</mi><mi>t</mi><mi>e</mi><mi>d</mi><mi>L</mi><mi>e</mi><mi>x</mi><mi>i</mi><mi>c</mi><mi>o</mi><mpadded width="+3.3pt"><mi>n</mi></mpadded><mo rspace="5.8pt">=</mo><mi>K</mi><mi>n</mi><mi>o</mi><mi>w</mi><mi>n</mi><mi>A</mi><mi>n</mi><mi>d</mi><mi>S</mi><mi>e</mi><mi>e</mi><mi>n</mi><mi>W</mi><mi>o</mi><mi>r</mi><mi>d</mi><mpadded width="+3.3pt"><mi>s</mi></mpadded><mo rspace="5.8pt" stretchy="false">|</mo><mi>P</mi><mi>r</mi><mi>e</mi><mpadded width="+3.3pt"><mi>f</mi></mpadded><mi>M</mi><mi>a</mi><mi>x</mi><mi>U</mi><mi>n</mi><mi>s</mi><mi>e</mi><mi>e</mi><mi>n</mi><mi>F</mi><mi>i</mi><mi>n</mi><mi>a</mi><mi>l</mi></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
</tr>
<tbody id="S3.E10"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_td ltx_align_right ltx_eqn_cell"><math id="S3.E10.m1" class="ltx_Math" alttext="\displaystyle|~{}MaxUnseenPref~{}Final~{}|~{}MaxUnseenPref~{}MaxUnseenFinal" display="inline"><mrow><mrow><mo rspace="5.8pt" stretchy="false">|</mo><mrow><mi>M</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>x</mi><mo>⁢</mo><mi>U</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>s</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>P</mi><mo>⁢</mo><mi>r</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mpadded width="+3.3pt"><mi>f</mi></mpadded><mo>⁢</mo><mi>F</mi><mo>⁢</mo><mi>i</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mpadded width="+3.3pt"><mi>l</mi></mpadded></mrow><mo rspace="5.8pt" stretchy="false">|</mo></mrow><mo>⁢</mo><mi>M</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>x</mi><mo>⁢</mo><mi>U</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>s</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>P</mi><mo>⁢</mo><mi>r</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mpadded width="+3.3pt"><mi>f</mi></mpadded><mo>⁢</mo><mi>M</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>x</mi><mo>⁢</mo><mi>U</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>s</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>F</mi><mo>⁢</mo><mi>i</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>l</mi></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(10)</span></td>
</tr></tbody>
</table>
</div>
<div id="S3.SS3.p6" class="ltx_para">
<p class="ltx_p">The <math id="S3.SS3.p6.m1" class="ltx_Math" alttext="WeightedLexicon" display="inline"><mrow><mi>W</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>i</mi><mo>⁢</mo><mi>g</mi><mo>⁢</mo><mi>h</mi><mo>⁢</mo><mi>t</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>d</mi><mo>⁢</mo><mi>L</mi><mo>⁢</mo><mi>e</mi><mo>⁢</mo><mi>x</mi><mo>⁢</mo><mi>i</mi><mo>⁢</mo><mi>c</mi><mo>⁢</mo><mi>o</mi><mo>⁢</mo><mi>n</mi></mrow></math> will assign the lowest corpus weight to the most
likely reading and the highest corpus weight to the most unlikely
reading of the original lexical transducer.</p>
</div>
</section>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>Data Sets</h2>

<div id="S4.p1" class="ltx_para">
<p class="ltx_p">As training and test data, we use a compilation of three years,
1995-1997, of daily issues of Helsingin Sanomat, which is the most
wide-spread Finnish newspaper. We disambiguated the corpus using
Machinese for Finnish<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>Machinese is available from Connexor
Ltd., www.connexor.com</span></span></span> which provided one reading in context for
each word using syntactic parsing. This provided us with a
mechanically derived standard and not a human controlled gold
standard.</p>
</div>
<section id="S4.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.1 </span>Training Data</h3>

<div id="S4.SS1.p1" class="ltx_para">
<p class="ltx_p">The training data actually spanned 2.5 years with 1995 and 1996 of
equal size and 1997 only half of this. This collection contained
approximately 2.4 million different words, i.e. types, corresponding
to approximately 70 million words of Finnish, i.e. tokens, divided
into 29 million tokens for 1995, 29 for 1996 and 11 for 1997. We used
the training data to count the non-compound tokens and their analyses.</p>
</div>
</section>
<section id="S4.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.2 </span>Test Data</h3>

<div id="S4.SS2.p1" class="ltx_para">
<p class="ltx_p">From the three years of training data we extracted running text from
comparable sections of the news paper data. We chose articles from the
section reporting on general news with normal running text (as a
contrast to e.g. the economy or sports section with significant
amounts of numbers and tables). The extracted test data sets contained
118 838, 134 837 and 193 733 tokens for 1995, 1996 and 1997,
respectively. We used the test data to verify the result of the
disambiguation.</p>
</div>
</section>
<section id="S4.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.3 </span>Baseline</h3>

<div id="S4.SS3.p1" class="ltx_para">
<p class="ltx_p">As a baseline method, we use the training data as such to create
statistical unigram taggers as outlined in Section <a href="#S3" title="3 Methodology ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>. In
Table <a href="#S4.T1" title="Table 1 ‣ 4.3 Baseline ‣ 4 Data Sets ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>, we show the baseline result for the test
data samples with a given training data tagger, the number of tokens
with 1st correct reading, the number of tokens with some other correct
reading, the number of tokens with some readings but no correct and
the number of tokens with no reading.</p>
</div>
<figure id="S4.T1" class="ltx_table">
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 1: </span>Baseline of the tagger test data.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Train</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Test</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t">
<span class="ltx_text" style="font-size:70%;">  </span><math id="S4.T1.m1" class="ltx_Math" alttext="1^{st}" display="inline"><msup><mn mathsize="70%">1</mn><mrow><mi mathsize="70%">s</mi><mo>⁢</mo><mi mathsize="70%">t</mi></mrow></msup></math>
</td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t">
<span class="ltx_text" style="font-size:70%;">  </span><math id="S4.T1.m2" class="ltx_Math" alttext="n^{th}" display="inline"><msup><mi mathsize="70%">n</mi><mrow><mi mathsize="70%">t</mi><mo>⁢</mo><mi mathsize="70%">h</mi></mrow></msup></math>
</td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  No</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  No</span></td>
<td class="ltx_td ltx_align_center ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Comment</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Year</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Year</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Correct (%)</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Correct (%)</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Correct (%)</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Analysis (%)</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">96.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">3.7</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Max.</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">92.2</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">4.1</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">91.9</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">4.6</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">91.9</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">3.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">0.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">4.5</span></td>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">96.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.6</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center"><span class="ltx_text" style="font-size:70%;">  Max.</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">92.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.2</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">4.1</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">89.6</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">3.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">0.5</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">6.6</span></td>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">90.1</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.2</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">6.2</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">96.7</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_b"><span class="ltx_text" style="font-size:70%;">  Max.</span></td>
</tr>
</tbody>
</table>
</figure>
</section>
</section>
<section id="S5" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5 </span>Tests and Results</h2>

<div id="S5.p1" class="ltx_para">
<p class="ltx_p">We created two versions of the weighted lexicon for disambiguating
running text. One weights the lexicon using the current corpus and
tests the result using only the weighted lexicon data. The second test
adds the baseline tagger to the lexicon in order to ensure some
additional domain specific data for lack of a guesser.</p>
</div>
<section id="S5.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">5.1 </span>Lexicon-based Unigram Tagger</h3>

<div id="S5.SS1.p1" class="ltx_para">
<p class="ltx_p">We did our first tagging experiment using a full year of news paper
articles as training data for the lexicon and testing with the test
data from the other two years. The first correct results are
consistently at 97 % of the words with some correct
analysis. However, the coverage is totally dependent on the fairly
restricted lexicon as shown in Table <a href="#S5.T2" title="Table 2 ‣ 5.1 Lexicon-based Unigram Tagger ‣ 5 Tests and Results ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. We also
include the results for testing and training on the same year as an
upper limit or reference.</p>
</div>
<figure id="S5.T2" class="ltx_table">
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 2: </span>Lexicon-based unigram tagger results for Finnish.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Train</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Test</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t">
<span class="ltx_text" style="font-size:70%;">  </span><math id="S5.T2.m1" class="ltx_Math" alttext="1^{st}" display="inline"><msup><mn mathsize="70%">1</mn><mrow><mi mathsize="70%">s</mi><mo>⁢</mo><mi mathsize="70%">t</mi></mrow></msup></math>
</td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t">
<span class="ltx_text" style="font-size:70%;">  </span><math id="S5.T2.m2" class="ltx_Math" alttext="n^{th}" display="inline"><msup><mi mathsize="70%">n</mi><mrow><mi mathsize="70%">t</mi><mo>⁢</mo><mi mathsize="70%">h</mi></mrow></msup></math>
</td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  No</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  No</span></td>
<td class="ltx_td ltx_align_center ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Comment</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Year</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Year</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Correct (%)</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Correct (%)</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Correct (%)</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Analysis (%)</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">68.2</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1.2</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">12.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">18.5</span></td>
<td class="ltx_td ltx_align_center ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Max.</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">69.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">1.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">12.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">17.3</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">69.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">1.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">11.7</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">17.5</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">67.9</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">12.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">18.5</span></td>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">69.7</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">1.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">12.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">17.3</span></td>
<td class="ltx_td ltx_align_center"><span class="ltx_text" style="font-size:70%;">  Max.</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">69.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">1.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">11.7</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">17.5</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">67.9</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1.6</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">12.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">18.5</span></td>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">69.4</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">1.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">12.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">17.3</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">69.6</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">1.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">11.7</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">17.5</span></td>
<td class="ltx_td ltx_align_center ltx_border_b"><span class="ltx_text" style="font-size:70%;">  Max.</span></td>
</tr>
</tbody>
</table>
</figure>
</section>
<section id="S5.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">5.2 </span>Extended Lexicon-based Unigram Tagger</h3>

<div id="S5.SS2.p1" class="ltx_para">
<p class="ltx_p">We did our second tagging experiment as the first with the addition of
using the full year of news paper data for extending the lexicon.
Again, we tested with the test data from the other two years. The
first correct results are consistently at 98 % of the words with some
correct analysis and the coverage is now considerably better as shown
in Table <a href="#S5.T3" title="Table 3 ‣ 5.2 Extended Lexicon-based Unigram Tagger ‣ 5 Tests and Results ‣ Weighting Finite-State Morphological Analyzers using HFST Tools The official publication was in the Proceedings of FSMNLP 2009." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>. We also include the results for
testing and training on the same year as an upper limit or reference.</p>
</div>
<figure id="S5.T3" class="ltx_table">
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 3: </span>Extended lexicon-based unigram tagger results for Finnish.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Train</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Test</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t">
<span class="ltx_text" style="font-size:70%;">  </span><math id="S5.T3.m1" class="ltx_Math" alttext="1^{st}" display="inline"><msup><mn mathsize="70%">1</mn><mrow><mi mathsize="70%">s</mi><mo>⁢</mo><mi mathsize="70%">t</mi></mrow></msup></math>
</td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t">
<span class="ltx_text" style="font-size:70%;">  </span><math id="S5.T3.m2" class="ltx_Math" alttext="n^{th}" display="inline"><msup><mi mathsize="70%">n</mi><mrow><mi mathsize="70%">t</mi><mo>⁢</mo><mi mathsize="70%">h</mi></mrow></msup></math>
</td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  No</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">  No</span></td>
<td class="ltx_td ltx_align_center ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Comment</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Year</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Year</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Correct (%)</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Correct (%)</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Correct (%)</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">  Analysis (%)</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">95.9</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">4.1</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_t"><span class="ltx_text" style="font-size:70%;">  Max.</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">93.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">4.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.7</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">2.0</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">93.1</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">4.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.6</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">2.3</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">92.9</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">4.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">0.7</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">2.2</span></td>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">96.1</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.9</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center"><span class="ltx_text" style="font-size:70%;">  Max.</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">93.6</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.7</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.6</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">1.9</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1995</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">91.6</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">4.1</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">1.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">3.2</span></td>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">1996</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">92.1</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.9</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.9</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.1</span></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">1997</span></th>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">96.3</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">3.7</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">0.0</span></td>
<td class="ltx_td ltx_align_center ltx_border_b"><span class="ltx_text" style="font-size:70%;">  Max.</span></td>
</tr>
</tbody>
</table>
</figure>
</section>
</section>
<section id="S6" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">6 </span>Discussion and Further Research</h2>

<div id="S6.p1" class="ltx_para">
<p class="ltx_p">In this section we analyze the errors for which the correct tag
sequence was not first, for which there was no correct tag sequence
and for which there was no analysis at all. We present the most common
by tag sequences or tokens. Finally, we make a few
additional observations.</p>
</div>
<section id="S6.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">6.1 </span>Correct Tag not <math id="S6.SS1.m1" class="ltx_Math" alttext="1^{st}" display="inline"><msup><mn>1</mn><mrow><mi>s</mi><mo>⁢</mo><mi>t</mi></mrow></msup></math> in Analysis</h3>

<div id="S6.SS1.p1" class="ltx_para">
<p class="ltx_p">The cases where correct tag is not the first are dominated by the already
known ambiguities where a token has multiple readings and both exist in corpus.
One big class of these are verbs like <em class="ltx_emph">olla</em> or negation verb <em class="ltx_emph">ei</em>,
since in perfect tense’s passive the auxiliary is still in present tense
active form (e.g. <em class="ltx_emph">on kerrottu</em> ‘has been told’ is
<span class="ltx_text ltx_font_typewriter">olla+pass+ind+pres kerrottu+pass+pcp2</span>
while most likely reading of <em class="ltx_emph">on</em> ‘is’ is <span class="ltx_text ltx_font_typewriter">olla+act+ind+pres+sg3</span>).
The majority of variation between adjective readings and participles results
also in number of wrong choices in tag strings with A or V PCP2.
Also for many tokens the variation between adverb and adposition is purely
syntactical and as such unigram tagger will fail in minority of cases.
Also for handful of verbs, the A infinitive form falls together with
3<math id="S6.SS1.p1.m1" class="ltx_Math" alttext="{}^{rd}" display="inline"><msup><mi></mi><mrow><mi>r</mi><mo>⁢</mo><mi>d</mi></mrow></msup></math> person singular present tense (e.g. <em class="ltx_emph">järjestää</em> ‘to arrange/(he) arranges’) which causes fair amount of unigram tagger misreadings. The amount
of compounds in analyses where first is not correct ranges from 6 % to 12 %.</p>
</div>
<figure id="S6.T4" class="ltx_table">
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 4: </span>Error analysis for cases where correct reading exists.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Error Type</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Baseline</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Dictionary</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t"><span class="ltx_text" style="font-size:70%;">Combined</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Tagged ‘ADV’/‘PSP’</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">4112</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">2561</span></td>
<td class="ltx_td ltx_align_center ltx_border_t"><span class="ltx_text" style="font-size:70%;">4331</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">Tagged ‘A SG NOM’/‘V PCP2 SG NOM’</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">3093</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">885</span></td>
<td class="ltx_td ltx_align_center"><span class="ltx_text" style="font-size:70%;">3388</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Token ‘on’</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">3855</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">0</span></td>
<td class="ltx_td ltx_align_center ltx_border_t"><span class="ltx_text" style="font-size:70%;">3855</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">Token ‘ei’</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">1170</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0</span></td>
<td class="ltx_td ltx_align_center"><span class="ltx_text" style="font-size:70%;">1170</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">Token ‘ollut’</span></th>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">735</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">0</span></td>
<td class="ltx_td ltx_align_center ltx_border_b"><span class="ltx_text" style="font-size:70%;">735</span></td>
</tr>
</tbody>
</table>
</figure>
</section>
<section id="S6.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">6.2 </span>Analyses without Correct Tag</h3>

<div id="S6.SS2.p1" class="ltx_para">
<p class="ltx_p">For analyses where correct analysis was not among the readings,
In corpus there’s a handful of underspecified analyses, such as (blah A),
which aren’t produced at all by dictionary based analyzer, but assumably the
corpus’s syntactic tagging mechanism has had use for those. Also for some
adverbs and adpositions the dictionary only contains the non-lexicalised
nominal form. There is also some overlap for cases in previous category here
if the alternate reading for ambiguous form is not generated or found for some
year.</p>
</div>
<figure id="S6.T5" class="ltx_table">
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 5: </span>Error analysis for cases where correct reading is missing.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Error Type</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Baseline</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Dictionary</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t"><span class="ltx_text" style="font-size:70%;">Combined</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Tagged ‘A SG NOM’</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">24</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">771</span></td>
<td class="ltx_td ltx_align_center ltx_border_t"><span class="ltx_text" style="font-size:70%;">82</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">Tagged ‘V PCP2 SG NOM’</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">20</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">4212</span></td>
<td class="ltx_td ltx_align_center"><span class="ltx_text" style="font-size:70%;">50</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">Tagged ‘A’</span></th>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">0</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:70%;">3300</span></td>
<td class="ltx_td ltx_align_center ltx_border_b"><span class="ltx_text" style="font-size:70%;">0</span></td>
</tr>
</tbody>
</table>
</figure>
</section>
<section id="S6.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">6.3 </span>No Analysis</h3>

<div id="S6.SS3.p1" class="ltx_para">
<p class="ltx_p">For dictionary based tagger, the tokens which mainly dominate the missing
analyses are proper nouns, abbreviations and numerals, which are known
shortcomings for the analyzer. For other analyzers, such as baseline or
extended, the main problem is proper nouns, many of which may appear only in
one years issues. Also, since the dictionary based analyzer lacks productive
numeral formation, many of the complex numeral expressions (e.g.
<em class="ltx_emph">5—15-vuotiaat</em> ‘5-to-15-year-olds’) or specific numbers (e.g.
<em class="ltx_emph">4029354</em>)
are missing when using training corpora from one year to test other years
analyses.</p>
</div>
<figure id="S6.T6" class="ltx_table">
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 6: </span>Error analysis for cases where no results are given.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Error Type</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Baseline</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Dictionary</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t"><span class="ltx_text" style="font-size:70%;">Combined</span></th>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Proper nouns</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">17795</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">106101</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t"><span class="ltx_text" style="font-size:70%;">17379</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Token ‘klo’</span></th>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">13242</span></td>
<td class="ltx_td ltx_align_center ltx_border_t"><span class="ltx_text" style="font-size:70%;">0</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r"><span class="ltx_text" style="font-size:70%;">Token ‘mk’</span></th>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">0</span></td>
<td class="ltx_td ltx_align_center ltx_border_r"><span class="ltx_text" style="font-size:70%;">5432</span></td>
<td class="ltx_td ltx_align_center"><span class="ltx_text" style="font-size:70%;">0</span></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">Tag NUM</span></th>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">699</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:70%;">7830</span></td>
<td class="ltx_td ltx_align_center ltx_border_b ltx_border_t"><span class="ltx_text" style="font-size:70%;">388</span></td>
</tr>
</tbody>
</table>
</figure>
</section>
<section id="S6.SS4" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">6.4 </span>Other Observations</h3>

<div id="S6.SS4.p1" class="ltx_para">
<p class="ltx_p">The error analysis confirms that the compounds for which the all parts
were known contributed on the average 0.67 % to the overall error
rate, i.e. correct not in the first position, for words with at least
one correct analysis. For further discussions on the similarities and
differences between Finnish, German and Swedish compounding, see
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib7" title="" class="ltx_ref">7</a>]</cite>.</p>
</div>
<div id="S6.SS4.p2" class="ltx_para">
<p class="ltx_p">If a disambiguated corpus is not available for calculating the word
analysis probabilities, it is still possible to use only the string
token probabilities to disambiguate the compound structure without
saying anything about the most likely morphological reading. This
segmentation would be similar to the segmentation the Morfessor
software <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib3" title="" class="ltx_ref">3</a>]</cite> tries to discover in an unsupervised way
from corpora alone.</p>
</div>
<div id="S6.SS4.p3" class="ltx_para">
<p class="ltx_p">The good results for statistical morphological disambiguation of
Finnish with a full morphological tag set using only a unigram model
is most likely the result of the highly inflectional and compounding
morphology of Finnish with free word order. In order for a language to
achieve a free word order, morphological ambiguities have to be
resolvable locally almost without context.</p>
</div>
<div id="S6.SS4.p4" class="ltx_para">
<p class="ltx_p">As the inflected Finnish compounds correspond to noun phrases or
prepositional phrases in English. This also sheds some additional
light on the supposedly free word order in Finnish, which is similar
to the rather free phrase ordering in many other languages,
i.e. similar changes in the topic of a clause occurs in Finnish when
shifting a phrase e.g. to a clause initial position.</p>
</div>
</section>
</section>
<section id="S7" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">7 </span>Conclusions</h2>

<div id="S7.p1" class="ltx_para">
<p class="ltx_p">We demonstrated how to build a weighted lexicon for a highly
inflecting and compounding Fenno-Ugric language like Finnish. Similar
methods apply to a number of Germanic languages with productive
morphological compounding. From a practical point of view, we
introduced the open source command line tools of <span class="ltx_text ltx_font_smallcaps">HFST</span> and
used them successfully for compiling a weighted lexicon. We applied
the weighted lexicon as a unigram tagger of running Finnish text
achieving 97 % precision on words in the vocabulary. The unigram
tagger is a good baseline when tagging morphologically complex
languages like Finnish and for some purposes it may even be sufficient
as such. In addition, it is easy to implement if a full-fledged
morphological analyzer and a training corpus is available. For unknown
foreign words and names, a guesser and an n-gram tagger may still be
necessary.</p>
</div>
</section>
<section id="Sx1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">Acknowledgments</h2>

<div id="Sx1.p1" class="ltx_para">
<p class="ltx_p">This research was funded by the Finnish Academy and the Finnish
Ministry of Education. We are also grateful to the HFST–Helsinki
Finite State Technology research team and to the anonymous reviewers
for various improvements of the manuscript.</p>
</div>
</section>
<section id="bib" class="ltx_bibliography">
<h2 class="ltx_title ltx_title_bibliography">References</h2>

<ul class="ltx_biblist">
      
<li id="bib.bib1" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">1</span>
        
<span class="ltx_bibblock">
Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar
Mohri.

</span>
        
<span class="ltx_bibblock">2007.

</span>
        
<span class="ltx_bibblock">OpenFst: A General and Efficient Weighted Finite-State Transducer
Library.

</span>
        
<span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic">Proceedings of the Ninth International Conference on
Implementation and Application of Automata, (CIAA 2007)</span>, volume 4783 of <span class="ltx_text ltx_font_italic">Lecture Notes in Computer Science</span>, pages 11–23. Springer.

</span>
        
<span class="ltx_bibblock"><a href="http://www.openfst.org" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://www.openfst.org</a>.

</span>
      </li>
      
<li id="bib.bib2" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">2</span>
        
<span class="ltx_bibblock">
Kenneth R. Beesley and Lauri Karttunen.

</span>
        
<span class="ltx_bibblock">2003.

</span>
        
<span class="ltx_bibblock"><span class="ltx_text ltx_font_italic">Finite State Morphology</span>.

</span>
        
<span class="ltx_bibblock">CSLI Publications.

</span>
        
<span class="ltx_bibblock"><a href="http://www.fsmbook.com" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://www.fsmbook.com</a>.

</span>
      </li>
      
<li id="bib.bib3" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">3</span>
        
<span class="ltx_bibblock">

</span>
        
<span class="ltx_bibblock">Mathias Creutz, Krista Lagus, Krister Lindén, Sami Virpioja.

</span>
        
<span class="ltx_bibblock">2005.

</span>
        
<span class="ltx_bibblock">Morfessor and Hutmegs: Unsupervised Morpheme Segmentation for Highly-Inflecting and Compounding Languages.

</span>
        
<span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic">Proceedings of the Second Baltic Conference on Human Language Technologies</span>.

</span>
      </li>
      
<li id="bib.bib4" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">4</span>
        
<span class="ltx_bibblock">
Auli Hakulinen, Maria Vilkuna, Riitta Korhonen, Vesa Koivisto, Tarja Riitta
Heinonen, and Irja Alho.

</span>
        
<span class="ltx_bibblock">2008.

</span>
        
<span class="ltx_bibblock"><span class="ltx_text ltx_font_italic">Iso suomen kielioppi</span>.

</span>
        
<span class="ltx_bibblock">Suomalaisen Kirjallisuuden Seura.

</span>
        
<span class="ltx_bibblock">referred on 31.12.2008, available from
<a href="http://scripta.kotus.fi/visk" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://scripta.kotus.fi/visk</a>.

</span>
      </li>
      
<li id="bib.bib5" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">5</span>
        
<span class="ltx_bibblock">
Fred Karlsson.

</span>
        
<span class="ltx_bibblock">1999.

</span>
        
<span class="ltx_bibblock">Finns - An Essential Grammar.

</span>
        
<span class="ltx_bibblock">Routledge. London. First published 1983 as Finnish Grammar.

</span>
      </li>
      
<li id="bib.bib6" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">6</span>
        
<span class="ltx_bibblock">
Kimmo Koskenniemi.

</span>
        
<span class="ltx_bibblock">1983.

</span>
        
<span class="ltx_bibblock">Two-Level Morphology: A General Computational Model for Word Form Generation and Recognition.

</span>
        
<span class="ltx_bibblock"><span class="ltx_text ltx_font_italic">Publication No. 11. Publications of the Department of General Linguistics</span>. University of Helsinki.

</span>
      </li>
      
<li id="bib.bib7" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">7</span>
        
<span class="ltx_bibblock">
Krister Lindén and Tommi Pirinen.

</span>
        
<span class="ltx_bibblock">2009.

</span>
        
<span class="ltx_bibblock">Weighted Finite-State Morphological Analysis of Finnish Compounding
with <span class="ltx_text ltx_font_smallcaps">HFST-LexC</span>.

</span>
        
<span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic">Proceedings of NoDaLiDa 2009</span>.

</span>
      </li>
      
<li id="bib.bib8" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">8</span>
        
<span class="ltx_bibblock">
Inari Listenmaa.

</span>
        
<span class="ltx_bibblock">2009.

</span>
        
<span class="ltx_bibblock">Combining Word Lists: Nykysuomen sanalista, Joukahainen-sanasto
and Käänteissanakirja (in Finnish).

</span>
        
<span class="ltx_bibblock">Bachelor’s Thesis. Department of Linguistics. University of Helsinki.

</span>
      </li>
      
<li id="bib.bib9" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">9</span>
        
<span class="ltx_bibblock">
Torsten Marek.

</span>
        
<span class="ltx_bibblock">2006.

</span>
        
<span class="ltx_bibblock">Analysis of German Compounds using Weighted Finite State Transducers.

</span>
        
<span class="ltx_bibblock">Technical report, Eberhard-Karls-Universität Tübingen.

</span>
      </li>
      
<li id="bib.bib10" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">10</span>
        
<span class="ltx_bibblock">
Tommi Pirinen.

</span>
        
<span class="ltx_bibblock">2008.

</span>
        
<span class="ltx_bibblock">Suomen kielen äärellistilainen automaattinen morfologinen
analyysi avoimen lähdekoodin keinoin.

</span>
        
<span class="ltx_bibblock">Master’s thesis, Helsingin yliopisto.

</span>
      </li>
      
<li id="bib.bib11" class="ltx_bibitem">
        <span class="ltx_bibtag ltx_role_refnum">11</span>
        
<span class="ltx_bibblock">
Anne Schiller.

</span>
        
<span class="ltx_bibblock">2005.

</span>
        
<span class="ltx_bibblock">German Compound Analysis with <span class="ltx_text ltx_font_smallcaps">w</span>fsc.

</span>
        
<span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic">FSMNLP</span>, pages 239–246.

</span>
      </li>
    
</ul>
</section>
</article>
</div>
<footer class="ltx_page_footer">
<div class="ltx_page_logo">Generated  on Fri Oct 13 18:33:09 2017 by <a href="http://dlmf.nist.gov/LaTeXML/">LaTeXML <img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="[LOGO]"></a>
</div></footer>
</div>
</body>
</html>