2009-nodalida/Pirinen-2009-nodalida.html

<!DOCTYPE html><html>
<head>
<title>Weighted Finite-State Morphological Analysisof Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017.  </title>
<!--Generated on Fri Oct 13 18:35:50 2017 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on Last modification: October 13, 2017.-->

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">Weighted Finite-State Morphological Analysis
<br class="ltx_break">of Finnish Inflection and Compounding<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup>The official
publication was in Nodalida 2009 organised in Odense,
<a href="http://beta.visl.sdu.dk/nodalida2009/" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://beta.visl.sdu.dk/nodalida2009/</a>, the electronic publication was
available at <a href="http://dspace.utlib.ee/dspace/handle/10062/9206" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://dspace.utlib.ee/dspace/handle/10062/9206</a> on
October 13, 2017.</span></span></span>
</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Krister Lindén
<br class="ltx_break">University of Helsinki
<br class="ltx_break">Helsinki, Finland
<br class="ltx_break"><span class="ltx_text ltx_font_typewriter">Krister.Linden@helsinki.fi</span> 
</span></span>
<span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Tommi Pirinen
<br class="ltx_break">University of Helsinki
<br class="ltx_break">Helsinki, Finland
<br class="ltx_break"><span class="ltx_text ltx_font_typewriter">Tommi.Pirinen@helsinki.fi</span> 
</span></span>
</div>
<div class="ltx_date ltx_role_creation">Last modification: October 13, 2017</div>

<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
    
<p class="ltx_p">Finnish has a very productive compounding and a rich inflectional
system, which causes ambiguity in the morphological segmentation of
compounds made with finite state transducer methods. In order to
disambiguate the compound segmentations, we compare three different
strategies, which we cast in a probabilistic framework. We present a
method for implementing the probabilistic framework as part of the
building process of lexc-style morpheme sub-lexicons creating
weighted lexical transducers. To implement the structurally
disambiguating morphological analyzer, we use the <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span>
tool which is part of the open source <em class="ltx_emph">Helsinki Finite-State
Technology</em>. This is the first time all three principles are cast
in a probabilistic framework and compared on the same corpus using
one tool. On our Finnish test corpus, the best method succeeds with
99,98 % precision and recall.</p>
  
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>

<div id="S1.p1" class="ltx_para">
<p class="ltx_p">In languages with productive multipart compounding, such as Finnish,
German and Swedish, approximately 9-10 % of the word tokens in a
corpus are compounds <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">hedlund/2002</span>]</cite> and approximately 2/3 of the
dictionary entries are compounds, cf. a publicly available Finnish
dictionary <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">kotus/2007</span>]</cite>.</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">There have been various attempts at curbing the potential
combinatorial explosion of segmentations that a prolific compounding
mechanism produces. Karlsson <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">karlsson/1992</span>]</cite> showed that for
Swedish the most significant factor in disambiguating compounds was
the counting of the number of parts in the analysis, where the
analysis with the fewest parts almost always was the best
candidate. This has later been corroborated by others. In particular,
it was the main disambiguation criterion formulated by
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite> on German compounding. In addition, Schiller used
frequency information for disambiguating between compounds with an
equal number of parts. Schiller estimated her figures from compound
part frequencies, which requires a considerable amount of manual
labour in order to create the training corpora consisting of attested
compound words and their correct segmentations.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">We suggest two modifications to the strategies of Karlsson and
Schiller. First we suggest that the word segment probabilities can be
estimated from non-compound word frequencies in the corpus. The
motivation for our approach is that compounds are formed in order to
distinguish between instances of frequently occurring phenomena and
therefore compounds are more often formed for more frequently
discussed phenomena. We assume that the frequency by which phenomena
are discussed is reflected in the non-compound word frequencies,
i.e. high-frequency words should in general have more compounds.</p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">In addition, we suggest that the special penalty suggested by Karlsson
and maintained by Schiller is unnecessary when framing the problem in
a probabilistic framework. This has also been suggested by others, see
e.g. Marek <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">marek/2006</span>]</cite>. However, this is the first time the
disambiguation principles of Karlsson and of Schiller are compared
with a fully probabilistic approach on the same corpus.</p>
</div>
<div id="S1.p5" class="ltx_para">
<p class="ltx_p">Previously, there has been no publicly available general framework for
conveniently integrating both a full-fledged morphological description
and for representing probabilities for general morphological compound
and inflectional analysis. Karlsson <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">karlsson/1992</span>]</cite> used
applied a post-processing phase to count the parts, and Schiller
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite> used the proprietary weighted finite-state compiler
of Xerox <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">kempe/2003</span>]</cite>, which compiles regular expressions. We
therefore introduce the open source software tool
<span class="ltx_text ltx_font_smallcaps">hfst-lexc<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup><a href="http://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstLexC" title="" class="ltx_ref ltx_url ltx_font_typewriter ltx_font_upright">http://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstLexC</a></span></span></span></span>,
which is similar to the Xerox lexc tool <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/2003</span>]</cite>. In
addition to the fact that <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span> compiles lexc-style
lexicons, it also has a mechanism for adding weights to compound parts
and morphological analyses.</p>
</div>
<div id="S1.p6" class="ltx_para">
<p class="ltx_p">The remainder of the article is structured as follows. In
Sections <a href="#S2" title="2 Inflection and Compounding in Finnish ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> and <a href="#S3" title="3 Morphological analysis of Finnish ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>, we introduce a version of
Finnish morphology for compounding. In Section <a href="#S4" title="4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>, we
introduce the probabilistic formulation of the methods for weighting
the lexical entries. In Section <a href="#S5" title="5 Training and Test Data ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>, we briefly introduce the
test and training corpora. In Section <a href="#S6" title="6 Tests and Results ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">6</span></a>, we present the
results. Finally, in Sections <a href="#S7" title="7 Implementation Note ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">7</span></a>, <a href="#S8" title="8 Discussion and Further Research ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">8</span></a> and
<a href="#S9" title="9 Conclusions ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">9</span></a>, we give some notes on the implementation, discuss the
results and draw the conclusions.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Inflection and Compounding in Finnish</h2>

<div id="S2.p1" class="ltx_para">
<p class="ltx_p">In Finnish morphology, the inflection of typical nouns produces
several thousands of forms for the productive inflection. Finnish
compounding theoretically allows nominal compounds of arbitrary length
to be created from initial parts of certain forms of nouns, and the
final part inflects in all possible forms.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">For example the compounds describing ancestors are compounded from
zero or more of <em class="ltx_emph">isän</em> ‘father <span class="ltx_text ltx_font_smallcaps">singular genitive</span>’ and
<em class="ltx_emph">äidin</em> ‘mother <span class="ltx_text ltx_font_smallcaps">singular genitive</span>’ and then one of any
inflected forms of <em class="ltx_emph">isä</em> or <em class="ltx_emph">äiti</em>, creating forms such as
<em class="ltx_emph">äidinisälle</em> ‘grandfather (maternal) <span class="ltx_text ltx_font_smallcaps">singular allative</span>’
or <em class="ltx_emph">isänisänisänisä</em> ‘great great grandfather <span class="ltx_text ltx_font_smallcaps">singular
nominative</span>’. As for the potential ambiguity, Finnish also has the
noun <em class="ltx_emph">nisä</em> ‘udder’, which creates ambiguity for any paternal
grandfather, e.g. <em class="ltx_emph">isän#isän#isän#isä</em>,
<em class="ltx_emph">isän#isä#nisän#isä</em>, <em class="ltx_emph">isä#nisä#nisä#nisä</em>, …</p>
</div>
<div id="S2.p3" class="ltx_para">
<p class="ltx_p">However, much of the ambiguity in Finnish compounds is aggravated by
the ambiguity of the inflected forms of the head words. For example
<em class="ltx_emph">isän</em>, has several possible analyses,
e.g. <span class="ltx_text ltx_font_smallcaps">isä+sg+gen</span>, <span class="ltx_text ltx_font_smallcaps">isä+sg+acc</span> and <span class="ltx_text ltx_font_smallcaps">isä+sg+ins</span>.</p>
</div>
<div id="S2.p4" class="ltx_para">
<p class="ltx_p">Finnish compounding also includes forms of compounding where all parts of
words are inflected with same form, but this is limited to part of adjective
initial compounds. Similarly some inflected verb forms may appear as parts
of compounds. These both are more rare than nominal compounds <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib132" title="Iso suomen kielioppi" class="ltx_ref">2</a>]</cite>
and not considered in this paper.</p>
</div>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Morphological analysis of Finnish</h2>

<div id="S3.p1" class="ltx_para">
<p class="ltx_p">Pirinen <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen/2008</span>]</cite> presented an open source
implementation of a finite state morphological analyzer for Finnish.
We use that implementation as a baseline for the compounding analysis
as Pirinen’s analyzer has a fully productive compounding
mechanism. Fully productive compounding means that it allows compounds
of arbitrary length with any combination of nominative singulars,
genitive singulars, or genitive plurals in the initial part and any
inflected form of a noun as the final part.</p>
</div>
<div id="S3.p2" class="ltx_para">
<p class="ltx_p">The morphotactic combination of morphemes is achieved with sublexicon
combinatorics as defined in <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/2003</span>]</cite>. We use the open source
software called <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span> with a similar interface as the
Xerox lexc tool. The <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span> tool includes preliminary
support for weights on the lexical entries.</p>
</div>
<div id="S3.p3" class="ltx_para">
<p class="ltx_p">In this implementation, each lexical entry constitutes one full word
form, i.e., we create a full form lexicon using the previously
mentioned analyzer <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen/2008</span>]</cite>. This creates a text file of 22
GB for the purely inflectional morphology of approximately 40 000
non-compound lexical entries for Finnish, which were stored in a
single CompoundFinalNoun lexicon as shown in
Figure <a href="#S3.F1" title="Figure 1 ‣ 3 Morphological analysis of Finnish ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>. The figure demonstrates an unweighted
lexicon and also shows how we model the compounding by dividing the
word forms into two categories: compound non-final (i.e., nominative
singular, genitive singular, and genitive plural) and compound final
forms allowing us to give weights to each form or compound part as
needed.</p>
</div>
<figure id="S3.F1" class="ltx_figure"><pre class="ltx_verbatim ltx_font_typewriter" style="font-size:90%;">
LEXICON Root
## CompoundNonFinalNoun ;
## CompoundFinalNoun ;

LEXICON Compound
#:0 CompoundNonFinalNound "weight: 0" ;
#:0 CompoundFinalNound "weight: 0" ;

LEXICON CompoundNonFinalNoun
isä   Compound  "weight: 0" ;
isän  Compound  "weight: 0" ;
äiti  Compound  "weight: 0" ;
äidin Compound  "weight: 0" ;

LEXICON CompoundFinalNoun
isä:isä+sg+nom     ## "weight: 0" ;
isän:isä+sg+gen    ## "weight: 0" ;
isälle:isä+sg+all  ## "weight: 0" ;

LEXICON ##
## # ;
</pre>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>Unweighted lexicon.
</figcaption>
</figure>
<div id="S3.p4" class="ltx_para">
<p class="ltx_p">Compounding implemented with the unweighted sublexicons in
Figure <a href="#S3.F1" title="Figure 1 ‣ 3 Morphological analysis of Finnish ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a> is equivalent to the original baseline
analyzer. The root sublexicon specifies that we can have start directly
from compound final noun forms, forming single part words, or start from
compound initial forms, forming multiword compounds. The compound initial
lexicon is a listing of all singular nominatives, singular genitives and
plural genitives, which is followed by compound boundary marker on in separate
sublexicon, and another word from either compound initial sublexicon or compound
final sublexicon. The compound final sublexicon contains the long listing of all
possible forms of all words, and their analyses,</p>
</div>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>Methodology</h2>

<div id="S4.p1" class="ltx_para">
<p class="ltx_p">We define the weight of a token through its probability to occur in
the corpus, i.e. we use the count,<em class="ltx_emph">c</em>, which is proportional to
the frequency with which a token appears in a corpus divided by the
corpus size, <em class="ltx_emph">cs</em>. The probability, <em class="ltx_emph">p(a)</em>, for a token,
<em class="ltx_emph">a</em>, is defined by Equation <a href="#S4.E1" title="(1) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.</p>
</div>
<div id="S4.p2" class="ltx_para">
<table id="S4.E1" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E1.m1" class="ltx_Math" alttext="\mathrm{p}(a)=\mathrm{c}(a)/\mathrm{cs}" display="block"><mrow><mrow><mi mathsize="90%" mathvariant="normal">p</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">=</mo><mrow><mrow><mi mathsize="90%" mathvariant="normal">c</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">/</mo><mi mathsize="90%">cs</mi></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td>
</tr>
</table>
</div>
<div id="S4.p3" class="ltx_para">
<p class="ltx_p">Tokens known to the lexicon but unseen in the corpus need to be
assigned a small probability mass different from 0, so they get
<em class="ltx_emph">c(x) = 1</em>, i.e. we define the count of a token as its corpus
frequency plus 1 as in Equation <a href="#S4.E2" title="(2) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.</p>
</div>
<div id="S4.p4" class="ltx_para">
<table id="S4.E2" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E2.m1" class="ltx_Math" alttext="\mathrm{c}(a)=1+\mathrm{frequency}(a)" display="block"><mrow><mrow><mi mathsize="90%" mathvariant="normal">c</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">=</mo><mrow><mn mathsize="90%">1</mn><mo mathsize="90%" stretchy="false">+</mo><mrow><mi mathsize="90%">frequency</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td>
</tr>
</table>
</div>
<div id="S4.p5" class="ltx_para">
<p class="ltx_p">If a token, e.g. <em class="ltx_emph">isän</em>, has several possible analyses, e.g.
<span class="ltx_text ltx_font_smallcaps">isä+sg+gen</span> and <span class="ltx_text ltx_font_smallcaps">isä+sg+acc</span>, the total count for
<em class="ltx_emph">isän</em> will be divided among the analyses in a disambiguated
training corpus. If the disambiguation result removes all readings
<span class="ltx_text ltx_font_smallcaps">isä+sg+acc</span> from the disambiguated result, the count for this
reading is still 1 according to Equation <a href="#S4.E2" title="(2) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. We need the
total probability mass of all the tokens in the lexicon to sum up to
1, so we define the corpus size as the number of all lexical token
counts according to Equation <a href="#S4.E3" title="(3) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>.</p>
</div>
<div id="S4.p6" class="ltx_para">
<table id="S4.E3" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E3.m1" class="ltx_Math" alttext="\mathrm{cs}=\sum_{x}\mathrm{c}(x)" display="block"><mrow><mi mathsize="90%">cs</mi><mo mathsize="90%" stretchy="false">=</mo><mrow><munder><mo largeop="true" mathsize="90%" movablelimits="false" stretchy="false" symmetric="true">∑</mo><mi mathsize="90%">x</mi></munder><mrow><mi mathsize="90%" mathvariant="normal">c</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">x</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(3)</span></td>
</tr>
</table>
</div>
<div id="S4.p7" class="ltx_para">
<p class="ltx_p">To use the probabilities as weights in the lexicon we implement them in
the tropical semiring, which means that we use the negative
log-probabilities as defined by Equation <a href="#S4.E4" title="(4) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>.</p>
</div>
<div id="S4.p8" class="ltx_para">
<table id="S4.E4" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E4.m1" class="ltx_Math" alttext="w(a)=-\mathrm{log}(p(a))" display="block"><mrow><mrow><mi mathsize="90%">w</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">=</mo><mrow><mo mathsize="90%" stretchy="false">-</mo><mrow><mi mathsize="90%">log</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mrow><mi mathsize="90%">p</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(4)</span></td>
</tr>
</table>
</div>
<div id="S4.p9" class="ltx_para">
<p class="ltx_p">For an illustration of how the weighting scheme is implemented in the
lexicon, see Figure <a href="#S4.F2" title="Figure 2 ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.</p>
</div>
<figure id="S4.F2" class="ltx_figure"><pre class="ltx_verbatim ltx_font_typewriter" style="font-size:90%;">
LEXICON Root
## CompoundNonFinalNoun ;
## CompoundFinalNoun ;

LEXICON Compound
0:# CompoudNonFinalNoun "weight: 0" ;
0:# CompoudFinalNoun "weight: 0" ;

LEXICON CompoundNonFinalNoun
isä   Compound  "weight: -log(c(isä)/cs)" ;
isän  Compound  "weight: -log(c(isän)/cs)" ;
äiti  Compound  "weight: -log(c(äiti)/cs)" ;
äidin Compound  "weight: -log(c(äidin)/cs)" ;

LEXICON CompoundFinalNoun
isä:isä+sg+nom     ## "weight:-log(c(isä+sg+nom)/cs)" ;
isän:isä+sg+gen    ## "weight:-log(c(isä+sg+gen)/cs)" ;
isälle:isä+sg+all  ## "weight:-log(c(isä+sg+all)/cs)" ;
isin:isä+pl+ins    ## "weight:-log(c(isä+sg+all)/cs)" ;

LEXICON ##
## # ;
</pre>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>Structure weighting scheme using token penalties.
</figcaption>
</figure>
<div id="S4.p10" class="ltx_para">
<p class="ltx_p">According to Karlsson <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">karlsson/1992</span>]</cite> and
Schiller <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite>, we may need to ensure that the
weight of the compound segmentation <em class="ltx_emph">ab</em> of a word always is
greater than the weight of a non-compound analysis <em class="ltx_emph">c</em> of the
same word, so for compounds we use Equation <a href="#S4.E5" title="(5) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>,
where <em class="ltx_emph">a</em> is the first part of the compound and <em class="ltx_emph">x</em> is the
remaining part, which may be split in to additional parts applying the
equation recursively.</p>
</div>
<div id="S4.p11" class="ltx_para">
<table id="S4.E5" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E5.m1" class="ltx_Math" alttext="w(ax)=w(a)+M+w(x)" display="block"><mrow><mrow><mi mathsize="90%">w</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mrow><mi mathsize="90%">a</mi><mo>⁢</mo><mi mathsize="90%">x</mi></mrow><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">=</mo><mrow><mrow><mi mathsize="90%">w</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">+</mo><mi mathsize="90%">M</mi><mo mathsize="90%" stretchy="false">+</mo><mrow><mi mathsize="90%">w</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">x</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(5)</span></td>
</tr>
</table>
</div>
<div id="S4.p12" class="ltx_para">
<p class="ltx_p">In particular, it is true that <math id="S4.p12.m1" class="ltx_Math" alttext="w(ab)&gt;w(c)" display="inline"><mrow><mrow><mi>w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mi>a</mi><mo>⁢</mo><mi>b</mi></mrow><mo stretchy="false">)</mo></mrow></mrow><mo>&gt;</mo><mrow><mi>w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo></mrow></mrow></mrow></math> if <em class="ltx_emph">M</em> is
defined as in Equation <a href="#S4.E6" title="(6) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">6</span></a>.</p>
</div>
<div id="S4.p13" class="ltx_para">
<table id="S4.E6" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E6.m1" class="ltx_Math" alttext="M=-\mathrm{log}(1/(\mathrm{cs}+1))" display="block"><mrow><mi mathsize="90%">M</mi><mo mathsize="90%" stretchy="false">=</mo><mrow><mo mathsize="90%" stretchy="false">-</mo><mrow><mi mathsize="90%">log</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mrow><mn mathsize="90%">1</mn><mo mathsize="90%" stretchy="false">/</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mrow><mi mathsize="90%">cs</mi><mo mathsize="90%" stretchy="false">+</mo><mn mathsize="90%">1</mn></mrow><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(6)</span></td>
</tr>
</table>
</div>
<div id="S4.p14" class="ltx_para">
<p class="ltx_p">For an illustration of how a structure weighting scheme with compound
penalties is implemented in the lexicon, see
Figure <a href="#S4.F3" title="Figure 3 ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>.</p>
</div>
<figure id="S4.F3" class="ltx_figure"><pre class="ltx_verbatim ltx_font_typewriter" style="font-size:90%;">
LEXICON Root
## CompoundNonFinalNoun ;
## CompoundFinalNoun ;

LEXICON Compound
0:# CompoundNonFinalNoun "weight: -log(1/(cs+1))" ;
0:# CompoundFinalNoun "weight: -log(1/(cs+1))" ;

LEXICON CompoundNonFinalNoun
isä   Compound  "weight: -log(c(isä)/cs)" ;
isän  Compound  "weight: -log(c(isän)/cs)" ;
äiti  Compound  "weight: -log(c(äiti)/cs)" ;
äidin Compound  "weight: -log(c(äidin)/cs)" ;

LEXICON CompoundFinalNoun
isä:isä+sg+nom     ## "weight:-log(c(isä+sg+nom)/cs)" ;
isän:isä+sg+gen    ## "weight:-log(c(isä+sg+gen)/cs)" ;
isälle:isä+sg+all  ## "weight:-log(c(isä+sg+all)/cs)" ;
isin:isä+pl+ins    ## "weight:-log(c(isä+sg+all)/cs)" ;

LEXICON ##
## # ;
</pre>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>Structure weighting scheme using token and compound penalties.
</figcaption>
</figure>
<div id="S4.p15" class="ltx_para">
<p class="ltx_p">In order to compare with the original principle suggested by
Karlsson <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">karlsson/1992</span>]</cite>, we create a third lexicon for which
structural weights are placed on the compound borders only, so for
compounds we use Equation <a href="#S4.E7" title="(7) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">7</span></a>.</p>
</div>
<div id="S4.p16" class="ltx_para">
<table id="S4.E7" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E7.m1" class="ltx_Math" alttext="w(ax)=M+w(x)" display="block"><mrow><mrow><mi mathsize="90%">w</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mrow><mi mathsize="90%">a</mi><mo>⁢</mo><mi mathsize="90%">x</mi></mrow><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">=</mo><mrow><mi mathsize="90%">M</mi><mo mathsize="90%" stretchy="false">+</mo><mrow><mi mathsize="90%">w</mi><mo>⁢</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">x</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(7)</span></td>
</tr>
</table>
</div>
<div id="S4.p17" class="ltx_para">
<p class="ltx_p">For an illustration of how a weighting scheme with the compound
penalty suggested by Karlsson is implemented in the lexicon, see
Figure <a href="#S4.F4" title="Figure 4 ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>.</p>
</div>
<figure id="S4.F4" class="ltx_figure"><pre class="ltx_verbatim ltx_font_typewriter" style="font-size:90%;">
LEXICON Root
## CompoundNonFinalNoun ;
## CompoundFinalNoun ;

LEXICON Compound
0:# CompoundNonFinalNoun "weight: -log(1/(cs+1))" ;
0:# CompoundFinalNoun "weight: -log(1/(cs+1))" ;

LEXICON CompoundNonFinalNoun
isä   Compound  "weight: 0" ;
isän  Compound  "weight: 0" ;
äiti  Compound  "weight: 0" ;
äidin Compound  "weight: 0" ;

LEXICON CompoundFinalNoun
isä:isä+sg+nom     ## "weight:-log(c(isä+sg+nom)/cs)" ;
isän:isä+sg+gen    ## "weight:-log(c(isä+sg+gen)/cs)" ;
isälle:isä+sg+all  ## "weight:-log(c(isä+sg+all)/cs)" ;
isin:isä+pl+ins    ## "weight:-log(c(isä+sg+all)/cs)" ;

LEXICON ##
## # ;
</pre>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_figure">Figure 4: </span>Structure weighting scheme using compound penalties.
</figcaption>
</figure>
</section>
<section id="S5" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5 </span>Training and Test Data</h2>

<div id="S5.p1" class="ltx_para">
<p class="ltx_p">For training and testing purposes, we use a compilation of three
years, 1995-1997, of daily issues of Helsingin Sanomat, which is the
most wide-spread Finnish newspaper. This collection contained
approximately 2.4 million different words, i.e. types. We
disambiguated the corpus using Machinese for
Finnish<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>Machinese is available from Connexor Ltd.,
www.connexor.com</span></span></span> which provided one reading in context for each
word based on syntactic parsing.</p>
</div>
<div id="S5.p2" class="ltx_para">
<p class="ltx_p">To create the test material from the corpus, we selected all word
forms with more than 20 characters for which our baseline analyzer
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen/2008</span>]</cite> gave a compound analysis, i.e. 53 270 types. Of
these, we selected the types which had a structural ambiguity and
found 4 721 such words, i.e. approximately 8.9 % of all the compound
words analyzed by our baseline analyzer. Of the remaining more than
20-character compounds 63.7 % contained no ambiguities or only
inflectional ambiguities. At most, the combination of structural and
inflectional ambiguities amounted to 30 readings in three different
words which after all is a fairly moderate number. On the average, the
structural and inflectional ambiguity amounts to 2.79 readings per
word. Examples of structurally ambiguous words are
<em class="ltx_emph">aktivointimahdollisuuksien</em> with the ambiguity
<em class="ltx_emph">aktivointi#mahdollisuus</em> ’of the opportunities to activate’ vs.
<em class="ltx_emph">akti#vointi#mahdollisuus</em> ’of the opportunities to act health’
and <em class="ltx_emph">hiihtoharjoittelupaikassa</em> with the ambiguity
<em class="ltx_emph">hiihto#harjoittelu#paikka</em> ’in the ski training location’
vs. <em class="ltx_emph">hiihto#harjoittelu#pai#kassa</em> ’ski training pie cashier’.</p>
</div>
<div id="S5.p3" class="ltx_para">
<p class="ltx_p">The characteristics of all the compounds in the corpus is presented in
Table <a href="#S5.T1" title="Table 1 ‣ 5 Training and Test Data ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.</p>
</div>
<figure id="S5.T1" class="ltx_table">
<table class="ltx_tabular ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" colspan="3"><span class="ltx_text" style="font-size:90%;"># of Characters</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" colspan="3"><span class="ltx_text" style="font-size:90%;"># of Segments</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Min.</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Max.</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Avg.</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Min.</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Max.</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Avg.</span></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">2</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">44</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">15.34</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">2</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">6</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">2.19</span></td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_table">Table 1: </span>Evaluation of compounds, segments and readings.
</figcaption>
</figure>
<div id="S5.p4" class="ltx_para">
<p class="ltx_p">Examples of six-part compounds are:
</p>
<ul id="I1" class="ltx_itemize">
<li id="I1.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize">•</span> 
<div id="I1.i1.p1" class="ltx_para">
<p class="ltx_p"><em class="ltx_emph">elo#kuva#teatteri#tuki#työ#ryhmä</em> 
<br class="ltx_break">’movie theater support workgroup’</p>
</div>
</li>
<li id="I1.i2" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize">•</span> 
<div id="I1.i2.p1" class="ltx_para">
<p class="ltx_p"><em class="ltx_emph">jatko#koulutus#yhteis#työ#toimi#kunta</em> 
<br class="ltx_break">’higher education cooperation committee’</p>
</div>
</li>
<li id="I1.i3" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize">•</span> 
<div id="I1.i3.p1" class="ltx_para">
<p class="ltx_p"><em class="ltx_emph">lähi#alue#yhtei#työ#määrä#raha</em> 
<br class="ltx_break">’regional cooperation reserve’</p>
</div>
</li>
</ul>
</div>
<div id="S5.p5" class="ltx_para">
<p class="ltx_p">The longest compounds found in the corpus is
<em class="ltx_emph">liikenne#turvallisuus#asiain#neuvottelu#kunnassa</em> ’in the road
safety issue negotiating committee’</p>
</div>
</section>
<section id="S6" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">6 </span>Tests and Results</h2>

<div id="S6.p1" class="ltx_para">
<p class="ltx_p">We estimate the probabilities for the non-compound words in the 1995
part of the corpus. Since we do not use the compounds for training we
can test on the compounds of all three years.</p>
</div>
<div id="S6.p2" class="ltx_para">
<p class="ltx_p">We evaluated the weighting schemes described in Section <a href="#S4" title="4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>,
i.e. the probabilistic method without compound boundary weighting, the
probabilistic method combined with compound weighting and the
traditional pure compound weighting. The precision and recall is
presented in Table <a href="#S6.T2" title="Table 2 ‣ 6 Tests and Results ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. Since we only took the first of
the best results, the precision is equal to recall.</p>
</div>
<figure id="S6.T2" class="ltx_table">
<table class="ltx_tabular ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Parameters</span></th>
<th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Precision</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Only compound penalty</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">99.94 %</span></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_l ltx_border_r"><span class="ltx_text" style="font-size:90%;">Compound penalty and prefix weights</span></td>
<td class="ltx_td ltx_align_left ltx_border_r"><span class="ltx_text" style="font-size:90%;">99.98 %</span></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_l ltx_border_r"><span class="ltx_text" style="font-size:90%;">No compound penalty and prefix weights</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:90%;">99.98 %</span></td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_table">Table 2: </span>Precision equals recall for the test results when we use
only the first result.
</figcaption>
</figure>
</section>
<section id="S7" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">7 </span>Implementation Note</h2>

<div id="S7.p1" class="ltx_para">
<p class="ltx_p">In <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span>, we use OpenFST <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">openfst/2007</span>]</cite> as the underlying
finite-state software library for handling weighted finite-state
transducers. The estimated probabilities are encoded as weights in the
tropical semiring, see <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">mohri/1997</span>]</cite>. To extract the n-best
results, we use a single-source n-best paths algorithm, see
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">mohri/2002</span>]</cite>.</p>
</div>
</section>
<section id="S8" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">8 </span>Discussion and Further Research</h2>

<div id="S8.p1" class="ltx_para">
<p class="ltx_p">Previous results for structural compound disambiguation for German
using word probabilities and compound penalties <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite> or
using only word probabilities <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">marek/2006</span>]</cite> also achieved results
with precision and recall in the region of 97-99 %. In German the
ambiguities of long compounds may produce even 120 readings, but on
the average the ambiguity in compounds is between 2-3 readings
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite>, which is on par with the ambiguity of 2.8
readings found for long Finnish compounds. As pointed out initially
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">hedlund/2002</span>]</cite>, the amount of compounds occurring in Finnish,
Swedish and German texts is also on a comparable level.</p>
</div>
<div id="S8.p2" class="ltx_para">
<p class="ltx_p">If a disambiguated corpus is not available for calculating the word
probabilities, using only the structural penalties may still be an
acceptable replacement in Finnish. However, we need to note, that a
similar strategy in German, i.e. using only compound penalties on all
compound prefixes, did not seem to perform as well
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite>. This may be due to the fact that German contains a
high number of very short one-syllable words which interfere with the
compounding, whereas Finnish is more restricted in the number of short
words. Scandinavian languages are similar to German in that they have
a number of short one-syllable nouns. Using probabilistic approach with
swedish compound disambiguation is demonstrated in <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">sjobergh/2004</span>]</cite>,
which shows results of 86 % accuracy of compound segmenting when using
compound component frequencies and 90 % for number of compound components.
However, it is a question for
further research whether a pure probabilistic approach could fare as
well for Scandinavian languages.</p>
</div>
</section>
<section id="S9" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">9 </span>Conclusions</h2>

<div id="S9.p1" class="ltx_para">
<p class="ltx_p">For Finnish, weighting compound complexity gives excellent results
around 99.9 % almost regardless of the approach. However, from a
theoretical point of view, we can still verify the two hypotheses we
postulated initially. Most importantly, there seems to be no need to
extract the counts from lists of disambiguated compounds, i.e., it is
quite feasible to use general word occurrence probabilities for
structurally disambiguating compounds. In addition, we can also
corroborate the observation that when using word probabilities, it is
possible to forego a specific structural penalty and rely only on the
word probabilities. From a practical point of view, we introduced the
open source tool, <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span>, and demonstrated how it can be
successfully used to encode various compound weighting schemes.
</p>
</div>
</section>
<section id="Sx1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">Acknowledgments</h2>

<div id="Sx1.p1" class="ltx_para">
<p class="ltx_p">This research was funded by the Finnish Academy and the Finnish
Ministry of Education. We are also grateful to the HFST-Helsinki
Finite State Technology research team and to the anonymous reviewers.</p>
</div>
</section>
<section id="bib" class="ltx_bibliography">
<h2 class="ltx_title ltx_title_bibliography">References</h2>

<ul id="L1" class="ltx_biblist">
<li id="bib.bib47" class="ltx_bibitem ltx_bib_book">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[1]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. Hakulinen, M. Vilkuna, R. Korhonen, V. Koivisto, Heinonen and I. Alho</span><span class="ltx_text ltx_bib_year"> (2008)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Iso suomen kielioppi</span>.
</span>
<span class="ltx_bibblock"> <span class="ltx_text ltx_bib_publisher">Suomalaisen Kirjallisuuden Seura</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://kaino.kotus.fi/visk" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#bib.bib132" title="Iso suomen kielioppi" class="ltx_ref">2</a>.
</span>
</li>
<li id="bib.bib132" class="ltx_bibitem ltx_bib_book">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[2]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. Hakulinen, M. Vilkuna, R. Korhonen, V. Koivisto, Heinonen and I. Alho</span><span class="ltx_text ltx_bib_year"> (2008)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Iso suomen kielioppi</span>.
</span>
<span class="ltx_bibblock"> <span class="ltx_text ltx_bib_publisher">Suomalaisen Kirjallisuuden Seura</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p4" title="2 Inflection and Compounding in Finnish ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.
</span>
</li>
</ul>
</section>
</article>
</div>
<footer class="ltx_page_footer">
<div class="ltx_page_logo">Generated  on Fri Oct 13 18:35:50 2017 by <a href="http://dlmf.nist.gov/LaTeXML/">LaTeXML <img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="[LOGO]"></a>
</div></footer>
</div>
</body>
</html>