-
Notifications
You must be signed in to change notification settings - Fork 3
/
Pirinen-2009-nodalida.html
727 lines (689 loc) · 55.6 KB
/
Pirinen-2009-nodalida.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
<!DOCTYPE html><html>
<head>
<title>Weighted Finite-State Morphological Analysisof Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017. </title>
<!--Generated on Fri Oct 13 18:35:50 2017 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on Last modification: October 13, 2017.-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">Weighted Finite-State Morphological Analysis
<br class="ltx_break">of Finnish Inflection and Compounding<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup>The official
publication was in Nodalida 2009 organised in Odense,
<a href="http://beta.visl.sdu.dk/nodalida2009/" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://beta.visl.sdu.dk/nodalida2009/</a>, the electronic publication was
available at <a href="http://dspace.utlib.ee/dspace/handle/10062/9206" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://dspace.utlib.ee/dspace/handle/10062/9206</a> on
October 13, 2017.</span></span></span>
</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Krister Lindén
<br class="ltx_break">University of Helsinki
<br class="ltx_break">Helsinki, Finland
<br class="ltx_break"><span class="ltx_text ltx_font_typewriter">[email protected]</span>
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Tommi Pirinen
<br class="ltx_break">University of Helsinki
<br class="ltx_break">Helsinki, Finland
<br class="ltx_break"><span class="ltx_text ltx_font_typewriter">[email protected]</span>
</span></span>
</div>
<div class="ltx_date ltx_role_creation">Last modification: October 13, 2017</div>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p">Finnish has a very productive compounding and a rich inflectional
system, which causes ambiguity in the morphological segmentation of
compounds made with finite state transducer methods. In order to
disambiguate the compound segmentations, we compare three different
strategies, which we cast in a probabilistic framework. We present a
method for implementing the probabilistic framework as part of the
building process of lexc-style morpheme sub-lexicons creating
weighted lexical transducers. To implement the structurally
disambiguating morphological analyzer, we use the <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span>
tool which is part of the open source <em class="ltx_emph">Helsinki Finite-State
Technology</em>. This is the first time all three principles are cast
in a probabilistic framework and compared on the same corpus using
one tool. On our Finnish test corpus, the best method succeeds with
99,98 % precision and recall.</p>
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p">In languages with productive multipart compounding, such as Finnish,
German and Swedish, approximately 9-10 % of the word tokens in a
corpus are compounds <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">hedlund/2002</span>]</cite> and approximately 2/3 of the
dictionary entries are compounds, cf. a publicly available Finnish
dictionary <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">kotus/2007</span>]</cite>.</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">There have been various attempts at curbing the potential
combinatorial explosion of segmentations that a prolific compounding
mechanism produces. Karlsson <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">karlsson/1992</span>]</cite> showed that for
Swedish the most significant factor in disambiguating compounds was
the counting of the number of parts in the analysis, where the
analysis with the fewest parts almost always was the best
candidate. This has later been corroborated by others. In particular,
it was the main disambiguation criterion formulated by
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite> on German compounding. In addition, Schiller used
frequency information for disambiguating between compounds with an
equal number of parts. Schiller estimated her figures from compound
part frequencies, which requires a considerable amount of manual
labour in order to create the training corpora consisting of attested
compound words and their correct segmentations.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">We suggest two modifications to the strategies of Karlsson and
Schiller. First we suggest that the word segment probabilities can be
estimated from non-compound word frequencies in the corpus. The
motivation for our approach is that compounds are formed in order to
distinguish between instances of frequently occurring phenomena and
therefore compounds are more often formed for more frequently
discussed phenomena. We assume that the frequency by which phenomena
are discussed is reflected in the non-compound word frequencies,
i.e. high-frequency words should in general have more compounds.</p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">In addition, we suggest that the special penalty suggested by Karlsson
and maintained by Schiller is unnecessary when framing the problem in
a probabilistic framework. This has also been suggested by others, see
e.g. Marek <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">marek/2006</span>]</cite>. However, this is the first time the
disambiguation principles of Karlsson and of Schiller are compared
with a fully probabilistic approach on the same corpus.</p>
</div>
<div id="S1.p5" class="ltx_para">
<p class="ltx_p">Previously, there has been no publicly available general framework for
conveniently integrating both a full-fledged morphological description
and for representing probabilities for general morphological compound
and inflectional analysis. Karlsson <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">karlsson/1992</span>]</cite> used
applied a post-processing phase to count the parts, and Schiller
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite> used the proprietary weighted finite-state compiler
of Xerox <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">kempe/2003</span>]</cite>, which compiles regular expressions. We
therefore introduce the open source software tool
<span class="ltx_text ltx_font_smallcaps">hfst-lexc<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup><a href="http://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstLexC" title="" class="ltx_ref ltx_url ltx_font_typewriter ltx_font_upright">http://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstLexC</a></span></span></span></span>,
which is similar to the Xerox lexc tool <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/2003</span>]</cite>. In
addition to the fact that <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span> compiles lexc-style
lexicons, it also has a mechanism for adding weights to compound parts
and morphological analyses.</p>
</div>
<div id="S1.p6" class="ltx_para">
<p class="ltx_p">The remainder of the article is structured as follows. In
Sections <a href="#S2" title="2 Inflection and Compounding in Finnish ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> and <a href="#S3" title="3 Morphological analysis of Finnish ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>, we introduce a version of
Finnish morphology for compounding. In Section <a href="#S4" title="4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>, we
introduce the probabilistic formulation of the methods for weighting
the lexical entries. In Section <a href="#S5" title="5 Training and Test Data ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>, we briefly introduce the
test and training corpora. In Section <a href="#S6" title="6 Tests and Results ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">6</span></a>, we present the
results. Finally, in Sections <a href="#S7" title="7 Implementation Note ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">7</span></a>, <a href="#S8" title="8 Discussion and Further Research ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">8</span></a> and
<a href="#S9" title="9 Conclusions ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">9</span></a>, we give some notes on the implementation, discuss the
results and draw the conclusions.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Inflection and Compounding in Finnish</h2>
<div id="S2.p1" class="ltx_para">
<p class="ltx_p">In Finnish morphology, the inflection of typical nouns produces
several thousands of forms for the productive inflection. Finnish
compounding theoretically allows nominal compounds of arbitrary length
to be created from initial parts of certain forms of nouns, and the
final part inflects in all possible forms.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">For example the compounds describing ancestors are compounded from
zero or more of <em class="ltx_emph">isän</em> ‘father <span class="ltx_text ltx_font_smallcaps">singular genitive</span>’ and
<em class="ltx_emph">äidin</em> ‘mother <span class="ltx_text ltx_font_smallcaps">singular genitive</span>’ and then one of any
inflected forms of <em class="ltx_emph">isä</em> or <em class="ltx_emph">äiti</em>, creating forms such as
<em class="ltx_emph">äidinisälle</em> ‘grandfather (maternal) <span class="ltx_text ltx_font_smallcaps">singular allative</span>’
or <em class="ltx_emph">isänisänisänisä</em> ‘great great grandfather <span class="ltx_text ltx_font_smallcaps">singular
nominative</span>’. As for the potential ambiguity, Finnish also has the
noun <em class="ltx_emph">nisä</em> ‘udder’, which creates ambiguity for any paternal
grandfather, e.g. <em class="ltx_emph">isän#isän#isän#isä</em>,
<em class="ltx_emph">isän#isä#nisän#isä</em>, <em class="ltx_emph">isä#nisä#nisä#nisä</em>, …</p>
</div>
<div id="S2.p3" class="ltx_para">
<p class="ltx_p">However, much of the ambiguity in Finnish compounds is aggravated by
the ambiguity of the inflected forms of the head words. For example
<em class="ltx_emph">isän</em>, has several possible analyses,
e.g. <span class="ltx_text ltx_font_smallcaps">isä+sg+gen</span>, <span class="ltx_text ltx_font_smallcaps">isä+sg+acc</span> and <span class="ltx_text ltx_font_smallcaps">isä+sg+ins</span>.</p>
</div>
<div id="S2.p4" class="ltx_para">
<p class="ltx_p">Finnish compounding also includes forms of compounding where all parts of
words are inflected with same form, but this is limited to part of adjective
initial compounds. Similarly some inflected verb forms may appear as parts
of compounds. These both are more rare than nominal compounds <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib132" title="Iso suomen kielioppi" class="ltx_ref">2</a>]</cite>
and not considered in this paper.</p>
</div>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Morphological analysis of Finnish</h2>
<div id="S3.p1" class="ltx_para">
<p class="ltx_p">Pirinen <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen/2008</span>]</cite> presented an open source
implementation of a finite state morphological analyzer for Finnish.
We use that implementation as a baseline for the compounding analysis
as Pirinen’s analyzer has a fully productive compounding
mechanism. Fully productive compounding means that it allows compounds
of arbitrary length with any combination of nominative singulars,
genitive singulars, or genitive plurals in the initial part and any
inflected form of a noun as the final part.</p>
</div>
<div id="S3.p2" class="ltx_para">
<p class="ltx_p">The morphotactic combination of morphemes is achieved with sublexicon
combinatorics as defined in <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/2003</span>]</cite>. We use the open source
software called <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span> with a similar interface as the
Xerox lexc tool. The <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span> tool includes preliminary
support for weights on the lexical entries.</p>
</div>
<div id="S3.p3" class="ltx_para">
<p class="ltx_p">In this implementation, each lexical entry constitutes one full word
form, i.e., we create a full form lexicon using the previously
mentioned analyzer <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen/2008</span>]</cite>. This creates a text file of 22
GB for the purely inflectional morphology of approximately 40 000
non-compound lexical entries for Finnish, which were stored in a
single CompoundFinalNoun lexicon as shown in
Figure <a href="#S3.F1" title="Figure 1 ‣ 3 Morphological analysis of Finnish ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>. The figure demonstrates an unweighted
lexicon and also shows how we model the compounding by dividing the
word forms into two categories: compound non-final (i.e., nominative
singular, genitive singular, and genitive plural) and compound final
forms allowing us to give weights to each form or compound part as
needed.</p>
</div>
<figure id="S3.F1" class="ltx_figure"><pre class="ltx_verbatim ltx_font_typewriter" style="font-size:90%;">
LEXICON Root
## CompoundNonFinalNoun ;
## CompoundFinalNoun ;
LEXICON Compound
#:0 CompoundNonFinalNound "weight: 0" ;
#:0 CompoundFinalNound "weight: 0" ;
LEXICON CompoundNonFinalNoun
isä Compound "weight: 0" ;
isän Compound "weight: 0" ;
äiti Compound "weight: 0" ;
äidin Compound "weight: 0" ;
LEXICON CompoundFinalNoun
isä:isä+sg+nom ## "weight: 0" ;
isän:isä+sg+gen ## "weight: 0" ;
isälle:isä+sg+all ## "weight: 0" ;
LEXICON ##
## # ;
</pre>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>Unweighted lexicon.
</figcaption>
</figure>
<div id="S3.p4" class="ltx_para">
<p class="ltx_p">Compounding implemented with the unweighted sublexicons in
Figure <a href="#S3.F1" title="Figure 1 ‣ 3 Morphological analysis of Finnish ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a> is equivalent to the original baseline
analyzer. The root sublexicon specifies that we can have start directly
from compound final noun forms, forming single part words, or start from
compound initial forms, forming multiword compounds. The compound initial
lexicon is a listing of all singular nominatives, singular genitives and
plural genitives, which is followed by compound boundary marker on in separate
sublexicon, and another word from either compound initial sublexicon or compound
final sublexicon. The compound final sublexicon contains the long listing of all
possible forms of all words, and their analyses,</p>
</div>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>Methodology</h2>
<div id="S4.p1" class="ltx_para">
<p class="ltx_p">We define the weight of a token through its probability to occur in
the corpus, i.e. we use the count,<em class="ltx_emph">c</em>, which is proportional to
the frequency with which a token appears in a corpus divided by the
corpus size, <em class="ltx_emph">cs</em>. The probability, <em class="ltx_emph">p(a)</em>, for a token,
<em class="ltx_emph">a</em>, is defined by Equation <a href="#S4.E1" title="(1) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.</p>
</div>
<div id="S4.p2" class="ltx_para">
<table id="S4.E1" class="ltx_equation ltx_eqn_table">
<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E1.m1" class="ltx_Math" alttext="\mathrm{p}(a)=\mathrm{c}(a)/\mathrm{cs}" display="block"><mrow><mrow><mi mathsize="90%" mathvariant="normal">p</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">=</mo><mrow><mrow><mi mathsize="90%" mathvariant="normal">c</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">/</mo><mi mathsize="90%">cs</mi></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td>
</tr>
</table>
</div>
<div id="S4.p3" class="ltx_para">
<p class="ltx_p">Tokens known to the lexicon but unseen in the corpus need to be
assigned a small probability mass different from 0, so they get
<em class="ltx_emph">c(x) = 1</em>, i.e. we define the count of a token as its corpus
frequency plus 1 as in Equation <a href="#S4.E2" title="(2) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.</p>
</div>
<div id="S4.p4" class="ltx_para">
<table id="S4.E2" class="ltx_equation ltx_eqn_table">
<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E2.m1" class="ltx_Math" alttext="\mathrm{c}(a)=1+\mathrm{frequency}(a)" display="block"><mrow><mrow><mi mathsize="90%" mathvariant="normal">c</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">=</mo><mrow><mn mathsize="90%">1</mn><mo mathsize="90%" stretchy="false">+</mo><mrow><mi mathsize="90%">frequency</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td>
</tr>
</table>
</div>
<div id="S4.p5" class="ltx_para">
<p class="ltx_p">If a token, e.g. <em class="ltx_emph">isän</em>, has several possible analyses, e.g.
<span class="ltx_text ltx_font_smallcaps">isä+sg+gen</span> and <span class="ltx_text ltx_font_smallcaps">isä+sg+acc</span>, the total count for
<em class="ltx_emph">isän</em> will be divided among the analyses in a disambiguated
training corpus. If the disambiguation result removes all readings
<span class="ltx_text ltx_font_smallcaps">isä+sg+acc</span> from the disambiguated result, the count for this
reading is still 1 according to Equation <a href="#S4.E2" title="(2) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. We need the
total probability mass of all the tokens in the lexicon to sum up to
1, so we define the corpus size as the number of all lexical token
counts according to Equation <a href="#S4.E3" title="(3) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>.</p>
</div>
<div id="S4.p6" class="ltx_para">
<table id="S4.E3" class="ltx_equation ltx_eqn_table">
<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E3.m1" class="ltx_Math" alttext="\mathrm{cs}=\sum_{x}\mathrm{c}(x)" display="block"><mrow><mi mathsize="90%">cs</mi><mo mathsize="90%" stretchy="false">=</mo><mrow><munder><mo largeop="true" mathsize="90%" movablelimits="false" stretchy="false" symmetric="true">∑</mo><mi mathsize="90%">x</mi></munder><mrow><mi mathsize="90%" mathvariant="normal">c</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">x</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(3)</span></td>
</tr>
</table>
</div>
<div id="S4.p7" class="ltx_para">
<p class="ltx_p">To use the probabilities as weights in the lexicon we implement them in
the tropical semiring, which means that we use the negative
log-probabilities as defined by Equation <a href="#S4.E4" title="(4) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>.</p>
</div>
<div id="S4.p8" class="ltx_para">
<table id="S4.E4" class="ltx_equation ltx_eqn_table">
<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E4.m1" class="ltx_Math" alttext="w(a)=-\mathrm{log}(p(a))" display="block"><mrow><mrow><mi mathsize="90%">w</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">=</mo><mrow><mo mathsize="90%" stretchy="false">-</mo><mrow><mi mathsize="90%">log</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mrow><mi mathsize="90%">p</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(4)</span></td>
</tr>
</table>
</div>
<div id="S4.p9" class="ltx_para">
<p class="ltx_p">For an illustration of how the weighting scheme is implemented in the
lexicon, see Figure <a href="#S4.F2" title="Figure 2 ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.</p>
</div>
<figure id="S4.F2" class="ltx_figure"><pre class="ltx_verbatim ltx_font_typewriter" style="font-size:90%;">
LEXICON Root
## CompoundNonFinalNoun ;
## CompoundFinalNoun ;
LEXICON Compound
0:# CompoudNonFinalNoun "weight: 0" ;
0:# CompoudFinalNoun "weight: 0" ;
LEXICON CompoundNonFinalNoun
isä Compound "weight: -log(c(isä)/cs)" ;
isän Compound "weight: -log(c(isän)/cs)" ;
äiti Compound "weight: -log(c(äiti)/cs)" ;
äidin Compound "weight: -log(c(äidin)/cs)" ;
LEXICON CompoundFinalNoun
isä:isä+sg+nom ## "weight:-log(c(isä+sg+nom)/cs)" ;
isän:isä+sg+gen ## "weight:-log(c(isä+sg+gen)/cs)" ;
isälle:isä+sg+all ## "weight:-log(c(isä+sg+all)/cs)" ;
isin:isä+pl+ins ## "weight:-log(c(isä+sg+all)/cs)" ;
LEXICON ##
## # ;
</pre>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>Structure weighting scheme using token penalties.
</figcaption>
</figure>
<div id="S4.p10" class="ltx_para">
<p class="ltx_p">According to Karlsson <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">karlsson/1992</span>]</cite> and
Schiller <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite>, we may need to ensure that the
weight of the compound segmentation <em class="ltx_emph">ab</em> of a word always is
greater than the weight of a non-compound analysis <em class="ltx_emph">c</em> of the
same word, so for compounds we use Equation <a href="#S4.E5" title="(5) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>,
where <em class="ltx_emph">a</em> is the first part of the compound and <em class="ltx_emph">x</em> is the
remaining part, which may be split in to additional parts applying the
equation recursively.</p>
</div>
<div id="S4.p11" class="ltx_para">
<table id="S4.E5" class="ltx_equation ltx_eqn_table">
<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E5.m1" class="ltx_Math" alttext="w(ax)=w(a)+M+w(x)" display="block"><mrow><mrow><mi mathsize="90%">w</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mrow><mi mathsize="90%">a</mi><mo></mo><mi mathsize="90%">x</mi></mrow><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">=</mo><mrow><mrow><mi mathsize="90%">w</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">a</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">+</mo><mi mathsize="90%">M</mi><mo mathsize="90%" stretchy="false">+</mo><mrow><mi mathsize="90%">w</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">x</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(5)</span></td>
</tr>
</table>
</div>
<div id="S4.p12" class="ltx_para">
<p class="ltx_p">In particular, it is true that <math id="S4.p12.m1" class="ltx_Math" alttext="w(ab)>w(c)" display="inline"><mrow><mrow><mi>w</mi><mo></mo><mrow><mo stretchy="false">(</mo><mrow><mi>a</mi><mo></mo><mi>b</mi></mrow><mo stretchy="false">)</mo></mrow></mrow><mo>></mo><mrow><mi>w</mi><mo></mo><mrow><mo stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo></mrow></mrow></mrow></math> if <em class="ltx_emph">M</em> is
defined as in Equation <a href="#S4.E6" title="(6) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">6</span></a>.</p>
</div>
<div id="S4.p13" class="ltx_para">
<table id="S4.E6" class="ltx_equation ltx_eqn_table">
<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E6.m1" class="ltx_Math" alttext="M=-\mathrm{log}(1/(\mathrm{cs}+1))" display="block"><mrow><mi mathsize="90%">M</mi><mo mathsize="90%" stretchy="false">=</mo><mrow><mo mathsize="90%" stretchy="false">-</mo><mrow><mi mathsize="90%">log</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mrow><mn mathsize="90%">1</mn><mo mathsize="90%" stretchy="false">/</mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mrow><mi mathsize="90%">cs</mi><mo mathsize="90%" stretchy="false">+</mo><mn mathsize="90%">1</mn></mrow><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(6)</span></td>
</tr>
</table>
</div>
<div id="S4.p14" class="ltx_para">
<p class="ltx_p">For an illustration of how a structure weighting scheme with compound
penalties is implemented in the lexicon, see
Figure <a href="#S4.F3" title="Figure 3 ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>.</p>
</div>
<figure id="S4.F3" class="ltx_figure"><pre class="ltx_verbatim ltx_font_typewriter" style="font-size:90%;">
LEXICON Root
## CompoundNonFinalNoun ;
## CompoundFinalNoun ;
LEXICON Compound
0:# CompoundNonFinalNoun "weight: -log(1/(cs+1))" ;
0:# CompoundFinalNoun "weight: -log(1/(cs+1))" ;
LEXICON CompoundNonFinalNoun
isä Compound "weight: -log(c(isä)/cs)" ;
isän Compound "weight: -log(c(isän)/cs)" ;
äiti Compound "weight: -log(c(äiti)/cs)" ;
äidin Compound "weight: -log(c(äidin)/cs)" ;
LEXICON CompoundFinalNoun
isä:isä+sg+nom ## "weight:-log(c(isä+sg+nom)/cs)" ;
isän:isä+sg+gen ## "weight:-log(c(isä+sg+gen)/cs)" ;
isälle:isä+sg+all ## "weight:-log(c(isä+sg+all)/cs)" ;
isin:isä+pl+ins ## "weight:-log(c(isä+sg+all)/cs)" ;
LEXICON ##
## # ;
</pre>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>Structure weighting scheme using token and compound penalties.
</figcaption>
</figure>
<div id="S4.p15" class="ltx_para">
<p class="ltx_p">In order to compare with the original principle suggested by
Karlsson <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">karlsson/1992</span>]</cite>, we create a third lexicon for which
structural weights are placed on the compound borders only, so for
compounds we use Equation <a href="#S4.E7" title="(7) ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">7</span></a>.</p>
</div>
<div id="S4.p16" class="ltx_para">
<table id="S4.E7" class="ltx_equation ltx_eqn_table">
<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S4.E7.m1" class="ltx_Math" alttext="w(ax)=M+w(x)" display="block"><mrow><mrow><mi mathsize="90%">w</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mrow><mi mathsize="90%">a</mi><mo></mo><mi mathsize="90%">x</mi></mrow><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow><mo mathsize="90%" stretchy="false">=</mo><mrow><mi mathsize="90%">M</mi><mo mathsize="90%" stretchy="false">+</mo><mrow><mi mathsize="90%">w</mi><mo></mo><mrow><mo maxsize="90%" minsize="90%">(</mo><mi mathsize="90%">x</mi><mo maxsize="90%" minsize="90%">)</mo></mrow></mrow></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(7)</span></td>
</tr>
</table>
</div>
<div id="S4.p17" class="ltx_para">
<p class="ltx_p">For an illustration of how a weighting scheme with the compound
penalty suggested by Karlsson is implemented in the lexicon, see
Figure <a href="#S4.F4" title="Figure 4 ‣ 4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>.</p>
</div>
<figure id="S4.F4" class="ltx_figure"><pre class="ltx_verbatim ltx_font_typewriter" style="font-size:90%;">
LEXICON Root
## CompoundNonFinalNoun ;
## CompoundFinalNoun ;
LEXICON Compound
0:# CompoundNonFinalNoun "weight: -log(1/(cs+1))" ;
0:# CompoundFinalNoun "weight: -log(1/(cs+1))" ;
LEXICON CompoundNonFinalNoun
isä Compound "weight: 0" ;
isän Compound "weight: 0" ;
äiti Compound "weight: 0" ;
äidin Compound "weight: 0" ;
LEXICON CompoundFinalNoun
isä:isä+sg+nom ## "weight:-log(c(isä+sg+nom)/cs)" ;
isän:isä+sg+gen ## "weight:-log(c(isä+sg+gen)/cs)" ;
isälle:isä+sg+all ## "weight:-log(c(isä+sg+all)/cs)" ;
isin:isä+pl+ins ## "weight:-log(c(isä+sg+all)/cs)" ;
LEXICON ##
## # ;
</pre>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_figure">Figure 4: </span>Structure weighting scheme using compound penalties.
</figcaption>
</figure>
</section>
<section id="S5" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5 </span>Training and Test Data</h2>
<div id="S5.p1" class="ltx_para">
<p class="ltx_p">For training and testing purposes, we use a compilation of three
years, 1995-1997, of daily issues of Helsingin Sanomat, which is the
most wide-spread Finnish newspaper. This collection contained
approximately 2.4 million different words, i.e. types. We
disambiguated the corpus using Machinese for
Finnish<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>Machinese is available from Connexor Ltd.,
www.connexor.com</span></span></span> which provided one reading in context for each
word based on syntactic parsing.</p>
</div>
<div id="S5.p2" class="ltx_para">
<p class="ltx_p">To create the test material from the corpus, we selected all word
forms with more than 20 characters for which our baseline analyzer
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen/2008</span>]</cite> gave a compound analysis, i.e. 53 270 types. Of
these, we selected the types which had a structural ambiguity and
found 4 721 such words, i.e. approximately 8.9 % of all the compound
words analyzed by our baseline analyzer. Of the remaining more than
20-character compounds 63.7 % contained no ambiguities or only
inflectional ambiguities. At most, the combination of structural and
inflectional ambiguities amounted to 30 readings in three different
words which after all is a fairly moderate number. On the average, the
structural and inflectional ambiguity amounts to 2.79 readings per
word. Examples of structurally ambiguous words are
<em class="ltx_emph">aktivointimahdollisuuksien</em> with the ambiguity
<em class="ltx_emph">aktivointi#mahdollisuus</em> ’of the opportunities to activate’ vs.
<em class="ltx_emph">akti#vointi#mahdollisuus</em> ’of the opportunities to act health’
and <em class="ltx_emph">hiihtoharjoittelupaikassa</em> with the ambiguity
<em class="ltx_emph">hiihto#harjoittelu#paikka</em> ’in the ski training location’
vs. <em class="ltx_emph">hiihto#harjoittelu#pai#kassa</em> ’ski training pie cashier’.</p>
</div>
<div id="S5.p3" class="ltx_para">
<p class="ltx_p">The characteristics of all the compounds in the corpus is presented in
Table <a href="#S5.T1" title="Table 1 ‣ 5 Training and Test Data ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.</p>
</div>
<figure id="S5.T1" class="ltx_table">
<table class="ltx_tabular ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" colspan="3"><span class="ltx_text" style="font-size:90%;"># of Characters</span></th>
<th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" colspan="3"><span class="ltx_text" style="font-size:90%;"># of Segments</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Min.</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Max.</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Avg.</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Min.</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Max.</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Avg.</span></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">2</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">44</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">15.34</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">2</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">6</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">2.19</span></td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_table">Table 1: </span>Evaluation of compounds, segments and readings.
</figcaption>
</figure>
<div id="S5.p4" class="ltx_para">
<p class="ltx_p">Examples of six-part compounds are:
</p>
<ul id="I1" class="ltx_itemize">
<li id="I1.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize">•</span>
<div id="I1.i1.p1" class="ltx_para">
<p class="ltx_p"><em class="ltx_emph">elo#kuva#teatteri#tuki#työ#ryhmä</em>
<br class="ltx_break">’movie theater support workgroup’</p>
</div>
</li>
<li id="I1.i2" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize">•</span>
<div id="I1.i2.p1" class="ltx_para">
<p class="ltx_p"><em class="ltx_emph">jatko#koulutus#yhteis#työ#toimi#kunta</em>
<br class="ltx_break">’higher education cooperation committee’</p>
</div>
</li>
<li id="I1.i3" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize">•</span>
<div id="I1.i3.p1" class="ltx_para">
<p class="ltx_p"><em class="ltx_emph">lähi#alue#yhtei#työ#määrä#raha</em>
<br class="ltx_break">’regional cooperation reserve’</p>
</div>
</li>
</ul>
</div>
<div id="S5.p5" class="ltx_para">
<p class="ltx_p">The longest compounds found in the corpus is
<em class="ltx_emph">liikenne#turvallisuus#asiain#neuvottelu#kunnassa</em> ’in the road
safety issue negotiating committee’</p>
</div>
</section>
<section id="S6" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">6 </span>Tests and Results</h2>
<div id="S6.p1" class="ltx_para">
<p class="ltx_p">We estimate the probabilities for the non-compound words in the 1995
part of the corpus. Since we do not use the compounds for training we
can test on the compounds of all three years.</p>
</div>
<div id="S6.p2" class="ltx_para">
<p class="ltx_p">We evaluated the weighting schemes described in Section <a href="#S4" title="4 Methodology ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>,
i.e. the probabilistic method without compound boundary weighting, the
probabilistic method combined with compound weighting and the
traditional pure compound weighting. The precision and recall is
presented in Table <a href="#S6.T2" title="Table 2 ‣ 6 Tests and Results ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. Since we only took the first of
the best results, the precision is equal to recall.</p>
</div>
<figure id="S6.T2" class="ltx_table">
<table class="ltx_tabular ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Parameters</span></th>
<th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Precision</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">Only compound penalty</span></td>
<td class="ltx_td ltx_align_left ltx_border_r ltx_border_t"><span class="ltx_text" style="font-size:90%;">99.94 %</span></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_l ltx_border_r"><span class="ltx_text" style="font-size:90%;">Compound penalty and prefix weights</span></td>
<td class="ltx_td ltx_align_left ltx_border_r"><span class="ltx_text" style="font-size:90%;">99.98 %</span></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_l ltx_border_r"><span class="ltx_text" style="font-size:90%;">No compound penalty and prefix weights</span></td>
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_r"><span class="ltx_text" style="font-size:90%;">99.98 %</span></td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_table">Table 2: </span>Precision equals recall for the test results when we use
only the first result.
</figcaption>
</figure>
</section>
<section id="S7" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">7 </span>Implementation Note</h2>
<div id="S7.p1" class="ltx_para">
<p class="ltx_p">In <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span>, we use OpenFST <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">openfst/2007</span>]</cite> as the underlying
finite-state software library for handling weighted finite-state
transducers. The estimated probabilities are encoded as weights in the
tropical semiring, see <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">mohri/1997</span>]</cite>. To extract the n-best
results, we use a single-source n-best paths algorithm, see
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">mohri/2002</span>]</cite>.</p>
</div>
</section>
<section id="S8" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">8 </span>Discussion and Further Research</h2>
<div id="S8.p1" class="ltx_para">
<p class="ltx_p">Previous results for structural compound disambiguation for German
using word probabilities and compound penalties <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite> or
using only word probabilities <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">marek/2006</span>]</cite> also achieved results
with precision and recall in the region of 97-99 %. In German the
ambiguities of long compounds may produce even 120 readings, but on
the average the ambiguity in compounds is between 2-3 readings
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite>, which is on par with the ambiguity of 2.8
readings found for long Finnish compounds. As pointed out initially
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">hedlund/2002</span>]</cite>, the amount of compounds occurring in Finnish,
Swedish and German texts is also on a comparable level.</p>
</div>
<div id="S8.p2" class="ltx_para">
<p class="ltx_p">If a disambiguated corpus is not available for calculating the word
probabilities, using only the structural penalties may still be an
acceptable replacement in Finnish. However, we need to note, that a
similar strategy in German, i.e. using only compound penalties on all
compound prefixes, did not seem to perform as well
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">schiller/2005</span>]</cite>. This may be due to the fact that German contains a
high number of very short one-syllable words which interfere with the
compounding, whereas Finnish is more restricted in the number of short
words. Scandinavian languages are similar to German in that they have
a number of short one-syllable nouns. Using probabilistic approach with
swedish compound disambiguation is demonstrated in <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">sjobergh/2004</span>]</cite>,
which shows results of 86 % accuracy of compound segmenting when using
compound component frequencies and 90 % for number of compound components.
However, it is a question for
further research whether a pure probabilistic approach could fare as
well for Scandinavian languages.</p>
</div>
</section>
<section id="S9" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">9 </span>Conclusions</h2>
<div id="S9.p1" class="ltx_para">
<p class="ltx_p">For Finnish, weighting compound complexity gives excellent results
around 99.9 % almost regardless of the approach. However, from a
theoretical point of view, we can still verify the two hypotheses we
postulated initially. Most importantly, there seems to be no need to
extract the counts from lists of disambiguated compounds, i.e., it is
quite feasible to use general word occurrence probabilities for
structurally disambiguating compounds. In addition, we can also
corroborate the observation that when using word probabilities, it is
possible to forego a specific structural penalty and rely only on the
word probabilities. From a practical point of view, we introduced the
open source tool, <span class="ltx_text ltx_font_smallcaps">hfst-lexc</span>, and demonstrated how it can be
successfully used to encode various compound weighting schemes.
</p>
</div>
</section>
<section id="Sx1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">Acknowledgments</h2>
<div id="Sx1.p1" class="ltx_para">
<p class="ltx_p">This research was funded by the Finnish Academy and the Finnish
Ministry of Education. We are also grateful to the HFST-Helsinki
Finite State Technology research team and to the anonymous reviewers.</p>
</div>
</section>
<section id="bib" class="ltx_bibliography">
<h2 class="ltx_title ltx_title_bibliography">References</h2>
<ul id="L1" class="ltx_biblist">
<li id="bib.bib47" class="ltx_bibitem ltx_bib_book">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[1]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. Hakulinen, M. Vilkuna, R. Korhonen, V. Koivisto, Heinonen and I. Alho</span><span class="ltx_text ltx_bib_year"> (2008)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Iso suomen kielioppi</span>.
</span>
<span class="ltx_bibblock"> <span class="ltx_text ltx_bib_publisher">Suomalaisen Kirjallisuuden Seura</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://kaino.kotus.fi/visk" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#bib.bib132" title="Iso suomen kielioppi" class="ltx_ref">2</a>.
</span>
</li>
<li id="bib.bib132" class="ltx_bibitem ltx_bib_book">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[2]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. Hakulinen, M. Vilkuna, R. Korhonen, V. Koivisto, Heinonen and I. Alho</span><span class="ltx_text ltx_bib_year"> (2008)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Iso suomen kielioppi</span>.
</span>
<span class="ltx_bibblock"> <span class="ltx_text ltx_bib_publisher">Suomalaisen Kirjallisuuden Seura</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p4" title="2 Inflection and Compounding in Finnish ‣ Weighted Finite-State Morphological Analysis of Finnish Inflection and CompoundingThe official publication was in Nodalida 2009 organised in Odense, http://beta.visl.sdu.dk/nodalida2009/, the electronic publication was available at http://dspace.utlib.ee/dspace/handle/10062/9206 on October 13, 2017." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.
</span>
</li>
</ul>
</section>
</article>
</div>
<footer class="ltx_page_footer">
<div class="ltx_page_logo">Generated on Fri Oct 13 18:35:50 2017 by <a href="http://dlmf.nist.gov/LaTeXML/">LaTeXML <img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="[LOGO]"></a>
</div></footer>
</div>
</body>
</html>