-
Notifications
You must be signed in to change notification settings - Fork 3
/
Pirinen-2015-nodalida-cg-reweight.html
524 lines (510 loc) · 38.5 KB
/
Pirinen-2015-nodalida-cg-reweight.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
<!DOCTYPE html><html>
<head>
<title>Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources</title>
<!--Generated on Fri Sep 29 15:32:55 2017 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on September 29, 2017.-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">Using weighted finite state morphology with VISL CG-3—Some experiments
with free open source Finnish resources</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Tommi A Pirinen
<br class="ltx_break">Ollscoil Chathair Bhaile Átha Cliath
<br class="ltx_break">CNGL—School of Computing
<br class="ltx_break">Dublin City University, Dublin 9
<br class="ltx_break"><span class="ltx_text ltx_font_typewriter">[email protected]</span>
</span></span>
</div>
<div class="ltx_date ltx_role_creation">September 29, 2017</div>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p">Traditionally, the coupling of finite state morphology and constraint
grammar has been strictly rule-based, making binary distinctions
between allowed and disallowed readings, however, in the recent years much
of the research in the finite state morphologies has adapted the
contemporary paradigm of statistically weighted analysis. This is reflected
in current versions of free and open source morphology of Finnish, omorfi, in
the finite state morphology part. In this paper we examine two strategies
of making use of the weights as a part of VISL CG-3 pipeline.
We evaluate the results intrinsically on small sample of analyses we
have disambiguated by hand ourselves, and extrinsically on the effect it
has on the rule-based machine translation of that text using the freely
available open source translator, apertium-fin-eng.</p>
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p">In the recent years, use of statistical information in computational
linguistics has gained much interest, with systems like hunpos <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib10" title="HunPos: an open source trigram tagger" class="ltx_ref">4</a>]</cite>,
moses <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib7" title="Moses: open source toolkit for statistical machine translation" class="ltx_ref">6</a>]</cite> etc.
being the main points of interest of most research in the field. In finite
state morphology as well as constraint grammars, extensions to handle
probabilities are recent and scarcely
documented <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib15" title="Weighting finite-state morphological analyzers using HFST tools" class="ltx_ref">9</a>, <a href="#bib.bib2" title="Introducing probabilistic information in constraint grammar parsing" class="ltx_ref">2</a>]</cite>. In this paper we
experiment with an existing weighted finite state morphology of
Finnish <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib16" title="Modularisation of Finnish finite-state language descriptionâtowards wide collaboration in open source development of morphological analyser" class="ltx_ref">10</a>]</cite><span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><a href="https://github.com/flammie/omorfi/" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://github.com/flammie/omorfi/</a></span></span></span>
with VISL CG-3. For CG we have adapted Fred Karlsson’s Finnish CG rules to
omorfi’s tag set, however, the rules were written for completely different
analyser, which results in relatively low quality and high level of ambiguity
at the current level. We estimate that salvaging these rules for the current
version of analysis would require a substantial re-writing effort. In the
meanwhile, there are a lot of easy targets that correctly trained statistical
analyser can already deal with without extra effort. E.g., one large
difference we assume between the analyser these CG rules were written for and
omorfi’s are that omorfi contains a huge number of proper nouns, dialectal
and sub-standard forms, and rare language, animal etc. names, that are left
ambiguous. It is obvious for a human reader that these words are very
unlikely and given most corpora we expect them to be highly penalised as
well.
</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">The main goal of this experiment is to create a functional
pipeline out of weighted finite-state analysis and current version of the
constraint grammar. There are obvious conflicts between the statistically
driven ranked hypotheses approach and the strictly deleting approach of the
current constraint grammar, which may limit usefulness of our current method
of combining these two information sources.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">The rest of the paper is structured as follows: In section <a href="#S2" title="2 Background ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>
we explain our starting point and current pipelines for morphological analysis,
disambiguation and machine translation. In section <a href="#S3" title="3 Methods ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a> we
explain various approaches we tried to include and combine weight data from
the weighted finite-state analysers into VISL CG-3 and finally into machine
translation. In section <a href="#S4" title="4 Experimental Setup ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a> we describe our
experiments and how we measured the workability of our approach. In
section <a href="#S5" title="5 Evaluation ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a> we show the results of the experiment. In
section <a href="#S6" title="6 Discussion ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">6</span></a> we perform error analysis, compare our work with
other existing approaches and lay out the future work. Finally in
section <a href="#S7" title="7 Conclusion ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">7</span></a> we summarise the conclusion of the experiments.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Background</h2>
<div id="S2.p1" class="ltx_para">
<p class="ltx_p">Our starting point for this experiment is such that we had a modern, weighted
finite-state morphology <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib3" title="Finite-state morphology: xerox tools and techniques" class="ltx_ref">1</a>, <span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen2008weighting</span>]</cite> implementation of Finnish morphology in
omorfi <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen2015omorfi</span>]</cite>. This morphology has rudimentary support for
probabilistic weighting of surface forms or analyses using corpora-based
unigram training approach. However, with the lack of high quality free
and open source corpora compatible with omorfi analyses means that it is
distributed with very basic linguist-written weights on the analysis side.
For the main purpose of this experimentation we deemed this sufficient, to
get the weights working through the pipeline at all.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">On the other hand we had a free and open source, mature and large CG grammar by
Fred Karlsson, that needed conversion to omorfi compatible tagging format, as
well as some changes from CG-1 syntax to VISL CG-3.<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup>even though CG-1
and VISL CG-3 are possibly are mostly compatible, we found that some things
may have started working better when changing to more conventional VISL CG-3
constructions</span></span></span></p>
</div>
<div id="S2.p3" class="ltx_para">
<p class="ltx_p">The fact that the CG rules from Karlsson were built using very different
analyser than ours also played a large role in our decision to combine the
weighted approach to with pure constraint grammar approach: the rule-writers
of the original grammar had not seen large portion of the ambiguities
introduced by larger, more varied lexicon of omorfi, including things like
dialects, large inventories of proper nouns and unlikely but attested
readings like plural cases of singular personal pronouns. For example, in
the story we use for reference in our translation experiments, the sentence
initial common words like “Mutta” (but) and “Koira” (dog) are also
proper nouns, but also proper nouns like “Mari” have been added a
common noun reading (slang for marihuana). Obviously these are not dealt with
in the original ruleset as they have not appeared as ambiguities to the
writers of th rules.</p>
</div>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Methods</h2>
<div id="S3.p1" class="ltx_para">
<p class="ltx_p">To first convert the original CG-1 ruleset to omorfi format analyses, we went
through the rules by hand from beginning to end. This resulted in a ruleset
where only a subset of rules matched to any constructs in the analysed texts.
To further improve the quality and fix a lot of conversion errors we made use
of the new VISL CG-3 features <span class="ltx_text ltx_font_typewriter">no-inline-sets</span>. With help of this
feature we got most of the ambiguous word-forms at least to match some of the
rules, which hopefully means conversion has not too many tag formatting
mismatch errors at the very least. The resulting ruleset with weight-based
rule integrated is available from omorfi git repository.</p>
</div>
<div id="S3.p2" class="ltx_para">
<p class="ltx_p">To feed omorfi analyses into VISL CG-3 we have extended the python interface
of omorfi to output CG stream format analyses, with omor style
<span class="ltx_text ltx_font_typewriter">[FEATURE=VALUE]</span> tags mapped into more conventional CG style tags,
mostly of form <span class="ltx_text ltx_font_typewriter">VALUE</span>. There are number of deviations to this of
course, most notable being the <span class="ltx_text ltx_font_typewriter">WEIGHT=</span> feature, which is turned into
VISL CG-3 <em class="ltx_emph">numeric tag</em>. Other special conversions include things like
usage, dialect and such lexical information, which are all included in angle
bracket tags following VISL CG-3 conventions. Omorfi python interface also
performs some case mangling (uppercasing, lowercasing, title-casing and
removing title case) that seems to be similar as CG-1 rules seem to expect to
appear in some angle-bracketed tags, so we have tried to map these to the
readings in the original ruleset, with limited success.</p>
</div>
<div id="S3.p3" class="ltx_para">
<p class="ltx_p">The probabilities in omorfi are provided by the underlying HFST <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Hfstâframework for compiling and applying morphologies" class="ltx_ref">7</a>]</cite>
system as a floating point number based on the finite-state implementation of a
tropical semiring. This weight can be based on negative logarithms of
probabilities of the word-forms, lemmas, analyses etc., as well as
linguist-defined arbitrary values. For the purposes of this experiment we only
used the linguist-defined values that are neatly in range of 0.1—32. This
simplifies the scaling of the weights introduced by VISL CG-3 processing as we
only have to scale against known range instead of e.g. combinations of
negative logarithms’ maxima. As noted earlier in section <a href="#S2" title="2 Background ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>,
we use the default setting which is based on linguist-approximated tag
likelihoods. Since VISL CG-3 does
not support floating point numbers, e.g. 0.1, we output weight in a
numeric tag multiplied by a 100 before rounding them down and turning
into a tag of the form <span class="ltx_text ltx_font_typewriter"><W=<em class="ltx_emph">weight</em>></span>, where
<em class="ltx_emph">weight</em> is the multiplied weight. This is sufficient for the
coarse weights that default analyser produces, and in line what e.g.
<span class="ltx_text ltx_font_typewriter">cg-conv</span> does when it treats stream formats containing decimal data
to be converted into numeric tags.</p>
</div>
<div id="S3.p4" class="ltx_para">
<p class="ltx_p">The basic support for numeric tag processign in VISL CG-3 is done by the
<span class="ltx_text ltx_font_typewriter">SELECT (<W=MIN>)</span> statement. If applied as a sole rule to result of
omorfi to VISL CG-3 conversion it exactly like traditional weighted finite
state morphology producing 1-best analysis. When combined into existing
ruleset, we add this into a last, separate <span class="ltx_text ltx_font_typewriter">SECTION</span>, in order to
integrate some weight handling to CG iterations.</p>
</div>
<div id="S3.p5" class="ltx_para">
<p class="ltx_p">One long-term goal of this experimentation was to use VISL CG-3 also as a
part of morphological analysis pipeline that produces n-best lists in same
manner as weighted finite-state analyser does. To make this work, we take
the output of VISL CG-3’s cg-proc in trace mode before converting it back
to an n-best list. There are multiple possible strategies to use readigns
for deleted analyses as weights again. With this experiment, we have simply
gone with adding the line number of the rule, this reflects the fact that
later rules in the file are more risky and less ambiguous. Ideally however,
we would like to annotate the rules using rule name labels, such as
“usually”, “dangerous” to denote e.g. multipliers for such rules. Furthermore,
it is likely that it is not exactly the line number, but rather the section
number, that is relevant for the rule likelihood, due to way linguists and
rulewriters will organise rules within sections into blocks of related rules
where ordering within and between blocks may not be important.</p>
</div>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>Experimental Setup</h2>
<div id="S4.p1" class="ltx_para">
<p class="ltx_p">For analysis we use the python API to omorfi version 20150326, to turn the
analyses into the format understood by VISL CG 3. We use a version Fred
Karlsson’s Finnish CG found in apertium’s
repository.<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup><a href="http://sourceforge.net/p/apertium/svn/HEAD/tree/nursery/apertium-fin-eng/apertium-fin-eng.fin-eng.rlx" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://sourceforge.net/p/apertium/svn/HEAD/tree/nursery/apertium-fin-eng/apertium-fin-eng.fin-eng.rlx</a></span></span></span>,
with the tag set manually converted to match
omorfi’s,<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup><a href="https://github.com/flammie/omorfi/tree/master/src/vislcg3" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://github.com/flammie/omorfi/tree/master/src/vislcg3</a></span></span></span>
however, given the amount of ambiguous names of tags and sets and lists in the
grammar, there may be some conversion errors left. The system is tested with
VISL CG-3 version 0.9.9.10730, compiled from Gentoo
packaging.<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">5</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">5</sup><a href="https://github.com/flammie/flammie-overlay/tree/master/sci-misc/vislcg3" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://github.com/flammie/flammie-overlay/tree/master/sci-misc/vislcg3</a></span></span></span></p>
</div>
<div id="S4.p2" class="ltx_para">
<p class="ltx_p">To test the functionality of our combination of weighted
finite-state analyser and VISL CG-3, we analyse a short text that we have
manually disambiguated and measure the quality of analyses. The source of the
text is found in the apertium’s SVN
repository.<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">6</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">6</sup><a href="http://sourceforge.net/p/apertium/svn/HEAD/tree/nursery/apertium-fin-eng/texts/tarina.fin.text" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://sourceforge.net/p/apertium/svn/HEAD/tree/nursery/apertium-fin-eng/texts/tarina.fin.text</a></span></span></span>
For the purpose of this experiment, we have manually tokenised the text
before processing it with omorfi.
In addition to analysis we use the results of analyses in apertium’s
Finnish-English machine translator, and measure the translation quality. This
way we can ensure that the gold annotation has not been selected to best fit
our results but is actually the semantically most fitting one. The gold
annotations can also be found in the omorfi git repository.</p>
</div>
<div id="S4.p3" class="ltx_para">
<p class="ltx_p">To perform evaluations we used simple python script that performs string
comparisons of the gold file lines between the lines starting with
<span class="ltx_text ltx_font_typewriter">"<</span> ignoring empty lines and the ADDed CLB tags. The machine
translation analysis was performed against current apertium-fin-eng ruleset
and the reference translation in their svn, with standard machine translation
metrics as measured by NIST’s <span class="ltx_text ltx_font_typewriter">mteval-13a.pl</span>, which is the standard
BLEU metric of machine translation <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">bleu</span>]</cite>.</p>
</div>
</section>
<section id="S5" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5 </span>Evaluation</h2>
<div id="S5.p1" class="ltx_para">
<p class="ltx_p">We first evaluated the analysers against the gold standard in
table <a href="#S5.T1" title="Table 1 ‣ 5 Evaluation ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>. We use simple metrics of <span class="ltx_text ltx_font_italic">Recall</span> and
<span class="ltx_text ltx_font_italic">Precision</span>, defined as <math id="S5.p1.m1" class="ltx_Math" alttext="\mathrm{Recall}=\frac{\mathrm{Correct}}{\mathrm{Gold}}" display="inline"><mrow><mi>Recall</mi><mo>=</mo><mfrac><mi>Correct</mi><mi>Gold</mi></mfrac></mrow></math>, where Correct is number of correct
readings and Gold is number of gold readings, and <math id="S5.p1.m2" class="ltx_Math" alttext="\mathrm{Precision}=\frac{\mathrm{Correct}}{\mathrm{All}}" display="inline"><mrow><mi>Precision</mi><mo>=</mo><mfrac><mi>Correct</mi><mi>All</mi></mfrac></mrow></math>, where All is number of all readings
given by the disambiguation scheme. The row <span class="ltx_text ltx_font_italic">Weights</span> stands for CG with
only select weighted best applied, the row <span class="ltx_text ltx_font_italic">Rules</span> stands for only
converted CG ruleset applied, and row <span class="ltx_text ltx_font_italic">Combination</span> uses both.</p>
</div>
<figure id="S5.T1" class="ltx_table">
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_bold">Rules</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_italic">Precision</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_italic">Recall</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_italic">Weights</span></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t">60</td>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t">99</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_italic">Rules</span></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t">78</td>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t">91</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_b ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_italic">Combination</span></th>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_r ltx_border_t">80</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_r ltx_border_t">90</td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 1: </span>Precision and recall of different combinations of
weighted morphology and rules.</figcaption>
</figure>
<div id="S5.p2" class="ltx_para">
<p class="ltx_p">The precision with combination is as expected greater than converted rules,
which in turn is greater than only weight-based rules that currently have not
much large disambiguating effect. The recall conversely is largest as the
weights let most readings through.</p>
</div>
<div id="S5.p3" class="ltx_para">
<p class="ltx_p">The resulting analyses is then converted to format expected by apertium for
machine translation and evaluated for machine translation quality in
table <a href="#S5.T2" title="Table 2 ‣ 5 Evaluation ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.</p>
</div>
<figure id="S5.T2" class="ltx_table">
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_bold">Rules</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_italic">BLEU</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_italic">PWER</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_italic">Weights</span></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t">6.86</td>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t">87.11</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_italic">Rules</span></th>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t">5.76</td>
<td class="ltx_td ltx_align_right ltx_border_r ltx_border_t">88.67</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_b ltx_border_l ltx_border_r ltx_border_t"><span class="ltx_text ltx_font_italic">Combination</span></th>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_r ltx_border_t">5.60</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_r ltx_border_t">88.89</td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 2: </span>Precision and recall of different combinations of
weighted morphology and rules.</figcaption>
</figure>
<div id="S5.p4" class="ltx_para">
<p class="ltx_p">As can be seen from the scores the translator is still quite far from usable
quality and thus comparison may not be very interesting. However we can see
that the scores are systematically better with system’s deemed worse by
precision and better by recall.</p>
</div>
</section>
<section id="S6" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">6 </span>Discussion</h2>
<div id="S6.p1" class="ltx_para">
<p class="ltx_p">First of all, we note that the quality differences with adding weights has
diminished from the version prior to conference and current version. This is
largely due to newer version released in the workshop containing features that
greatly improved the tag matching of the converted ruleset. Following this
result we can say that the weights are most useful when the rules are not
as high coverage, i.e. early stages of development or, as in this case,
conversion process.</p>
</div>
<div id="S6.p2" class="ltx_para">
<p class="ltx_p">Nevertheless, the overall effect of combining weights has still improvements
to exactly the shortcomings noted in the introduction as problems of the
mismatching morphologies. In error evaluation, the cases that are affected
by rules are mostly in derivation and productive compounding, but also some
marginal cases that are not covered by rules.</p>
</div>
<div id="S6.p3" class="ltx_para">
<p class="ltx_p">For future work we are aiming to use the n-best list version of the result in
a real-world application pipeline.</p>
</div>
</section>
<section id="S7" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">7 </span>Conclusion</h2>
<div id="S7.p1" class="ltx_para">
<p class="ltx_p">We have implemented a VISL CG-3 output on top of existing weighted finite-state
analysis of Finnish language and tested that it works combined with VISL CG-3.
We have successfully included this combination as a part of apertium machine
translation pipeline. We note that weighted finite-state analysis can be
easily combined with VISL CG 3 and results in an increased accuracy.</p>
</div>
</section>
<section id="Sx1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">Acknowledgments</h2>
<div id="Sx1.p1" class="ltx_para">
<p class="ltx_p">The research leading to these results has received
funding from the European Union Seventh Framework
Programme FP7/2007-2013 under grant agreement
PIAP-GA-2012-324414 (Abu-MaTran).</p>
</div>
</section>
<section id="bib" class="ltx_bibliography">
<h2 class="ltx_title ltx_title_bibliography">References</h2>
<ul id="L1" class="ltx_biblist">
<li id="bib.bib3" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[1]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. R. Beesley and L. Karttunen</span><span class="ltx_text ltx_bib_year"> (2003)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Finite-state morphology: xerox tools and techniques</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">CSLI, Stanford</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p1" title="2 Background ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.
</span>
</li>
<li id="bib.bib2" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[2]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">E. Bick</span><span class="ltx_text ltx_bib_year"> (2009)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Introducing probabilistic information in constraint grammar parsing</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of Corpus Linguistics 2009</span>,
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://ucrel.lancs.ac.uk/publications/cl2009/" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p1" title="1 Introduction ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib11" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[3]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">P. Halácsy, A. Kornai and C. Oravecz</span><span class="ltx_text ltx_bib_year"> (2007)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">HunPos: an open source trigram tagger</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_pages"> pp. 209–212</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#bib.bib10" title="HunPos: an open source trigram tagger" class="ltx_ref">4</a>.
</span>
</li>
<li id="bib.bib10" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[4]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">P. Halácsy, A. Kornai and C. Oravecz</span><span class="ltx_text ltx_bib_year"> (2007)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">HunPos: an open source trigram tagger</span>.
</span>
<span class="ltx_bibblock">See <span class="ltx_text ltx_bib_crossref"><cite class="ltx_cite"><a href="#bib.bib11" title="HunPos: an open source trigram tagger" class="ltx_ref">HunPos: an open source trigram tagger, Halácsy<span class="ltx_text ltx_bib_etal"> et al.</span></a></cite></span>,
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p1" title="1 Introduction ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib8" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[5]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran and R. Zens</span><span class="ltx_text ltx_bib_year"> (2007)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Moses: open source toolkit for statistical machine translation</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_pages"> pp. 177–180</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#bib.bib7" title="Moses: open source toolkit for statistical machine translation" class="ltx_ref">6</a>.
</span>
</li>
<li id="bib.bib7" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[6]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran and R. Zens</span><span class="ltx_text ltx_bib_year"> (2007)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Moses: open source toolkit for statistical machine translation</span>.
</span>
<span class="ltx_bibblock">See <span class="ltx_text ltx_bib_crossref"><cite class="ltx_cite"><a href="#bib.bib8" title="Moses: open source toolkit for statistical machine translation" class="ltx_ref">Moses: open source toolkit for statistical machine translation, Koehn<span class="ltx_text ltx_bib_etal"> et al.</span></a></cite></span>,
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p1" title="1 Introduction ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib12" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[7]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Lindén, E. Axelson, S. Hardwick, T. A. Pirinen and M. </span><span class="ltx_text ltx_bib_year"> (2011)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Hfstâframework for compiling and applying morphologies</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Systems and Frameworks for Computational Morphology</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S3.p3" title="3 Methods ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>.
</span>
</li>
<li id="bib.bib13" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[8]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Lindén, E. Axelson, S. Hardwick, T. A. Pirinen and M. </span><span class="ltx_text ltx_bib_year"> (2011)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Hfstâframework for compiling and applying morphologies</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Systems and Frameworks for Computational Morphology</span>, <span class="ltx_text ltx_bib_pages"> pp. 67–85</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#bib.bib12" title="Hfstâframework for compiling and applying morphologies" class="ltx_ref">7</a>.
</span>
</li>
<li id="bib.bib15" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[9]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Lindén and T. Pirinen</span><span class="ltx_text ltx_bib_year"> (2009-07)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Weighting finite-state morphological analyzers using HFST tools</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">FSMNLP 2009</span>, <span class="ltx_text ltx_bib_editor">B. Watson, D. Courie, L. Cleophas and P. Rautenbach (Eds.)</span>,
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text isbn ltx_bib_external">ISBN 978-1-86854-743-2</span>,
<a href="http://www.ling.helsinki.fi/klinden/pubs/fsmnlp2009weighting.pdf" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p1" title="1 Introduction ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib16" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[10]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">T. A. Pirinen</span><span class="ltx_text ltx_bib_year"> (2011)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Modularisation of Finnish finite-state language descriptionâtowards wide collaboration in open source development of morphological analyser</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of Nodalida</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_series">NEALT proceedings</span>, Vol. <span class="ltx_text ltx_bib_volume">18</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://www.helsinki.fi/%5C%7etapirine/publications/Pirinen-nodalida-2011-omorfi.pdf" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p1" title="1 Introduction ‣ Using weighted finite state morphology with VISL CG-3—Some experiments with free open source Finnish resources" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
</ul>
</section>
</article>
</div>
<footer class="ltx_page_footer">
<div class="ltx_page_logo">Generated on Fri Sep 29 15:32:55 2017 by <a href="http://dlmf.nist.gov/LaTeXML/">LaTeXML <img src="" alt="[LOGO]"></a>
</div></footer>
</div>
</body>
</html>