-
Notifications
You must be signed in to change notification settings - Fork 3
/
Pirinen-2011-nodalida-omorfi.html
424 lines (407 loc) · 27.7 KB
/
Pirinen-2011-nodalida-omorfi.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
<!DOCTYPE html><html>
<head>
<title>Modularisation of Finnish Finite-State Language Description—Towards Wide Collaboration in Open Source Development of Morphological Analyser\footnotepubrightsPublished originally in Nodalida 2011 in RÄ«ga. Official version may be at http://dspace.utlib.ee/dspace/handle/10062/16955.</title>
<!--Generated on Fri Sep 29 12:57:01 2017 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="LaTeXML.css" type="text/css">
<link rel="stylesheet" href="ltx-article.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">Modularisation of Finnish Finite-State Language Description—Towards
Wide Collaboration in Open Source Development of Morphological
Analyser<span class="ltx_ERROR undefined">\footnotepubrights</span>Published originally in Nodalida 2011 in RÄ«ga.
Official version may be at <a href="http://dspace.utlib.ee/dspace/handle/10062/16955" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://dspace.utlib.ee/dspace/handle/10062/16955</a>.</h1>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p">In this paper we present an open source implementation for Finnish
morphological parser. We shortly evaluate it against contemporary criticism
towards monolithic and unmaintainable finite-state language description. We
use it to demonstrate way of writing finite-state language description that is
used for varying set of projects, that typically need morphological analyser,
such as POS tagging, morphological analysis, hyphenation, spell checking and
correction, rule-based machine translation and syntactic analysis. The
language description is done using available open source methods for building
finite-state descriptions coupled with autotools-style build system, which is
de facto standard in open source projects.</p>
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p">Writing maintainable language descriptions for finite-state systems has
traditionally been a laborious task. Even though finite-state technology has
been de facto standard for writing computational language descriptions for more
than two decades now <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/2003</span>]</cite>, it has some recognised flaws and
problems both caused by shortcomings of actual implementations and
background technology <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">wintner/2008</span>]</cite>. Commonly language description is
performed by a single linguist or language technologist. The descriptions
typically wind up being complex enough that modifying them requires a great
amount of studying and understanding before one is able to do the smallest of
modifications to the system. In the current times that all proper, healthy,
scientific projects should be open source and globally developed, this poses a
challenge for such project’s internal structure. Another source of problem in
such collaboration is that background of contributors for language description
varies from computer scientists to linguists <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">maxwell/2008</span>]</cite> to
computer-savvy native language speakers, all of whom should be able to
contribute to the project. The solutions we propose for this is to embrace
proper modularisation in language descriptions to allow multiple specific
entry points to contributors.</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">In this paper we describe a new implementation of the Finnish language
description called omorfi<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup>http://home.gna.org/omorfi</span></span></span>, made to support
large variety of NLP applications and different audiences. While the
background theory for implementing finite-state description of Finnish was laid
out already in <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">koskenniemi/1983</span>]</cite>, and morphophonological system
doesn’t have significant changes, the actual system was rewritten from the
scratch. The rewriting was originally done by single linguist as usual, in a
master’s thesis project <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen/2008</span>]</cite>, but afterwards it has been
extended as full-fledged open source project and used in various contexts. This
extended development has made necessary to create a better modularised
framework to allow people of different level of knowledge and familiarity with
finite-state technology and Finnish to contribute on their prospective parts of
the description without causing problems for other applications of the
finite-state analysers.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">The projects that have used and use omorfi as language description include
spell checking and correction <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen/2010/lrec</span>]</cite>, lemmatizing for IR
applications (e.g. <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">kurola/2010</span>]</cite>), named entity recognition, rule-based
machine translation (e.g. Finnish—Northern
Sámi<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup><a href="http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-fin/" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-fin/</a></span></span></span>)
and syntactic disambiguation and analysis. The demands for even the basic
morphology with all these different applications are very different with
regards to productivity; lexical coverage and accuracy as well as depth of
tagging, so it has became obvious that no one lexical automaton will work for
everyone. For this reason the modularisation has to provide easily
configurable options and modifiability for all end-points.</p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">The modularisation of omorfi was based primarily on technological development,
which is the reason why it still maintains some traditional finite-state
morphology distinctions, such as strict separation off morphological
combinatorics (i.e. lexc code) and morphophonological phenomena (i.e.
twolc code). The same is similarly true for new code base, such as one for
training weighted finite-state models; weighting was split to its one
module.</p>
</div>
<div id="S1.p5" class="ltx_para">
<p class="ltx_p">One of the key points in modular structure here is that we ensure that modifying
will not typically break already working parts, so contributors adding new words
or moving hyphens will not cause problems in other parts of description
as much as possible.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Modularisation of Finite-State Language Description</h2>
<div id="S2.p1" class="ltx_para">
<p class="ltx_p">The modularisation scheme we ended up with in finite-state description of
Finnish has grown organically around rather standard description of
finite-state morphology not significantly different from the work of
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">koskenniemi/1983</span>]</cite>. The further developments of modularisation more or
less have followed from development of finite-state technology along years
from initial implementation of omorfi at publication of <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen/2008</span>]</cite>
to its current open source community developed version.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">In figure <a href="#S2.F1" title="Figure 1 ‣ 2 Modularisation of Finite-State Language Description ‣ Modularisation of Finnish Finite-State Language Description—Towards Wide Collaboration in Open Source Development of Morphological Analyser\footnotepubrightsPublished originally in Nodalida 2011 in RÄ«ga. Official version may be at http://dspace.utlib.ee/dspace/handle/10062/16955." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a> we give a hierarchical list of abstract
modules implemented in omorfi at the time of writing. As mentioned, the
classical modules of morphotactic combinatorics (i.e. Xerox compatible lexc
language description) and morphophonology (i.e. Xerox compatible twolc
description) is still present. The morphotactic combinatorics has already been
split to sub modules for two reasons. First is primarily practical fact
that code base for morphotactic combinatorics for words of Finnish is huge.
Second and perhaps the more important distinction is the fact that central
and integral part of the life force of morphophonological description of
the all languages is to keep up with constant influx of new lexical items
to the language; neologisms, proper nouns and other coinages. From further
new modules, orthographical variations was implemented to create detached
support for certain obvious variations of Finnish written data, e.g. the
typewriter and SF7-ASCII era digraphs like sh and zh in stead of š and ž
respectively. The hyphenation and syllabification of Finnish language is also
one obvious service for morphological dictionary to provide; for Finnish the
compound boundaries cannot typically be discriminated without a
dictionary <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">linden/2009/nodalida</span>]</cite>. One of the features that has become
rather obvious along years for morphological parsers is the fact that all
computational linguistic applications must require their own very special
version of morphological analysis, so in omorfi we’ve chosen to avoid lock-ins
for any specific type of <em class="ltx_emph">tag sets</em> or morphological analyses so to say,
and instead go for one version of analysis to contain certain superset of all
needed forms and implement finite-state descriptions (i.e. simple rewriting
rules) to provide all of the different analysis styles. The statistical models
is one of recent developments of finite-state technology, and there’s a lot
to offer for language models here so the whole family of weighted finite-state
training and models is also implemented in omorfi as a separate module here,
which for most intents and purposes does work independently of any other part
of the language description.</p>
</div>
<figure id="S2.F1" class="ltx_figure">
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>Hierarchy of modules in modern finite-state implementation Finnish languege
</figcaption>
<ul id="I1" class="ltx_itemize">
<li id="I1.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text" style="font-size:70%;">•</span></span>
<div id="I1.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">morphotactics</span></p>
<ul id="I1.I1" class="ltx_itemize">
<li id="I1.I1.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text ltx_font_bold" style="font-size:70%;">–</span></span>
<div id="I1.I1.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">lexical data</span></p>
</div>
</li>
<li id="I1.I1.i2" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text ltx_font_bold" style="font-size:70%;">–</span></span>
<div id="I1.I1.i2.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">stem morphophonology</span></p>
</div>
</li>
<li id="I1.I1.i3" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text ltx_font_bold" style="font-size:70%;">–</span></span>
<div id="I1.I1.i3.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">inflection, derivation and compounding
</span></p>
</div>
</li>
<li id="I1.I1.i4" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text ltx_font_bold" style="font-size:70%;">–</span></span>
<div id="I1.I1.i4.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">lexical-syntactic data</span></p>
</div>
</li>
</ul>
</div>
</li>
<li id="I1.i2" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text" style="font-size:70%;">•</span></span>
<div id="I1.i2.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">morphophonology</span></p>
</div>
</li>
<li id="I1.i3" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text" style="font-size:70%;">•</span></span>
<div id="I1.i3.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">analysis models</span></p>
</div>
</li>
<li id="I1.i4" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text" style="font-size:70%;">•</span></span>
<div id="I1.i4.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">orthographical variations</span></p>
</div>
</li>
<li id="I1.i5" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text" style="font-size:70%;">•</span></span>
<div id="I1.i5.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">hyphenation and syllabification</span></p>
</div>
</li>
<li id="I1.i6" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text" style="font-size:70%;">•</span></span>
<div id="I1.i6.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">error models</span></p>
</div>
</li>
<li id="I1.i7" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text" style="font-size:70%;">•</span></span>
<div id="I1.i7.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">statistical models</span></p>
</div>
</li>
<li id="I1.i8" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_itemize"><span class="ltx_text" style="font-size:70%;">•</span></span>
<div id="I1.i8.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text" style="font-size:70%;">lexicographic filtering</span></p>
</div>
</li>
</ul>
</figure>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Implementation</h2>
<div id="S3.p1" class="ltx_para">
<p class="ltx_p">Here we briefly discuss implementation of modules, mainly to discuss
about practices that help the cooperation. Naturally full discussion of the
modules is found in the documentation of the
system<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup><a href="http://home.gna.org/omorfi/" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://home.gna.org/omorfi/</a></span></span></span>. The system
implementation is harnessing the autotools framework and unix style tools of
HFST to incrementally build the finite-state automata using finite-state
algebra, such as composition to extend them, originally noted even in
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/2003</span>]</cite>. The crucial thing for this modular approach is that it
can be applied incrementally, each module can be replaced or disabled entirely
at needs of end application, and with autotools framework all this
can be performed by simple command-line switch to <code class="ltx_verbatim ltx_font_typewriter">./configure</code>.</p>
</div>
<section id="S3.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>Lexical Data—Lexicon and Features of Lexical Items</h3>
<div id="S3.SS1.p1" class="ltx_para">
<p class="ltx_p">The initial part of morphotactics deals with lexicographical data. This is
the part where most modification and cooperation can be used, the lexical items
in language change all the time in introduction of new word forms, and the
expertise needed to extend the lexicon does not require significant expertise
beyond understanding of the language. For this case we provide different entry
formats for new lexical data; csv, XML and so on. The minimal data to enter for
new word form is morphological part of speech, defined in Finnish grammar
as distinction between verb, nominal or non-inflecting particle, additionally a
paradigm classification is typically needed for working inflection and
derivation. While this is facilitated as much as possible, it is still seen as
problematic by part of contributors, so further research for easy lexicon
management is still required.</p>
</div>
<div id="S3.SS1.p2" class="ltx_para">
<p class="ltx_p">The other practical example as to why easy modification of lexical data is
crucial is that for example for rule-based machine translation—in our case
apertium—it is useful to establish mapping between lexical units of source
language and target language. For this reason easy access to lexical units
is required for developers of rule-based machine translation units.</p>
</div>
<div id="S3.SS1.p3" class="ltx_para">
<p class="ltx_p">Another extension scheme needed is for projects working syntactic parsers based
on morphological language description—an example in Finnish
cases is yet another forthcoming constraint grammar<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">karlsson/1990</span>]</cite> based
analyser. In this case extensions can be also based on purely lexical data,
such as argument structures and ordering for verbs and adpositions. The method
to implement this is currently at preprocessing phase, however it could be
arguably a separate phase, since lexical data can be trivially composed
afterwards.</p>
</div>
</section>
<section id="S3.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.2 </span>Traditional Morphotactics and Phonology—The Lexc and Twolc Model</h3>
<div id="S3.SS2.p1" class="ltx_para">
<p class="ltx_p">The various lexical data sources are joined back to traditional lexc format,
which is combined with word stem variation definitions and inflectional data to
produce lexical automaton. This is compose-intersected with morphophonology
descriptions to produce the analyser already; as these parts rarely need
changes beyond bug-fixes and are unlikely to benefit from open source
cooperation beyond initial linguist work, they are still in same form as
traditional finite-state morphology by <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/2003</span>]</cite>, even if it was
deemed monolithic and fragile for such collaboration.</p>
</div>
</section>
<section id="S3.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.3 </span>Analysis Formats and Sets</h3>
<div id="S3.SS3.p1" class="ltx_para">
<p class="ltx_p">Another thing that is quickly obvious for interoperability is that all projects
using morphological analyser, for whatever purpose, require their own analysis
format. Instead of converging to standard we have temporarily solved this by
making our analyser to contain superset of required features at all times, and
providing rules to rewrite the tagsets. The rulesets can be compiled to
finite-state networks and composed like usual. Typical rules are of course
relatively simple contextless rewriting, for example the annotation for
singular nominative is <code class="ltx_verbatim ltx_font_typewriter">+sg+nom</code>, <code class="ltx_verbatim ltx_font_typewriter"> SG NOM</code> or <code class="ltx_verbatim ltx_font_typewriter"><sg><nom></code>, for
different applications, so a simple composition of <code class="ltx_verbatim ltx_font_typewriter">NUM=SG:+sg</code> rule and
<code class="ltx_verbatim ltx_font_typewriter">CASE=NOM:nom</code> is enough for providing the singular nominatives to that
analysis style. Ideally of course this would be solved by using more suitable
abstract data type for the annotations than character string
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">wintner/2008</span>]</cite>, ideally derived from standardized set of features, such as
ISOCat as is also suggested by <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">maxwell/2008</span>]</cite>.</p>
</div>
</section>
<section id="S3.SS4" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.4 </span>Orthographical variations</h3>
<div id="S3.SS4.p1" class="ltx_para">
<p class="ltx_p">When dealing with data from various sources, such as old literature or
<em class="ltx_emph">spoken</em> standard language found in instant messaging and perhaps
transcribed spoken corpora, there are certain variations on spelling rules to
take care of. These has also been implemented as independent rule set compiled
to composable finite-state automaton. Incidentally both mapping of typewriter
digraphs sh and zh to š and ž correspondingly and omission of final component
of i-final diphthong of spoken language are both definable as rule working on
morphological analyser as an independent unit.</p>
</div>
</section>
<section id="S3.SS5" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.5 </span>Hyphenation and syllabification</h3>
<div id="S3.SS5.p1" class="ltx_para">
<p class="ltx_p">Hyphenation is in practice also one of the applications of the language, It has
been defined as a rule set over half-build morphological analyser, since it can
neatly abuse build-time information of the analyser. such as word and morpheme
boundaries. The syllables could also conversely be used by other parts of the
description if needed, e.g. the morphophonological alteration a:e depends more
on syllable count than traditional stem paradigms classes of Finnish</p>
</div>
</section>
<section id="S3.SS6" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.6 </span>Error models</h3>
<div id="S3.SS6.p1" class="ltx_para">
<p class="ltx_p">Error model is a crucial part of spell-checking system, for the correction
task. This is implemented as finite-state filter that can be
applied with on-the-fly composition <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">pirinen/2010/cla</span>]</cite> to perform the
error correction for spell checking, or for example error-tolerant analysis.</p>
</div>
</section>
<section id="S3.SS7" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.7 </span>Statistical models</h3>
<div id="S3.SS7.p1" class="ltx_para">
<p class="ltx_p">Statistical models provide for disambiguating language models and spell-checking
tasks for example. The statistical models used are simple finite-state automata
or training sets combinable to the language description by use of
composition <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">linden/2009/nodalida</span>, <span class="ltx_ref ltx_missing_citation ltx_ref_self">linden/2009/fsmnlp</span>]</cite>.</p>
</div>
</section>
<section id="S3.SS8" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.8 </span>Filtering the Analyser</h3>
<div id="S3.SS8.p1" class="ltx_para">
<p class="ltx_p">The models needed for different task may need widely different dictionaries and
allowed word-forms, and not always the statistical models are sufficient to
discriminate between good word forms. So we also provide filter rule sets, to
limit features, such as derivation and compounding, and lexical units, such as
archaic or dialectal words. For example for the spell-checker’s error detection
lexicon or information retrieval task compounding and derivation can be largely
allowed, whereas in the spelling correction the suggestions should be relatively
conservative for plausible but non-existing compounds and derivations.</p>
</div>
</section>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>Discussion and Future Work</h2>
<div id="S4.p1" class="ltx_para">
<p class="ltx_p">In this article we have showed that finite-state description can be implemented
in modularised manner enabling wide cooperation in the open source context for
people with varying background. Especially</p>
</div>
<div id="S4.p2" class="ltx_para">
<p class="ltx_p">Furthermore we have demonstrated the ease of
proper abstraction in finite-state language description using easily available
open source tools while still providing open source community with the de facto
standard build system of autotoolset for wide distribution, packaging and
deployment.</p>
</div>
<div id="S4.p3" class="ltx_para">
<p class="ltx_p">What we did not address here is the easy way of coupling up-to-date
documentation with our modularised language description. The next step to
research is to see into integrating the notion of literate programming in this
framework. This topic has already been widely researched by
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">maxwell/2008</span>]</cite>, specifically in case of finite-state language
descriptions.
</p>
</div>
</section>
<section id="bib" class="ltx_bibliography">
<h2 class="ltx_title ltx_title_bibliography">References</h2>
<ul id="L1" class="ltx_biblist">
</ul>
</section>
</article>
</div>
<footer class="ltx_page_footer">
<div class="ltx_page_logo">Generated on Fri Sep 29 12:57:01 2017 by <a href="http://dlmf.nist.gov/LaTeXML/">LaTeXML <img src="" alt="[LOGO]"></a>
</div></footer>
</div>
</body>
</html>