-
Notifications
You must be signed in to change notification settings - Fork 3
/
Pirinen-2010-cla.html
1928 lines (1823 loc) · 190 KB
/
Pirinen-2010-cla.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html><html>
<head>
<title>Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305.</title>
<!--Generated on Thu Sep 28 18:46:55 2017 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on Last modification: September 28, 2017.-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-listings.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">Building and Using Existing Hunspell Dictionaries and TeX Hyphenators
as Finite-State Automata<span class="ltx_ERROR undefined">\footnotepubrights</span>The official publication was in
Computational Linguistics – Applications 2010 in Wisła, Poland.
<span class="ltx_ERROR undefined">\url</span>https://fedcsis.org/2010/pg/395/305.</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Tommi A Pirinen
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Krister Lindén
<br class="ltx_break">University of Helsinki
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">
<br class="ltx_break">Department of Modern Languages
<br class="ltx_break">Unionkatu 40
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname"> FI-00014 University of Helsinki
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname"> Finland
<br class="ltx_break">Email: {tommi.pirinen
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">krister.linden}@helsinki.fi
</span></span>
</div>
<div class="ltx_date ltx_role_creation">Last modification: September 28, 2017</div>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p">There are numerous formats for writing spell-checkers for
open-source systems and there are many descriptions for languages
written in these formats. Similarly, for word hyphenation by
computer there are TeX rules for many languages. In this paper we
demonstrate a method for converting these spell-checking lexicons
and hyphenation rule sets into finite-state automata, and present a
new finite-state based system for writer’s tools used in current
open-source software such as Firefox, OpenOffice.org and enchant via
the spell-checking library voikko.</p>
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p">Currently there is a wide range of different free open-source
solutions for spell-checking and hyphenation by computer. For
hyphenation the ubiquitous solution is the original TeX algorithm
described in <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">liang/1983</span>]</cite>. The most popular of the spelling
dictionaries are the various instances of *spell software, i.e.
ispell<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_ERROR undefined">\url</span>http://www.lasr.cs.ucla.edu/geoff/ispell.html</span></span></span>,
aspell<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup><span class="ltx_ERROR undefined">\url</span>http://aspell.net</span></span></span>, myspell and
hunspell<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup><span class="ltx_ERROR undefined">\url</span>http://hunspell.sf.net</span></span></span> and other *spell
derivatives. The TeX hyphenation patterns are readily available on
the Internet to cover some 49 languages. The hunspell dictionaries
provided with the OpenOffice.org suite cover 98 languages.</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">The program-based spell-checking methods have their limitations
because they are based on specific program code that is extensible
only by coding new features into the system and getting all users to
upgrade. E.g. hunspell has limitations on what affix morphemes you can
attach to word roots with the consequence that not all languages with
rich inflectional morphologies can be conveniently implemented in
hunspell. This has already resulted in multiple new pieces of software
for a few languages with implementations to work around the
limitations, e.g. emberek (Turkish), hspell (Hebrew), uspell
(Yiddish) and voikko (Finnish). What we propose is to use a generic
framework of finite-state automata for these tasks. With finite-state
automata it is possible to implement the spell-checking functionality
as a one-tape weighted automaton containing the language model and a
two-tape weighted automaton containing the error model. This also
allows simple use of unigram training for optimizing spelling suggestion
results <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">conf/lrec/Pirinen2010</span>]</cite>. With this model, extensions to
context-based n-gram models for real-word spelling error problems
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">DBLP:conf/cicling/Wilcox-OHearnHB08</span>]</cite> are also possible.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">We also provide a method for integrating the finite-state spell-checking
and hyphenation into applications using an open-source spell-checking
library voikko<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup><span class="ltx_ERROR undefined">\url</span>http://voikko.sf.net</span></span></span>, which provides a
connection to typical open-source software, such as Mozilla Firefox,
OpenOffice.org and the Gnome desktop via enchant.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Definitions</h2>
<div id="S2.p1" class="ltx_para">
<p class="ltx_p">In this article we use weighted two-tape finite-state automata—or
weighted finite-state transducers—for all processing. We use the
following symbol conventions to denote the parts of a weighted
finite-state automaton: a transducer <math id="S2.p1.m1" class="ltx_Math" alttext="T=(\Sigma,\Gamma,Q,q_{0},Q_{f},\delta,\rho)" display="inline"><mrow><mi>T</mi><mo>=</mo><mrow><mo stretchy="false">(</mo><mi mathvariant="normal">Σ</mi><mo>,</mo><mi mathvariant="normal">Γ</mi><mo>,</mo><mi>Q</mi><mo>,</mo><msub><mi>q</mi><mn>0</mn></msub><mo>,</mo><msub><mi>Q</mi><mi>f</mi></msub><mo>,</mo><mi>δ</mi><mo>,</mo><mi>ρ</mi><mo stretchy="false">)</mo></mrow></mrow></math> with a semi-ring <math id="S2.p1.m2" class="ltx_Math" alttext="(S,\oplus,\otimes,\overline{0},\overline{1})" display="inline"><mrow><mo stretchy="false">(</mo><mi>S</mi><mo>,</mo><mo>⊕</mo><mo>,</mo><mo>⊗</mo><mo>,</mo><mover accent="true"><mn>0</mn><mo>¯</mo></mover><mo>,</mo><mover accent="true"><mn>1</mn><mo>¯</mo></mover><mo stretchy="false">)</mo></mrow></math> for weights. Here <math id="S2.p1.m3" class="ltx_Math" alttext="\Sigma" display="inline"><mi mathvariant="normal">Σ</mi></math> is a set with
the input tape alphabet, <math id="S2.p1.m4" class="ltx_Math" alttext="\Gamma" display="inline"><mi mathvariant="normal">Γ</mi></math> is a set with the output tape
alphabet, <math id="S2.p1.m5" class="ltx_Math" alttext="Q" display="inline"><mi>Q</mi></math> a finite set of states in the transducer, <math id="S2.p1.m6" class="ltx_Math" alttext="q_{0}\in Q" display="inline"><mrow><msub><mi>q</mi><mn>0</mn></msub><mo>∈</mo><mi>Q</mi></mrow></math> is
an initial state of the transducer, <math id="S2.p1.m7" class="ltx_Math" alttext="Q_{f}\subset Q" display="inline"><mrow><msub><mi>Q</mi><mi>f</mi></msub><mo>⊂</mo><mi>Q</mi></mrow></math> is a set of finite
states, <math id="S2.p1.m8" class="ltx_Math" alttext="\delta:Q\times\Sigma\times\Gamma\times S\rightarrow Q" display="inline"><mrow><mi>δ</mi><mo>:</mo><mrow><mrow><mi>Q</mi><mo>×</mo><mi mathvariant="normal">Σ</mi><mo>×</mo><mi mathvariant="normal">Γ</mi><mo>×</mo><mi>S</mi></mrow><mo>→</mo><mi>Q</mi></mrow></mrow></math>
is a transition relation, <math id="S2.p1.m9" class="ltx_Math" alttext="\rho:Q_{f}\rightarrow S" display="inline"><mrow><mi>ρ</mi><mo>:</mo><mrow><msub><mi>Q</mi><mi>f</mi></msub><mo>→</mo><mi>S</mi></mrow></mrow></math> is a final weight
function. A successful path is a list of transitions from an initial
state to a final state with a weight different from <math id="S2.p1.m10" class="ltx_Math" alttext="\overline{0}" display="inline"><mover accent="true"><mn>0</mn><mo>¯</mo></mover></math>
collected from the transition function and the final state function in
the semi-ring <math id="S2.p1.m11" class="ltx_Math" alttext="S" display="inline"><mi>S</mi></math> by the operation <math id="S2.p1.m12" class="ltx_Math" alttext="\otimes" display="inline"><mo>⊗</mo></math>. We typically denote a
successful path as a concatenation of input symbols, a colon and a
concatenation of output symbols. The weight of the successful path is
indicated as a subscript in angle brackets,
<math id="S2.p1.m13" class="ltx_Math" alttext="\textit{input:output}_{<w>}" display="inline"><msub><mtext mathvariant="italic">input:output</mtext><mrow><mo><</mo><mi>w</mi><mo>></mo></mrow></msub></math>. A path transducer is denoted by
subscripting a transducer with the path. If the input and output
symbols are the same, the colon and the output part can be omitted.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">The finite-state formulation we use in this article is based on Xerox
formalisms for finite-state methods in natural language processing
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/2003</span>]</cite>, in practice lexc is a formalism for writing right linear
grammars using morpheme sets called lexicons. Each morpheme in a lexc grammar
can define their right follower lexicon, creating a finite-state network called
a <em class="ltx_emph">lexical transducer</em>. In formulae, we denote a lexc style lexicon named <math id="S2.p2.m1" class="ltx_Math" alttext="X" display="inline"><mi>X</mi></math> as
<math id="S2.p2.m2" class="ltx_Math" alttext="Lex_{X}" display="inline"><mrow><mi>L</mi><mo></mo><mi>e</mi><mo></mo><msub><mi>x</mi><mi>X</mi></msub></mrow></math> and use the shorthand notation <math id="S2.p2.m3" class="ltx_Math" alttext="Lex_{X}\cup\textit{input:output Y}" display="inline"><mrow><mrow><mi>L</mi><mo></mo><mi>e</mi><mo></mo><msub><mi>x</mi><mi>X</mi></msub></mrow><mo>∪</mo><mtext mathvariant="italic">input:output Y</mtext></mrow></math> to
denote the addition of a lexc string or morpheme, <span class="ltx_text ltx_font_typewriter">input:output Y ;</span> to
the <span class="ltx_text ltx_font_typewriter">LEXICON X</span>. In the same framework, the twolc formalism is used to
describe context restrictions for symbols and their realizations in the form of
parallel rules as defined in the appendix of <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/2003</span>]</cite>. We use
<math id="S2.p2.m4" class="ltx_Math" alttext="Twol_{Z}" display="inline"><mrow><mi>T</mi><mo></mo><mi>w</mi><mo></mo><mi>o</mi><mo></mo><msub><mi>l</mi><mi>Z</mi></msub></mrow></math> to denote the rule set <math id="S2.p2.m5" class="ltx_Math" alttext="Z" display="inline"><mi>Z</mi></math> and use the shorthand notation <math id="S2.p2.m6" class="ltx_Math" alttext="Twol_{Z}\cap\textit{a:b}\leftrightarrow\textit{l e f t}\_\textit{r i g h t}" display="inline"><mrow><mrow><mrow><mi>T</mi><mo></mo><mi>w</mi><mo></mo><mi>o</mi><mo></mo><msub><mi>l</mi><mi>Z</mi></msub></mrow><mo>∩</mo><mtext mathvariant="italic">a:b</mtext></mrow><mo>↔</mo><mrow><mtext mathvariant="italic">l e f t</mtext><mo></mo><mi mathvariant="normal">_</mi><mo></mo><mtext mathvariant="italic">r i g h t</mtext></mrow></mrow></math> to denote
the addition of a rule string <span class="ltx_text ltx_font_typewriter">a:b <=> l e f t _ r i g h t ;</span> to the
rule set <math id="S2.p2.m7" class="ltx_Math" alttext="Z" display="inline"><mi>Z</mi></math>, effectively saying that <span class="ltx_text ltx_font_italic">a:b</span> only applies in
the specified context.</p>
</div>
<div id="S2.p3" class="ltx_para">
<p class="ltx_p">A spell-checking dictionary is essentially a single-tape finite-state
automaton or a language model <math id="S2.p3.m1" class="ltx_Math" alttext="T_{L}" display="inline"><msub><mi>T</mi><mi>L</mi></msub></math>, where the alphabet <math id="S2.p3.m2" class="ltx_Math" alttext="\Sigma_{L}=\Gamma_{L}" display="inline"><mrow><msub><mi mathvariant="normal">Σ</mi><mi>L</mi></msub><mo>=</mo><msub><mi mathvariant="normal">Γ</mi><mi>L</mi></msub></mrow></math> are characters of a natural language. The successful paths
define the correctly spelled word-forms of the language
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">conf/lrec/Pirinen2010</span>]</cite>. If the spell-checking automaton is
weighted, the weights may provide additional information on a word’s
correctness, e.g. the likelihood of the word being correctly spelled
or the probability of the word in some reference corpus. The
spell-checking of a word <math id="S2.p3.m3" class="ltx_Math" alttext="s" display="inline"><mi>s</mi></math> is performed by creating a path automaton
<math id="S2.p3.m4" class="ltx_Math" alttext="T_{s}" display="inline"><msub><mi>T</mi><mi>s</mi></msub></math> and composing it with the language model, <math id="S2.p3.m5" class="ltx_Math" alttext="T_{s}\circ T_{L}" display="inline"><mrow><msub><mi>T</mi><mi>s</mi></msub><mo>∘</mo><msub><mi>T</mi><mi>L</mi></msub></mrow></math>. A
result with the successful path <math id="S2.p3.m6" class="ltx_Math" alttext="s_{<W>}" display="inline"><msub><mi>s</mi><mrow><mo><</mo><mi>W</mi><mo>></mo></mrow></msub></math>, where <math id="S2.p3.m7" class="ltx_Math" alttext="W" display="inline"><mi>W</mi></math> is greater than
some threshold value, means that the word is correctly spelled. As the
result is not needed for further processing as an automaton and as the language model
automaton is free of epsilon cycles, the spell-checking can be
optimized by performing a simple traversal (lookup) instead, which
gives a significant speed-advantage over full composition
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">conf/fsmnlp/Silfverberg2009</span>]</cite>.</p>
</div>
<div id="S2.p4" class="ltx_para">
<p class="ltx_p">A spelling correction model or an error model <math id="S2.p4.m1" class="ltx_Math" alttext="T_{E}" display="inline"><msub><mi>T</mi><mi>E</mi></msub></math> is a two-tape
automaton mapping the input text strings of the text to be
spell-checked into strings that may be in the language model. The
input alphabet <math id="S2.p4.m2" class="ltx_Math" alttext="\Sigma_{E}" display="inline"><msub><mi mathvariant="normal">Σ</mi><mi>E</mi></msub></math> is the alphabet of the text to be
spell-checked and the output alphabet is <math id="S2.p4.m3" class="ltx_Math" alttext="\Gamma_{E}=\Sigma_{L}" display="inline"><mrow><msub><mi mathvariant="normal">Γ</mi><mi>E</mi></msub><mo>=</mo><msub><mi mathvariant="normal">Σ</mi><mi>L</mi></msub></mrow></math>. For
practical applications, the input alphabet needs to be extended by a
special any symbol with the semantics of a character not belonging to
the alphabet of the language model in order to account for input text
containing typos outside the target natural language alphabet. The
error model can be composed with the language model, <math id="S2.p4.m4" class="ltx_Math" alttext="T_{L}\circ T_{E}" display="inline"><mrow><msub><mi>T</mi><mi>L</mi></msub><mo>∘</mo><msub><mi>T</mi><mi>E</mi></msub></mrow></math>,
to obtain an error model that only produces strings of the target
language. For space efficiency, the composition may be carried out
during run-time using the input string to limit the search space. The
weights of an error model may be used as an estimate for the likelihood of the
combination of errors. The error model is applied as a filter between
the path automaton <math id="S2.p4.m5" class="ltx_Math" alttext="T_{s}" display="inline"><msub><mi>T</mi><mi>s</mi></msub></math> compiled from the erroneous string, <math id="S2.p4.m6" class="ltx_Math" alttext="s\notin T_{L}" display="inline"><mrow><mi>s</mi><mo>∉</mo><msub><mi>T</mi><mi>L</mi></msub></mrow></math>, and the language model, <math id="S2.p4.m7" class="ltx_Math" alttext="T_{L}" display="inline"><msub><mi>T</mi><mi>L</mi></msub></math>, using two compositions, <math id="S2.p4.m8" class="ltx_Math" alttext="T_{s}\circ T_{E}\circ T_{L}" display="inline"><mrow><msub><mi>T</mi><mi>s</mi></msub><mo>∘</mo><msub><mi>T</mi><mi>E</mi></msub><mo>∘</mo><msub><mi>T</mi><mi>L</mi></msub></mrow></math>. The resulting transducer consists of a
potentially infinite set of paths relating an incorrect string with
correct strings from <math id="S2.p4.m9" class="ltx_Math" alttext="L" display="inline"><mi>L</mi></math>. The paths, <math id="S2.p4.m10" class="ltx_Math" alttext="s:s^{i}_{<w_{i}>}" display="inline"><mrow><mi>s</mi><mo>:</mo><msubsup><mi>s</mi><mrow><mo><</mo><msub><mi>w</mi><mi>i</mi></msub><mo>></mo></mrow><mi>i</mi></msubsup></mrow></math>, are weighted by
the error model and language model using the semi-ring multiplication
operation, <math id="S2.p4.m11" class="ltx_Math" alttext="\otimes" display="inline"><mo>⊗</mo></math>. If the error model and the language model
generate an infinite number of suggestions, the best suggestions may
be efficiently enumerated with some variant of the n-best-paths
algorithm <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">mohri/2002</span>]</cite>. For automatic spelling corrections, the
best path may be used. If either the error model or the language model
is known to generate only a finite set of results, the suggestion
generation algorithm may be further optimized.</p>
</div>
<div id="S2.p5" class="ltx_para">
<p class="ltx_p">A hyphenation model <math id="S2.p5.m1" class="ltx_Math" alttext="T_{H}" display="inline"><msub><mi>T</mi><mi>H</mi></msub></math> is a two-tape automaton mapping input text
strings of the text to be hyphenated to possibly hyphenated strings of
the text, where the input alphabet, <math id="S2.p5.m2" class="ltx_Math" alttext="\Sigma_{E}" display="inline"><msub><mi mathvariant="normal">Σ</mi><mi>E</mi></msub></math>, is the alphabet of the
text to be hyphenated and the output alphabet, <math id="S2.p5.m3" class="ltx_Math" alttext="\Gamma_{E}" display="inline"><msub><mi mathvariant="normal">Γ</mi><mi>E</mi></msub></math>, is
<math id="S2.p5.m4" class="ltx_Math" alttext="\Sigma_{E}\cup H" display="inline"><mrow><msub><mi mathvariant="normal">Σ</mi><mi>E</mi></msub><mo>∪</mo><mi>H</mi></mrow></math>, where <math id="S2.p5.m5" class="ltx_Math" alttext="H" display="inline"><mi>H</mi></math> is the set of symbols marking hyphenation
points. For simple applications, this equals hyphens or discretionary
(soft) hyphens <math id="S2.p5.m6" class="ltx_Math" alttext="H={-}" display="inline"><mrow><mi>H</mi><mo>=</mo><mo>-</mo></mrow></math>. For more fine-grained control over
hyphenation, it is possible to use several different hyphens or
weighted hyphens. Hyphenation of the word <math id="S2.p5.m7" class="ltx_Math" alttext="s" display="inline"><mi>s</mi></math> is performed with the
path automaton <math id="S2.p5.m8" class="ltx_Math" alttext="T_{s}" display="inline"><msub><mi>T</mi><mi>s</mi></msub></math> by composing, <math id="S2.p5.m9" class="ltx_Math" alttext="T_{s}\circ T_{H}" display="inline"><mrow><msub><mi>T</mi><mi>s</mi></msub><mo>∘</mo><msub><mi>T</mi><mi>H</mi></msub></mrow></math>, which results in
an acyclic path automaton containing a set of strings mapped to the
hyphenated strings with weights <math id="S2.p5.m10" class="ltx_Math" alttext="s:s^{h}_{<w_{h}>}" display="inline"><mrow><mi>s</mi><mo>:</mo><msubsup><mi>s</mi><mrow><mo><</mo><msub><mi>w</mi><mi>h</mi></msub><mo>></mo></mrow><mi>h</mi></msubsup></mrow></math>. Several alternative hyphenations
may be correct according to the hyphenation rules. A conservative hyphenation algorithm
should only suggest the hyphenation points agreed on by all the alternatives.</p>
</div>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Material</h2>
<div id="S3.p1" class="ltx_para">
<p class="ltx_p">In this article we present methods for converting the hunspell and
TeX dictionaries and rule sets for use with open-source finite-state
writer’s tools. As concrete dictionaries we use the repositories of
free implementations of these dictionaries and rule sets found on the
internet, e.g. for the hunspell dictionary files found on the
OpenOffice.org spell-checking
site<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">5</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">5</sup><span class="ltx_ERROR undefined">\url</span>http://wiki.services.openoffice.org/wiki/Dictionaries</span></span></span>.
For hyphenation, we use the TeX hyphenation patterns found on the
TeXhyphen page<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">6</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">6</sup><span class="ltx_ERROR undefined">\url</span>http://www.tug.org/tex-hyphen/</span></span></span>.</p>
</div>
<div id="S3.p2" class="ltx_para">
<p class="ltx_p">In this section we describe the parts of the file formats we are
working with. All of the information of the hunspell format specifics
is derived from the
<span class="ltx_text ltx_font_typewriter">hunspell(4)<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">7</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">7</sup><span class="ltx_ERROR undefined">\url</span><span class="ltx_text ltx_font_serif">http://manpages.ubuntu.com/manpages/dapper/man4/hunspell.4.html</span></span></span></span></span>
man page, as that is the only normative documentation of hunspell we
have been able to locate. For TeX hyphenation patterns, the
reference documentation is Frank Liang’s doctoral thesis
<cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">liang/1983</span>]</cite> and the TeXbook <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">knuth/1986</span>]</cite>.</p>
</div>
<section id="S3.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>Hunspell File Format</h3>
<div id="S3.SS1.p1" class="ltx_para">
<p class="ltx_p">A hunspell spell-checking dictionary consists of two files: a
dictionary file and an affix file. The dictionary file contains only
root forms of words with information about morphological affix classes
to combine with the roots. The affix file contains lists of affixes
along with their context restrictions and effects, but the affix file
also serves as a settings file for the dictionary, containing all
meta-data and settings as well.</p>
</div>
<div id="S3.SS1.p2" class="ltx_para">
<p class="ltx_p">The dictionary file starts with a number that is intended to be the
number of lines of root forms in the dictionary file, but in practice
many of the files have numbers different from the actual line count,
so it is safer to just treat it as a rough estimate. Following the
initial line is a list of strings containing the root forms of the
words in the morphology. Each word may be associated with an arbitrary
number of classes separated by a slash. The classes are encoded in one
of the three formats shown in the examples of
Figure <a href="#S3.F1" title="Figure 1 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>: a list of binary octets
specifying classes from 1–255 (minus octets for CR, LF etc.), as in
the Swedish example on lines
<a href="#S3.F1" title="Figure 1 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>–<a href="#S3.F1" title="Figure 1 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>, a list of
binary words, specifying classes from 1–65,535 (again ignoring octets
with CR and LF) or a comma separated list of numbers written in digits
specifying classes 1–65,535 as in the North Sámi examples on lines
<a href="#S3.F1" title="Figure 1 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>–<a href="#S3.F1" title="Figure 1 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>. We refer to
all of these as continuation classes encoded by their numeric decimal
values, e.g. ’abakus’ on line <a href="#S3.F1" title="Figure 1 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a> would have
continuation classes 72, 68 and 89 (the decimal values of the ASCII
code points for H, D and Y respectively). In the Hungarian example,
you can see the affix compression scheme, which refers to the line
numbers in the affix file containing the continuation class listings,
i.e. the part following the slash character in the previous two
examples. The lines of the Hungarian dictionary also contain some
extra numeric values separated by a tab which refer to the morphology
compression scheme that is also mentioned in the affix definition
file; this is used in the hunmorph morphological analyzer
functionality which is not implemented nor described in this paper.</p>
</div>
<figure id="S3.F1" class="ltx_figure">
<div class="ltx_listing ltx_lst_numbers_left ltx_lstlisting ltx_align_center ltx_listing">
<div class="ltx_listing_data"><a href="data:text/plain;base64,ICAgICMgU3dlZGlzaAogICAgYWJha3VzL0hEWXwtXGxhYmVse3ZyYjpodW5zcGVsbGRpY3N2c3Rh%0AcnR9LXwKICAgIGFiYWxpZW5hdGlvbi9BSER2WQogICAgYWJhbGllbmVyYS9NWXwtXGxhYmVse3Zy%0AYjpodW5zcGVsbGRpY3N2ZW5kfS18CiAgICAjIE5vcnRoZXJuIFN8LVwne2F9LXxtaQogICAgb2t0%0AYS8xfC1cbGFiZWx7dnJiOmh1bnNwZWxsZGljc2VzdGFydH0tfAogICAgZ3Vva3RlLzEsMwogICAg%0AZ29sYm1hLzEsM3wtXGxhYmVse3ZyYjpodW5zcGVsbGRpY3NlZW5kfS18CiAgICAjIEh1bmdhcmlh%0AbgogICAgfC1cInt1fS18enwtXCd7ZX0tfHIvMSAgMXwtXGxhYmVse3ZyYjpodW5zcGVsbGRpY2h1%0Ac3RhcnR9LXwKICAgIHwtXCJ7dX0tfHpsZXR8LVwne2F9LXxnLzIgICAgICAgMgogICAgfC1cInt1%0AfS18emxldHZlemV0fC1cIntvfS18LzMgICAxCiAgICB8LVwie3V9LXx6bGV0c3plcnp8LVwie299%0ALXwvNCAgIDF8LVxsYWJlbHt2ZXJiOmh1bnNwZWxsZGljaHVlbmR9LXw=%0A">⬇</a></div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">1</span><span class="ltx_text ltx_lst_space"> </span>#<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">Swedish</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">abakus</span>/<span class="ltx_text ltx_lst_identifier">HDY</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">3</span><span class="ltx_text ltx_lst_identifier">abalienation</span>/<span class="ltx_text ltx_lst_identifier">AHDvY</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">abalienera</span>/<span class="ltx_text ltx_lst_identifier">MY</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">5</span>#<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">Northern</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">S</span>á<span class="ltx_text ltx_lst_identifier">mi</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">okta</span>/1
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">7</span><span class="ltx_text ltx_lst_identifier">guokte</span>/1,3
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">golbma</span>/1,3
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">9</span>#<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">Hungarian</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>ü<span class="ltx_text ltx_lst_identifier">z</span>é<span class="ltx_text ltx_lst_identifier">r</span>/1<span class="ltx_text ltx_lst_space"> </span>1
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">11</span>ü<span class="ltx_text ltx_lst_identifier">zlet</span>á<span class="ltx_text ltx_lst_identifier">g</span>/2<span class="ltx_text ltx_lst_space"> </span>2
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>ü<span class="ltx_text ltx_lst_identifier">zletvezet</span>ö/3<span class="ltx_text ltx_lst_space"> </span>1
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">13</span>ü<span class="ltx_text ltx_lst_identifier">zletszerz</span>ö/4<span class="ltx_text ltx_lst_space"> </span>1
</div>
</div>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>Excerpts of Swedish, Northern S—-á-—mi and Hungarian dictionaries</figcaption>
</figure>
<div id="S3.SS1.p3" class="ltx_para">
<p class="ltx_p">The second file in the hunspell dictionaries is the affix file,
containing all the settings for the dictionary, and all non-root
morphemes. The Figure <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> shows parts of
the Hungarian affix file that we use for describing different setting
types. The settings are typically given on a single line composed of
the setting name in capitals, a space and the setting values, like the
NAME setting on line <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. The hunspell files
have some values encoded in UTF-8, some in the ISO 8859 encoding, and
some using both binary and ASCII data at the same time. Note that in the examples in this article, we have
transcribed everything into UTF-8 format or the nearest relevant encoded
character with a displayable code point.</p>
</div>
<div id="S3.SS1.p4" class="ltx_para">
<p class="ltx_p">The settings we have used for building the spell-checking automata can
be roughly divided into the following four categories: meta-data, error
correction models, special continuation classes, and the actual
affixes. An excerpt of the parts that we use in the Hungarian affix file
is given in Figure <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.</p>
</div>
<div id="S3.SS1.p5" class="ltx_para">
<p class="ltx_p">The meta-data section contains, e.g., the name of the dictionary on
line <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, the character set encoding on
line <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, and the type of parsing used for
continuation classes, which is omitted from the Hungarian lexicon
indicating 8-bit binary parsing.</p>
</div>
<div id="S3.SS1.p6" class="ltx_para">
<p class="ltx_p">The error model settings each contain a small part of the actual error
model, such as the characters to be used for edit distance, their weights,
confusion sets and phonetic confusion sets. The list of word
characters in order of popularity, as seen on
line <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> of
Figure <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, is used for the edit distance
model. The keyboard layout, i.e. neighboring key sets, is specified
for the substitution error model on
line <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. Each set of the characters, separated by
vertical bars, is regarded as a possible slip-of-the-finger typing
error. The ordered confusion set of possible spelling error pairs is
given on
lines <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>–<a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>,
where each line is a pair of a ‘mistyped’ and a ‘corrected’ word
separated by whitespace.</p>
</div>
<div id="S3.SS1.p7" class="ltx_para">
<p class="ltx_p">The compounding model is defined by special continuation classes,
i.e. some of the continuation classes in the dictionary or affix file
may not lead to affixes, but are defined in the compounding section of
the settings in the affix file. In
Figure <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, the compounding rules are
specified on
lines <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>–<a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. The
flags in these settings are the same as in the affix definitions, so
the words in class 118 (corresponding to lower case v) would be
eligible as compound initial words, the words with class 120 (lower
case x) occur at the end of a compound, and words with 117 only occur
within a compound. Similarly, special flags are given to word forms
needing affixes that are used only for spell checking but not for the
suggestion mechanism, etc.</p>
</div>
<div id="S3.SS1.p8" class="ltx_para">
<p class="ltx_p">The actual affixes are defined in three different parts of the file:
the compression scheme part on the
lines <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>–<a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, the
suffix definitions on the
lines <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>–<a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, and
the prefix definitions on the
lines <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>–<a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.</p>
</div>
<div id="S3.SS1.p9" class="ltx_para">
<p class="ltx_p">The compression scheme is a grouping of frequently co-occurring
continuation classes. This is done by having the first AF line list a
set of continuation classes which are referred to as the continuation
class 1 in the dictionary, the second line is referred to the
continuation class 2, and so forth. This means that for example
continuation class 1 in the Hungarian dictionary refers to the classes
on line <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> starting from 86 (V) and ending
with 108 (l).</p>
</div>
<div id="S3.SS1.p10" class="ltx_para">
<p class="ltx_p">The prefix and suffix definitions use the same structure. The prefixes
define the left-hand side context and deletions of a dictionary entry
whereas the suffixes deal with the right-hand side. The first line of
an affix set contains the class name, a boolean value defining whether
the affix participates in the prefix-suffix combinatorics and the
count of the number of morphemes in the continuation class, e.g. the
line <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> defines the prefix continuation
class attaching to morphemes of class 114 (r) and it combines with
other affixes as defined by the Y instead of N in the third field. The
following lines describe the prefix morphemes as triplets of removal,
addition and context descriptions, e.g., the line
<a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> defines removal of ’ö’, addition of
’ős’ with continuation classes from AF line 1108, in case the
previous morpheme ends in ’ö’. The context description may also
contain bracketed expressions for character classes or a fullstop
indicating any character (i.e. a wild-card) as in the POSIX regular
expressions, e.g. the context description on line
<a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> matches any Hungarian vowel except a, e or
ö, and the <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> matches any context. The
deletion and addition parts may also consist of a sole ‘0’ meaning a
zero-length string. As can be seen in the Hungarian example, the lines
may also contain an additional number at the end which is used for the
morphological analyzer functionalities.</p>
</div>
<figure id="S3.F2" class="ltx_figure">
<div class="ltx_listing ltx_lst_numbers_left ltx_lstlisting ltx_align_center ltx_listing">
<div class="ltx_listing_data"><a href="data:text/plain;base64,ICAgIEFGIDEyNjMgfC1cbGFiZWx7dnJiOmh1bnNwZWxsYWZmYWZzdGFydH0tfAogICAgQUYgVnwt%0AXCJ7RX0tfC1qeExufC1cJ3tPfS18fC1cJ3tlfS18fC1cYHtlfS18M3wtXCJ7QX0tfHwtXCJ7YX0t%0AfFR0WWMsNGwgIyAxIHwtXGxhYmVse3ZyYjpodW5zcGVsbGFmZmFmb25lfS18CiAgICBBRiBVbXwt%0AXCJ7T30tfHlpWWN8LVxje0N9LXwgIyAyCiAgICBBRiB8LVwie099LXxDV1J8LVwne0l9LXwtanwt%0Aw7AtfHwtXCd7T30tfHwtXCd7aX0tfHl8LVwne0V9LXx8LVwne0F9LXx8LVwie3l9LXxZYzIgIyAz%0AIHwtXGxhYmVse3ZyYjpodW5zcGVsbGFmZmFmZW5kfS18CgogICAgTkFNRSBNYWd5YXIgSXNwZWxs%0AIGhlbHllc3wtXCd7aX0tfHJ8LVwne2F9LXxzaSBzenwtXCd7b30tfHR8LVwne2F9LXxyICAgICB8%0ALVxsYWJlbHt2cmI6aHVuc3BlbGxhZmZuYW1lfS18CiAgICBMQU5HIGh1X0hVICAgICB8LVxsYWJl%0AbHt2cmI6aHVuc3BlbGxhZmZsYW5nfS18CiAgICBTRVQgVVRGLTggICAgIHwtXGxhYmVse3ZyYjpo%0AdW5zcGVsbGFmZnNldH0tfAogICAgS0VZIHwtXCJ7b30tfHwtXCJ7dX0tfHwtXCd7b30tfHxxd2Vy%0AdHp1aW9wfC1cSHtvfS18fC1cJ3t1fS18fCAjIHdyYXAKICAgICAgICBhc2RmZ2hqa2x8LVwne2V9%0ALXx8LVwne2F9LXx8LVxIe3V9LXx8LVwne2l9LXx5eGN2Ym5tICAgICAgICB8LVxsYWJlbHt2cmI6%0AaHVuc3BlbGxhZmZrZXl9LXwKICAgIFRSWSB8LVwne2l9LXx8LVwne299LXx8LVwne3V9LXx0YWVz%0AbHp8LVwne2F9LXxub3JoZ2tpfC1cJ3tlfS18ICMgd3JhcAogICAgICAgIGRteXwtXEh7b30tfHB2%0AfC1cIntvfS18YnVjZmp8LVwie3V9LXx5eHdxLS58LVwne2F9LXwgICAgICB8LVxsYWJlbHt2cmI6%0AaHVuc3BlbGxhZmZ0cnl9LXwKCiAgICBDT01QT1VOREJFR0lOIHYgICAgIHwtXGxhYmVse3ZyYjpo%0AdW5zcGVsbGFmZmNvbXBvdW5kc3RhcnR9LXwKICAgIENPTVBPVU5ERU5EIHgKICAgIE9OTFlJTkNP%0ATVBPVU5EIHwgICAgIHwtXGxhYmVse3ZyYjpodW5zcGVsbGFmZmNvbXBvdW5kZW5kfS18CiAgICBO%0ARUVEQUZGSVggdQoKICAgIFJFUCAxMjUgICAgICAgIHwtXGxhYmVse3ZyYjpodW5zcGVsbGFmZnJl%0AcHN0YXJ0fS18CiAgICBSRVAgfC1cJ3tpfS18IGkKICAgIFJFUCBpIHwtXCd7aX0tfAogICAgUkVQ%0AIHwtXCd7b30tfCBvCiAgICBSRVAgb2xpZXJlIG9saXwtXCd7ZX0tfHJlCiAgICBSRVAgY2MgZ3lz%0AegogICAgUkVQIGNzIHRzCiAgICBSRVAgY3MgZHMKICAgIFJFUCBjY3MgdHMgICAgICAgfC1cbGFi%0AZWx7dnJiOmh1bnNwZWxsYWZmcmVwZW5kfS18CiAgICAjIDExNiBtb3JlIFJFUCBsaW5lcwoKICAg%0AIFNGWCA/IFkgMyB8LVxsYWJlbHt2cmI6aHVuc3BlbGxhZmZzZnhzdGFydH0tfAogICAgU0ZYID8g%0AfC1cIntvfS18IHwtXEh7b30tfHMvMTEwOCB8LVwie299LXwgMjA5NzMgfC1cbGFiZWx7dnJiOmh1%0AbnNwZWxsYWZmc2Z4b25lfS18CiAgICBTRlggPyAwIHwtXCJ7b30tfHMvMTEwOCBbXmF8LVwne2F9%0ALXxlfC1cJ3tlfS18aXwtXCd7aX0tfG98LVwne299LXx8LVwie299LXx8LVxIe299LXx1fC1cInt1%0AfS18fC1cSHt1fS18XSAyMDk3MwogICAgU0ZYID8gMCBzLzExMDggW3wtXCd7YX0tfHwtXCd7ZX0t%0AfGl8LVwne2l9LXxvfC1cJ3tvfS18fC1cJ3t1fS18fC1cSHtvfS18dXwtXCd7dX0tfHwtXCJ7dX0t%0AfHwtXEh7dX0tfC1dIDIwOTczIHwtXGxhYmVse3ZyYjpodW5zcGVsbGFmZnNmeGVuZH0tfAoKICAg%0AIFBGWCByIFkgMTk1IHwtXGxhYmVse3ZyYjpodW5zcGVsbGFmZnBmeHN0YXJ0fS18CiAgICBQRlgg%0AciAwIGxlZ3wtXCd7dX0tfGpyYS8xMjYyIC4gMjI1NTEKICAgIFBGWCByIDAgbGVnfC1cJ3t1fS18%0Aamp8LVwne2F9LXwvMTI2MiAuIDIyNTUyIHwtXGxhYmVse3ZyYjpodW5zcGVsbGFmZnBmeGVuZH0t%0AfAogICAgIyAxOTMgbW9yZSBQRlggciBsaW5lcw==%0A">⬇</a></div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">1</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">AF</span><span class="ltx_text ltx_lst_space"> </span>1263<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">AF</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">V</span>Ë-j<span class="ltx_text ltx_lst_identifier">xLn</span>Óéè3Ää<span class="ltx_text ltx_lst_identifier">TtYc</span>,4<span class="ltx_text ltx_lst_identifier">l</span><span class="ltx_text ltx_lst_space"> </span>#<span class="ltx_text ltx_lst_space"> </span>1<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">3</span><span class="ltx_text ltx_lst_identifier">AF</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">Um</span>Ö<span class="ltx_text ltx_lst_identifier">yiYc</span>Ç<span class="ltx_text ltx_lst_space"> </span>#<span class="ltx_text ltx_lst_space"> </span>2
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">AF</span><span class="ltx_text ltx_lst_space"> </span>Ö<span class="ltx_text ltx_lst_identifier">CWR</span>Í-jðÓí<span class="ltx_text ltx_lst_identifier">y</span>ÉÁÿ<span class="ltx_text ltx_lst_identifier">Yc2</span><span class="ltx_text ltx_lst_space"> </span>#<span class="ltx_text ltx_lst_space"> </span>3<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">5</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">NAME</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">Magyar</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">Ispell</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">helyes</span>í<span class="ltx_text ltx_lst_identifier">r</span>á<span class="ltx_text ltx_lst_identifier">si</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">sz</span>ó<span class="ltx_text ltx_lst_identifier">t</span>á<span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">7</span><span class="ltx_text ltx_lst_identifier">LANG</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">hu_HU</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">SET</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">UTF</span>-8<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">9</span><span class="ltx_text ltx_lst_identifier">KEY</span><span class="ltx_text ltx_lst_space"> </span>öüó|<span class="ltx_text ltx_lst_identifier">qwertzuiop</span>őú|<span class="ltx_text ltx_lst_space"> </span>#<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">wrap</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">asdfgh</span>j<span class="ltx_text ltx_lst_identifier">kl</span>éáűí<span class="ltx_text ltx_lst_identifier">yxcvbnm</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">11</span><span class="ltx_text ltx_lst_identifier">TRY</span><span class="ltx_text ltx_lst_space"> </span>íóú<span class="ltx_text ltx_lst_identifier">taeslz</span>á<span class="ltx_text ltx_lst_identifier">norhgki</span>é<span class="ltx_text ltx_lst_space"> </span>#<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">wrap</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">dmy</span>ő<span class="ltx_text ltx_lst_identifier">pv</span>ö<span class="ltx_text ltx_lst_identifier">bucf</span>jü<span class="ltx_text ltx_lst_identifier">yxwq</span>-.á<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">13</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">COMPOUNDBEGIN</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">v</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">15</span><span class="ltx_text ltx_lst_identifier">COMPOUNDEND</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">x</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">ONLYINCOMPOUND</span><span class="ltx_text ltx_lst_space"> </span>|<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">17</span><span class="ltx_text ltx_lst_identifier">NEEDAFFIX</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">u</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">19</span><span class="ltx_text ltx_lst_identifier">REP</span><span class="ltx_text ltx_lst_space"> </span>125<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">REP</span><span class="ltx_text ltx_lst_space"> </span>í<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">i</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">21</span><span class="ltx_text ltx_lst_identifier">REP</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">i</span><span class="ltx_text ltx_lst_space"> </span>í
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">REP</span><span class="ltx_text ltx_lst_space"> </span>ó<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">o</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">23</span><span class="ltx_text ltx_lst_identifier">REP</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">oliere</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">oli</span>é<span class="ltx_text ltx_lst_identifier">re</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">REP</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">cc</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">gysz</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">25</span><span class="ltx_text ltx_lst_identifier">REP</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">cs</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">ts</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">REP</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">cs</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">ds</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">27</span><span class="ltx_text ltx_lst_identifier">REP</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">ccs</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">ts</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>#<span class="ltx_text ltx_lst_space"> </span>116<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">more</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">REP</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">lines</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">29</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">SFX</span><span class="ltx_text ltx_lst_space"> </span>?<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">Y</span><span class="ltx_text ltx_lst_space"> </span>3<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">31</span><span class="ltx_text ltx_lst_identifier">SFX</span><span class="ltx_text ltx_lst_space"> </span>?<span class="ltx_text ltx_lst_space"> </span>ö<span class="ltx_text ltx_lst_space"> </span>ő<span class="ltx_text ltx_lst_identifier">s</span>/1108<span class="ltx_text ltx_lst_space"> </span>ö<span class="ltx_text ltx_lst_space"> </span>20973<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">SFX</span><span class="ltx_text ltx_lst_space"> </span>?<span class="ltx_text ltx_lst_space"> </span>0<span class="ltx_text ltx_lst_space"> </span>ö<span class="ltx_text ltx_lst_identifier">s</span>/1108<span class="ltx_text ltx_lst_space"> </span>[^<span class="ltx_text ltx_lst_identifier">a</span>á<span class="ltx_text ltx_lst_identifier">e</span>é<span class="ltx_text ltx_lst_identifier">i</span>í<span class="ltx_text ltx_lst_identifier">o</span>óöő<span class="ltx_text ltx_lst_identifier">u</span>üű]<span class="ltx_text ltx_lst_space"> </span>20973
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">33</span><span class="ltx_text ltx_lst_identifier">SFX</span><span class="ltx_text ltx_lst_space"> </span>?<span class="ltx_text ltx_lst_space"> </span>0<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">s</span>/1108<span class="ltx_text ltx_lst_space"> </span>[áé<span class="ltx_text ltx_lst_identifier">i</span>í<span class="ltx_text ltx_lst_identifier">o</span>óúő<span class="ltx_text ltx_lst_identifier">u</span>úüű-]<span class="ltx_text ltx_lst_space"> </span>20973<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">35</span><span class="ltx_text ltx_lst_identifier">PFX</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">Y</span><span class="ltx_text ltx_lst_space"> </span>195<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">PFX</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span>0<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">leg</span>új<span class="ltx_text ltx_lst_identifier">ra</span>/1262<span class="ltx_text ltx_lst_space"> </span>.<span class="ltx_text ltx_lst_space"> </span>22551
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">37</span><span class="ltx_text ltx_lst_identifier">PFX</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span>0<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">leg</span>újjá/1262<span class="ltx_text ltx_lst_space"> </span>.<span class="ltx_text ltx_lst_space"> </span>22552<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>#<span class="ltx_text ltx_lst_space"> </span>193<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">more</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">PFX</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">lines</span>
</div>
</div>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>Excerpts from Hungarian affix file</figcaption>
</figure>
</section>
<section id="S3.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.2 </span>TeX Hyphenation Files</h3>
<div id="S3.SS2.p1" class="ltx_para">
<p class="ltx_p">The TeX hyphenation scheme is described in Frank Liang’s
dissertation <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">liang/1983</span>]</cite>, which provides a packed suffix tree
structure for storing the hyphenation patterns, which is a special
optimized finite-state automaton. This paper merely reformulates the
finite-state form of the patterns, for the purpose of obtaining a
general finite-state transducer version of the rules to be combined
with other pieces of the finite-state writer’s tools. In principle,
the TeX hyphenation files are like any TeX source files, they may
contain arbitrary TeX code, and the only requirement is that they
have the ‘patterns’ command and/or the ‘hyphenation’ command. In
practice, it is a convention that they do not contain anything else
than these two commands, as well as a comment section describing the
licensing and these conventions. The patterns section is a whitespace
separated list of hyphenation pattern strings. The pattern strings are
simple strings containing characters of the language as well as
numbers marking hyphenation points, as shown in
Figure <a href="#S3.F3" title="Figure 3 ‣ 3.2 TeX Hyphenation Files ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>. The odd numbers add a potential
hyphenation point in the context specified by non-numeric characters,
and the even numbers remove one, e.g. on line <a href="#S3.F3" title="Figure 3 ‣ 3.2 TeX Hyphenation Files ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>,
the hyphen with left context ‘f’ and right context ‘fis’ would be
removed, and a hyphen with left context ‘ff’ and right context ‘is’ is
added. The numbers are applied in ascending order. The full-stop
character is used to signify a word boundary so the rule on
line <a href="#S3.F3" title="Figure 3 ‣ 3.2 TeX Hyphenation Files ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a> will apply to ‘ache’ but not to
‘headache’. The hyphenation command on
lines <a href="#S3.F3" title="Figure 3 ‣ 3.2 TeX Hyphenation Files ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>–<a href="#S3.F3" title="Figure 3 ‣ 3.2 TeX Hyphenation Files ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a> is just a list
of words with all hyphenation points marked by hyphens. It has higher
precedence than the rules and it is used for fixing mistakes made
by the rule set.</p>
</div>
<figure id="S3.F3" class="ltx_figure">
<div class="ltx_listing ltx_lst_numbers_left ltx_lstlisting ltx_align_center ltx_listing">
<div class="ltx_listing_data"><a href="data:text/plain;base64,ICAgIFxwYXR0ZXJuc3sKICAgICAgLmFjaDQgfC1cbGFiZWx7dnJiOnRleGh5cGhwYXRzdGFydH0t%0AfAogICAgICAuYWQ0ZGVyCiAgICAgIC5hZjF0CiAgICAgIC5hbDN0CiAgICAgIC5hbTVhdAogICAg%0AICBmNWZpbi4KICAgICAgZjJmNWlzIHwtXGxhYmVse3ZyYjp0ZXhoeXBocGF0MjV9LXwKICAgICAg%0AZjRmbHkKICAgICAgZjJmeSB8LVxsYWJlbHt2cmI6dGV4aHlwaHBhdGVuZH0tfAogICAgfQogICAg%0AXGh5cGhlbmF0aW9uewogICAgICBhcy1zby1jaWF0ZSB8LVxsYWJlbHt2cmI6dGV4aHlwaHlwc3Rh%0AcnR9LXwKCiAgICAgIHByb2plY3QKICAgICAgdGEtYmxlIHwtXGxhYmVse3ZyYjp0ZXhoeXBoeXBl%0AbmR9LXwKICAgIH0K%0A">⬇</a></div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">1</span><span class="ltx_text ltx_lst_space"> </span>\<span class="ltx_text ltx_lst_identifier">patterns</span>{
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span>.<span class="ltx_text ltx_lst_identifier">ach4</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">3</span><span class="ltx_text ltx_lst_space"> </span>.<span class="ltx_text ltx_lst_identifier">ad4der</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span>.<span class="ltx_text ltx_lst_identifier">af1t</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">5</span><span class="ltx_text ltx_lst_space"> </span>.<span class="ltx_text ltx_lst_identifier">al3t</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span>.<span class="ltx_text ltx_lst_identifier">am5at</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">7</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">f5fin</span>.
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">f2f5is</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">9</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">f4fly</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">f2fy</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">11</span>}
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>\<span class="ltx_text ltx_lst_identifier">hyphenation</span>{
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">13</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">as</span>-<span class="ltx_text ltx_lst_identifier">so</span>-<span class="ltx_text ltx_lst_identifier">ciate</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">15</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">pro</span>j<span class="ltx_text ltx_lst_identifier">ect</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">ta</span>-<span class="ltx_text ltx_lst_identifier">ble</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">17</span>}
</div>
</div>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>Excerpts from English TeX hyphenation patterns</figcaption>
</figure>
</section>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>Methods</h2>
<div id="S4.p1" class="ltx_para">
<p class="ltx_p">This article presents methods for converting the existing spell-checking
dictionaries with error models, as well as hyphenators to finite-state
automata. As our toolkit we use the free open-source HFST
toolkit<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">8</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">8</sup><span class="ltx_ERROR undefined">\url</span>http://HFST.sf.net</span></span></span>, which is a general purpose API for
finite-state automata, and a set of tools for using legacy data, such as Xerox
finite-state morphologies. For this reason this paper presents the algorithms
as formulae such that they can be readily implemented using finite-state
algebra and the basic HFST tools.</p>
</div>
<div id="S4.p2" class="ltx_para">
<p class="ltx_p">The lexc lexicon model is used by the tools for describing parts of the
morphotactics. It is a simple right-linear grammar for specifying finite-state
automata described in <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/2003</span>, <span class="ltx_ref ltx_missing_citation ltx_ref_self">conf/sfcm/Linden2009</span>]</cite>. The twolc rule
formalism is used for defining context-based rules with two-level automata and
they are described in <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">koskenniemi/1983</span>, <span class="ltx_ref ltx_missing_citation ltx_ref_self">conf/sfcm/Linden2009</span>]</cite>.</p>
</div>
<div id="S4.p3" class="ltx_para">
<p class="ltx_p">This section presents both a pseudo-code presentation for the conversion
algorithms, as well as excerpts of the final converted files from the material
given in Figures <a href="#S3.F1" title="Figure 1 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> and <a href="#S3.F3" title="Figure 3 ‣ 3.2 TeX Hyphenation Files ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a> of
Section <a href="#S3" title="3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>. The converter code is available in the HFST SVN
repository<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">9</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">9</sup><span class="ltx_ERROR undefined">\url</span>http://hfst.svn.sourceforge.net/viewvc/hfst/trunk/conversion-scripts/</span></span></span>
, for those who wish to see the specifics of the implementation in
lex, yacc, c and python.</p>
</div>
<section id="S4.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.1 </span>Hunspell Dictionaries</h3>
<div id="S4.SS1.p1" class="ltx_para">
<p class="ltx_p">The hunspell dictionaries are transformed into a finite-state transducer
language model by a finite-state formulation consisting of two parts: a lexicon
and one or more rule sets. The root and affix dictionaries are turned into
finite-state lexicons in the lexc formalism. The Lexc formalism models the part
of the morphotax concerning the root dictionary and the adjacent suffixes. The
rest is encoded by injecting special symbols, called flag diacritics, into the
morphemes restricting the morpheme co-occurrences by implicit rules that have
been outlined in <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">beesley/1998</span>]</cite>; the flag diacritics are denoted in lexc
by at-sign delimited substrings. The affix definitions in hunspell also define
deletions and context restrictions which are turned into explicit two-level
rules.</p>
</div>
<div id="S4.SS1.p2" class="ltx_para">
<p class="ltx_p">The pseudo-code for the conversion of hunspell files is provided in
Algorithm <a href="#alg1" title="Algorithm 1 ‣ 4.1 Hunspell Dictionaries ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a> and excerpts from the conversion of the examples
in Figures <a href="#S3.F1" title="Figure 1 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a> and <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>
can be found in Figure <a href="#S4.F4" title="Figure 4 ‣ 4.1 Hunspell Dictionaries ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>. The dictionary file of
hunspell is almost identical to the lexc root lexicon, and the conversion is
straightforward. This is expressed on lines <a href="#alg1.l4" title="4 ‣ Algorithm 1 ‣ 4.1 Hunspell Dictionaries ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>–<a href="#alg1.l9" title="9 ‣ Algorithm 1 ‣ 4.1 Hunspell Dictionaries ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">9</span></a> as simply going
through all entries and adding them to the root lexicon, as in
lines <a href="#S4.F4" title="Figure 4 ‣ 4.1 Hunspell Dictionaries ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>—<a href="#S4.F4" title="Figure 4 ‣ 4.1 Hunspell Dictionaries ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a> of the example result. The handling of
affixes is similar, with the exception of adding flag diacritics for
co-occurrence restrictions along with the morphemes. This is shown on
lines <a href="#alg1.l10" title="10 ‣ Algorithm 1 ‣ 4.1 Hunspell Dictionaries ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">10</span></a>—<a href="#alg1.l28" title="28 ‣ Algorithm 1 ‣ 4.1 Hunspell Dictionaries ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">28</span></a> of the pseudo-code, and applying it will create the
lines <a href="#S4.F4" title="Figure 4 ‣ 4.1 Hunspell Dictionaries ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>—<a href="#S4.F4" title="Figure 4 ‣ 4.1 Hunspell Dictionaries ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a> of the Swedish example, which does not contain further
restrictions on suffixes.
</p>
</div>
<div id="S4.SS1.p3" class="ltx_para">
<p class="ltx_p">To finalize the morpheme and compounding restrictions, the final lexicon in the
lexc description must be a lexicon checking that all prefixes with forward requirements
have their requiring flags turned off.</p>
</div>
<figure id="alg1" class="ltx_float ltx_float_algorithm ltx_framed_top">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_float">Algorithm 1 </span>Extracting morphemes from hunspell dictionaries</figcaption>
<div class="ltx_listing ltx_framed_topbottom ltx_listing">
<div id="alg1.l1" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <math id="alg1.l1.m1" class="ltx_Math" alttext="finalflags\leftarrow\epsilon" display="inline"><mrow><mrow><mi>f</mi><mo></mo><mi>i</mi><mo></mo><mi>n</mi><mo></mo><mi>a</mi><mo></mo><mi>l</mi><mo></mo><mi>f</mi><mo></mo><mi>l</mi><mo></mo><mi>a</mi><mo></mo><mi>g</mi><mo></mo><mi>s</mi></mrow><mo>←</mo><mi>ϵ</mi></mrow></math>
</div>
<div id="alg1.l2" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">2:</span></span> <span class="ltx_text ltx_font_bold">for all</span> lines <math id="alg1.l2.m1" class="ltx_Math" alttext="morpheme/Conts" display="inline"><mrow><mrow><mrow><mi>m</mi><mo></mo><mi>o</mi><mo></mo><mi>r</mi><mo></mo><mi>p</mi><mo></mo><mi>h</mi><mo></mo><mi>e</mi><mo></mo><mi>m</mi><mo></mo><mi>e</mi></mrow><mo>/</mo><mi>C</mi></mrow><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>s</mi></mrow></math> in dic <span class="ltx_text ltx_font_bold">do</span>
</div>
<div id="alg1.l3" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <math id="alg1.l3.m1" class="ltx_Math" alttext="flags\leftarrow\epsilon" display="inline"><mrow><mrow><mi>f</mi><mo></mo><mi>l</mi><mo></mo><mi>a</mi><mo></mo><mi>g</mi><mo></mo><mi>s</mi></mrow><mo>←</mo><mi>ϵ</mi></mrow></math>
</div>
<div id="alg1.l4" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">4:</span></span> <span class="ltx_text ltx_font_bold">for all</span> <math id="alg1.l4.m1" class="ltx_Math" alttext="cont" display="inline"><mrow><mi>c</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi></mrow></math> in <math id="alg1.l4.m2" class="ltx_Math" alttext="Conts" display="inline"><mrow><mi>C</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>s</mi></mrow></math> <span class="ltx_text ltx_font_bold">do</span>
</div>
<div id="alg1.l5" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <math id="alg1.l5.m1" class="ltx_Math" alttext="flags\leftarrow flags+\textit{@C.cont@}" display="inline"><mrow><mrow><mi>f</mi><mo></mo><mi>l</mi><mo></mo><mi>a</mi><mo></mo><mi>g</mi><mo></mo><mi>s</mi></mrow><mo>←</mo><mrow><mrow><mi>f</mi><mo></mo><mi>l</mi><mo></mo><mi>a</mi><mo></mo><mi>g</mi><mo></mo><mi>s</mi></mrow><mo>+</mo><mtext mathvariant="italic">@C.cont@</mtext></mrow></mrow></math>
</div>
<div id="alg1.l6" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">6:</span></span> <math id="alg1.l6.m1" class="ltx_Math" alttext="Lex_{Conts}\leftarrow Lex_{Conts}\cup\textit{0:[<cont] cont}" display="inline"><mrow><mrow><mi>L</mi><mo></mo><mi>e</mi><mo></mo><msub><mi>x</mi><mrow><mi>C</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>s</mi></mrow></msub></mrow><mo>←</mo><mrow><mrow><mi>L</mi><mo></mo><mi>e</mi><mo></mo><msub><mi>x</mi><mrow><mi>C</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>s</mi></mrow></msub></mrow><mo>∪</mo><mtext mathvariant="italic">0:[¡cont] cont</mtext></mrow></mrow></math>
</div>
<div id="alg1.l7" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <span class="ltx_text ltx_font_bold">end</span> <span class="ltx_text ltx_font_bold">for</span>
</div>
<div id="alg1.l8" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">8:</span></span> <math id="alg1.l8.m1" class="ltx_Math" alttext="Lex_{Root}\leftarrow Lex_{Root}\cup\textit{flags + morpheme Conts}" display="inline"><mrow><mrow><mi>L</mi><mo></mo><mi>e</mi><mo></mo><msub><mi>x</mi><mrow><mi>R</mi><mo></mo><mi>o</mi><mo></mo><mi>o</mi><mo></mo><mi>t</mi></mrow></msub></mrow><mo>←</mo><mrow><mrow><mi>L</mi><mo></mo><mi>e</mi><mo></mo><msub><mi>x</mi><mrow><mi>R</mi><mo></mo><mi>o</mi><mo></mo><mi>o</mi><mo></mo><mi>t</mi></mrow></msub></mrow><mo>∪</mo><mtext mathvariant="italic">flags + morpheme Conts</mtext></mrow></mrow></math>
</div>
<div id="alg1.l9" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <span class="ltx_text ltx_font_bold">end</span> <span class="ltx_text ltx_font_bold">for</span>
</div>
<div id="alg1.l10" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">10:</span></span> <span class="ltx_text ltx_font_bold">for all</span> suffixes <math id="alg1.l10.m1" class="ltx_Math" alttext="lex,deletions,morpheme/Conts,context" display="inline"><mrow><mrow><mi>l</mi><mo></mo><mi>e</mi><mo></mo><mi>x</mi></mrow><mo>,</mo><mrow><mi>d</mi><mo></mo><mi>e</mi><mo></mo><mi>l</mi><mo></mo><mi>e</mi><mo></mo><mi>t</mi><mo></mo><mi>i</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>s</mi></mrow><mo>,</mo><mrow><mrow><mrow><mi>m</mi><mo></mo><mi>o</mi><mo></mo><mi>r</mi><mo></mo><mi>p</mi><mo></mo><mi>h</mi><mo></mo><mi>e</mi><mo></mo><mi>m</mi><mo></mo><mi>e</mi></mrow><mo>/</mo><mi>C</mi></mrow><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>s</mi></mrow><mo>,</mo><mrow><mi>c</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>e</mi><mo></mo><mi>x</mi><mo></mo><mi>t</mi></mrow></mrow></math> in aff <span class="ltx_text ltx_font_bold">do</span>
</div>
<div id="alg1.l11" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <math id="alg1.l11.m1" class="ltx_Math" alttext="flags\leftarrow\epsilon" display="inline"><mrow><mrow><mi>f</mi><mo></mo><mi>l</mi><mo></mo><mi>a</mi><mo></mo><mi>g</mi><mo></mo><mi>s</mi></mrow><mo>←</mo><mi>ϵ</mi></mrow></math>
</div>
<div id="alg1.l12" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">12:</span></span> <span class="ltx_text ltx_font_bold">for all</span> <math id="alg1.l12.m1" class="ltx_Math" alttext="cont" display="inline"><mrow><mi>c</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi></mrow></math> in <math id="alg1.l12.m2" class="ltx_Math" alttext="Conts" display="inline"><mrow><mi>C</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>s</mi></mrow></math> <span class="ltx_text ltx_font_bold">do</span>
</div>
<div id="alg1.l13" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <math id="alg1.l13.m1" class="ltx_Math" alttext="flags\leftarrow flags+\textit{@C.cont@}" display="inline"><mrow><mrow><mi>f</mi><mo></mo><mi>l</mi><mo></mo><mi>a</mi><mo></mo><mi>g</mi><mo></mo><mi>s</mi></mrow><mo>←</mo><mrow><mrow><mi>f</mi><mo></mo><mi>l</mi><mo></mo><mi>a</mi><mo></mo><mi>g</mi><mo></mo><mi>s</mi></mrow><mo>+</mo><mtext mathvariant="italic">@C.cont@</mtext></mrow></mrow></math>
</div>
<div id="alg1.l14" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">14:</span></span> <math id="alg1.l14.m1" class="ltx_Math" alttext="Lex_{Conts}\leftarrow Lex_{Conts}\cup\textit{0 cont}" display="inline"><mrow><mrow><mi>L</mi><mo></mo><mi>e</mi><mo></mo><msub><mi>x</mi><mrow><mi>C</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>s</mi></mrow></msub></mrow><mo>←</mo><mrow><mrow><mi>L</mi><mo></mo><mi>e</mi><mo></mo><msub><mi>x</mi><mrow><mi>C</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>s</mi></mrow></msub></mrow><mo>∪</mo><mtext mathvariant="italic">0 cont</mtext></mrow></mrow></math>
</div>
<div id="alg1.l15" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <span class="ltx_text ltx_font_bold">end</span> <span class="ltx_text ltx_font_bold">for</span>
</div>
<div id="alg1.l16" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">16:</span></span> <math id="alg1.l16.m1" class="ltx_Math" alttext="Lex_{lex}\leftarrow Lex_{lex}\cup\textit{flags}+[<lex]+\textit{morpheme Conts}" display="inline"><mrow><mi>L</mi><mi>e</mi><msub><mi>x</mi><mrow><mi>l</mi><mo></mo><mi>e</mi><mo></mo><mi>x</mi></mrow></msub><mo>←</mo><mi>L</mi><mi>e</mi><msub><mi>x</mi><mrow><mi>l</mi><mo></mo><mi>e</mi><mo></mo><mi>x</mi></mrow></msub><mo>∪</mo><mtext>𝑓𝑙𝑎𝑔𝑠</mtext><mo>+</mo><mrow><mo stretchy="false">[</mo><mo><</mo><mi>l</mi><mi>e</mi><mi>x</mi><mo stretchy="false">]</mo></mrow><mo>+</mo><mtext mathvariant="italic">morpheme Conts</mtext></mrow></math>
</div>
<div id="alg1.l17" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <span class="ltx_text ltx_font_bold">for all</span> <math id="alg1.l17.m1" class="ltx_Math" alttext="del" display="inline"><mrow><mi>d</mi><mo></mo><mi>e</mi><mo></mo><mi>l</mi></mrow></math> in <math id="alg1.l17.m2" class="ltx_Math" alttext="deletions" display="inline"><mrow><mi>d</mi><mo></mo><mi>e</mi><mo></mo><mi>l</mi><mo></mo><mi>e</mi><mo></mo><mi>t</mi><mo></mo><mi>i</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>s</mi></mrow></math> <span class="ltx_text ltx_font_bold">do</span>
</div>
<div id="alg1.l18" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">18:</span></span> <math id="alg1.l18.m1" class="ltx_Math" alttext="lc\leftarrow context+deletions\textrm{ before del}" display="inline"><mrow><mrow><mi>l</mi><mo></mo><mi>c</mi></mrow><mo>←</mo><mrow><mrow><mi>c</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>e</mi><mo></mo><mi>x</mi><mo></mo><mi>t</mi></mrow><mo>+</mo><mrow><mi>d</mi><mo></mo><mi>e</mi><mo></mo><mi>l</mi><mo></mo><mi>e</mi><mo></mo><mi>t</mi><mo></mo><mi>i</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>s</mi><mo></mo><mtext> before del</mtext></mrow></mrow></mrow></math>
</div>
<div id="alg1.l19" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <math id="alg1.l19.m1" class="ltx_Math" alttext="rc\leftarrow deletions\textrm{ after del}+[<lex]+morpheme" display="inline"><mrow><mi>r</mi><mi>c</mi><mo>←</mo><mi>d</mi><mi>e</mi><mi>l</mi><mi>e</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi><mi>s</mi><mtext> after del</mtext><mo>+</mo><mrow><mo stretchy="false">[</mo><mo><</mo><mi>l</mi><mi>e</mi><mi>x</mi><mo stretchy="false">]</mo></mrow><mo>+</mo><mi>m</mi><mi>o</mi><mi>r</mi><mi>p</mi><mi>h</mi><mi>e</mi><mi>m</mi><mi>e</mi></mrow></math>
</div>
<div id="alg1.l20" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">20:</span></span> <math id="alg1.l20.m1" class="ltx_Math" alttext="Twol_{d}\leftarrow Twol_{d}\cap\textit{del:0}\Leftrightarrow\textit{lc \_ rc}" display="inline"><mrow><mrow><mrow><mi>T</mi><mo></mo><mi>w</mi><mo></mo><mi>o</mi><mo></mo><msub><mi>l</mi><mi>d</mi></msub></mrow><mo>←</mo><mrow><mrow><mi>T</mi><mo></mo><mi>w</mi><mo></mo><mi>o</mi><mo></mo><msub><mi>l</mi><mi>d</mi></msub></mrow><mo>∩</mo><mtext mathvariant="italic">del:0</mtext></mrow></mrow><mo>⇔</mo><mtext mathvariant="italic">lc _ rc</mtext></mrow></math>
</div>
<div id="alg1.l21" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <span class="ltx_text ltx_font_bold">end</span> <span class="ltx_text ltx_font_bold">for</span>
</div>
<div id="alg1.l22" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">22:</span></span> <math id="alg1.l22.m1" class="ltx_Math" alttext="Twol_{m}\leftarrow Twol_{m}\cap[<lex]:0\Leftrightarrow\textit{context \_ morpheme}" display="inline"><mrow><mrow><mrow><mi>T</mi><mo></mo><mi>w</mi><mo></mo><mi>o</mi><mo></mo><msub><mi>l</mi><mi>m</mi></msub></mrow><mo>←</mo><mrow><mrow><mrow><mi>T</mi><mo></mo><mi>w</mi><mo></mo><mi>o</mi><mo></mo><msub><mi>l</mi><mi>m</mi></msub></mrow><mo>∩</mo></mrow><mspace width="veryverythickmathspace"></mspace><mrow><mo stretchy="false">[</mo><mrow><mi></mi><mo><</mo><mrow><mi>l</mi><mo></mo><mi>e</mi><mo></mo><mi>x</mi></mrow></mrow><mo stretchy="false">]</mo></mrow></mrow></mrow><mo>:</mo><mn>0</mn><mo>⇔</mo><mtext mathvariant="italic">context _ morpheme</mtext></mrow></math>
</div>
<div id="alg1.l23" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <span class="ltx_text ltx_font_bold">end</span> <span class="ltx_text ltx_font_bold">for</span>
</div>
<div id="alg1.l24" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">24:</span></span> <span class="ltx_text ltx_font_bold">for all</span> prefixes <math id="alg1.l24.m1" class="ltx_Math" alttext="lex,deletions,morpheme/conts,context" display="inline"><mrow><mrow><mi>l</mi><mo></mo><mi>e</mi><mo></mo><mi>x</mi></mrow><mo>,</mo><mrow><mi>d</mi><mo></mo><mi>e</mi><mo></mo><mi>l</mi><mo></mo><mi>e</mi><mo></mo><mi>t</mi><mo></mo><mi>i</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>s</mi></mrow><mo>,</mo><mrow><mrow><mrow><mi>m</mi><mo></mo><mi>o</mi><mo></mo><mi>r</mi><mo></mo><mi>p</mi><mo></mo><mi>h</mi><mo></mo><mi>e</mi><mo></mo><mi>m</mi><mo></mo><mi>e</mi></mrow><mo>/</mo><mi>c</mi></mrow><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>s</mi></mrow><mo>,</mo><mrow><mi>c</mi><mo></mo><mi>o</mi><mo></mo><mi>n</mi><mo></mo><mi>t</mi><mo></mo><mi>e</mi><mo></mo><mi>x</mi><mo></mo><mi>t</mi></mrow></mrow></math> in aff <span class="ltx_text ltx_font_bold">do</span>
</div>
<div id="alg1.l25" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <math id="alg1.l25.m1" class="ltx_Math" alttext="flags\leftarrow\textit{@P.lex@}" display="inline"><mrow><mrow><mi>f</mi><mo></mo><mi>l</mi><mo></mo><mi>a</mi><mo></mo><mi>g</mi><mo></mo><mi>s</mi></mrow><mo>←</mo><mtext mathvariant="italic">@P.lex@</mtext></mrow></math>
</div>
<div id="alg1.l26" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">26:</span></span> <math id="alg1.l26.m1" class="ltx_Math" alttext="finalflags\leftarrow finalflags+\textit{@D.lex@}" display="inline"><mrow><mrow><mi>f</mi><mo></mo><mi>i</mi><mo></mo><mi>n</mi><mo></mo><mi>a</mi><mo></mo><mi>l</mi><mo></mo><mi>f</mi><mo></mo><mi>l</mi><mo></mo><mi>a</mi><mo></mo><mi>g</mi><mo></mo><mi>s</mi></mrow><mo>←</mo><mrow><mrow><mi>f</mi><mo></mo><mi>i</mi><mo></mo><mi>n</mi><mo></mo><mi>a</mi><mo></mo><mi>l</mi><mo></mo><mi>f</mi><mo></mo><mi>l</mi><mo></mo><mi>a</mi><mo></mo><mi>g</mi><mo></mo><mi>s</mi></mrow><mo>+</mo><mtext mathvariant="italic">@D.lex@</mtext></mrow></mrow></math>
</div>
<div id="alg1.l27" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <math id="alg1.l27.m1" class="ltx_Math" alttext="lex\rightarrow prefixes" display="inline"><mrow><mrow><mi>l</mi><mo></mo><mi>e</mi><mo></mo><mi>x</mi></mrow><mo>→</mo><mrow><mi>p</mi><mo></mo><mi>r</mi><mo></mo><mi>e</mi><mo></mo><mi>f</mi><mo></mo><mi>i</mi><mo></mo><mi>x</mi><mo></mo><mi>e</mi><mo></mo><mi>s</mi></mrow></mrow></math>
{othewise as with suffixes, swapping left and right}
</div>
<div id="alg1.l28" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"><span class="ltx_text" style="font-size:80%;">28:</span></span> <span class="ltx_text ltx_font_bold">end</span> <span class="ltx_text ltx_font_bold">for</span>
</div>
<div id="alg1.l29" class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span> <math id="alg1.l29.m1" class="ltx_Math" alttext="Lex_{end}\leftarrow Lex_{end}\cup\textit{finalflags \#}" display="inline"><mrow><mrow><mi>L</mi><mo></mo><mi>e</mi><mo></mo><msub><mi>x</mi><mrow><mi>e</mi><mo></mo><mi>n</mi><mo></mo><mi>d</mi></mrow></msub></mrow><mo>←</mo><mrow><mrow><mi>L</mi><mo></mo><mi>e</mi><mo></mo><msub><mi>x</mi><mrow><mi>e</mi><mo></mo><mi>n</mi><mo></mo><mi>d</mi></mrow></msub></mrow><mo>∪</mo><mtext mathvariant="italic">finalflags #</mtext></mrow></mrow></math>
</div>
</div>
</figure>
<figure id="S4.F4" class="ltx_figure">
<div class="ltx_listing ltx_lst_numbers_left ltx_lstlisting ltx_align_center ltx_listing">
<div class="ltx_listing_data"><a href="data:text/plain;base64,ICAgIExFWElDT04gUm9vdAogICAgICAgIEhVTlNQRUxMX3BmeCA7CiAgICAgICAgSFVOUEVMTF9k%0AaWMgOwoKICAgICEgc3dlZGlzaCBsZXhjCiAgICBMRVhJQ09OIEhVTlNQRUxMX2RpYyB8LVxsYWJl%0AbHtjfS18CiAgICBAQy5IQEBDLkRAQEMuWUBhYmFrdXMgSERZIDsKICAgIEBDLkFAQEMuSEBAQy5E%0AQEBDLnZAQEMuWUBhYmFsaWVuYXRpb24KICAgICAgICBIVU5TUEVMTF9BSER2WSA7CiAgICBAQy5N%0AQEBDLllAYWJhbGllbmVyYSBNWSA7IHwtXGxhYmVse2R9LXwKCiAgICBMRVhJQ09OIEhEWQogICAg%0AMDpbPEhdICAgIEggOwogICAgMDpbPERdICAgIEQgOwogICAgMDpbPFldICAgIFkgOwoKICAgIExF%0AWElDT04gSCB8LVxsYWJlbHtlfS18CiAgICBlciAgSFVOU1BFTExfZW5kIDsKICAgIGVycyAgSFVO%0AU1BFTExfZW5kIDsKICAgIGVyICBIVU5TUEVMTF9lbmQgOwogICAgZXJzICBIVU5TUEVMTF9lbmQg%0AOyB8LVxsYWJlbHtmfS18CgogICAgTEVYSUNPTiBIVU5TUEVMTF9lbmQKICAgIEBELkhAQEQuREBA%0ARC5ZQEBELkFAQEQudkBARC5tQCAjIDsKCiAgICAhIHN3ZWRpc2ggdHdvbGMgZmlsZQogICAgUnVs%0AZXMKICAgICJTdWZmaXggSCBhbGxvd2VkIGNvbnRleHRzIgogICAgJVslPEglXTogMCA8PT4gXCBh%0AIF8gZSByIDsKICAgICAgICBcIGEgXyBlIHIgcyA7CiAgICAgICAgYTowIF8gZSByIDsKICAgICAg%0AICBhOjAgXyBlIHIgcyA7CgogICAgImEgZGVsZXRpb24gY29udGV4dHMiCiAgICBhOjAgPD0+IF8g%0AJVslPEglXTowIGUgciA7CiAgICAgICAgXyAlWyU8SCVdOiBlIHIgcyA7%0A">⬇</a></div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">1</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">LEXICON</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">Root</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HUNSPELL_pfx</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">3</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HUNPELL_dic</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">5</span>!<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">swedish</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">lexc</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">LEXICON</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HUNSPELL_dic</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">7</span><span class="ltx_text ltx_lst_identifier">@C</span>.<span class="ltx_text ltx_lst_identifier">H@@C</span>.<span class="ltx_text ltx_lst_identifier">D@@C</span>.<span class="ltx_text ltx_lst_identifier">Y@abakus</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HDY</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">@C</span>.<span class="ltx_text ltx_lst_identifier">A@@C</span>.<span class="ltx_text ltx_lst_identifier">H@@C</span>.<span class="ltx_text ltx_lst_identifier">D@@C</span>.<span class="ltx_text ltx_lst_identifier">v@@C</span>.<span class="ltx_text ltx_lst_identifier">Y@abalienation</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">9</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HUNSPELL_AHDvY</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">@C</span>.<span class="ltx_text ltx_lst_identifier">M@@C</span>.<span class="ltx_text ltx_lst_identifier">Y@abalienera</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">MY</span><span class="ltx_text ltx_lst_space"> </span>;<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">11</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">LEXICON</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HDY</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">13</span>0:[<<span class="ltx_text ltx_lst_identifier">H</span>]<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">H</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>0:[<<span class="ltx_text ltx_lst_identifier">D</span>]<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">D</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">15</span>0:[<<span class="ltx_text ltx_lst_identifier">Y</span>]<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">Y</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">17</span><span class="ltx_text ltx_lst_identifier">LEXICON</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">H</span><span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">er</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HUNSPELL_end</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">19</span><span class="ltx_text ltx_lst_identifier">ers</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HUNSPELL_end</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">er</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HUNSPELL_end</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">21</span><span class="ltx_text ltx_lst_identifier">ers</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HUNSPELL_end</span><span class="ltx_text ltx_lst_space"> </span>;<span class="ltx_text ltx_lst_space"> </span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">23</span><span class="ltx_text ltx_lst_identifier">LEXICON</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">HUNSPELL_end</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_identifier">@D</span>.<span class="ltx_text ltx_lst_identifier">H@@D</span>.<span class="ltx_text ltx_lst_identifier">D@@D</span>.<span class="ltx_text ltx_lst_identifier">Y@@D</span>.<span class="ltx_text ltx_lst_identifier">A@@D</span>.<span class="ltx_text ltx_lst_identifier">v@@D</span>.<span class="ltx_text ltx_lst_identifier">m@</span><span class="ltx_text ltx_lst_space"> </span>#<span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">25</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>!<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">swedish</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">twolc</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">file</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">27</span><span class="ltx_text ltx_lst_identifier">Rules</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>”<span class="ltx_text ltx_lst_identifier">Suffix</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">H</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">allowed</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">contexts</span>”
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">29</span>%[%<<span class="ltx_text ltx_lst_identifier">H</span>%]:<span class="ltx_text ltx_lst_space"> </span>0<span class="ltx_text ltx_lst_space"> </span><=><span class="ltx_text ltx_lst_space"> </span>\<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">a</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">_</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">e</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span>\<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">a</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">_</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">e</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">s</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">31</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">a</span>:0<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">_</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">e</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">a</span>:0<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">_</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">e</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">s</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">33</span>
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span>”<span class="ltx_text ltx_lst_identifier">a</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">deletion</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">contexts</span>”
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing">35</span><span class="ltx_text ltx_lst_identifier">a</span>:0<span class="ltx_text ltx_lst_space"> </span><=><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">_</span><span class="ltx_text ltx_lst_space"> </span>%[%<<span class="ltx_text ltx_lst_identifier">H</span>%]:0<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">e</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
<div class="ltx_listingline">
<span class="ltx_tag ltx_tag_listing"></span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">_</span><span class="ltx_text ltx_lst_space"> </span>%[%<<span class="ltx_text ltx_lst_identifier">H</span>%]:<span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">e</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">r</span><span class="ltx_text ltx_lst_space"> </span><span class="ltx_text ltx_lst_identifier">s</span><span class="ltx_text ltx_lst_space"> </span>;
</div>
</div>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4: </span>Converted dic and aff lexicons and rules governing the deletions</figcaption>
</figure>
</section>
<section id="S4.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.2 </span>Hunspell Error Models</h3>
<div id="S4.SS2.p1" class="ltx_para">
<p class="ltx_p">The hunspell dictionary configuration file, i.e. the affix file, contains
several parts that need to be combined to achieve a similar error correction
model as in the hunspell lexicon.</p>
</div>
<div id="S4.SS2.p2" class="ltx_para">
<p class="ltx_p">The error model part defined in the KEY section allows for one slip of
the finger in any of the keyboard neighboring classes. This is implemented by
creating a simple homogeneously weighted crossproduct of each class, as given
on lines <a href="#alg2.l1" title="1 ‣ Algorithm 2 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>–<a href="#alg2.l7" title="7 ‣ Algorithm 2 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">7</span></a> of
Algorithm <a href="#alg2" title="Algorithm 2 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. For the first part of the example on
line <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> of Figure <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, this
results in the lexc lexicon on
lines <a href="#S4.F5" title="Figure 5 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>–<a href="#S4.F5" title="Figure 5 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a> in
Figure <a href="#S4.F5" title="Figure 5 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>.</p>
</div>
<div id="S4.SS2.p3" class="ltx_para">
<p class="ltx_p">The error model part defined in the REP section is an arbitrarily long ordered
confusion set. This is implemented by simply encoding them as increasingly
weighted paths, as shown in lines
<a href="#alg2.l9" title="9 ‣ Algorithm 2 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">9</span></a>–<a href="#alg2.l12" title="12 ‣ Algorithm 2 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">12</span></a> of the pseudo-code in
Algorithm <a href="#alg2" title="Algorithm 2 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.</p>
</div>
<div id="S4.SS2.p4" class="ltx_para">
<p class="ltx_p">The TRY section such as the one on line <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> of
Figure <a href="#S3.F2" title="Figure 2 ‣ 3.1 Hunspell File Format ‣ 3 Material ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, defines characters to be tried as the
edit distance grows in descending order. For a more detailed formulation of a
weighted edit distance transducer, see e.g. <cite class="ltx_cite ltx_citemacro_cite">[<span class="ltx_ref ltx_missing_citation ltx_ref_self">conf/lrec/Pirinen2010</span>]</cite>). We
created an edit distance model with the sum of the positions of the characters
in the TRY string as the weight, which is defined on
lines <a href="#alg2.l14" title="14 ‣ Algorithm 2 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">14</span></a>–<a href="#alg2.l21" title="21 ‣ Algorithm 2 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">21</span></a> of the pseudo-code in
Algorithm <a href="#alg2" title="Algorithm 2 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. The initial part of the converted example
is displayed on lines <a href="#S4.F5" title="Figure 5 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>–<a href="#S4.F5" title="Figure 5 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a> of
Figure <a href="#S4.F5" title="Figure 5 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>.</p>
</div>
<div id="S4.SS2.p5" class="ltx_para">
<p class="ltx_p">Finally to attribute different likelihood to different parts of the error
models we use different weight magnitudes on different types of errors, and
to allow only correctly written substrings, we restrict the result by the
root lexicon and morfotax lexicon, as given on
lines <a href="#S4.F5" title="Figure 5 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>–<a href="#S4.F5" title="Figure 5 ‣ 4.2 Hunspell Error Models ‣ 4 Methods ‣ Building and Using Existing Hunspell Dictionaries and TeX Hyphenators as Finite-State Automata\footnotepubrightsThe official publication was in Computational Linguistics – Applications 2010 in Wisła, Poland. \urlhttps://fedcsis.org/2010/pg/395/305." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a> of