2014-cicling/Pirinen-2014-cicling.html

<!DOCTYPE html><html>
<head>
<title>State-of-the-Art in Weighted Finite-State Spell-Checking</title>
<!--Generated on Fri Sep 29 14:26:40 2017 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on Last Modification: September 29, 2017.-->

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">State-of-the-Art in Weighted Finite-State Spell-Checking</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Tommi A Pirinen and Krister LindÃ©n
<br class="ltx_break">University of Helsinki
<br class="ltx_break">Department of Modern Languages
<br class="ltx_break"><span class="ltx_ERROR undefined">\url</span>tommi.pirinen@helsinki.fi and <span class="ltx_ERROR undefined">\url</span>krister.linden@helsinki.fi
</span></span>
</div>
<div class="ltx_date ltx_role_creation">Last Modification: September 29, 2017</div>

<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>

<p class="ltx_p">The following claims can be made about finite-state methods for spell-checking:
1) Finite-state language models provide support for morphologically complex
languages that word lists, affix stripping and similar approaches do not
provide; 2) Weighted finite-state models have expressive power equal to other,
state-of-the-art string algorithms used by contemporary spell-checkers; and 3)
Finite-state models are at least as fast as other string algorithms for lookup
and error correction. In this article, we use some contemporary
non-finite-state spell-checking methods as a baseline and perform tests in
light of the claims, to evaluate state-of-the-art finite-state spell-checking
methods. We verify that finite-state spell-checking systems outperform the
traditional approaches for English. We also show that the models for
morphologically complex languages can be made to perform on par with English
systems.</p>

<p class="ltx_p"><span class="ltx_text ltx_font_bold" style="font-size:90%;">Keywords:<span class="ltx_text ltx_font_medium"> spell-checking, weighted finite-state technology, error models</span></span></p>

</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<span class="ltx_ERROR undefined">\footnotepubrights</span>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p">This version is fully reformatted from original which
was made during serious time constraints due to my graduation.
Spell-checking and correction is a traditional and well-researched part of
computational linguistics. Finite-state methods for language models are
widely recognized as a good way to handle languages which are morphologically
more complex <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib66" title="Finite state morphology" class="ltx_ref">2</a>]</cite>. In this article, we evaluate weighted, fully finite-state
spell-checking systems for morphologically complex languages. We use existing
finite-state models and algorithms and describe some necessary additions to
bridge the gaps and surpass state-of-the-art in non-finite-state
spell-checking. For the set of languages, we have chosen to study North SÃ¡mi
and Finnish from the complex, agglutinative group of languages, Green- landic
from the complex poly-agglutinative group, and English to confirm that our
finite-state formulations of traditional spelling correction applications are
working as described in the literature.
</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">As contemporary spell-checkers are increasingly using statistical approaches
for the task, weighted finite-state models provide the equivalent expressive
power, even for the morphologically more complex languages, by encoding the
probabilities as weights in the automata. As the programmatic noisy channel
models <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib67" title="An improved error model for noisy channel spelling correction" class="ltx_ref">6</a>]</cite> can encode the error probabilities when making
the corrections, so can the weighted finite-state automata encode these
probabilities.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">The task of spell-checking is split into two parts, error detection and error
correction. Error detection by language model lookup is referred to as
non-word or isolated error detection. The task of detecting isolated errors is
often considered trivial or solved in many research papers dealing with
spelling correction, e.g. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib26" title="Contextual spelling correction" class="ltx_ref">20</a>]</cite>. More complex error
detection systems may be used to detect words that are correctly spelled, but
are unsuitable in the syntactic or semantic context. This is referred to as
real-word error detection in context <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib68" title="Context based spelling correction" class="ltx_ref">15</a>]</cite>.</p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">The task of error-correction is to generate the most likely correct word-forms
given a misspelled word-form. This can also be split in two different tasks:
generating sug- gestions and ranking them. Generating corrections is often
referred to as error modeling. The main point of error modeling is to correct
spelling errors accurately by ob- serving the causes of errors and making
predictive models of them <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib22" title="Correcting spelling errors by modelling their causes" class="ltx_ref">10</a>]</cite>. This effectively
splits the error models into numerous sub-categories, each applicable to
correcting specific types of spelling errors. The most used model accounts for
typos, i.e. the slip of a finger on a keyboard. This model is nearly language
agnostic, although it can be tuned to each local keyboard layout. The other set
of errors is more language and user-specificâit stems from the lack of
knowledge or language competence, e.g., in non-phonemic orthographies, such as
English, learners and unskilled writers commonly make mistakes such as
writing their instead of there, as they are pronounced alike; similarly
competence errors will give rise to common confusable words in other languages,
such as missing an accent, writing a digraph instead of its unigraph variant,
or confusing one morph with another.</p>
</div>
<div id="S1.p5" class="ltx_para">
<p class="ltx_p">A common source of the probabilities for ranking suggestions related to
competence errors are the neighboring words and word-forms captured in a
language model. For morphologically complex languages, part-of-speech
information is needed <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib26" title="Contextual spelling correction" class="ltx_ref">20</a>, <a href="#bib.bib29" title="Improving finite-state spell-checker suggestions with part of speech n-grams" class="ltx_ref">24</a>]</cite>, which
can be compared with the studies on isolating
languages <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib68" title="Context based spelling correction" class="ltx_ref">15</a>, <a href="#bib.bib59" title="Real-word spelling correction with trigrams: a reconsideration of the Mays, Damerau, and Mercer model" class="ltx_ref">30</a>]</cite>. Context-based models
like these are, however, considered to be out of scope for spell-checking,
rather being part of grammar-checking.</p>
</div>
<div id="S1.p6" class="ltx_para">
<p class="ltx_p">Advanced language model training schemes, such as the use of morphological
analyses as error detection evidence <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib68" title="Context based spelling correction" class="ltx_ref">15</a>]</cite>, require large
manually verified morphologically analyzed and disambiguated corpora, which do
not exist as open, freely usable resources, if at all. In addition, for
polysynthetic languages like Greenlandic, even a gigaword corpus is usually not
nearly as complete as an English corpus with a million word-forms.</p>
</div>
<div id="S1.p7" class="ltx_para">
<p class="ltx_p">As we compare existing finite-state technologies with contemporary
non-finite-state string algorithm solutions, we use
Hunspell<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_ERROR undefined">\url</span>http://hunspell.sf.net</span></span></span> as setting the current de facto
standard in open-source spell-checking and the baseline for the quality to
achieve. Taken together this paper demonstrates for the first time that using
weighted finite-state technology, spell-checking for morphologically complex
languages can be made to perform on par with English systems and surpass the
current de facto standard.</p>
</div>
<div id="S1.p8" class="ltx_para">
<p class="ltx_p">This article is structured as follows: In Subsection 1.1, we briefly describe
the history of spell-checking up to the finite-state formulation of the
problem. In Subsection 1.2, we revisit the notations behind the statistics we
apply to our language and error models. In Section 2, we present existing
methods for creating finite-state language and error models for spell-checkers.
In Section 3, we present the actual data, the language models, the error
models and the corpora we have used, and in Section 4, we show how different
languages and error models affect the accuracy, precision, and speed of
finite-state spell-checking. In Section 5, we discuss the results, and finally,
in Section 6, we conclude our findings.</p>
</div>
<section id="S1.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">1.1 </span>A Brief History of Automatic Spell-Checking and Correction</h3>

<div id="S1.SS1.p1" class="ltx_para">
<p class="ltx_p">Automatic spelling correction by computer is in itself, an old invention, with
the ini- tial work done as early as in the 1960âs. Beginning with the invention
of the generic error model for typing mistakes, the Levenshtein-Damerau
distance <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib53" title="A technique for computer detection and correction of spelling errors" class="ltx_ref">8</a>, <a href="#bib.bib55" title="ÐÐ²Ð¾Ð¸ÑÐ½ÑÐµ ÐºÐ¾Ð´Ñ Ñ Ð¸ÑÐ¿ÑÐ°Ð²Ð»ÐµÐ½Ð¸ÐµÐ¼ Ð²ÑÐ¿Ð°Ð´ÐµÐ½Ð¸Ð¹, Ð²ÑÑÐ°Ð²Ð¾Ðº Ð¸ Ð·Ð°Ð¼ÐµÑÐµÐ½Ð¸Ð¹ ÑÐ¸Ð¼Ð²Ð¾Ð»Ð¾Ð²" class="ltx_ref">9</a>]</cite> and the first
applications of the noisy channel model <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib28" title="A mathematical theory of communications, i and ii" class="ltx_ref">29</a>]</cite> to
spell-checking <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib27" title="Decision making in Markov chains applied to the problem of pattern recognition" class="ltx_ref">26</a>]</cite>, the early solutions treated the
dictionaries as simple word lists, or later, word-lists with up to a few
affixes with simple stem mutations and finally some basic compounding process-
es. The most recent and widely spread implementation with a word-list, stem
muta- tions, affixes and some compounding is Hunspell, which is in common use
in the open-source world of spell-checking and correction and must be regarded
as the ref- erence implementation. The word-list approach, even with some affix
stripping and stem mutations, has sometimes been found insufficient for
morphologically complex languages. E.g. a recent attempt to utilize Hunspell
for Finnish was unsuccessful <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib24" title="Hunspell-in kesäkoodi 2006: final report" class="ltx_ref">25</a>]</cite>. In part, the
popularity of the finite-state methods in computational linguistics seen in the
1980âs was driven by a need for the morphologically more complex languages to
get language models and morphological analyzers with recurring derivation and
compounding processes <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib23" title="Morphological analysis and generation: a first step in natural language processing" class="ltx_ref">3</a>]</cite>. They also provide an
opportunity to use arbitrary finite-state automata as language models without
modifying the runtime code, e.g. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib89" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">23</a>]</cite>.</p>
</div>
<div id="S1.SS1.p2" class="ltx_para">
<p class="ltx_p">Given the finite-state representation of the dictionaries and the expressive
power of the finite-state systems, the concept of a finite-state based
implementation for spelling correction was an obvious development. The earliest
approaches presented an algorithmic way to implement the finite-state network
traversal with error-tolerance <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib43" title="Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction" class="ltx_ref">19</a>]</cite> in a fast and
effective manner <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib50" title="Fast approximate string matching with finite automata" class="ltx_ref">12</a>, <a href="#bib.bib48" title="Typographical nearest-neighbor search in a finite-state lexicon and its application to spelling correction" class="ltx_ref">27</a>]</cite>. Schulz
and Mihov <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib51" title="Fast string correction with levenshtein-automata" class="ltx_ref">28</a>]</cite> presented the Levenshtein-Damerau distance in a
finite-state form such that the finite-state spelling correction could be
performed using standard finite-state algebraic operations with any existing
finite-state library. Furthermore, e.g., Pirinen and
LindÃ©n <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib6" title="Creating and weighting Hunspell dictionaries as finite-state automata" class="ltx_ref">22</a>]</cite> have shown that the weighted finite-state
methods can be used to gain the same expressive power as the existing
statistical spellchecking software algorithms.</p>
</div>
</section>
<section id="S1.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">1.2 </span>Notations and some Statistics for Language and Error Models</h3>

<div id="S1.SS2.p1" class="ltx_para">
<p class="ltx_p">In this article, where the formulas of finite-state algebra are concerned, we
assume the standard notations from Aho et al. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib18" title="Compilers: principles, techniques, and tools" class="ltx_ref">1</a>]</cite>: a
finite-state automaton <math id="S1.SS2.p1.m1" class="ltx_Math" alttext="\mathcal{M}" display="inline"><mi class="ltx_font_mathcaligraphic">ℳ</mi></math> is a system <math id="S1.SS2.p1.m2" class="ltx_Math" alttext="(Q,\Sigma,\delta,Q_{s},Q_{f},W)" display="inline"><mrow><mo stretchy="false">(</mo><mi>Q</mi><mo>,</mo><mi mathvariant="normal">Σ</mi><mo>,</mo><mi>δ</mi><mo>,</mo><msub><mi>Q</mi><mi>s</mi></msub><mo>,</mo><msub><mi>Q</mi><mi>f</mi></msub><mo>,</mo><mi>W</mi><mo stretchy="false">)</mo></mrow></math>, where <math id="S1.SS2.p1.m3" class="ltx_Math" alttext="Q" display="inline"><mi>Q</mi></math> is the set of states, <math id="S1.SS2.p1.m4" class="ltx_Math" alttext="\Sigma" display="inline"><mi mathvariant="normal">Σ</mi></math> the alphabet, <math id="S1.SS2.p1.m5" class="ltx_Math" alttext="\delta" display="inline"><mi>δ</mi></math> the
transition mapping of form <math id="S1.SS2.p1.m6" class="ltx_Math" alttext="Q\times\Sigma\rightarrow Q" display="inline"><mrow><mrow><mi>Q</mi><mo>×</mo><mi mathvariant="normal">Σ</mi></mrow><mo>→</mo><mi>Q</mi></mrow></math>, and <math id="S1.SS2.p1.m7" class="ltx_Math" alttext="Q_{s}" display="inline"><msub><mi>Q</mi><mi>s</mi></msub></math> and <math id="S1.SS2.p1.m8" class="ltx_Math" alttext="Q_{f}" display="inline"><msub><mi>Q</mi><mi>f</mi></msub></math>
the initial and final states of the automaton, respectively. For weighted
automata, we extend the definition in the same way as
Mohri <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib19" title="Weighted automata algorithms" class="ltx_ref">16</a>]</cite> such that <math id="S1.SS2.p1.m9" class="ltx_Math" alttext="\delta" display="inline"><mi>δ</mi></math> is extended to the transition
mapping <math id="S1.SS2.p1.m10" class="ltx_Math" alttext="Q×\Sigma ×W\rightarrow Q" display="inline"><mrow><mrow><mi>Q</mi><mo>⁢</mo><mi>Ã</mi><mo>⁢</mo><mi mathvariant="normal"></mi><mo>⁢</mo><mi mathvariant="normal">Σ</mi><mo>⁢</mo><mi>Ã</mi><mo>⁢</mo><mi mathvariant="normal"></mi><mo>⁢</mo><mi>W</mi></mrow><mo>→</mo><mi>Q</mi></mrow></math>, where <math id="S1.SS2.p1.m11" class="ltx_Math" alttext="W" display="inline"><mi>W</mi></math> is the weight, and the system
additionally includes a final weight mapping <math id="S1.SS2.p1.m12" class="ltx_Math" alttext="\rho:Q_{f}\rightarrow W" display="inline"><mrow><mi>ρ</mi><mo>:</mo><mrow><msub><mi>Q</mi><mi>f</mi></msub><mo>→</mo><mi>W</mi></mrow></mrow></math>. The
structure we use for weights is systematically the tropical semiring
<math id="S1.SS2.p1.m13" class="ltx_Math" alttext="(\mathbb{R}_{+}\cup\infty,\min,+\infty,0)" display="inline"><mrow><mo stretchy="false">(</mo><mrow><msub><mi>ℝ</mi><mo>+</mo></msub><mo>∪</mo><mi mathvariant="normal">∞</mi></mrow><mo>,</mo><mi>min</mi><mo>,</mo><mrow><mo>+</mo><mi mathvariant="normal">∞</mi></mrow><mo>,</mo><mn>0</mn><mo stretchy="false">)</mo></mrow></math>, i.e. weights are positive real
numbers that are collected by addition. The tropical semiring models penalty
weighting.</p>
</div>
<div id="S1.SS2.p2" class="ltx_para">
<p class="ltx_p">For the finite-state spell-checking, we use the following common notations:
<math id="S1.SS2.p2.m1" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{D}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">D</mi></msub></math> is
a single tape weighted finite-state automaton used for detecting the spelling errors, <math id="S1.SS2.p2.m2" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{S}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub></math>
is a single tape weighted finite-state automaton used as a language model when
suggesting correct words, where the weight is used for ranking the suggestions. On
many occasions, we consider the possibility that
<math id="S1.SS2.p2.m3" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{D}}=\mathcal{M}_{\mathrm{S}}" display="inline"><mrow><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">D</mi></msub><mo>=</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub></mrow></math>. The error models are
weighted two-tape automata commonly marked as <math id="S1.SS2.p2.m4" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{E}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">E</mi></msub></math>
. A word automaton is generally marked as <math id="S1.SS2.p2.m5" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{word}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi>word</mi></msub></math>
. A misspelling is detected by composing the word automaton
with the detection automaton:</p>
</div>
<div id="S1.SS2.p3" class="ltx_para">
<table id="S1.E1" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S1.E1.m1" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{word}}\circ\mathcal{M}_{\mathrm{D}}," display="block"><mrow><mrow><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi>word</mi></msub><mo>∘</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">D</mi></msub></mrow><mo>,</mo></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td>
</tr>
</table>
</div>
<div id="S1.SS2.p4" class="ltx_para">
<p class="ltx_p">which results in an empty automaton on a misspelling and a non-empty automaton
on a correct spelling. The weight of the result may represent the likelihood or
the correctness of the word-form. Corrections for misspelled words can be
obtained by composing a misspelled word, an error model and a model of
correct words:</p>
</div>
<div id="S1.SS2.p5" class="ltx_para">
<table id="S1.E2" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S1.E2.m1" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{word}}\circ\mathcal{M}_{\mathrm{E}}\circ\mathcal{M}_{%
\mathrm{S}}," display="block"><mrow><mrow><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi>word</mi></msub><mo>∘</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">E</mi></msub><mo>∘</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub></mrow><mo>,</mo></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td>
</tr>
</table>
<p class="ltx_p">which results in a two-tape automaton consisting of the misspelled word-form
mapped to the spelling corrections described by the error model
<math id="S1.SS2.p5.m1" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{E}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">E</mi></msub></math>
and approved by the suggestion language model <math id="S1.SS2.p5.m2" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{S}}" display="inline"><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub></math>
. Both models may be weighted and the weight is
collected by standard operations as defined by the effective semiring.</p>
</div>
<div id="S1.SS2.p6" class="ltx_para">
<p class="ltx_p">Where probabilities are used, the basic formula to estimate probabilities from
discrete frequencies of events (word-forms, mistyping events, etc.) is as
follows: <math id="S1.SS2.p6.m1" class="ltx_Math" alttext="P(x)=\frac{c(x)}{\mathrm{corpussize}}" display="inline"><mrow><mrow><mi>P</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><mfrac><mrow><mi>c</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow><mi>corpussize</mi></mfrac></mrow></math> , <math id="S1.SS2.p6.m2" class="ltx_Math" alttext="x" display="inline"><mi>x</mi></math> where is the event,
<math id="S1.SS2.p6.m3" class="ltx_Math" alttext="c(x)" display="inline"><mrow><mi>c</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow></math> is the count or frequency of the event, and <math id="S1.SS2.p6.m4" class="ltx_Math" alttext="\mathrm{corpussize}" display="inline"><mi>corpussize</mi></math> is
the sum of all event counts in the training corpus. The en coding of
probability as tropical weights in a finite-state automaton is done by setting
is the end weight of path <math id="S1.SS2.p6.m5" class="ltx_Math" alttext="Q_{\pi_{x}}=-\log P(x)" display="inline"><mrow><msub><mi>Q</mi><msub><mi>π</mi><mi>x</mi></msub></msub><mo>=</mo><mrow><mo>-</mo><mrow><mrow><mi>log</mi><mo>⁡</mo><mi>P</mi></mrow><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow></mrow></mrow></math>, though in practice the
<math id="S1.SS2.p6.m6" class="ltx_Math" alttext="-\log P(x)" display="inline"><mrow><mo>-</mo><mrow><mrow><mi>log</mi><mo>⁡</mo><mi>P</mi></mrow><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow></mrow></math> weight may be distributed along the path depending on the
specific implementation. As events not appearing in corpora should have a
larger probability than zero, we use additive smoothing <math id="S1.SS2.p6.m7" class="ltx_Math" alttext="P(\hat{x})=\frac{c(x)+\alpha}{\mathrm{corpussize}+\mathrm{dictionarysize}%
\times\alpha}" display="inline"><mrow><mrow><mi>P</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mover accent="true"><mi>x</mi><mo stretchy="false">^</mo></mover><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><mfrac><mrow><mrow><mi>c</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow><mo>+</mo><mi>α</mi></mrow><mrow><mi>corpussize</mi><mo>+</mo><mrow><mi>dictionarysize</mi><mo>×</mo><mi>α</mi></mrow></mrow></mfrac></mrow></math>, so
for an unknown event , the probability will be counted as if it had <math id="S1.SS2.p6.m8" class="ltx_Math" alttext="\alpha" display="inline"><mi>α</mi></math>
appearances. Another approach would be to set <math id="S1.SS2.p6.m9" class="ltx_Math" alttext="P(\hat{x})&lt;\frac{1}{\mathrm{corpussize}}" display="inline"><mrow><mrow><mi>P</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mover accent="true"><mi>x</mi><mo stretchy="false">^</mo></mover><mo stretchy="false">)</mo></mrow></mrow><mo>&lt;</mo><mfrac><mn>1</mn><mi>corpussize</mi></mfrac></mrow></math>, which makes the probability distribution leak
but may work under some conditions <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib20" title="Large language models in machine translation" class="ltx_ref">5</a>]</cite>.
</p>
</div>
</section>
<section id="S1.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">1.3 </span>Morphologically Complex Resource-Poor Languages</h3>

<div id="S1.SS3.p1" class="ltx_para">
<p class="ltx_p">One of the main reasons for going fully finite-state instead of relying on
word-form lists and affix stripping is the claim that morphologically complex
languages simply cannot be handled with sufficient coverage and quality using
traditional methods. While Hunspell has virtually 100 % domination of the
open-source spell-checking field, authors of language models for
morphologically complex languages such asTurkish (cf.
Zemberek<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup>http://code.google.com/p/zemberek</span></span></span> ) and Finnish (cf.
Voikko<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>http://voikko.sf.net</span></span></span> ) have still opted to write separate
software, even though it makes the usage of their spell-checkers troublesome
and the coverage of supported applications much smaller.</p>
</div>
<div id="S1.SS3.p2" class="ltx_para">
<p class="ltx_p">Another aspect of the problems with morphologically complex languages is that
the amount of training data in terms of running word-forms is greater, as the
amount of unique word-forms in an average text is much higher compared with
morphologi- cally less complex languages. In addition, the majority of
morphologically complex languages tend to have fewer resources to train the
models. For training spelling checkers, the data needed is merely correctly
written unannotated text, but even that is scarce when it comes to languages
like Greenlandic or North SÃ¡mi. Even a very simple probabilistic weighting
using a small corpus of unverified texts will improve the quality of
suggestions <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib6" title="Creating and weighting Hunspell dictionaries as finite-state automata" class="ltx_ref">22</a>]</cite>, so having a weighted language model is
more effective.
</p>
</div>
</section>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Weighting Finite-State Language and Error Models</h2>

<div id="S2.p1" class="ltx_para">
<p class="ltx_p">The task of spell-checking is divided into locating spelling errors, and
suggesting the corrections for the spelling errors. In finite-state
spell-checking, the former task requires a language model that can tell
whether or not a given string is correct. The error correction requires two
components: a language model and an error model.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">The error model is a two-tape finite-state automaton that can encode the
relation between misspellings and the correctly typed words. This relation can
also be weighted with the probabilities of making a specific typo or error, or
arbitrary hand-made penalties as with many of the traditional non-finite-state
approaches, e.g. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib25" title="Hunspell manual" class="ltx_ref">17</a>]</cite>.</p>
</div>
<div id="S2.p3" class="ltx_para">
<p class="ltx_p">The rest of this section is organized as follows. In Subsection 2.1, we
describe how finite-state language models are made. In Subsection 2.2, we
describe how finite-state error models are made. In Subsection 2.3, we describe
some methods for combining the weights in language models with the weights in
error models.</p>
</div>
<section id="S2.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.1 </span>Compiling Finite-State Language Models</h3>

<div id="S2.SS1.p1" class="ltx_para">
<p class="ltx_p">The baseline for any language model as realized by numerous spell-checking
systems and the literature is a word-list (or a word-form list). One of the
most popular exam- ples of this approach is given by Norvig <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib69" title="How to write a spelling corrector" class="ltx_ref">18</a>]</cite>,
describing a toy spelling corrector being made during an intercontinental
flight. The finite-state formulation of this idea is equally simple; given a
list of word-forms, we compile each string as a path in an
automaton <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Effects of weighted finite-state language and error models on speed and efficiency of finite-state spell-checking" class="ltx_ref">21</a>]</cite>. In fact, even the classical optimized data
structures used for efficient- ly encoding word lists, like tries and acyclic
deterministic finite-state automata, are usable as finite-state automata for
our purposes, without modifications. We have:
<math id="S2.SS1.p1.m1" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{S}}=\mathcal{M}_{\mathrm{D}}=\bigcup_{\mathrm{wf}\in%
\mathrm{corpus}}\mathrm{wf}" display="inline"><mrow><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub><mo>=</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">D</mi></msub><mo>=</mo><mrow><msub><mo largeop="true" mathsize="160%" stretchy="false" symmetric="true">⋃</mo><mrow><mi>wf</mi><mo>∈</mo><mi>corpus</mi></mrow></msub><mi>wf</mi></mrow></mrow></math>, where <math id="S2.SS1.p1.m2" class="ltx_Math" alttext="\mathrm{wf}" display="inline"><mi>wf</mi></math> is a word-form and <math id="S2.SS1.p1.m3" class="ltx_Math" alttext="\mathrm{corpus}" display="inline"><mi>corpus</mi></math> is a
set of word-forms in a corpus. These are already valid language models for
Formula 1 and 2, but in practice any finite-state
lexicon <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib66" title="Finite state morphology" class="ltx_ref">2</a>]</cite> will suffice.</p>
</div>
</section>
<section id="S2.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2 </span>Compiling Finite-State Versions of Error Models</h3>

<div id="S2.SS2.p1" class="ltx_para">
<p class="ltx_p">The baseline error model for spell-checking is the Damerau-Levenshtein distance
measure. As the finite-state formulations of error models are the most recent
devel- opment in finite-state spell-checking, the earliest reference to a
finite-state error mod- el in an actual spell-checking system is by Schulz and
Mihov <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib51" title="Fast string correction with levenshtein-automata" class="ltx_ref">28</a>]</cite>. It also contains a very thorough description of
building finite-state models for different edit distances. As error models,
they can be applied in Formula 2.The edit distance type error models used in
this article are all simple edit distance models.</p>
</div>
<div id="S2.SS2.p2" class="ltx_para">
<p class="ltx_p">One of the most popular modifications to speed up the edit distance algorithm
is to disallow modifications of the first character of the
word <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib30" title="Spelling error pattern analysis of punjabi typed text" class="ltx_ref">4</a>]</cite>. This modification pro- vides a measurable
speed-up at a low cost to recall. The finite-state implementation of it is
simple; we concatenate one unmodifiable character in front of the error model.</p>
</div>
<div id="S2.SS2.p3" class="ltx_para">
<p class="ltx_p">Hunspellâs implementation of the correction algorithm uses configurable
alphabets for the error types in the edit distance model. The errors that do
not come from regular typing mistakes are nearly always covered by specific
string transformations, i.e. confusion sets. Encoding a simple string
transformation as a finite-state automaton can be done as follows: for any
given transformation <math id="S2.SS2.p3.m1" class="ltx_Math" alttext="S:U" display="inline"><mrow><mi>S</mi><mo>:</mo><mi>U</mi></mrow></math> , we have a path <math id="S2.SS2.p3.m2" class="ltx_Math" alttext="\pi_{s:u}=S_{1}:U_{1}S_{2}:U_{2}\ldots S_{n}:U_{n}" display="inline"><mrow><mrow><msub><mi>π</mi><mrow><mi>s</mi><mo>:</mo><mi>u</mi></mrow></msub><mo>=</mo><msub><mi>S</mi><mn>1</mn></msub></mrow><mo>:</mo><mrow><msub><mi>U</mi><mn>1</mn></msub><mo>⁢</mo><msub><mi>S</mi><mn>2</mn></msub></mrow><mo>:</mo><mrow><msub><mi>U</mi><mn>2</mn></msub><mo>⁢</mo><mi mathvariant="normal">…</mi><mo>⁢</mo><msub><mi>S</mi><mi>n</mi></msub></mrow><mo>:</mo><msub><mi>U</mi><mi>n</mi></msub></mrow></math> , where <math id="S2.SS2.p3.m3" class="ltx_Math" alttext="n=\max(|S|,|U|)" display="inline"><mrow><mi>n</mi><mo>=</mo><mrow><mi>max</mi><mo>⁡</mo><mrow><mo stretchy="false">(</mo><mrow><mo stretchy="false">|</mo><mi>S</mi><mo stretchy="false">|</mo></mrow><mo>,</mo><mrow><mo stretchy="false">|</mo><mi>U</mi><mo stretchy="false">|</mo></mrow><mo stretchy="false">)</mo></mrow></mrow></mrow></math> and
the missing characters of the shorter word substituted with epsilons. The path
can be extended with arbitrary contexts <math id="S2.SS2.p3.m4" class="ltx_Math" alttext="L,R" display="inline"><mrow><mi>L</mi><mo>,</mo><mi>R</mi></mrow></math>, by concatenating those contexts
on the left and right, respectively. To apply these confusion sets on a word
using a language model, we use the following formula: <math id="S2.SS2.p3.m5" class="ltx_Math" alttext="\mathcal{M}_{\mathrm{confusion}}=\bigcup_{\mathrm{S:U}\in\mathrm{CP}}S:U" display="inline"><mrow><mrow><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi>confusion</mi></msub><mo>=</mo><mrow><msub><mo largeop="true" mathsize="160%" stretchy="false" symmetric="true">⋃</mo><mrow><mi mathvariant="normal">S</mi><mo>:</mo><mrow><mi mathvariant="normal">U</mi><mo>∈</mo><mi>CP</mi></mrow></mrow></msub><mi>S</mi></mrow></mrow><mo>:</mo><mi>U</mi></mrow></math>, where <math id="S2.SS2.p3.m6" class="ltx_Math" alttext="\mathrm{CP}" display="inline"><mi>CP</mi></math> is a set
of confused string pairs. The error model can be applied in a standard manner
in Formula 2. For a more detailed descrip- tion of a finite-state
implementation of Hunspell error models, see Pirinen and
Linden <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib6" title="Creating and weighting Hunspell dictionaries as finite-state automata" class="ltx_ref">22</a>]</cite>.</p>
</div>
</section>
<section id="S2.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3 </span>Combining Weights from Different Sources and Different Models</h3>

<div id="S2.SS3.p1" class="ltx_para">
<p class="ltx_p">As both our language and error models are weighted automata, the weights need
to be combined when applying the error and the language models to a misspelled
string. Since the application performs what is basically a finite-state
composition as defined in Formula 2, the default outcome is a weight semiring
multiplication of the values; i.e., a real number addition in the tropical
semiring. This is a reasonable way to combine the models, which can be used
as a good baseline. In many cases, however, it is preferable to treat the
probabilities or the weights drawn from different sources as unequal in
strength. For example, in many of the existing spelling-checker systems, it is
preferable to first suggest all the corrections that assume only one spelling
error before the ones with two errors, regardless of the likelihood of the word
forms in the language model. To accomplish this, we scale the weights in the
error model to ensure that any weight in the error model is greater than or
equal to any weight in the language model: <math id="S2.SS3.p1.m1" class="ltx_Math" alttext="\hat{w_{e}}=w_{e}+maxw(\mathcal{M}_{\mathrm{S}})" display="inline"><mrow><mover accent="true"><msub><mi>w</mi><mi>e</mi></msub><mo stretchy="false">^</mo></mover><mo>=</mo><mrow><msub><mi>w</mi><mi>e</mi></msub><mo>+</mo><mrow><mi>m</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>x</mi><mo>⁢</mo><mi>w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub><mo stretchy="false">)</mo></mrow></mrow></mrow></mrow></math>, where <math id="S2.SS3.p1.m2" class="ltx_Math" alttext="\hat{w_{e}}" display="inline"><mover accent="true"><msub><mi>w</mi><mi>e</mi></msub><mo stretchy="false">^</mo></mover></math> is the scaled
weight of error model weights, <math id="S2.SS3.p1.m3" class="ltx_Math" alttext="w_{e}" display="inline"><msub><mi>w</mi><mi>e</mi></msub></math> the original error model weight and
<math id="S2.SS3.p1.m4" class="ltx_Math" alttext="maxw(\mathcal{M}_{\mathrm{S}})" display="inline"><mrow><mi>m</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>x</mi><mo>⁢</mo><mi>w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><msub><mi class="ltx_font_mathcaligraphic">ℳ</mi><mi mathvariant="normal">S</mi></msub><mo stretchy="false">)</mo></mrow></mrow></math> the
maximum weight found in the language model used for error corrections.</p>
</div>
</section>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>The Language and Error Model Data Used For Evaluation</h2>

<div id="S3.p1" class="ltx_para">
<p class="ltx_p">To evaluate the weighting schemes and the language and the error models, we
have selected two of the morphologically more complex languages with little to
virtually no corpus resources available: North SÃ¡mi and Greenlandic.
Furthermore, as a morphologically complex language with moderate resources,
we have used Finnish. As a comparative baseline for a morphologically simple
language with huge corpus resources, we use English. English is also used
here to reproduce the results of the existing models to verify functionality
of our selected approach.</p>
</div>
<div id="S3.p2" class="ltx_para">
<p class="ltx_p">This section briefly introduces the data and methods to compile the models; for
the exact implementation, for any reproduction of results or for attempts to
implement the same approaches for another language, the reader is advised to
utilize the scripts, the programs and the makefiles available at our source
code repository.
<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup><span class="ltx_ERROR undefined">\url</span>https://github.com/flammie/purplemonkeydishwasher/tree/master/2014-cicling</span></span></span></p>
</div>
<figure id="S3.T1" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 1: </span>The extent of Wikipedia data per language</figcaption>
<table class="ltx_tabular ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Data:</th>
<td class="ltx_td ltx_align_right">Train tokens</td>
<td class="ltx_td ltx_align_right">Train types</td>
<td class="ltx_td ltx_align_right">Test tokens</td>
<td class="ltx_td ltx_align_right">Test types</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Language</th>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">English</th>
<td class="ltx_td ltx_align_right">276,730,786</td>
<td class="ltx_td ltx_align_right">3,216,142</td>
<td class="ltx_td ltx_align_right">111,882,292</td>
<td class="ltx_td ltx_align_right">1,945,878</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Finnish</th>
<td class="ltx_td ltx_align_right">9,779,826</td>
<td class="ltx_td ltx_align_right">1,065,631</td>
<td class="ltx_td ltx_align_right">4,116,896</td>
<td class="ltx_td ltx_align_right">538,407</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">North SÃ¡mi</th>
<td class="ltx_td ltx_align_right">183,643</td>
<td class="ltx_td ltx_align_right">38,893</td>
<td class="ltx_td ltx_align_right">33,722</td>
<td class="ltx_td ltx_align_right">8,239</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Greenlandic</th>
<td class="ltx_td ltx_align_right">136,241</td>
<td class="ltx_td ltx_align_right">28,268</td>
<td class="ltx_td ltx_align_right">7,233</td>
<td class="ltx_td ltx_align_right">1,973</td>
</tr>
</tbody>
</table>
</figure>
<div id="S3.p3" class="ltx_para">
<p class="ltx_p">In Table 1, we show the statistics of the data we have drawn from Wikipedia for
training and testing purposes. In case of English and Finnish, the data is
selected from a subset of Wikipedia test tokens. With North SÃ¡mi and
Greenlandic, we had no other choice but to use all Wikipedia test tokens.</p>
</div>
<div id="S3.p4" class="ltx_para">
<p class="ltx_p">For the English language model, we use the data from Norvig <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib69" title="How to write a spelling corrector" class="ltx_ref">18</a>]</cite> and
Pirinen and Hardwick <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Effects of weighted finite-state language and error models on speed and efficiency of finite-state spell-checking" class="ltx_ref">21</a>]</cite>, which is a basic language model
based on a frequency weighted wordlist extracted from freely available Internet
corpora such as Wikipedia and project Gutenberg. The language models for North
SÃ¡mi, Finnish and Greenlandic are drawn from the free/libre open-source
repository of finite-state language models managed by the University of
TromsÃ¸.<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">5</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">5</sup><span class="ltx_ERROR undefined">\url</span>http://giellatekno.uit.no</span></span></span> The language models are all
based on the morphological analyzers built in the finite-state
morphology <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib66" title="Finite state morphology" class="ltx_ref">2</a>]</cite> fashion. The repository also includes the
basic versions of finite-state spell-checking under the same framework that we
use in this article for testing. To compile our dictionaries, we have used the
makefiles available in the repository. The exact methods for this are also
detailed in the source code of the repository.</p>
</div>
<div id="S3.p5" class="ltx_para">
<p class="ltx_p">The error models for English are combined from a basic edit distance with
English alphabet a-z and the confusion set from Hunspellâs English dictionary
containing 96 confusion pairs<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">6</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">6</sup>as found in en-US.aff from Ubuntu Linux
LTS 12.04</span></span></span>. The error models for North SÃ¡mi, Finnish and Greenlandic are the
edit distances of English with addition of Ã¥Ã¤Ã¶Å¡Å¾Ä and <span class="ltx_ERROR undefined">\fontspec</span>Charis SIL Å§
and for North SÃ¡mi and Ã¥Ã¤Ã¶Å¡Å¾ for Finnish. For North SÃ¡mi we also use the actual
Hunspell parts from the divvun speller<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">7</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">7</sup><span class="ltx_ERROR undefined">\url</span>http://divvun.no</span></span></span>; for
Greenlandic, we have no confusion sets or character likelihoods for
Hunspell-style data, so only the ordering of the Hunspell correction mechanisms
is retained. For English, the Hunspell phonemic folding scheme was not used.
This makes the English results easier to compare with those of other languages,
which do not even have any phonemic error sources.</p>
</div>
<div id="S3.p6" class="ltx_para">
<p class="ltx_p">The keyboard adjacency weighting and optimization for the English error models
is based on a basic qwerty keyboard. The keyboard adjacency values are taken
from the CLDR Version 22<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">8</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">8</sup>http://cldr.unicode.org</span></span></span>, modified to the
standard 101â104 key PC keyboard layout.</p>
</div>
<div id="S3.p7" class="ltx_para">
<p class="ltx_p">The training corpora for each of the languages are based on Wikipedia. To
estimate the weights in the models, we have used the correct word-forms of the
first 90 % of Wikipedia for the language model and the non-words for the error
model. We used the remaining 10 % for extracting non-words for testing. The
error corpus was ex- tracted with a script very similar to the one described by
Max and Wisniewski <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib57" title="Mining naturally-occurring corrections and paraphrases from wikipediaâs revision history" class="ltx_ref">14</a>]</cite>. The script that performs fetching and
cleaning can be found in our repository.
<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">9</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">9</sup><span class="ltx_ERROR undefined">\url</span>https://github.com/flammie/purplemonkeydishwasher/tree/master/2014-cicling</span></span></span>
We have selected the
spelling corrections found in Wikipedia by only taking those, where the
incorrect version does not belong to the language model (i.e. is a non-word
error), and the corrected word-form does.</p>
</div>
<section id="S3.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>The Models Used For Evaluation</h3>

<div id="S3.SS1.p1" class="ltx_para">
<p class="ltx_p">The finite-state language and error models described in this article have a
number of adjustable settings. For weighting our language models, we have
picked a subset of corpus strings for estimating word form probabilities. As
both North SÃ¡mi and Greenlandic Wikipedia were quite limited in size, we used
all strings except those that appear only once (hapax legomena) whereas for
Finnish, we set the frequency threshold to 5, and for English, we set it to 20.
For English, we also used all word-forms in the material from Norvigâs corpora,
as we believe that they are already hand-selected to some extent.</p>
</div>
<div id="S3.SS1.p2" class="ltx_para">
<p class="ltx_p">As error models, we have selected the following combinations of basic models:
the basic edit distance consisting of homogeneously weighted errors of the
Levenshtein-Damerau type, the same model limited to the non-first positions of
the word, and the Hunspell version of the edit distance errors (i.e. swaps only
apply to adjacent keys, and deletions and additions are only tried for a
selected alphabet).</p>
</div>
</section>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>The Speed and Quality of Different Finite-State Models and
Weighting Schemes</h2>

<div id="S4.p1" class="ltx_para">
<p class="ltx_p">To evaluate the systems, we have used a modified version of the HFST
spell-checking tool <span class="ltx_text ltx_font_typewriter">hfst-ospell-survey</span> 0.2.4 otherwise using the
default options, but for the speed measurements we have used the
<span class="ltx_text ltx_font_typewriter">--profile</span> argument. The evaluation of speed and memory usage has been
performed by averaging over five test runs on a dedicated test server: an Intel
Xeon E5450 at 3 GHz, with 64 GB of RAM memory. The rest of the section is
organized as follows: in Subsection 4.1, we show naÃ¯ve coverage baselines. In
Subsection 4.2, we measure the quality of spell-checking with real-world
spelling corrections found in Wikipedia logs. Finally in Subsections 4.3 and
4.4, we provide the speed and memory efficiency figures for these experiments,
respectively.</p>
</div>
<section id="S4.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.1 </span>Coverage Evaluation</h3>

<div id="S4.SS1.p1" class="ltx_para">
<p class="ltx_p">To show the starting point for spell-checking, we measure the coverage of the
lan- guage models. That is, we measure how much of the test data can be
recognized using only the language models, and how many of the word-forms are
beyond the reach of the models. The measurements in Table 2 are measured over
word-forms in running text that can be measured in reasonable time, i.e. no
more than the first 1,000,000 word-forms of each test corpus. As can be seen in
Table 2, the task is very different for languages like English compared with
morphologically more complex languages.</p>
</div>
<figure id="S4.T2" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 2: </span>The word-form coverage of the language models on test data (in %).</figcaption>
<table class="ltx_tabular ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English aspell</td>
<td class="ltx_td ltx_align_right">22.7</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English full automaton</td>
<td class="ltx_td ltx_align_right">80.1</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish full automaton</td>
<td class="ltx_td ltx_align_right">64.8</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North SÃ¡mi Hunspell</td>
<td class="ltx_td ltx_align_right">34.4</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North SÃ¡mi full automaton</td>
<td class="ltx_td ltx_align_right">48.5</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Greenlandic full automaton</td>
<td class="ltx_td ltx_align_right">25.3</td>
</tr>
</tbody>
</table>
</figure>
</section>
<section id="S4.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.2 </span>Quality Evaluation</h3>

<div id="S4.SS2.p1" class="ltx_para">
<p class="ltx_p">To measure the quality of spell-checking, we have run the list of misspelled
words through the language and error models of our spelling correctors,
extracting all the suggestions. The quality, in Table 3, is measured by the
proportion of correct sugges- tions appearing at a given position 1-5 and
finally the proportion appearing in any remaining positions.</p>
</div>
<div id="S4.SS2.p2" class="ltx_para">
<p class="ltx_p">On the rows indicated with error, in Table 3, we present the baselines for
using language and error models allowing one edit in any position of the word.
The rows with ânon-first errorâ show the same error models with the restriction
that the first letter of the word may not be changed.</p>
</div>
<div id="S4.SS2.p3" class="ltx_para">
<p class="ltx_p">Finally, in Table 3, we also compare the results of our spell-checkers with the
actual systems in everyday use, i.e. the Hunspell and aspell in practice.
When looking at this comparison, we can see that for English data, we actually
provide an overall im- provement already by allowing only one edit per word.
This is mainly due to the weighted language model which works very nicely for
languages like English. The data on North SÃ¡mi on the other hand shows no
meaningful improvement neither with the change from Hunspell to our weighted
language models nor with the restriction of the error models.</p>
</div>
<figure id="S4.T3" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 3: </span>The effect of different language and error models on correction quality (precision in % at a given suggestion position)</figcaption>
<table class="ltx_tabular ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Rank:</td>
<td class="ltx_td ltx_align_right">1st</td>
<td class="ltx_td ltx_align_right">2nd</td>
<td class="ltx_td ltx_align_right">3rd</td>
<td class="ltx_td ltx_align_right">4th</td>
<td class="ltx_td ltx_align_right">5th</td>
<td class="ltx_td ltx_align_right">rest</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Language and error models</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English aspell</td>
<td class="ltx_td ltx_align_right">55.7</td>
<td class="ltx_td ltx_align_right">5.7</td>
<td class="ltx_td ltx_align_right">8.0</td>
<td class="ltx_td ltx_align_right">2.2</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English Hunspell</td>
<td class="ltx_td ltx_align_right">59.3</td>
<td class="ltx_td ltx_align_right">5.8</td>
<td class="ltx_td ltx_align_right">3.5</td>
<td class="ltx_td ltx_align_right">2.3</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English w/ 1 error</td>
<td class="ltx_td ltx_align_right">66.7</td>
<td class="ltx_td ltx_align_right">7.0</td>
<td class="ltx_td ltx_align_right">5.2</td>
<td class="ltx_td ltx_align_right">1.8</td>
<td class="ltx_td ltx_align_right">1.8</td>
<td class="ltx_td ltx_align_right">1.8</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">66.7</td>
<td class="ltx_td ltx_align_right">8.8</td>
<td class="ltx_td ltx_align_right">7.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">1.8</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish aspell</td>
<td class="ltx_td ltx_align_right">21.1</td>
<td class="ltx_td ltx_align_right">5.8</td>
<td class="ltx_td ltx_align_right">3.8</td>
<td class="ltx_td ltx_align_right">1.9</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish w/ 1 error</td>
<td class="ltx_td ltx_align_right">54.8</td>
<td class="ltx_td ltx_align_right">19.0</td>
<td class="ltx_td ltx_align_right">7.1</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">54.8</td>
<td class="ltx_td ltx_align_right">21.4</td>
<td class="ltx_td ltx_align_right">4.8</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North SÃ¡mi Hunspell</td>
<td class="ltx_td ltx_align_right">9.4</td>
<td class="ltx_td ltx_align_right">3.1</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">3.1</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North SÃ¡mi w/ 1 error</td>
<td class="ltx_td ltx_align_right">3.5</td>
<td class="ltx_td ltx_align_right">3.5</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">6.9</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North SÃ¡mi w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">3.5</td>
<td class="ltx_td ltx_align_right">3.5</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">6.9</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">0.0</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Greenlandic w/ 1 error</td>
<td class="ltx_td ltx_align_right">13.3</td>
<td class="ltx_td ltx_align_right">2.2</td>
<td class="ltx_td ltx_align_right">6.7</td>
<td class="ltx_td ltx_align_right">2.2</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">8.9</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Greenlandic w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">13.3</td>
<td class="ltx_td ltx_align_right">2.2</td>
<td class="ltx_td ltx_align_right">6.7</td>
<td class="ltx_td ltx_align_right">2.2</td>
<td class="ltx_td ltx_align_right">0.0</td>
<td class="ltx_td ltx_align_right">8.9</td>
</tr>
</tbody>
</table>
</figure>
<div id="S4.SS2.p4" class="ltx_para">
<p class="ltx_p">Some of the trade-offs are efficiency versus quality. In Table 3, we measure
among other things the quality effect of limiting the search space in the error
model. It is important to contrast these results with the speed or memory gains
shown in the corresponding Tables 4 and 5. As we can see, the optimizations
that limit the search space will generally not have a big effect on the
results. Only the results that get cut out of the search space are moved. A few
of the results disappear or move to worse positions.</p>
</div>
</section>
<section id="S4.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.3 </span>Speed Evaluation</h3>

<div id="S4.SS3.p1" class="ltx_para">
<p class="ltx_p">For practical spell-checking systems, there are multiple levels of speed
requirements, so we measure the effects of our different models on speed to see
if the optimal mod- els can actually be used in interactive systems, off-line
corrections, or just batch pro- cessing. In Table 4, we show the speed of
different model combinations for spell- checkingâfor a more thorough evaluation
of the speed of the finite-state language and the error models we refer to
Pirinen et al. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Effects of weighted finite-state language and error models on speed and efficiency of finite-state spell-checking" class="ltx_ref">21</a>]</cite>. We perform three different test sets:
startup time tests to see how much time is spent on startup alone; a running
cor- pus processing test to see how well the system fares when processing
running text;and a non-word correcting test, to see how fast the system is when
producing correc- tions for words. For each test, the results are averaged over
at least 5 runs.</p>
</div>
<figure id="S4.T4" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 4: </span>The effect of different language and error models on speed of spelling correction (startup time in seconds, correction rate in words per second)</figcaption>
<table class="ltx_tabular ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Input: <math id="S4.T4.m1" class="ltx_Math" alttext="1^{\mathrm{st}}" display="inline"><msup><mn>1</mn><mi>st</mi></msup></math> word</td>
<td class="ltx_td ltx_align_right">all words</td>
<td class="ltx_td ltx_align_right">non-words</td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Language and error models</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English Hunspell</td>
<td class="ltx_td ltx_align_right">0.5</td>
<td class="ltx_td ltx_align_right">174</td>
<td class="ltx_td ltx_align_right">40</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English w/ 1 error</td>
<td class="ltx_td ltx_align_right">0.06</td>
<td class="ltx_td ltx_align_right">5,721</td>
<td class="ltx_td ltx_align_right">6,559</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">English w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">0.20</td>
<td class="ltx_td ltx_align_right">16,474</td>
<td class="ltx_td ltx_align_right">17,911</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish aspell</td>
<td class="ltx_td ltx_align_right">¡0.1</td>
<td class="ltx_td ltx_align_right">781</td>
<td class="ltx_td ltx_align_right">686</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish w/ 1 error</td>
<td class="ltx_td ltx_align_right">1.0</td>
<td class="ltx_td ltx_align_right">166</td>
<td class="ltx_td ltx_align_right">357</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Finnish w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">1.0</td>
<td class="ltx_td ltx_align_right">303</td>
<td class="ltx_td ltx_align_right">1,886</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North SÃ¡mi Hunspell</td>
<td class="ltx_td ltx_align_right">4.51</td>
<td class="ltx_td ltx_align_right">3</td>
<td class="ltx_td ltx_align_right">2</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North SÃ¡mi w/ 1 error</td>
<td class="ltx_td ltx_align_right">0.28</td>
<td class="ltx_td ltx_align_right">2,304</td>
<td class="ltx_td ltx_align_right">2,839</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">North SÃ¡mi w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">0.27</td>
<td class="ltx_td ltx_align_right">5,025</td>
<td class="ltx_td ltx_align_right">7,898</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Greenlandic w/ 1 error</td>
<td class="ltx_td ltx_align_right">1.27</td>
<td class="ltx_td ltx_align_right">49</td>
<td class="ltx_td ltx_align_right">142</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Greenlandic w/ 1 non-first error</td>
<td class="ltx_td ltx_align_right">1.25</td>
<td class="ltx_td ltx_align_right">85</td>
<td class="ltx_td ltx_align_right">416</td>
</tr>
</tbody>
</table>
</figure>
<div id="S4.SS3.p2" class="ltx_para">
<p class="ltx_p">In Table 4, we already notice an important aspect of finite-state spelling
correction: the speed is very predictable, and in the same ballpark regardless
of input data. Furthermore, we can readily see that the speed of a
finite-state system in general outperforms Hunspell with both of the language
models we compare. Furthermore, we show the speed gains achieved by cutting the
search space to disallow errors in the first character of a word. This is the
speed-equivalent of Table 3 of the previous section, which clearly shows the
trade-off between speed and quality.</p>
</div>
</section>
<section id="S4.SS4" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.4 </span>Memory Usage Evaluation</h3>

<div id="S4.SS4.p1" class="ltx_para">
<p class="ltx_p">Depending on the use case of the spell-checker, memory usage may also be a
limiting factor. To give an idea of the memory-speed trade-offs that different
finite-state models entail, in Table 5, we provide the memory usage values
when performing the evaluation tasks above. The measurements are performed
with the Valgrind utility and represent the peak memory usage. It needs to be
emphasized that this method, like all of the methods of measuring memory usage
of a program, has its flaws, and the figures can at best be considered rough
estimates.</p>
</div>
<figure id="S4.T5" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 5: </span>The peak memory usage of processes checking and correcting word-forms with various language and error model combinations.</figcaption>
<table class="ltx_tabular ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Measurement:</th>
<td class="ltx_td ltx_align_right">Peak memory usage</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Language and error models</th>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">English Hunspell</th>
<td class="ltx_td ltx_align_right">7.5 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">English w/ 1 error</th>
<td class="ltx_td ltx_align_right">7.0 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">English w/ 1 non-first error</th>
<td class="ltx_td ltx_align_right">7.0 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Finnish aspell</th>
<td class="ltx_td ltx_align_right">186 kB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Finnish w/ 1 error</th>
<td class="ltx_td ltx_align_right">79.3 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Finnish w/ 1 non-first error</th>
<td class="ltx_td ltx_align_right">79.3 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">North SÃ¡mi Hunspell</th>
<td class="ltx_td ltx_align_right">151.0 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">North SÃ¡mi w/ 1 error</th>
<td class="ltx_td ltx_align_right">31.4 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">North SÃ¡mi w/ 1 non-first error</th>
<td class="ltx_td ltx_align_right">31.4 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Greenlandic w/ 1 error</th>
<td class="ltx_td ltx_align_right">300.0 MB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Greenlandic w/ 1 non-first error</th>
<td class="ltx_td ltx_align_right">300.7 MB</td>
</tr>
</tbody>
</table>
</figure><span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">10</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">10</sup>Drobac &amp; al. <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib95" title="Heuristic hyperminimization of finite-state lexicons" class="ltx_ref">11</a>]</cite> report on how to hyper-minimize finite-state lexicons keeping the Greenlandic lexicon at less than 20 MB at runtime which gives a considerable speed-up of loading
with only a small reduction of runtime speed.</span></span></span>
</section>
</section>
<section id="S5" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5 </span>Discussion</h2>

<div id="S5.p1" class="ltx_para">
<p class="ltx_p">The improvement of quality by using simple probabilistic features for
spell-checking is well-studied, e.g. by Church and
Gale <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib90" title="Probability scoring for spelling correction" class="ltx_ref">7</a>]</cite>. In our work, we describe introducing
probabilistic features into a finite-state spell-checking system giving a
similar increase in the quality of the spell-checking suggestions as seen in
previous approaches. The methods are usable for a morphologically varied set
of languages.</p>
</div>
<div id="S5.p2" class="ltx_para">
<p class="ltx_p">The speed to quality trade-off is a well-known feature in spell-checking
systems, and several aspects of it have been investigated in previous research.
The concept of cutting away string initial modifications from the search space
has often been suggested <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib30" title="Spelling error pattern analysis of punjabi typed text" class="ltx_ref">4</a>, <a href="#bib.bib47" title="Techniques for automatically correcting words in text" class="ltx_ref">13</a>]</cite>, but
only rarely quantified extensively. In this paper we have investigated its
effects on finite-state systems and complex languages. We noted that it gives
speed improvements in line with previous solutions, and we also verified that
the quality deterioration on real-world data is minimal.</p>
</div>
<div id="S5.p3" class="ltx_para">
<p class="ltx_p">In this paper, we have reviewed basic finite-state language and error models
for spelling correction. The obvious future improvements that need to be
researched are extensions, e.g. to errors at edit distance 2 or more, as well
as more elaborate models for both model types. The combination of adaptive
technologies based on user feed-back at runtime and finite-state models has
not been researched in spelling correction, but it has shown good results in
practical spelling correction applications.</p>
</div>
</section>
<section id="S6" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">6 </span>Conclusion</h2>

<div id="S6.p1" class="ltx_para">
<p class="ltx_p">We have demonstrated that finite-state spell-checking is a feasible alternative
to traditional string algorithm-driven versions by verifying three claims.
The language support has been demonstrated by the fact that there is a
working implementation of Greenlandic that could not have been successfully
implemented without finite-state models, and by giving finite-state versions of
the North SÃ¡mi and English language models that cover more Wikipedia word forms
than the non-finite-state equivalents. In addition, the suggestion mechanism
using a weighted finite-state implementation isable to provide better quality
suggestions for English and Finnish than corresponding non-finite-state
implementations. The efficiency of the finite-state approach is verified by
showing a reasonable or greater speed when compared with Hunspell.</p>
</div>
</section>
<section id="Sx1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">Acknowledgements</h2>

<div id="Sx1.p1" class="ltx_para">
<p class="ltx_p">We thank our fellow researchers in the HFST research group at the University of
Helsinki for ideas and discussions.</p>
</div>
</section>
<section id="bib" class="ltx_bibliography">
<h2 class="ltx_title ltx_title_bibliography">References</h2>

<ul id="L1" class="ltx_biblist">
<li id="bib.bib18" class="ltx_bibitem ltx_bib_book">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[1]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. V. Aho, M. S. Lam, R. Sethi and J. D. Ullman</span><span class="ltx_text ltx_bib_year"> (2007)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Compilers: principles, techniques, and tools</span>.
</span>
<span class="ltx_bibblock">Vol. <span class="ltx_text ltx_bib_volume">1009</span>,  <span class="ltx_text ltx_bib_publisher">Pearson/Addison Wesley</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS2.p1" title="1.2 Notations and some Statistics for Language and Error Models ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.2</span></a>.
</span>
</li>
<li id="bib.bib66" class="ltx_bibitem ltx_bib_book">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[2]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. R. Beesley and L. Karttunen</span><span class="ltx_text ltx_bib_year"> (2003)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Finite state morphology</span>.
</span>
<span class="ltx_bibblock"> <span class="ltx_text ltx_bib_publisher">CSLI publications</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p1" title="1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S2.SS1.p1" title="2.1 Compiling Finite-State Language Models ‣ 2 Weighting Finite-State Language and Error Models ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.1</span></a>,
<a href="#S3.p4" title="3 The Language and Error Model Data Used For Evaluation ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>.
</span>
</li>
<li id="bib.bib23" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[3]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. R. Beesley</span><span class="ltx_text ltx_bib_year"> (2004)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Morphological analysis and generation: a first step in natural language processing</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">First Steps in Language Documentation for Minority Languages: Computational Linguistic Tools for Morphology, Lexicon and Corpus Compilation, Proceedings of the SALTMIL Workshop at LREC</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_pages"> pp. 1–8</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p1" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>.
</span>
</li>
<li id="bib.bib30" class="ltx_bibitem ltx_bib_thesis">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[4]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">M. Bhagat</span><span class="ltx_text ltx_bib_year"> (2007)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Spelling error pattern analysis of punjabi typed text</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_type">Master’s Thesis</span>, <span class="ltx_text ltx_bib_publisher">Thapar University</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://hdl.handle.net/123456789/247" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS2.p2" title="2.2 Compiling Finite-State Versions of Error Models ‣ 2 Weighting Finite-State Language and Error Models ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.2</span></a>,
<a href="#S5.p2" title="5 Discussion ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>.
</span>
</li>
<li id="bib.bib20" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[5]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">T. Brants, A.C. Popat, P. Xu, F.J. Och and J. Dean</span><span class="ltx_text ltx_bib_year"> (2007)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Large language models in machine translation</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">EMNLP</span>,
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS2.p6" title="1.2 Notations and some Statistics for Language and Error Models ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.2</span></a>.
</span>
</li>
<li id="bib.bib67" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[6]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">E. Brill and R. C. Moore</span><span class="ltx_text ltx_bib_year"> (2000)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">An improved error model for noisy channel spelling correction</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">ACL ’00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">Morristown, NJ, USA</span>, <span class="ltx_text ltx_bib_pages"> pp. 286–293</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://dx.doi.org/http://dx.doi.org/10.3115/1075218.1075255" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p2" title="1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib90" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[7]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. W. Church and W. A. Gale</span><span class="ltx_text ltx_bib_year"> (1991)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Probability scoring for spelling correction</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Statistics and Computing</span> <span class="ltx_text ltx_bib_volume">1</span>, <span class="ltx_text ltx_bib_pages"> pp. 93â103</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S5.p1" title="5 Discussion ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>.
</span>
</li>
<li id="bib.bib53" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[8]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">F. J. Damerau</span><span class="ltx_text ltx_bib_year"> (1964)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">A technique for computer detection and correction of spelling errors</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Communications of the ACM</span> <span class="ltx_text ltx_bib_volume">7</span> (<span class="ltx_text ltx_bib_number">3</span>), <span class="ltx_text ltx_bib_pages"> pp. 171–176</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p1" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>.
</span>
</li>
<li id="bib.bib55" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[9]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">Ð.Ð. ÐÐµÐ²ÐµÐ½ÑÑÐµÐ¹Ð½</span><span class="ltx_text ltx_bib_year"> (1965)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">ÐÐ²Ð¾Ð¸ÑÐ½ÑÐµ ÐºÐ¾Ð´Ñ Ñ Ð¸ÑÐ¿ÑÐ°Ð²Ð»ÐµÐ½Ð¸ÐµÐ¼ Ð²ÑÐ¿Ð°Ð´ÐµÐ½Ð¸Ð¹, Ð²ÑÑÐ°Ð²Ð¾Ðº Ð¸ Ð·Ð°Ð¼ÐµÑÐµÐ½Ð¸Ð¹ ÑÐ¸Ð¼Ð²Ð¾Ð»Ð¾Ð²</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">ÐÐ¾ÐºÐ»Ð°Ð´Ñ ÐÐºÐ°Ð´ÐµÐ¼Ð¸Ð¹ ÐÐ°ÑÐº Ð¡CCP</span> <span class="ltx_text ltx_bib_volume">163</span> (<span class="ltx_text ltx_bib_number">4</span>), <span class="ltx_text ltx_bib_pages"> pp. 845â8</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p1" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>.
</span>
</li>
<li id="bib.bib22" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[10]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">S. Deorowicz and M.G. Ciura</span><span class="ltx_text ltx_bib_year"> (2005)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Correcting spelling errors by modelling their causes</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">International journal of applied mathematics and computer science</span> <span class="ltx_text ltx_bib_volume">15</span> (<span class="ltx_text ltx_bib_number">2</span>), <span class="ltx_text ltx_bib_pages"> pp. 275</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p4" title="1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib95" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[11]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">S. Drobac, K. LindÃ©n, T. Pirinen and M. Silfverberg</span><span class="ltx_text ltx_bib_year"> (2014)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Heuristic hyperminimization of finite-state lexicons</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">In the Proceedings of LREC 2014</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">Reykavik, Iceland</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S4.SS4" title="4.4 Memory Usage Evaluation ‣ 4 The Speed and Quality of Different Finite-State Models and Weighting Schemes ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.4</span></a>.
</span>
</li>
<li id="bib.bib50" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[12]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">M. Huldén</span><span class="ltx_text ltx_bib_year"> (2009)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Fast approximate string matching with finite automata</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Procesamiento del Lenguaje Natural</span> <span class="ltx_text ltx_bib_volume">43</span>, <span class="ltx_text ltx_bib_pages"> pp. 57–64</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p2" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>.
</span>
</li>
<li id="bib.bib47" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[13]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Kukich</span><span class="ltx_text ltx_bib_year"> (1992)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Techniques for automatically correcting words in text</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">ACM Comput. Surv.</span> <span class="ltx_text ltx_bib_volume">24</span> (<span class="ltx_text ltx_bib_number">4</span>), <span class="ltx_text ltx_bib_pages"> pp. 377–439</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text issn ltx_bib_external">ISSN 0360-0300</span>,
<a href="http://dx.doi.org/http://doi.acm.org/10.1145/146370.146380" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S5.p2" title="5 Discussion ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>.
</span>
</li>
<li id="bib.bib57" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[14]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. Max and G. Wisniewski</span><span class="ltx_text ltx_bib_year"> (2010)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Mining naturally-occurring corrections and paraphrases from wikipediaâs revision history</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of LREC</span>,
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S3.p7" title="3 The Language and Error Model Data Used For Evaluation ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>.
</span>
</li>
<li id="bib.bib68" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[15]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">E. Mays, F. J. Damerau and R. L. Mercer</span><span class="ltx_text ltx_bib_year"> (1991)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Context based spelling correction</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Inf. Process. Manage.</span> <span class="ltx_text ltx_bib_volume">27</span> (<span class="ltx_text ltx_bib_number">5</span>), <span class="ltx_text ltx_bib_pages"> pp. 517–522</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text issn ltx_bib_external">ISSN 0306-4573</span>,
<a href="http://dx.doi.org/http://dx.doi.org/10.1016/0306-4573(91)90066-U" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p3" title="1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S1.p5" title="1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S1.p6" title="1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib19" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[16]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">M. Mohri</span><span class="ltx_text ltx_bib_year"> (2009)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Weighted automata algorithms</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Handbook of weighted automata</span>, <span class="ltx_text ltx_bib_pages"> pp. 213–254</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS2.p1" title="1.2 Notations and some Statistics for Language and Error Models ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.2</span></a>.
</span>
</li>
<li id="bib.bib25" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[17]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">L. Németh</span><span class="ltx_text ltx_bib_year"> (2011)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Hunspell manual</span>.
</span>
<span class="ltx_bibblock">Note: <span class="ltx_text ltx_bib_note">Electronic Software Manual (manpage)</span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p2" title="2 Weighting Finite-State Language and Error Models ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.
</span>
</li>
<li id="bib.bib69" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[18]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">P. Norvig</span><span class="ltx_text ltx_bib_year"> (2010)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">How to write a spelling corrector</span>.
</span>
<span class="ltx_bibblock">Note: <span class="ltx_text ltx_bib_note">referred 2011-01-11, available <span class="ltx_ERROR undefined">\url</span>http://norvig.com/spell-correct.html</span>
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://norvig.com/spell-correct.html" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS1.p1" title="2.1 Compiling Finite-State Language Models ‣ 2 Weighting Finite-State Language and Error Models ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.1</span></a>,
<a href="#S3.p4" title="3 The Language and Error Model Data Used For Evaluation ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>.
</span>
</li>
<li id="bib.bib43" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[19]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Oflazer</span><span class="ltx_text ltx_bib_year"> (1996)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Comput. Linguist.</span> <span class="ltx_text ltx_bib_volume">22</span> (<span class="ltx_text ltx_bib_number">1</span>), <span class="ltx_text ltx_bib_pages"> pp. 73–89</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text issn ltx_bib_external">ISSN 0891-2017</span></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p2" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>.
</span>
</li>
<li id="bib.bib26" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[20]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">J. Otero, J. Grana and M. Vilares</span><span class="ltx_text ltx_bib_year"> (2007)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Contextual spelling correction</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Computer Aided Systems Theory–EUROCAST 2007</span>, <span class="ltx_text ltx_bib_pages"> pp. 290–296</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p3" title="1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S1.p5" title="1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib12" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[21]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">T. A. Pirinen and S. Hardwick</span><span class="ltx_text ltx_bib_year"> (2012-July 23–25)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Effects of weighted finite-state language and error models on speed and efficiency of finite-state spell-checking</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Pre-proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing FSMNLP</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_pages"> pp. 6–14</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS1.p1" title="2.1 Compiling Finite-State Language Models ‣ 2 Weighting Finite-State Language and Error Models ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.1</span></a>,
<a href="#S3.p4" title="3 The Language and Error Model Data Used For Evaluation ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>,
<a href="#S4.SS3.p1" title="4.3 Speed Evaluation ‣ 4 The Speed and Quality of Different Finite-State Models and Weighting Schemes ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.3</span></a>.
</span>
</li>
<li id="bib.bib6" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[22]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">T. A. Pirinen and K. LindÃ©n</span><span class="ltx_text ltx_bib_year"> (2010)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Creating and weighting Hunspell dictionaries as finite-state automata</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Investigationes Linguisticae</span> <span class="ltx_text ltx_bib_volume">21</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p2" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>,
<a href="#S1.SS3.p2" title="1.3 Morphologically Complex Resource-Poor Languages ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.3</span></a>,
<a href="#S2.SS2.p3" title="2.2 Compiling Finite-State Versions of Error Models ‣ 2 Weighting Finite-State Language and Error Models ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.2</span></a>.
</span>
</li>
<li id="bib.bib89" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[23]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">T. A. Pirinen and K. Lindén</span><span class="ltx_text ltx_bib_year"> (2010)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Finite-state spell-checking with weighted language and error models</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the Seventh SaLTMiL workshop on creation and use of basic lexical resources for less-resourced languagages</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">Valletta, Malta</span>, <span class="ltx_text ltx_bib_pages"> pp. 13–18</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://siuc01.si.ehu.es/%5C%7Ejipsagak/SALTMIL2010_Proceedings.pdf" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p1" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>.
</span>
</li>
<li id="bib.bib29" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[24]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">T. A. Pirinen, M. Silfverberg and K. Lindén</span><span class="ltx_text ltx_bib_year"> (2012)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Improving finite-state spell-checker suggestions with part of speech n-grams</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">CICLING 2012</span>,
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p5" title="1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib24" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[25]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">H. Pitkänen</span><span class="ltx_text ltx_bib_year"> (2006)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Hunspell-in kesäkoodi 2006: final report</span>.
</span>
<span class="ltx_bibblock">Note: <span class="ltx_text ltx_bib_note">Technical report; refered on 16th of September, available: <span class="ltx_ERROR undefined">\url</span>http://www.puimula.org/htp/archive/kesakoodi2006-report.pdf</span>
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://www.puimula.org/htp/archive/kesakoodi2006-report.pdf" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p1" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>.
</span>
</li>
<li id="bib.bib27" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[26]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">J. Raviv</span><span class="ltx_text ltx_bib_year"> (1967)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Decision making in Markov chains applied to the problem of pattern recognition</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Information Theory, IEEE Transactions on</span> <span class="ltx_text ltx_bib_volume">13</span> (<span class="ltx_text ltx_bib_number">4</span>), <span class="ltx_text ltx_bib_pages"> pp. 536–551</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p1" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>.
</span>
</li>
<li id="bib.bib48" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[27]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. Savary</span><span class="ltx_text ltx_bib_year"> (2002)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Typographical nearest-neighbor search in a finite-state lexicon and its application to spelling correction</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">CIAA ’01: Revised Papers from the 6th International Conference on Implementation and Application of Automata</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">London, UK</span>, <span class="ltx_text ltx_bib_pages"> pp. 251–260</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p2" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>.
</span>
</li>
<li id="bib.bib51" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[28]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Schulz and S. Mihov</span><span class="ltx_text ltx_bib_year"> (2002)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Fast string correction with levenshtein-automata</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">International Journal of Document Analysis and Recognition</span> <span class="ltx_text ltx_bib_volume">5</span>, <span class="ltx_text ltx_bib_pages"> pp. 67–85</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p2" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>,
<a href="#S2.SS2.p1" title="2.2 Compiling Finite-State Versions of Error Models ‣ 2 Weighting Finite-State Language and Error Models ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.2</span></a>.
</span>
</li>
<li id="bib.bib28" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[29]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">C.E. Shannon</span><span class="ltx_text ltx_bib_year"> (1948)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">A mathematical theory of communications, i and ii</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Bell Syst. Tech. J</span> <span class="ltx_text ltx_bib_volume">27</span>, <span class="ltx_text ltx_bib_pages"> pp. 379–423</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.SS1.p1" title="1.1 A Brief History of Automatic Spell-Checking and Correction ‣ 1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1.1</span></a>.
</span>
</li>
<li id="bib.bib59" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[30]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">L. A. Wilcox-O’Hearn, G. Hirst and A. Budanitsky</span><span class="ltx_text ltx_bib_year"> (20082008)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Real-word spelling correction with trigrams: a reconsideration of the Mays, Damerau, and Mercer model</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">CICLing</span>,  <span class="ltx_text ltx_bib_editor">A. F. Gelbukh (Ed.)</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_series">Lecture Notes in Computer Science</span>, Vol. <span class="ltx_text ltx_bib_volume">4919</span>, <span class="ltx_text ltx_bib_pages"> pp. 605–616</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p5" title="1 Introduction ‣ State-of-the-Art in Weighted Finite-State Spell-Checking" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
</ul>
</section>
</article>
</div>
<footer class="ltx_page_footer">
<div class="ltx_page_logo">Generated  on Fri Sep 29 14:26:40 2017 by <a href="http://dlmf.nist.gov/LaTeXML/">LaTeXML <img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="[LOGO]"></a>
</div></footer>
</div>
</body>
</html>