2011-cicling-contextspell/Pirinen-2011-cicling-contextspell.html

<!DOCTYPE html><html>
<head>
<title>Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/.</title>
<!--Generated on Fri Sep 29 12:53:30 2017 by LaTeXML (version 0.8.2) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on Last Modification: September 29, 2017.-->

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">Improving Finite-State Spell-Checker Suggestions with
Part of Speech N-Grams<span class="ltx_ERROR undefined">\footnotepubrights</span>Originally published in CICLING 2012
in Delhi <span class="ltx_ERROR undefined">\url</span>http://cicling.org/2012/.</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Tommi A Pirinen
</span></span>
<span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Miikka Silfverberg
</span></span>
<span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Krister Lindén
<br class="ltx_break">University of Helsinki
<br class="ltx_break">Department of Modern Languages
<br class="ltx_break">University of Helsinki
</span></span>
<span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname"> 00014
<br class="ltx_break">tommi.pirinen
</span></span>
<span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">miikka.silfverberg
</span></span>
<span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">krister.linden@helsinki.fi
</span></span>
</div>
<div class="ltx_date ltx_role_creation">Last Modification: September 29, 2017</div>

<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>

<p class="ltx_p">In this paper we demonstrate a finite-state implementation of
context-aware spell checking utilizing an N-gram based part of speech (POS)
tagger to rerank the suggestions from a simple edit-distance based
spell-checker. We demonstrate the benefits of context-aware spell-checking for
English and Finnish and introduce modifications that are necessary to make
traditional N-gram models work for morphologically more complex languages, such
as Finnish.</p>

</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>

<div id="S1.p1" class="ltx_para">
<p class="ltx_p">Spell-checking by computer is perhaps one of the oldest and most researched
applications in the field of language technology starting from the mid 20th
century <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib25" title="A technique for computer detection and correction of spelling errors" class="ltx_ref">3</a>]</cite>. One of the crucial parts of spell-checking—both
from an interactive user-interface point of view and for unsupervised correction
of errors—is the production of spelling suggestions. In this article we test
various finite-state methods for using context and shallow morphological
analysis to improve the suggestions generated by traditional edit distance
measures or unigram frequencies such as simple weighted
finite-state dictionaries trained from word form frequencies as
in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">13</a>]</cite>.</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">The spell-checking task can be split into two parts, i.e. <em class="ltx_emph">detection</em> and
actual <em class="ltx_emph">correction</em> of the spelling errors. The spelling errors can be
detected in text as word forms that are unlikely to belong to the
natural language in question, such as writing ‘cta’ instead of ‘cat’. This form
of spelling errors is commonly called <em class="ltx_emph">non-word (spelling) errors</em>.
Another form of spelling errors is word forms that do not belong to the given
context under certain syntactic or semantic requirements, such as writing
‘their’ instead of ‘there’. This form is correspondingly called <em class="ltx_emph">real-word
(spelling) errors</em>. The non-word type of spelling errors can easily be detected
using a dictionary, whereas the detection of the latter type of errors typically
requires syntactic analysis or probabilistic methods <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib8" title="Ordering the suggestions of a spellchecker without using context*" class="ltx_ref">8</a>]</cite>. For
the purpose of this article we do not distinguish between them, as the same
correction methods can be applied to both.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">The correction of spelling errors usually means generating a list of word forms
belonging to the language for a user to chose from. The mechanism for
generating correction suggestions for the erroneous word-forms is an
<em class="ltx_emph">error-model</em>. The purpose of an error-model is to act as a filter to
revert the mistakes the user typing the erroneous word-form has made. The
simplest and most traditional model for making such corrections is the
Levenshtein-Damerau edit distance algorithm, attributed initially to
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib27" title="Binary codes capable of correcting deletions, insertions, and reversals" class="ltx_ref">6</a>]</cite> and especially in the context of spell-checking to
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib25" title="A technique for computer detection and correction of spelling errors" class="ltx_ref">3</a>]</cite>. The Levenshtein-Damerau edit distance assumes that
spelling errors are one of insertion, deletion or changing of a single
character to another, or swapping two adjacent characters, which models well
the spelling errors caused by an accidental slip of finger on a keyboard. It
was originally discovered that for most languages and spelling errors, this
simple method already covers 80 % of all spelling errors <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib25" title="A technique for computer detection and correction of spelling errors" class="ltx_ref">3</a>]</cite>.
This model is also language-independent, ignoring the differences in character
repertoires of a given language. Various other error models have also been
developed, ranging from confusion sets to phonemic folding <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib18" title="Techniques for automatically correcting words in text" class="ltx_ref">5</a>]</cite>.</p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">In this paper, we evaluate the use of context for further fine-tuning of the
correction suggestions. The context is still not commonly used in
spell-checkers. According to <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib18" title="Techniques for automatically correcting words in text" class="ltx_ref">5</a>]</cite> it was lacking in the majority
of spell-checkers and while the situation may have improved slightly for some
commercial office suite products, the main spell-checkers for open source
environments are still primarily context-ignorant, such as
hunspell<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_ERROR undefined">\url</span>http://hunspell.sf.net</span></span></span> which is widely used in the
open source world. For English, the surface word-form trigrams model has been
demonstrated to be reasonably efficient both for non-word cases
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib28" title="Probability scoring for spelling correction" class="ltx_ref">2</a>]</cite> and for for real-word cases<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib43" title="Context based spelling correction" class="ltx_ref">7</a>]</cite>. As an
additional way to improve the set of suggestions, we propose to use
morphosyntactically relevant analyses in context. In this article, we evaluate
a model with a statistical morphological tagger <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib69" title="Combining statistical models for pos tagging using finite-state calculus" class="ltx_ref">16</a>]</cite>. The
resulting system is in effect similar as described in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib1" title="Contextual spelling correction." class="ltx_ref">11</a>]</cite> for
Spanish<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup>We are grateful for the anonymous reviewer on bringing this
previous work on same methods and similar systems to our knowledge.</span></span></span></p>
</div>
<div id="S1.p5" class="ltx_para">
<p class="ltx_p">The system described is fully built on freely available tools and data,
available for download and use from
<span class="ltx_ERROR undefined">\url</span>http://hfst.svn.sourceforge.net/viewvc/hfst/trunk/cicling-2011-contextspell/.
The only exception to this is the training data for Finnish, since there is no
available morphological training data for Finnish as of yet, the download does
not contain the source material for training but only the trigram models
compiled into binary format automata.</p>
</div>
<div id="S1.p6" class="ltx_para">
<p class="ltx_p">Furthermore, we test the context-based spelling methods using both English and
Finnish language materials to ensure the applicability of the method for
morphologically different languages. The reason for doing this is two-fold;
firstly the fact that English has rather low morphological productivity may
make it behave statistically differently from other languages. On the other
hand, English has the largest amount of freely available text corpora. For
other languages, the availability of free corpora, especially annotated
material, is often seen as a problem.</p>
</div>
<div id="S1.p7" class="ltx_para">
<p class="ltx_p">The article is laid out as follows: In Section <a href="#S2" title="2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>, we outline
the implementation of a finite-state context-aware spell-checker and describe
the statistical methods used. In Section <a href="#S3" title="3 Material ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>, we introduce the
corpora and dictionaries used for spell-checking and training material as well
as the corpora used for obtaining the spelling errors with context. In
Section <a href="#S4" title="4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>, we show how the created spelling correctors
improve the results and explain the errors left. In
Section <a href="#S5" title="5 Future Work and Discussion ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>, we compare our work with other current systems
and enumerate possible improvements for both.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Methods</h2>

<div id="S2.p1" class="ltx_para">
<p class="ltx_p">The spelling correction in this article is performed in several phases:
assuming misspelled word <em class="ltx_emph">cta</em> for <em class="ltx_emph">cat</em>, we first apply the error
model to the already known incorrect string <em class="ltx_emph">cta</em> to produce candidates
for probable mistypings. For this purpose we use the Damerau-Levenshtein
edit-distance algorithm in finite-state form. When applied to <em class="ltx_emph">cta</em> we get
all strings with one or two typing mistakes, i.e. <em class="ltx_emph">ata</em>, <em class="ltx_emph">bta</em>,
…, <em class="ltx_emph">acta</em>, <em class="ltx_emph">bcta</em>, …, <em class="ltx_emph">ta</em>, <em class="ltx_emph">ca</em>, …,
<em class="ltx_emph">tca</em>, and the correct <em class="ltx_emph">cat</em>. This set of strings is simultaneously
matched against the language model, which will produce a set of corrections,
such as <em class="ltx_emph">cat</em>, <em class="ltx_emph">act</em> or <em class="ltx_emph">car</em>. Since both the error-model and
the language model contain information on likelihoods of errors and words
respectively, the resulting list will be sorted according to a combination of
the edit distance measure and the probability of the word in a reference
corpus. The rankings based on edit distance alone and the edit distance
combined with word probabilities form our two baseline models.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">The context-based models we introduce here use the suggestion list gained from
a contextless spelling-checker and the context of the words as input to rerank
suggestions based on N-gram models. Each of the suggestions is tried against
the N-gram models, and the ones with higher likelihoods will be lifted. For
example when correcting the misspelling of ‘an’ as ‘anx’ in the sentence “this
is anx example sentence”, as shown in the Table <a href="#S2.T1" title="Table 1 ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>, we have the
surface trigrams {this, is, _}, {is, _, example}, {_, example,
sentence}, and corresponding analysis trigrams {DET, VVBZ, _}, {VVBZ, _,
NN}, {_, NN, NN}. The suggestions for anx at edit distance one include
‘ax’, ‘an’ (one deletion), ‘ant’, ‘and’, ‘any’ (one change) and so on. To rank
the possible suggestions, we substitute <math id="S2.p2.m1" class="ltx_Math" alttext="s_{3}" display="inline"><msub><mi>s</mi><mn>3</mn></msub></math> with the suggestions, and
calculate the likelihood of their analyses.
</p>
</div>
<figure id="S2.T1" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 1: </span>Example trigram combinations</figcaption>
<table class="ltx_tabular ltx_centering ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">this<math id="S2.T1.m1" class="ltx_Math" alttext="{}_{s_{1}}" display="inline"><msub><mi></mi><msub><mi>s</mi><mn>1</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_left ltx_border_t">is<math id="S2.T1.m2" class="ltx_Math" alttext="{}_{s_{2}}" display="inline"><msub><mi></mi><msub><mi>s</mi><mn>2</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_center ltx_border_t">_<math id="S2.T1.m3" class="ltx_Math" alttext="{}_{s_{3}}" display="inline"><msub><mi></mi><msub><mi>s</mi><mn>3</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_right ltx_border_t">example<math id="S2.T1.m4" class="ltx_Math" alttext="{}_{s_{4}}" display="inline"><msub><mi></mi><msub><mi>s</mi><mn>4</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_right ltx_border_t">sentence<math id="S2.T1.m5" class="ltx_Math" alttext="{}_{s_{5}}" display="inline"><msub><mi></mi><msub><mi>s</mi><mn>5</mn></msub></msub></math>
</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_b">DET<math id="S2.T1.m6" class="ltx_Math" alttext="{}_{a_{1}}" display="inline"><msub><mi></mi><msub><mi>a</mi><mn>1</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_left ltx_border_b">VVBZ<math id="S2.T1.m7" class="ltx_Math" alttext="{}_{a_{2}}" display="inline"><msub><mi></mi><msub><mi>a</mi><mn>2</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_center ltx_border_b">_ <math id="S2.T1.m8" class="ltx_Math" alttext="{}_{a_{3}}" display="inline"><msub><mi></mi><msub><mi>a</mi><mn>3</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_right ltx_border_b">NN<math id="S2.T1.m9" class="ltx_Math" alttext="{}_{a_{4}}" display="inline"><msub><mi></mi><msub><mi>a</mi><mn>4</mn></msub></msub></math>
</td>
<td class="ltx_td ltx_align_right ltx_border_b">NN<math id="S2.T1.m10" class="ltx_Math" alttext="{}_{a_{5}}" display="inline"><msub><mi></mi><msub><mi>a</mi><mn>5</mn></msub></msub></math>
</td>
</tr>
</tbody>
</table>
</figure>
<section id="S2.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.1 </span>Weighted Finite-State Interpretation of the Method</h3>

<div id="S2.SS1.p1" class="ltx_para">
<p class="ltx_p">In this article we use a finite-state formulation of
spell-checking. We assume the standard notation for finite-state
algebra and define the language model as a weighted finite-state
automaton assigning a weight to each correctly spelled word-form of a
language, and an error model automaton mapping a misspelled string to
a set of corrected strings and their weights. The probabilistic
interpretation of the components is such that the weighted fsa as a
language model assigns weight <math id="S2.SS1.p1.m1" class="ltx_Math" alttext="w(s)" display="inline"><mrow><mi>w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">)</mo></mrow></mrow></math> to word <math id="S2.SS1.p1.m2" class="ltx_Math" alttext="s" display="inline"><mi>s</mi></math> corresponding to the
probability <math id="S2.SS1.p1.m3" class="ltx_Math" alttext="P(s)" display="inline"><mrow><mi>P</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">)</mo></mrow></mrow></math> for the word to be a correct word in the
language. The error model assigns weight <math id="S2.SS1.p1.m4" class="ltx_Math" alttext="w(s:r)" display="inline"><mrow><mi>w</mi><mrow><mo stretchy="false">(</mo><mi>s</mi><mo>:</mo><mi>r</mi><mo stretchy="false">)</mo></mrow></mrow></math> to string pair <math id="S2.SS1.p1.m5" class="ltx_Math" alttext="s,r" display="inline"><mrow><mi>s</mi><mo>,</mo><mi>r</mi></mrow></math> corresponding to the probability <math id="S2.SS1.p1.m6" class="ltx_Math" alttext="P(s|r)" display="inline"><mrow><mi>P</mi><mrow><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">|</mo><mi>r</mi><mo stretchy="false">)</mo></mrow></mrow></math> of a user writing word r
when intending to write the word <math id="S2.SS1.p1.m7" class="ltx_Math" alttext="s" display="inline"><mi>s</mi></math>, and the context model assigns
weight <math id="S2.SS1.p1.m8" class="ltx_Math" alttext="w(s_{3}a_{3})" display="inline"><mrow><mi>w</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><msub><mi>s</mi><mn>3</mn></msub><mo>⁢</mo><msub><mi>a</mi><mn>3</mn></msub></mrow><mo stretchy="false">)</mo></mrow></mrow></math> to word <math id="S2.SS1.p1.m9" class="ltx_Math" alttext="s_{3}" display="inline"><msub><mi>s</mi><mn>3</mn></msub></math> with associated POS tagging <math id="S2.SS1.p1.m10" class="ltx_Math" alttext="a_{3}" display="inline"><msub><mi>a</mi><mn>3</mn></msub></math>
corresponding to the standard HMM estimate <math id="S2.SS1.p1.m11" class="ltx_Math" alttext="P(a_{3}s_{3})" display="inline"><mrow><mi>P</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><msub><mi>a</mi><mn>3</mn></msub><mo>⁢</mo><msub><mi>s</mi><mn>3</mn></msub></mrow><mo stretchy="false">)</mo></mrow></mrow></math> of the analysis being in
a 3-gram context given by equation (<a href="#S2.E1" title="(1) ‣ 2.1 Weighted Finite-State Interpretation of the Method ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>).</p>
<table id="S2.E1" class="ltx_equation ltx_eqn_table">

<tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math id="S2.E1.m1" class="ltx_Math" alttext="P(a_{3}s_{3})=\prod_{i=3}^{5}P(s_{i}|a_{i})P(a_{i}|a_{i-2},\ a_{i-1})" display="block"><mrow><mi>P</mi><mrow><mo stretchy="false">(</mo><msub><mi>a</mi><mn>3</mn></msub><msub><mi>s</mi><mn>3</mn></msub><mo stretchy="false">)</mo></mrow><mo>=</mo><munderover><mo largeop="true" movablelimits="false" symmetric="true">∏</mo><mrow><mi>i</mi><mo>=</mo><mn>3</mn></mrow><mn>5</mn></munderover><mi>P</mi><mrow><mo stretchy="false">(</mo><msub><mi>s</mi><mi>i</mi></msub><mo stretchy="false">|</mo><msub><mi>a</mi><mi>i</mi></msub><mo stretchy="false">)</mo></mrow><mi>P</mi><mrow><mo stretchy="false">(</mo><msub><mi>a</mi><mi>i</mi></msub><mo stretchy="false">|</mo><msub><mi>a</mi><mrow><mi>i</mi><mo>-</mo><mn>2</mn></mrow></msub><mo rspace="7.5pt">,</mo><msub><mi>a</mi><mrow><mi>i</mi><mo>-</mo><mn>1</mn></mrow></msub><mo stretchy="false">)</mo></mrow></mrow></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td rowspan="1" class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td>
</tr>
</table>
</div>
<div id="S2.SS1.p2" class="ltx_para">
<p class="ltx_p">In a weighted finite-state system, the probabilistic data needs to be converted
to the algebra supported by the finite-state weight structure. In this case we
use the tropical semi-ring by transforming the frequencies into penalty weights
with the formula <math id="S2.SS1.p2.m1" class="ltx_Math" alttext="-\log\frac{f}{CS}" display="inline"><mrow><mo>-</mo><mrow><mi>log</mi><mo>⁡</mo><mfrac><mi>f</mi><mrow><mi>C</mi><mo>⁢</mo><mi>S</mi></mrow></mfrac></mrow></mrow></math>, where <math id="S2.SS1.p2.m2" class="ltx_Math" alttext="f" display="inline"><mi>f</mi></math> is the frequency and <math id="S2.SS1.p2.m3" class="ltx_Math" alttext="CS" display="inline"><mrow><mi>C</mi><mo>⁢</mo><mi>S</mi></mrow></math> the
corpus size in number of tokens. If the language model allows for words that
are not in the dictionary, a maximal weight is assigned to the unseen word
forms that may be in the language model but not in the training corpus, i.e.
any unseen word has a penalty weight of <math id="S2.SS1.p2.m4" class="ltx_Math" alttext="-\log\frac{1}{CS}" display="inline"><mrow><mo>-</mo><mrow><mi>log</mi><mo>⁡</mo><mfrac><mn>1</mn><mrow><mi>C</mi><mo>⁢</mo><mi>S</mi></mrow></mfrac></mrow></mrow></math>.</p>
</div>
<div id="S2.SS1.p3" class="ltx_para">
<p class="ltx_p">The spelling corrections suggested by these unigram lexicon-based
spell-checkers are initially generated by composing an edit-distance
automaton <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib19" title="Typographical nearest-neighbor search in a finite-state lexicon and its application to spelling correction" class="ltx_ref">15</a>]</cite> with an error weight corresponding to the
probability of the error estimated in a corpus, i.e. <math id="S2.SS1.p3.m1" class="ltx_Math" alttext="-\log\frac{f_{F}}{CS+1}" display="inline"><mrow><mo>-</mo><mrow><mi>log</mi><mo>⁡</mo><mfrac><msub><mi>f</mi><mi>F</mi></msub><mrow><mrow><mi>C</mi><mo>⁢</mo><mi>S</mi></mrow><mo>+</mo><mn>1</mn></mrow></mfrac></mrow></mrow></math>,
where <math id="S2.SS1.p3.m2" class="ltx_Math" alttext="f_{F}" display="inline"><msub><mi>f</mi><mi>F</mi></msub></math> is the frequency of the misspelling in a corpus. This weight is
attached to the edit distance type error. In practice, this typically still
means that the corrections are initially ordered primarily by the edit distance
of the correction, and secondarily by the unigram frequency of the word-form in
the reference corpus. This order is implicitly encoded in the weighted paths
of the resulting automaton; to list the corrections we use the n-best paths
algorithm <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib60" title="An efficient algorithm for the n-best-strings problem" class="ltx_ref">9</a>]</cite>. This ordering is also used as our second baseline.</p>
</div>
<div id="S2.SS1.p4" class="ltx_para">
<p class="ltx_p">For a context-based reordering of the corrections, we use the POS tagging
probabilities for the given suggestions. The implementation of the analysis
N-gram probability estimation is similar to the one described in
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib69" title="Combining statistical models for pos tagging using finite-state calculus" class="ltx_ref">16</a>]</cite> with the following adaptations for the spelling
correction. For the suggestion which gives the highest ranking, the most likely
analysis is selected. The N-gram probability is estimated separately for each
spelling suggestion and then combined with the baseline probability given by
the unigram probability and the edit distance weight. The ideal scaling for the
weights of unigram probabilities, i.e. edit distance probabilities with
respect to N-gram probabilities, can be acquired by e.g.g tuning the scaling
parameter on an automatically generated spelling error corpus.</p>
</div>
<div id="S2.SS1.p5" class="ltx_para">
<p class="ltx_p">The resulting finite-state system consists of three sets of automata, i.e. the
dictionary for spell-checking, the error-model as described in
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">13</a>]</cite>, and the new N-gram model automata. The automata sizes
are given in Table <a href="#S2.T2" title="Table 2 ‣ 2.1 Weighted Finite-State Interpretation of the Method ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> for reference. The sizes also give an
estimate of the memory usage of the spell-checking system, although the actual
memory-usage during correction will rise depending on the actual extent of the
search space during the correction phase.</p>
</div>
<figure id="S2.T2" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 2: </span>Automata sizes.</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Automaton</th>
<td class="ltx_td ltx_align_right">States</td>
<td class="ltx_td ltx_align_right">Transitions</td>
<td class="ltx_td ltx_align_right">Bytes</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t" colspan="4"><span class="ltx_text ltx_font_bold">English</span></th>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Dictionary</th>
<td class="ltx_td ltx_align_right ltx_border_t">25,330</td>
<td class="ltx_td ltx_align_right ltx_border_t">42,448</td>
<td class="ltx_td ltx_align_right ltx_border_t">1.2 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Error model</th>
<td class="ltx_td ltx_align_right">1,303</td>
<td class="ltx_td ltx_align_right">492,232</td>
<td class="ltx_td ltx_align_right">5.9 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">N-gram lexicon</th>
<td class="ltx_td ltx_align_right">363,053</td>
<td class="ltx_td ltx_align_right">1,253,315</td>
<td class="ltx_td ltx_align_right">42 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">N-gram sequences</th>
<td class="ltx_td ltx_align_right">46,517</td>
<td class="ltx_td ltx_align_right">200,168</td>
<td class="ltx_td ltx_align_right">4.2 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t" colspan="4"><span class="ltx_text ltx_font_bold">Finnish</span></th>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Dictionary</th>
<td class="ltx_td ltx_align_right ltx_border_t">179,035</td>
<td class="ltx_td ltx_align_right ltx_border_t">395,032</td>
<td class="ltx_td ltx_align_right ltx_border_t">16 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Error model</th>
<td class="ltx_td ltx_align_right">1,863</td>
<td class="ltx_td ltx_align_right">983,227</td>
<td class="ltx_td ltx_align_right">12 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">N-gram lexicon</th>
<td class="ltx_td ltx_align_right">70,665</td>
<td class="ltx_td ltx_align_right">263,298</td>
<td class="ltx_td ltx_align_right">8.0 MiB</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_b">N-gram sequences</th>
<td class="ltx_td ltx_align_right ltx_border_b">3,325</td>
<td class="ltx_td ltx_align_right ltx_border_b">22,418</td>
<td class="ltx_td ltx_align_right ltx_border_b">430 KiB</td>
</tr>
</tbody>
</table>
</figure>
</section>
<section id="S2.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2 </span>English-Specific Finite-State Weighting Methods</h3>

<div id="S2.SS2.p1" class="ltx_para">
<p class="ltx_p">The language model for English was created as described in
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib44" title="How to write a spelling corrector" class="ltx_ref">10</a>]</cite><span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>The finite-state formulation of this is informally
described in
<span class="ltx_ERROR undefined">\url</span>http://blogs.helsinki.fi/tapirine/2011/07/21/how-to-write-an-hfst-spelling-corrector/</span></span></span>.
It consists of the word-forms and their probabilities in the corpora. The edit
distance is composed of the standard English alphabet with an estimated error
likelihood of 1 in 1000 words. Similarly for the English N-gram material, the
initial analyses found in the WSJ corpus were used in the finite-state tagger
as such. The scaling factor between the dictionary probability model and the
edit distance model was acquired by estimating the optimal multipliers using
the automatic misspellings and corrections of a Project Gutenberg
Ebook<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup><span class="ltx_ERROR undefined">\url</span>http://www.gutenberg.org/cache/epub/11/pg11.txt</span></span></span>
<em class="ltx_emph">Alice’s Adventures in Wonderland</em>. In here the estimation simply means
trying out factors until results are stable and picking the best one.</p>
</div>
</section>
<section id="S2.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3 </span>Finnish-Specific Finite-State Weighting Methods</h3>

<div id="S2.SS3.p1" class="ltx_para">
<p class="ltx_p">The Finnish language model was based on a readily-available morphological
weighted analyser of Finnish language <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib9" title="Modularisation of finnish finite-state language description—towards wide collaboration in open source development of a morphological analyser" class="ltx_ref">14</a>]</cite>. We further
modified the automaton to penalize suggestions with newly created compounds and
derivations by adding a weight greater than the maximum to such suggestions,
i.e. <math id="S2.SS3.p1.m1" class="ltx_Math" alttext="-A\log\frac{1}{CS+1}" display="inline"><mrow><mo>-</mo><mrow><mi>A</mi><mo>⁢</mo><mrow><mi>log</mi><mo>⁡</mo><mfrac><mn>1</mn><mrow><mrow><mi>C</mi><mo>⁢</mo><mi>S</mi></mrow><mo>+</mo><mn>1</mn></mrow></mfrac></mrow></mrow></mrow></math>, where <math id="S2.SS3.p1.m2" class="ltx_Math" alttext="A" display="inline"><mi>A</mi></math> is the scaling factor acquired from the
training material. This has nearly the same effect as using a separate
dictionary for suggestions that excludes the heavily weighted forms without
requiring the extra space. Also for Finnish, a scaling factor was estimated by
using automatic misspellings and corrections of a Project Gutenberg
Ebook<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">5</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">5</sup><span class="ltx_ERROR undefined">\url</span>http://www.gutenberg.org/cache/epub/10863/pg10863.txt</span></span></span>
<em class="ltx_emph">Juha</em>.</p>
</div>
<div id="S2.SS3.p2" class="ltx_para">
<p class="ltx_p">In the initial Finnish tagger, there was a relatively large tagset, all of
which did not contain information necessary for the task of
spell-checking, such as discourse particles, which are relatively
context-agnostic <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib50" title="Iso suomen kielioppi" class="ltx_ref">4</a>]</cite>, so we opted to simplify the tagging in these
cases. Furthermore, the tagger used for training produced heuristic readings for
unrecognized word-forms, which we also removed. Finally, we needed to add
some extra penalties to the word forms unknown to the dictionary in the
N-gram model, since this phenomenon was more frequent and diverse for Finnish
than English. The extra penalties were acquired by iterative testing on the
correction material using generated errors.</p>
</div>
</section>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Material</h2>

<div id="S3.p1" class="ltx_para">
<p class="ltx_p">To train the spell-checker lexicons, word-form probabilities can be acquired
from arbitrary running text. By using unigram frequencies, we can assign
all word-forms some initial probabilities in isolation, i.e. with no spell-checking context.
The unigram-trained models we used were acquired from existing
spell-checker systems <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib44" title="How to write a spelling corrector" class="ltx_ref">10</a>, <a href="#bib.bib12" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">13</a>]</cite>, but we briefly
describe the used corpora here as well.</p>
</div>
<div id="S3.p2" class="ltx_para">
<p class="ltx_p">To train the various N-gram models, corpora are required. For the surface-form
training material, it is sufficient to capture running N-grams in the text.
For training the statistical tagger with annotations, we also require
disambiguated readings. Ideally of course this means hand-annotated
tree banks or similar gold standard corpora.</p>
</div>
<div id="S3.p3" class="ltx_para">
<p class="ltx_p">The corpora used are summarized in Table <a href="#S3.T3" title="Table 3 ‣ 3 Material ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>. The sizes are
provided to make it possible to reconstruct the systems. In practice, they are the newest
available versions of the respective corpora at the time of testing. In the
table, the first row is the training material used for the finite-state
lexicon, i.e. the extracted surface word-forms without the analyses for unigram
training. The second row is for the analyzed and disambiguated material for the
N-gram based taggers for suggestion improvement. The third line is the corpora
of spelling errors used only for the evaluation of the systems. As we can see
from the figures of English compared with Finnish, there is a significant
difference in freely available corpora such as Wikipedia. When going further to
lesser resourced languages, the number will drop enough to make such statistical
approaches less useful, e.g. Northern Sámi in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">13</a>]</cite>.</p>
</div>
<figure id="S3.T3" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 3: </span>Sizes of training and evaluation corpora.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row ltx_border_t"></th>
<td class="ltx_td ltx_align_center ltx_border_t">Sentences</td>
<td class="ltx_td ltx_align_center ltx_border_t">Tokens</td>
<td class="ltx_td ltx_align_center ltx_border_t">Word-forms</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t" colspan="4"><span class="ltx_text ltx_font_bold">English</span></th>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Unigrams</th>
<td class="ltx_td ltx_border_t"></td>
<td class="ltx_td ltx_align_center ltx_border_t">2,110,728,338</td>
<td class="ltx_td ltx_align_center ltx_border_t">128,457</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">N-grams</th>
<td class="ltx_td ltx_align_center">42,164</td>
<td class="ltx_td ltx_align_center">969,905</td>
<td class="ltx_td ltx_align_center">39,690</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Errors</th>
<td class="ltx_td ltx_align_center">85</td>
<td class="ltx_td ltx_align_center">606</td>
<td class="ltx_td ltx_align_center">217</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t" colspan="4"><span class="ltx_text ltx_font_bold">Finnish</span></th>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Unigrams</th>
<td class="ltx_td ltx_border_t"></td>
<td class="ltx_td ltx_align_center ltx_border_t">17,479,297</td>
<td class="ltx_td ltx_align_center ltx_border_t">968,996</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">N-grams</th>
<td class="ltx_td ltx_align_center">98,699</td>
<td class="ltx_td ltx_align_center">1,027,514</td>
<td class="ltx_td ltx_align_center">144,658</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_b">Errors</th>
<td class="ltx_td ltx_align_center ltx_border_b">333</td>
<td class="ltx_td ltx_align_center ltx_border_b">4,177</td>
<td class="ltx_td ltx_align_center ltx_border_b">2,762</td>
</tr>
</tbody>
</table>
</figure>
<section id="S3.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>English corpora</h3>

<div id="S3.SS1.p1" class="ltx_para">
<p class="ltx_p">The English dictionary is based on a frequency weighted word-form list of
the English language as proposed in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib44" title="How to write a spelling corrector" class="ltx_ref">10</a>]</cite>. The word-forms were
collected from the English Wiktionary<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">6</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">6</sup><span class="ltx_ERROR undefined">\url</span>http://en.wiktionary.org</span></span></span>,
the English EBooks from the project
Gutenberg<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">7</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">7</sup><span class="ltx_ERROR undefined">\url</span>http://www.gutenberg.org/browse/languages/en</span></span></span> and the
British National
Corpus<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">8</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">8</sup><span class="ltx_ERROR undefined">\url</span>http://www.kilgarriff.co.uk/bnc-readme.html</span></span></span>. This
frequency weighted word-list is in effect used as a unigram lexicon for spell-checking.</p>
</div>
<div id="S3.SS1.p2" class="ltx_para">
<p class="ltx_p">To train an English morphosyntactic tagger, we use the WSJ corpus. In
this corpus each word is annotated by a single tag that encodes some
morphosyntactic information, such as part-of-speech and inflectional
form. The total number of tags in this corpus is <math id="S3.SS1.p2.m1" class="ltx_Math" alttext="77" display="inline"><mn>77</mn></math>.</p>
</div>
<div id="S3.SS1.p3" class="ltx_para">
<p class="ltx_p">The spelling errors of English were acquired by extracting the ones with
context from the Birkbeck error
corpus<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">9</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">9</sup><span class="ltx_ERROR undefined">\url</span>http://ota.oucs.ox.ac.uk/headers/0643.xml</span></span></span>. In this
corpus, the errors are from a variety of sources, including errors made by
children and language-learners. For the purpose of this experiment we picked
the subset of errors which had context and also removed the cases of word
joining and splitting to simplify the implementation of parsing and suggestion.
When interpreting results it should be noted that many of these English errors
are competence errors while the baseline algorithm used to model errors here is
for typing errors.</p>
</div>
</section>
<section id="S3.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.2 </span>Finnish Corpora</h3>

<div id="S3.SS2.p1" class="ltx_para">
<p class="ltx_p">As the Finnish dictionary, we selected the freely available open source
finite-state implementation of a Finnish morphological
analyser<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">10</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">10</sup><span class="ltx_ERROR undefined">\url</span>http://home.gna.org/omorfi</span></span></span>. The analyser had the
frequency-weighted word-form list based on Finnish
Wikipedia<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">11</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">11</sup><span class="ltx_ERROR undefined">\url</span>http://download.wikipedia.org/fiwiki/</span></span></span> making it in
practice an extended unigram lexicon for Finnish. The Finnish morphological
analyser, however, is capable of infinite compounding and derivation, which
makes it a notably different approach to spell checking than the English finite
word-form list.</p>
</div>
<div id="S3.SS2.p2" class="ltx_para">
<p class="ltx_p">The Finnish morphosyntactic N-gram model was trained using a
morphosyntactically analyzed Finnish
Newspaper<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">12</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">12</sup><span class="ltx_ERROR undefined">\url</span>http://www.csc.fi/english/research/software/ftc</span></span></span>. In
this format, the annotation is based on a sequence of tags, encoding
part of speech and inflectional form. The total number of different
tag sequences for this annotation is 747.</p>
</div>
<div id="S3.SS2.p3" class="ltx_para">
<p class="ltx_p">For Finnish spelling errors, we ran the Finnish unigram spell-checker through
Wikipedia, europarl and a corpus of Finnish EBooks from the project
Gutenberg<span class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">13</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">13</sup><span class="ltx_ERROR undefined">\url</span>http://www.gutenberg.org/browse/languages/fi</span></span></span> to
acquire the non-word spelling errors, and picked at random the errors having
frequencies in range 1 to 8 instances; a majority of higher frequency non-words
were actually proper nouns or neologisms missing from the dictionary. Using all of
Wikipedia, europarl and Gutenberg provides a reasonable variety of both
contemporary and old texts in a wide range of styles.</p>
</div>
</section>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>Tests and Evaluation</h2>

<div id="S4.p1" class="ltx_para">
<p class="ltx_p">The evaluation of the correction suggestion quality is described in
Table <a href="#S4.T4" title="Table 4 ‣ 4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>. The Table <a href="#S4.T4" title="Table 4 ‣ 4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a> contains the
precision for the spelling errors. The precision is measured by ranked
suggestions. In the tables, we give the results separately for ranks 1—5, and
then for the accumulated ranks 1—10. The rows of the table represent
different combinations of the N-gram models. The first row is a baseline score
achieved by the weighted edit distance model alone, and the second is with
unigram-weighted dictionary over edit-distance 2. The last two columns are the
traditional word-form N-gram model and our POS tagger based extension to it.</p>
</div>
<figure id="S4.T4" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 4: </span>Precision of suggestion algorithms with real spelling errors.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Algorithm</td>
<td class="ltx_td ltx_align_right ltx_border_t">1</td>
<td class="ltx_td ltx_align_right ltx_border_t">2</td>
<td class="ltx_td ltx_align_right ltx_border_t">3</td>
<td class="ltx_td ltx_align_right ltx_border_t">4</td>
<td class="ltx_td ltx_align_right ltx_border_t">5</td>
<td class="ltx_td ltx_align_right ltx_border_t">1—10</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_center ltx_border_t" colspan="7"><span class="ltx_text ltx_font_bold">English</span></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 (baseline)</td>
<td class="ltx_td ltx_align_right ltx_border_t">25.9 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">2.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">2.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">1.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.5 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">94.1 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 with Unigrams</td>
<td class="ltx_td ltx_align_right ltx_border_t">28.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">5.9 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">29.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.5%</td>
<td class="ltx_td ltx_align_right ltx_border_t">28.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">97.6 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 with Word N-grams</td>
<td class="ltx_td ltx_align_right ltx_border_t">29.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">10.6 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">34.1 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">5.9 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">14.1 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">97.7 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 with POS N-grams</td>
<td class="ltx_td ltx_align_right ltx_border_t">68.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">18.8 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.5 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">2.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">0.0 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">92.9 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_center ltx_border_t" colspan="7"><span class="ltx_text ltx_font_bold">Finnish</span></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 (baseline)</td>
<td class="ltx_td ltx_align_right ltx_border_t">66.5 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">8.7 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">4.0 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">4.7 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">1.9 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">89.8 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 with Unigrams</td>
<td class="ltx_td ltx_align_right ltx_border_t">61.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">13.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">1.6 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.1 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">88.2 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Edit distance 2 with Word N-grams</td>
<td class="ltx_td ltx_align_right ltx_border_t">65.0 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">14.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.8 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">3.1 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">2.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">90.6 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_b ltx_border_t">Edit distance 2 with POS N-grams</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">71.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">9.3 %</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">1.2 %</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">3.4 %</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">0.3 %</td>
<td class="ltx_td ltx_align_right ltx_border_b ltx_border_t">85.7 %</td>
</tr>
</tbody>
</table>
</figure>
<div id="S4.p2" class="ltx_para">
<p class="ltx_p">It would appear that POS N-grams will in both cases give a significant boost to
the results, whereas the word-form N-grams will merely give a slight increase to
the results. In the next subsections we further dissect the specific changes to
results the different approaches give.</p>
</div>
<section id="S4.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.1 </span>English Error-Analysis</h3>

<div id="S4.SS1.p1" class="ltx_para">
<p class="ltx_p">In <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib44" title="How to write a spelling corrector" class="ltx_ref">10</a>]</cite>, the authors identify errors that are not solved using
simple unigram weights, such as correcting <em class="ltx_emph">rember</em> to <em class="ltx_emph">remember</em>
instead of <em class="ltx_emph">member</em>. Here, our scaled POS N-gram context-model as well
as the simpler word N-gram model, which can bypass the
edit distance model weight will select the correct suggestion. However, when
correcting e.g. <em class="ltx_emph">ment</em> to <em class="ltx_emph">meant</em> in stead of <em class="ltx_emph">went</em> or
<em class="ltx_emph">met</em> the POS based context reranking gives no help as the POS stays the
same.</p>
</div>
</section>
<section id="S4.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.2 </span>Finnish Error-Analysis</h3>

<div id="S4.SS2.p1" class="ltx_para">
<p class="ltx_p">In Finnish results we can easily notice that variation within the first
position in the baseline results and reference system is more sporadic. This
can be traced back to the fuzz factor caused by a majority of probabilities
falling into the same category in our tests. The same edit-distance and unigram
probability leaves the decision to random factors irrelevant to this
experiment, such as alphabetical ordering that comes from data structures
backing up the program code. The N-gram based reorderings are the only methods
that can tiebreak the results here.</p>
</div>
<div id="S4.SS2.p2" class="ltx_para">
<p class="ltx_p">An obvious improvement for Finnish with POS N-grams comes from correcting
agreeing NP’s towards case agreement, such as <em class="ltx_emph">yhdistetstÃ¤</em> to
<em class="ltx_emph">yhdisteistÃ¤</em> (‘of compounds’ <span class="ltx_text ltx_font_smallcaps">Pl Ela</span>) instead of the statistically
more common <em class="ltx_emph">yhdisteestÃ¤</em> (‘of compound’ <span class="ltx_text ltx_font_smallcaps">Sg Ela</span>). However, as
with English, the POS information does fail to rerank cases where two equally
rare word-forms with the same POS occur at the same edit distance, which seems
to be common with participles, such as correcting <em class="ltx_emph">varustunut</em> to
<em class="ltx_emph">varautunut</em> in stead of <em class="ltx_emph">varastanut</em>.</p>
</div>
<div id="S4.SS2.p3" class="ltx_para">
<p class="ltx_p">Furthermore we note that the the discourse particles that were dropped from the
POS tagger’s analysis tag set in order to decrease the tag set size will cause
certain word forms in the dictionary to be incorrectly reranked, such as when
correcting the very common misspelling <em class="ltx_emph">muillekkin</em> into
<em class="ltx_emph">muillekokin</em> (‘for others as well?’ <span class="ltx_text ltx_font_smallcaps">Pl All Qst Kin</span>) instead of
the originally correct <em class="ltx_emph">muillekin</em> (‘for others as well’ <span class="ltx_text ltx_font_smallcaps">Pl All
Kin</span>), since the analyses <span class="ltx_text ltx_font_smallcaps">Qst</span> (for question enclitic) and <span class="ltx_text ltx_font_smallcaps">Kin</span>
(for additive enclitic) are both dropped from the POS analyses.</p>
</div>
</section>
<section id="S4.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.3 </span>Performance Evaluation</h3>

<div id="S4.SS3.p1" class="ltx_para">
<p class="ltx_p">We did not work on optimizing the N-gram analysis and selection,
but we found that the speed of the system is reasonable—even in its current
form, considering that the algorithm is applied only to incorrect words on
the user’s request. Table <a href="#S4.T5" title="Table 5 ‣ 4.3 Performance Evaluation ‣ 4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a> summarizes the average
speed of performing the experiments in Table <a href="#S4.T4" title="Table 4 ‣ 4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>.</p>
</div>
<figure id="S4.T5" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 5: </span>The speed of ranking the errors.
</figcaption>
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Material</th>
<td class="ltx_td ltx_align_center ltx_border_t">English</td>
<td class="ltx_td ltx_align_center ltx_border_t">Finnish</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row">Algorithm</th>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">Unigram (baseline)</th>
<td class="ltx_td ltx_align_center ltx_border_t">10.0 s</td>
<td class="ltx_td ltx_align_center ltx_border_t">51.8 s</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row"></th>
<td class="ltx_td ltx_align_center">399.1 wps</td>
<td class="ltx_td ltx_align_center">6.2 wps</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t">POS N-grams</th>
<td class="ltx_td ltx_align_center ltx_border_t">377.4 s</td>
<td class="ltx_td ltx_align_center ltx_border_t">1616.2 s</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_th ltx_th_row ltx_border_b"></th>
<td class="ltx_td ltx_align_center ltx_border_b">10.6 wps</td>
<td class="ltx_td ltx_align_center ltx_border_b">0.14 wps</td>
</tr>
</tbody>
</table>
</figure>
<div id="S4.SS3.p2" class="ltx_para">
<p class="ltx_p">The performance penalty that is incurred on Finnish spell-checking but not so
much on English comes from the method of determining readings for words unknown
to the language model, i.e. from the guessing algorithm. The amount of words
unknown to the language model in Finnish was greater than for English due to
the training data sparseness and the morphological complexity of the language.</p>
</div>
</section>
</section>
<section id="S5" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5 </span>Future Work and Discussion</h2>

<div id="S5.p1" class="ltx_para">
<p class="ltx_p">In this work we recreated the results of basic and context-based spelling
correction for English and implemented same system for Finnish. We have shown
that the POS based N-gram models are suitable for improving the spelling
corrections for both morphologically more complex languages such as Finnish and
for further improving languages with simpler morphologies like English. To
further verify the claim, the method may still need to be tested on a
typologically wider spectrum of languages.</p>
</div>
<div id="S5.p2" class="ltx_para">
<p class="ltx_p">In this article, we used readily available and hand-made error corpora to test
our error correction method. A similar method as the one we use for error
correction should be possible in error detection as well, especially when
detecting real-word errors <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib43" title="Context based spelling correction" class="ltx_ref">7</a>]</cite>. In future research, an obvious
development is to integrate the N-gram system as a part of a real spell-checker
system for both detection and correction of spelling errors, as is already done
for the unigram based spell checker demonstrated in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib12" title="Finite-state spell-checking with weighted language and error models" class="ltx_ref">13</a>]</cite>.</p>
</div>
<div id="S5.p3" class="ltx_para">
<p class="ltx_p">The article discussed only the reranking over basic edit distance error
models, further research should include more careful statistical training for
the error model as well, such as one demonstrated in <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib42" title="An improved error model for noisy channel spelling correction" class="ltx_ref">1</a>]</cite>.</p>
</div>
</section>
<section id="S6" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">6 </span>Conclusion</h2>

<div id="S6.p1" class="ltx_para">
<p class="ltx_p">In this paper we have demonstrated the use of finite-state methods for trigram
based generation of spelling suggestions. We have shown that the basic
word-form trigram methods suggested for languages like English do not seem to
be as useful without modification for morphologically more complex languages
like Finnish. Instead a more elaborate N-gram scheme using POS n-grams is
successful for Finnish as well as English.</p>
</div>
</section>
<section id="Sx1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">Acknowledgements</h2>

<div id="Sx1.p1" class="ltx_para">
<p class="ltx_p">We are grateful to Sam Hardwick for making the spell checking software
available and the HFST research group for fruitful discussions. We also thank
the anonymous reviewers for useful suggestions and pointers.
</p>
</div>
</section>
<section id="bib" class="ltx_bibliography">
<h2 class="ltx_title ltx_title_bibliography">References</h2>

<ul id="L1" class="ltx_biblist">
<li id="bib.bib42" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[1]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">E. Brill and R. C. Moore</span><span class="ltx_text ltx_bib_year"> (2000)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">An improved error model for noisy channel spelling correction</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">ACL ’00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">Morristown, NJ, USA</span>, <span class="ltx_text ltx_bib_pages"> pp. 286–293</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://dx.doi.org/http://dx.doi.org/10.3115/1075218.1075255" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S5.p3" title="5 Future Work and Discussion ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>.
</span>
</li>
<li id="bib.bib28" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[2]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Church and W. Gale</span><span class="ltx_text ltx_bib_year"> (1991)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Probability scoring for spelling correction</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Statistics and Computing</span>, <span class="ltx_text ltx_bib_pages"> pp. 93–103</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p4" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib25" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[3]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">F. J. Damerau</span><span class="ltx_text ltx_bib_year"> (1964)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">A technique for computer detection and correction of spelling errors</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Commun. ACM</span> (<span class="ltx_text ltx_bib_number">7</span>).
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p1" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S1.p3" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib50" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[4]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. Hakulinen, M. Vilkuna, R. Korhonen, V. Koivisto, Heinonen and I. Alho</span><span class="ltx_text ltx_bib_year"> (2008)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Iso suomen kielioppi</span>.
</span>
<span class="ltx_bibblock"> <span class="ltx_text ltx_bib_publisher">Suomalaisen Kirjallisuuden Seura</span>.
</span>
<span class="ltx_bibblock">Note: <span class="ltx_text ltx_bib_note">referred on 31.12.2008, available from <span class="ltx_ERROR undefined">\url</span>http://kaino.kotus.fi/visk</span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS3.p2" title="2.3 Finnish-Specific Finite-State Weighting Methods ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.3</span></a>.
</span>
</li>
<li id="bib.bib18" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[5]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Kukich</span><span class="ltx_text ltx_bib_year"> (1992)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Techniques for automatically correcting words in text</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">ACM Comput. Surv.</span> <span class="ltx_text ltx_bib_volume">24</span> (<span class="ltx_text ltx_bib_number">4</span>), <span class="ltx_text ltx_bib_pages"> pp. 377–439</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text issn ltx_bib_external">ISSN 0360-0300</span>,
<a href="http://dx.doi.org/http://doi.acm.org/10.1145/146370.146380" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p3" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S1.p4" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib27" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[6]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">V. I. Levenshtein</span><span class="ltx_text ltx_bib_year"> (1966)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Binary codes capable of correcting deletions, insertions, and reversals</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Soviet Physics—Doklady 10, 707â710. Translated from Doklady Akademii Nauk SSSR</span>, <span class="ltx_text ltx_bib_pages"> pp. 845–848</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p3" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib43" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[7]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">E. Mays, F. J. Damerau and R. L. Mercer</span><span class="ltx_text ltx_bib_year"> (1991)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Context based spelling correction</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Inf. Process. Manage.</span> <span class="ltx_text ltx_bib_volume">27</span> (<span class="ltx_text ltx_bib_number">5</span>), <span class="ltx_text ltx_bib_pages"> pp. 517–522</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text issn ltx_bib_external">ISSN 0306-4573</span>,
<a href="http://dx.doi.org/http://dx.doi.org/10.1016/0306-4573(91)90066-U" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p4" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S5.p2" title="5 Future Work and Discussion ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>.
</span>
</li>
<li id="bib.bib8" class="ltx_bibitem ltx_bib_article">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[8]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">R. Mitton</span><span class="ltx_text ltx_bib_year"> (2009)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Ordering the suggestions of a spellchecker without using context*</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Nat. Lang. Eng.</span> <span class="ltx_text ltx_bib_volume">15</span> (<span class="ltx_text ltx_bib_number">2</span>), <span class="ltx_text ltx_bib_pages"> pp. 173–192</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text issn ltx_bib_external">ISSN 1351-3249</span>,
<a href="http://dx.doi.org/http://dx.doi.org/10.1017/S1351324908004804" title="" class="ltx_ref doi ltx_bib_external">Document</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p2" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib60" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[9]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">M. Mohri and M. Riley</span><span class="ltx_text ltx_bib_year"> (2002)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">An efficient algorithm for the n-best-strings problem</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS1.p3" title="2.1 Weighted Finite-State Interpretation of the Method ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.1</span></a>.
</span>
</li>
<li id="bib.bib44" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[10]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">P. Norvig</span><span class="ltx_text ltx_bib_year"> (2010)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">How to write a spelling corrector</span>.
</span>
<span class="ltx_bibblock">Note: <span class="ltx_text ltx_bib_note">Web Page, Visited February 28th 2010, Available <span class="ltx_ERROR undefined">\url</span>http://norvig.com/spell-correct.html</span>
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://norvig.com/spell-correct.html" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS2.p1" title="2.2 English-Specific Finite-State Weighting Methods ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.2</span></a>,
<a href="#S3.SS1.p1" title="3.1 English corpora ‣ 3 Material ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3.1</span></a>,
<a href="#S3.p1" title="3 Material ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>,
<a href="#S4.SS1.p1" title="4.1 English Error-Analysis ‣ 4 Tests and Evaluation ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.1</span></a>.
</span>
</li>
<li id="bib.bib1" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[11]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">J. Otero, J. GraÃ±a and M. Vilares</span><span class="ltx_text ltx_bib_year"> (2008-10-19)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Contextual spelling correction.</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">EUROCAST</span>,  <span class="ltx_text ltx_bib_editor">R. Moreno-DÃ­az, F. Pichler and A. Quesada-Arencibia (Eds.)</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_series">Lecture Notes in Computer Science</span>, Vol. <span class="ltx_text ltx_bib_volume">4739</span>, <span class="ltx_text ltx_bib_pages"> pp. 290–296</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text isbn ltx_bib_external">ISBN 978-3-540-75866-2</span>,
<a href="http://dblp.uni-trier.de/db/conf/eurocast/eurocast2007.html#OteroGV07" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p4" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.
</span>
</li>
<li id="bib.bib35" class="ltx_bibitem ltx_bib_proceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[12]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_editor">B. S. Pedersen, G. NeÅ¡pore and I. Skadi na (Eds.)</span><span class="ltx_text ltx_bib_year"> (2011)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Proceedings of 18th nordic conference of computational linguistics, nodalida 2011, riiga, latvia, may 111—13 2011, proceedings</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_series">NEALT Proceedings</span>, Vol. <span class="ltx_text ltx_bib_volume">11</span>,  <span class="ltx_text ltx_bib_publisher">NEALT</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text isbn ltx_bib_external">ISBN </span></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#bib.bib9" title="Modularisation of finnish finite-state language description—towards wide collaboration in open source development of a morphological analyser" class="ltx_ref">14</a>.
</span>
</li>
<li id="bib.bib12" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[13]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">T. A. Pirinen and K. Lindén</span><span class="ltx_text ltx_bib_year"> (2010)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Finite-state spell-checking with weighted language and error models</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the Seventh SaLTMiL workshop on creation and use of basic lexical resources for less-resourced languagages</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">Valletta, Malta</span>, <span class="ltx_text ltx_bib_pages"> pp. 13–18</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://siuc01.si.ehu.es/%5C%7Ejipsagak/SALTMIL2010_Proceedings.pdf" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p1" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S2.SS1.p5" title="2.1 Weighted Finite-State Interpretation of the Method ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.1</span></a>,
<a href="#S3.p1" title="3 Material ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>,
<a href="#S3.p3" title="3 Material ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a>,
<a href="#S5.p2" title="5 Future Work and Discussion ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a>.
</span>
</li>
<li id="bib.bib9" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[14]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">T. A. Pirinen</span><span class="ltx_text ltx_bib_year"> (2011)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Modularisation of finnish finite-state language description—towards wide collaboration in open source development of a morphological analyser</span>.
</span>
<span class="ltx_bibblock">See <span class="ltx_text ltx_bib_crossref"><cite class="ltx_cite"><a href="#bib.bib35" title="Proceedings of 18th nordic conference of computational linguistics, nodalida 2011, riiga, latvia, may 111—13 2011, proceedings" class="ltx_ref">Proceedings of 18th nordic conference of computational linguistics, nodalida 2011, riiga, latvia, may 111—13 2011, proceedings, Pedersen<span class="ltx_text ltx_bib_etal"> et al.</span></a></cite></span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_pages"> pp. 299–302</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="http://www.helsinki.fi/%5C%7Etapirine/publications/Pirinen-nodalida2011.pdf" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS3.p1" title="2.3 Finnish-Specific Finite-State Weighting Methods ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.3</span></a>.
</span>
</li>
<li id="bib.bib19" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[15]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. Savary</span><span class="ltx_text ltx_bib_year"> (2002)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Typographical nearest-neighbor search in a finite-state lexicon and its application to spelling correction</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">CIAA ’01: Revised Papers from the 6th International Conference on Implementation and Application of Automata</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">London, UK</span>, <span class="ltx_text ltx_bib_pages"> pp. 251–260</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text isbn ltx_bib_external">ISBN 3-540-00400-9</span></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.SS1.p3" title="2.1 Weighted Finite-State Interpretation of the Method ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.1</span></a>.
</span>
</li>
<li id="bib.bib69" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_bibtag ltx_bib_key ltx_role_refnum">[16]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">M. Silfverberg and K. Lindén</span><span class="ltx_text ltx_bib_year"> (2011)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Combining statistical models for pos tagging using finite-state calculus</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the 18th Conference on Computational Linguistics,
NODALIDA 2011</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_pages"> pp. 183–190</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S1.p4" title="1 Introduction ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>,
<a href="#S2.SS1.p4" title="2.1 Weighted Finite-State Interpretation of the Method ‣ 2 Methods ‣ Improving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams\footnotepubrightsOriginally published in CICLING 2012 in Delhi \urlhttp://cicling.org/2012/." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2.1</span></a>.
</span>
</li>
</ul>
</section>
</article>
</div>
<footer class="ltx_page_footer">
<div class="ltx_page_logo">Generated  on Fri Sep 29 12:53:30 2017 by <a href="http://dlmf.nist.gov/LaTeXML/">LaTeXML <img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="[LOGO]"></a>
</div></footer>
</div>
</body>
</html>