-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathKhanna-2021-mt-apertium.html
3377 lines (3256 loc) · 255 KB
/
Khanna-2021-mt-apertium.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html><html>
<head>
<title>Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6 </title>
<!--Generated on Fri Feb 4 08:41:29 2022 by LaTeXML (version 0.8.5) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on This version: February 4, 2022.-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">Recent advances in Apertium, a free / open-source rule-based machine
translation platform for low-resource languages
<span id="footnote1" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup>
<span class="ltx_tag ltx_tag_note">1</span>
Springer Open Access publication. This version from pre-print
latex form does not contain some changes made in the editorial process.
Published version available:
<a href="https://link.springer.com/article/10.1007/s10590-021-09260-6" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://link.springer.com/article/10.1007/s10590-021-09260-6</a></span></span></span>
</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Tanmai Khanna
<br class="ltx_break">Language Technologies Research Centre
<br class="ltx_break">IIIT Hyderabad, Telangana India 500032
<br class="ltx_break"><a href="[email protected]" title="" class="ltx_ref ltx_url ltx_font_typewriter">[email protected]</a>
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Jonathan N Washington
<br class="ltx_break">Swarthmore College
<br class="ltx_break">Swarthmore, PA USA 19081
<br class="ltx_break"><a href="[email protected]" title="" class="ltx_ref ltx_url ltx_font_typewriter">[email protected]</a>
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Francis M Tyers
<br class="ltx_break">Indiana University
<br class="ltx_break">Bloomington, IN USA 47401
<br class="ltx_break"><a href="[email protected]" title="" class="ltx_ref ltx_url ltx_font_typewriter">[email protected]</a>
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Sevilay Bayatlı
<br class="ltx_break">Beykent Üniversitesi
<br class="ltx_break">İstanbul, Turkey
<br class="ltx_break"><a href="[email protected]" title="" class="ltx_ref ltx_url ltx_font_typewriter">[email protected]</a>
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Daniel G Swanson
<br class="ltx_break">Swarthmore College
<br class="ltx_break">Swarthmore, PA USA 19081
<br class="ltx_break"><a href="[email protected]" title="" class="ltx_ref ltx_url ltx_font_typewriter">[email protected]</a>
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Flammie A Pirinen
<br class="ltx_break">UiT—Norgga árktalaš universitehta
<br class="ltx_break">NO-9000, Romssa
<br class="ltx_break"><a href="[email protected]" title="" class="ltx_ref ltx_url ltx_font_typewriter">[email protected]</a>
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Irene Tang
<br class="ltx_break">University of Chicago
<br class="ltx_break">Chicago, IL USA 60637
<br class="ltx_break"><a href="[email protected]" title="" class="ltx_ref ltx_url ltx_font_typewriter">[email protected]</a>
</span></span>
<span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Hèctor Alòs i Font
<br class="ltx_break">Centre de Recerca en Sociolingüística i Comunicació
<br class="ltx_break">Universitat de Barcelona
<br class="ltx_break"><a href="[email protected]" title="" class="ltx_ref ltx_url ltx_font_typewriter">[email protected]</a>
</span></span>
</div>
<div class="ltx_dates">(This version: February 4, 2022)</div>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p">This paper presents an overview of Apertium, a free and open-source
rule-based machine translation platform. Translation in Apertium happens
through a pipeline of modular tools, and the platform continues to be
improved as more language pairs are added. Several advances have been
implemented since the last publication, including some new optional modules:
a module that allows rules to process recursive structures at the structural
transfer stage, a module that deals with contiguous and discontiguous
multi-word expressions, and a module that resolves anaphora to aid
translation. Also highlighted is the hybridisation of Apertium through
statistical modules that augment the pipeline, and statistical methods that
augment existing modules. This includes morphological disambiguation,
weighted structural transfer, and lexical selection modules that learn from
limited data. The paper also discusses how a platform like Apertium can be
a critical part of access to language technology for so-called low-resource
languages, which might be ignored or deemed unapproachable by popular
corpus-based translation technologies. Finally, the paper presents some of
the released and unreleased language pairs, concluding with a brief look at
some supplementary Apertium tools that prove valuable to users as well as
language developers. All Apertium-related code, including language data, is
free/open-source and available at <a href="https://github.com/apertium" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://github.com/apertium</a>.
</p>
<p class="ltx_p">Keywords: machine translation low-resource languages rule-based
machine translation hybrid machine translation</p>
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<span id="footnote2" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup>
<span class="ltx_tag ltx_tag_note">2</span>
Several of the advances described in this paper were supported by
Google Summer of Code funding, for which the authors are very grateful.</span></span></span>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p">Apertium <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib60" title="Apertium: a free/open-source platform for rule-based machine translation platform" class="ltx_ref">10</a>]</cite> is a free/open-source platform for
rule-based machine translation (RBMT). It was designed to use the shallow
transfer based approach to translation, and most modules in the pipeline work on
rules written by language developers and linguists. The platform provides an
accessible way to create language data and rules, such that apart from language
developers, speakers of a language with a limited understanding of programming
and/or linguistics can create decent translation systems for their languages as
well. This is a superior model for creating translation systems for low-resource
languages both because it involves stakeholders from the language communities,
and because the languages lack widely available corpora that would be needed for
fully data-driven approaches. Apart from developing RBMT systems for
low-resource languages, the Apertium open source organisation also develops and
supports tools for the creation of RBMT systems.</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">Several advances to the Apertium platform (Release version 3.6) have been
implemented since the previous publication <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib60" title="Apertium: a free/open-source platform for rule-based machine translation platform" class="ltx_ref">10</a>]</cite>. These
include organisational improvements, additional tools, additional methods to
augment RBMT with corpus-based methods, new modules for more precise
translation, a few additional tools not directly involved in the RBMT pipeline,
and resources for many more languages and translation pairs.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">Organisational changes include a migration of the codebase from subversion
(hosted by SourceForge) to git (hosted by GitHub), a switch from two-letter ISO
codes (ISO 639-1) to three-letter ISO codes (ISO 639-3), and a three-directory
model for translation pairs (one for components specific to each language, and
one for the common components). Additionally, morphological transducers for a
number of languages make use of Helsinki Finite-State Technology (HFST)
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib116" title="Hfst—framework for compiling and applying morphologies" class="ltx_ref">20</a>]</cite>, morphological disambiguation has been improved in many
languages by using Visual Interactive Syntax Learning Constraint Grammar (VISL
CG-3) <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib28" title="CG-3 – beyond classical constraint grammar" class="ltx_ref">6</a>]</cite>, and several new features have been incorporated into
the lexical selection module.</p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">Section <a href="#S2" title="2 Overview of the Apertium platform ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> overviews the design of the Apertium RBMT platform.
Section <a href="#S3" title="3 Use of corpus-based approaches in Apertium modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3</span></a> discusses modules used by Apertium to
augment RBMT using corpus-based methods. Section <a href="#S4" title="4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a> introduces
the new modules in the pipeline: a module that allows rules to process recursive
structures at the structural transfer stage, a module that deals with contiguous
and discontiguous multiword expressions, and one that resolves anaphors to aid
translation. Section <a href="#S5" title="5 Supporting minoritised languages ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">5</span></a> discusses Apertium’s
contribution to language revitalisation and reclamation efforts.
Section <a href="#S6" title="6 Supplementary tools ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">6</span></a> introduces several supplementary Apertium
tools. Section <a href="#S7" title="7 Conclusion ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">7</span></a> concludes.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Overview of the Apertium platform</h2>
<div id="S2.p1" class="ltx_para">
<p class="ltx_p">The overall design of Apertium is a pipeline with a series of modules. Each
stage of the pipeline reads from and writes to text streams in a consistent
format so that modules can easily be added or removed according to the needs of
the languages in question.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">Apertium consists of both the management of the pipeline (the main
<span class="ltx_text ltx_font_typewriter">apertium</span> executable) and all the stages in this pipeline, except where
outside tools (such as HFST for morphological analysis and generation, or CG for
morphological disambiguation) are used. Each stage consists of a general
processor which modifies the stream based on hand-crafted “rules” (coded
linguistic generalisations) for a given language or language pair.
Figure <a href="#S2.F1" title="Figure 1 ‣ 2 Overview of the Apertium platform ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a> shows the entire pipeline, including optional modules.</p>
</div>
<figure id="S2.F1" class="ltx_figure"><img src="x1.png" id="S2.F1.g1" class="ltx_graphics ltx_centering" width="674" height="237" alt="The architecture of Apertium, a transfer-based machine translation
system. Each rounded box is a module available for language-specific or
pair-specific development. Broken lines show optional modules. Lines with
arrows represent the flow of data through the pipeline. The stages in the
pipeline are grouped by whether they are relevant to source-language
analysis, bilingual transfer, or target-language generation—the three
logical sections of the pipeline. The deformatter and reformatter are
language-agnostic and provided by Apertium.">
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>The architecture of Apertium, a transfer-based machine translation
system. Each rounded box is a module available for language-specific or
pair-specific development. Broken lines show optional modules. Lines with
arrows represent the flow of data through the pipeline. The stages in the
pipeline are grouped by whether they are relevant to source-language
analysis, bilingual transfer, or target-language generation—the three
logical sections of the pipeline. The deformatter and reformatter are
language-agnostic and provided by Apertium.</figcaption>
</figure>
<div id="S2.p3" class="ltx_para">
<p class="ltx_p">A short overview of each of the stages of the pipeline is provided below. The
new ones are discussed in further detail in Section <a href="#S4" title="4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>.</p>
</div>
<div id="S2.p4" class="ltx_para">
<ul id="S2.I1" class="ltx_itemize">
<li id="S2.I1.ix1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item"><math id="S2.I1.ix1.m1" class="ltx_Math" alttext="-" display="inline"><mo>-</mo></math></span>
<div id="S2.I1.ix1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Deformatter:</span> Encapsulates any document formatting tags so
that they go through the rest of the translation pipeline untouched.
This is a language-agnostic part of Apertium.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p5" class="ltx_para">
<ul id="S2.I2" class="ltx_itemize">
<li id="S2.I2.ix1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item"><math id="S2.I2.ix1.m1" class="ltx_Math" alttext="-" display="inline"><mo>-</mo></math></span>
<div id="S2.I2.ix1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Source Language morphological analyser:</span> Segments the
surface form of text (words or multi-word lexical units) using a
finite-state transducer (FST) and delivers one or more lexical forms (or
“analyses”), each of which includes a lemma and a part-of-speech label
(encoded as a “tag”), as well as any relevant subcategory and
grammatical (e.g., inflectional) information (also encoded as tags).</p>
</div>
</li>
</ul>
</div>
<div id="S2.p6" class="ltx_para">
<ul id="S2.I3" class="ltx_itemize">
<li id="S2.I3.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I3.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Source Language morphological disambiguator:</span> Tries to choose
the best sequence of morphological analyses for an ambiguous sentence.
<br class="ltx_break">The original Apertium disambiguator used a first-order hidden Markov
model (HMM). Other statistical models, such as averaged weighted
Perceptron, have since been added and are currently in use for various
languages. Additionally, CG <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib28" title="CG-3 – beyond classical constraint grammar" class="ltx_ref">6</a>]</cite> is often combined with a
statistical model for a two-step process. The different approaches are
discussed in Section <a href="#S3.SS1" title="3.1 Morphological disambiguation ‣ 3 Use of corpus-based approaches in Apertium modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3.1</span></a>.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p7" class="ltx_para">
<ul id="S2.I4" class="ltx_itemize">
<li id="S2.I4.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I4.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Source Language retokenization:</span> Adjusts token boundaries for
multi-word expressions, which can be non-contiguous (such as separable
verbs in Germanic languages), in preparation for translation. Often
this consists of combining component parts into single multi-word
expressions. This module is discussed in more detail in
Section <a href="#S4.SS2" title="4.2 Processing multi-word expressions ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.2</span></a>.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p8" class="ltx_para">
<ul id="S2.I5" class="ltx_itemize">
<li id="S2.I5.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I5.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Lexical transfer:</span> Reads each source-language (SL) lexical
form and delivers a set of corresponding target-language (TL) lexical
forms by looking it up in a bilingual dictionary, implemented as an FST.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p9" class="ltx_para">
<ul id="S2.I6" class="ltx_itemize">
<li id="S2.I6.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I6.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Lexical selection:</span> Based on context rules, chooses the most
adequate translation of ambiguous SL lexical forms. The original
module <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib231" title="Flexible finite-state lexical selection for rule-based machine translation" class="ltx_ref">43</a>]</cite> has been extended with new features
like macros. This is discussed in more detail in
Section <a href="#S3.SS2" title="3.2 Lexical selection ‣ 3 Use of corpus-based approaches in Apertium modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3.2</span></a>.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p10" class="ltx_para">
<ul id="S2.I7" class="ltx_itemize">
<li id="S2.I7.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I7.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Source Language anaphora resolution:</span> Resolves references to
earlier items in discourse. Using saliency metrics, this module attaches
the lexical unit of the antecedent to its corresponding anaphor to aid
translation. This module is discussed in more detail in
Section <a href="#S4.SS3" title="4.3 Anaphora resolution ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.3</span></a>.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p11" class="ltx_para">
<ul id="S2.I8" class="ltx_itemize">
<li id="S2.I8.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I8.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Shallow structural transfer:</span> Apertium’s shallow structural
transfer module implements a sequence of one or more finite-state
constraint rules on the output of the lexical selection module. It
generally consists of three sub-modules: a chunker mode, an interchunk
mode, and a postchunk mode.</p>
</div>
<div id="S2.I8.i1.p2" class="ltx_para">
<p class="ltx_p">Apertium 1.0 had one single structural transfer step. This was
considered enough for the translators between the closely related
Iberian Romance languages which constituted the first Apertium
translators. The one-step strategy is still used in the current released
versions of many of them, including the Catalan-Spanish translation
pair, which since then has been continuously improved and is widely
used.<span id="footnote3" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>
<span class="ltx_tag ltx_tag_note">3</span>
In 2020, the Softcatalà-hosted Apertium translators
served an average of 4.6 million requests per month from Spanish to
Catalan and 1.1 million from Catalan to Spanish (data kindly provided by
Xavier Ivars).</span></span></span>
<br class="ltx_break">Beginning with the implementation of the
Spanish-English and Catalan-English language pairs, a three-step
transfer architecture was developed, leading to the release of Apertium
2.0. The first step creates chunks in the source language and reorders
words inside the chunk as per the transfer rules. The second step
reorders chunks based on the target language syntax, and the final step
makes the stream ready for the generator. This is currently the standard
Apertium structural transfer architecture. Several pairs have additional
transfer steps, such as Catalan-Esperanto (5 steps) and French-Occitan
(4 steps).</p>
</div>
<div id="S2.I8.i1.p3" class="ltx_para">
<p class="ltx_p">In the Catalan-Esperanto translator there are three “interchunk” steps
aimed at a deeper syntactic analysis, with the overarching objective of
generating the correct case morphology on various types of nominals in
the target language (Esperanto), since the source language (Catalan)
lacks case morphology except in its pronominal system. The shallow
transfer system is used creatively in other ways as well.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p12" class="ltx_para">
<ul id="S2.I9" class="ltx_itemize">
<li id="S2.I9.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I9.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Recursive structural transfer:</span>
This module is a recently developed alternative to the shallow
structural transfer module (chunker, interchunk, and postchunk). Its
linguistic data is specified as context-free grammars (CFGs) and it uses
a Generalized Left-right Right-reduce (GLR) parser rather than
finite-state chunking to more effectively implement long-distance
reordering. This module is discussed further in
Section <a href="#S4.SS1" title="4.1 Recursive structural transfer ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.1</span></a>.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p13" class="ltx_para">
<ul id="S2.I10" class="ltx_itemize">
<li id="S2.I10.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I10.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Target Language retokenization:</span> Adjusts token boundaries for
multi-word expressions, which can be non-contiguous (such as separable
verbs in Germanic languages), in preparation for target-language
morphological generation. Often this consists of separating multi-word
expressions into their component parts. This module is discussed in
more detail in Section <a href="#S4.SS2" title="4.2 Processing multi-word expressions ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.2</span></a>.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p14" class="ltx_para">
<ul id="S2.I11" class="ltx_itemize">
<li id="S2.I11.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I11.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Target Language morphological generator:</span> Delivers the
sequence of TL surface forms for each corresponding TL lexical form
received from earlier modules in the pipeline.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p15" class="ltx_para">
<ul id="S2.I12" class="ltx_itemize">
<li id="S2.I12.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I12.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Target Language Post-generator:</span> Performs mainly orthographic
operations across tokens, for example elision (such as <span class="ltx_text ltx_font_italic">lo + òme
= l’òme</span> in Occitan), fusion (such as <span class="ltx_text ltx_font_italic">da + il = dal</span> in
Italian), epenthesis (such as <span class="ltx_text ltx_font_italic">a ¿ an</span> in English, or <span class="ltx_text ltx_font_italic">с ¿
со</span> and <span class="ltx_text ltx_font_italic">о ¿ об</span> in Russian), or dissimilation (such as
<span class="ltx_text ltx_font_italic">la + agua ¿ el agua</span> in Spanish).</p>
</div>
</li>
</ul>
</div>
<div id="S2.p16" class="ltx_para">
<ul id="S2.I13" class="ltx_itemize">
<li id="S2.I13.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S2.I13.i1.p1" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_bold">Reformatter:</span> De-encapsulates any formatting information to
prepare a finally formatted document in the target language. This is a
language-agnostic part of Apertium.</p>
</div>
</li>
</ul>
</div>
<div id="S2.p17" class="ltx_para">
<p class="ltx_p">The reader is referred to the Apertium
wiki<span id="footnote4" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup>
<span class="ltx_tag ltx_tag_note">4</span>
<a href="http://wiki.apertium.org/wiki/Pipeline" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://wiki.apertium.org/wiki/Pipeline</a></span></span></span> for more information
about file naming conventions, mode naming conventions, and dates of
introduction for each stage of the pipeline. Any further additions to the
pipeline will be documented on this wiki.</p>
</div>
<div id="S2.p18" class="ltx_para">
<p class="ltx_p">It should be added that a major difference in the organisation of Apertium
language pairs as compared to the original model is the three-directory
structure currently used for most (but not all) of the released translation
pairs. Initially, every translation pair was developed in a single
self-contained repository that included all relevant linguistic data. Currently,
monolingual data, such as morphological dictionaries, morphological
disambiguators and post-generators, are shared by different translators,
allowing much easier reuse of data and cooperation in the improvement of
linguistic resources <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib124" title="FST trimming: ending dictionary redundancy in Apertium" class="ltx_ref">24</a>]</cite>. Thus, for instance, compiling the
<span class="ltx_text ltx_font_typewriter">apertium-spa-cat</span> pair now depends on the <span class="ltx_text ltx_font_typewriter">apertium-spa</span> and
<span class="ltx_text ltx_font_typewriter">apertium-cat</span> modules, which are also used by other translation pairs.</p>
</div>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Use of corpus-based approaches in Apertium modules</h2>
<div id="S3.p1" class="ltx_para">
<p class="ltx_p">Several methods of incorporating corpus-based approaches into Apertium RBMT
systems are available. These methods fall into the domains of morphological
disambiguation (Section <a href="#S3.SS1" title="3.1 Morphological disambiguation ‣ 3 Use of corpus-based approaches in Apertium modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3.1</span></a>), lexical selection
(Section <a href="#S3.SS2" title="3.2 Lexical selection ‣ 3 Use of corpus-based approaches in Apertium modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3.2</span></a>), and structural transfer
(Section <a href="#S3.SS3" title="3.3 Structural transfer module ‣ 3 Use of corpus-based approaches in Apertium modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">3.3</span></a>).</p>
</div>
<section id="S3.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>Morphological disambiguation</h3>
<div id="S3.SS1.p1" class="ltx_para">
<p class="ltx_p">The goal of morphological disambiguation is to choose the correct morphological
analysis if there are multiple possible analyses for a given lexical unit.</p>
</div>
<div id="S3.SS1.p2" class="ltx_para">
<p class="ltx_p">The oldest and most commonly used morphological disambiguation method in
Apertium <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib198" title="Speeding up target language driven part-of-speech tagger training for machine translation" class="ltx_ref">37</a>, <a href="#bib.bib199" title="Training part-of-speech taggers to build machine translation systems for less-resourced language pairs" class="ltx_ref">33</a>]</cite> is a module that relies on patterns
learned from a corpus. This bigram-based morphological disambiguator chooses one
analysis from among those returned by the morphological analyser based on a
probabilistic model of sequences of part-of-speech tags given the surrounding
context.</p>
</div>
<div id="S3.SS1.p3" class="ltx_para">
<p class="ltx_p">Some Apertium disambiguators are implemented instead using statistical methods
based on Hidden Markov Models (HMM), which processes the result of the
application of constraint-grammar rules <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib87" title="Constraint grammar: a language-independent system for parsing unrestricted text" class="ltx_ref">15</a>]</cite>. The
perceptron tagger <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib262" title="Syntactic processing using the generalized perceptron and beam search" class="ltx_ref">52</a>]</cite> in the English language
module (<span class="ltx_text ltx_font_typewriter">apertium-eng</span>) follows one such method.</p>
</div>
<div id="S3.SS1.p4" class="ltx_para">
<p class="ltx_p">Furthermore, VISL CG-3 <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib28" title="CG-3 – beyond classical constraint grammar" class="ltx_ref">6</a>]</cite> has become a popular method among
Apertium developers of implementing morphological disambiguation using
hand-crafted heuristics. For many languages, it is combined with one of the
other methods for a two-step disambiguation stage.</p>
</div>
</section>
<section id="S3.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.2 </span>Lexical selection</h3>
<div id="S3.SS2.p1" class="ltx_para">
<p class="ltx_p">The goal of lexical selection is to choose an adequate translation in the target
language from among several possible translations for a given source-language
lexical unit. An FST-based module that allows the writing of rules has been in
use for some time <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib231" title="Flexible finite-state lexical selection for rule-based machine translation" class="ltx_ref">43</a>]</cite>.</p>
</div>
<div id="S3.SS2.p2" class="ltx_para">
<p class="ltx_p">Apart from manually written rules, a system has also been developed that learns
rules through a maximum-entropy model trained in an unsupervised
manner <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib233" title="Unsupervised training of maximum-entropy models for lexical selection in rule-based machine translation" class="ltx_ref">42</a>]</cite>. The training method requires only a source language
corpus, a statistical target-language language model, and the RBMT system
itself. All possible translations are scored against the TL language model, and
these scores are normalized to provide fractional counts to train
source-language maximum-entropy lexical selection models.</p>
</div>
</section>
<section id="S3.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.3 </span>Structural transfer module</h3>
<div id="S3.SS3.p1" class="ltx_para">
<p class="ltx_p">Structural transfer handles differences between the source and target languages
in terms of word order and morphological information by applying transfer rules.
In the chunker module, these transfer rules function by matching a source
language pattern of lexical items, creating chunks and applying a sequence of
actions to convert the word order and morphological properties of the chunk as
per the target language. There can, however, be more than one potential sequence
of actions for each source language pattern, as well as overlapping patterns. To
generate an accurate translation, transfer rules are applied to the input using
a left-right-longest match algorithm.</p>
</div>
<div id="S3.SS3.p2" class="ltx_para">
<p class="ltx_p">Work has been done to extract, or “learn”, chunking rules using Alignment
Templates <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib82" title="Using alignment templates to infer shallow-transfer machine translation rules" class="ltx_ref">36</a>, <a href="#bib.bib200" title="Automatic induction of shallow-transfer rules for open-source machine translation" class="ltx_ref">34</a>, <a href="#bib.bib127" title="Using unsupervised corpus-based methods to build rule-based machine translation systems" class="ltx_ref">23</a>, <a href="#bib.bib202" title="Inferring shallow-transfer machine translation rules from small parallel corpora" class="ltx_ref">35</a>, <a href="#bib.bib203" title="A generalised alignment template formalism and its application to the inference of shallow-transfer machine translation rules from scarce bilingual corpora" class="ltx_ref">32</a>]</cite>. A
parallel corpus is searched for sequences of lexical units that exhibit
differences in order or morphological information.</p>
</div>
<div id="S3.SS3.p3" class="ltx_para">
<p class="ltx_p">In addition, chunker rules can now be weighted so as to apply different rules in
different overlapping lexical environments. These weights can be learned using
an unsupervised maximum entropy approach <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib17" title="Unsupervised weighting of transfer rules in rule-based machine translation using maximum-entropy approach" class="ltx_ref">4</a>]</cite>.</p>
</div>
<div id="S3.SS3.p4" class="ltx_para">
<p class="ltx_p">The basic goal of this method is to choose between conflicting structural
transfer rules based on the lexical environment. For example, the Spanish
sentence <em class="ltx_emph ltx_font_italic">Encontré el pastel muy bueno</em> has (at least) two different
(hypothetical) translations to English depending on the syntactic parse in
Spanish: (a) “I found the cake very good” or (b) “I found the very good
cake”. That is, <em class="ltx_emph ltx_font_italic">muy bueno</em> may be parsed (a) as a complement to the verb
<em class="ltx_emph ltx_font_italic">encuentro</em> or (b) as a modifier to the noun <em class="ltx_emph ltx_font_italic">el pastel</em>. These
parses correspond to different sets of transfer rules, each of which could be
matched: (a) a single verb phrase consisting of V DET N ADV ADJ, or (b) a verb
V, followed by a noun phrase DET N ADV ADJ. The noun phrase rule would specify
that the elements be output in a different order, DET ADV ADJ N, and both rules
that match the verb would add a lexical unit for the pronominal subject.</p>
</div>
<div id="S3.SS3.p5" class="ltx_para">
<p class="ltx_p">A model is produced by running SL text through all possible transfer rules,
comparing the potential translations that are output to a TL language model, and
dividing the scores by the series of SL lemmas that matched each transfer rule
pattern for a given potential translation. If the example above were part of
the training data, then the potential translation <em class="ltx_emph ltx_font_italic">I found the cake very
good</em> would score higher than <em class="ltx_emph ltx_font_italic">I found the very good cake</em> against an
English language model due to having a higher probability. These different
scores are then distributed as weights, along with the input lemmas, attached to
the rules that each translation is the result of. In this example, the weight
assigned to the V DET N ADV ADJ rule for the Spanish lemmas is higher than the
sum of the weights assigned to the V and DET N ADV ADJ rules for these same
lemmas, and hence the V DET N ADV ADJ rule will be selected.</p>
</div>
<div id="S3.SS3.p6" class="ltx_para">
<p class="ltx_p">During translation, when a string of SL text matches multiple transfer rules,
the system is able to choose between them (infer the “correct” one) based on
the weights associated with the rules that the SL lemmas trigger. For example,
if this same sentence were being translated, the V DET N ADV ADJ rule would be
matched, resulting in the output “I found the cake very good”.</p>
</div>
<div id="S3.SS3.p7" class="ltx_para">
<p class="ltx_p">A contrasting example, <em class="ltx_emph ltx_font_italic">Encontré un pastel muy bueno</em>, would match the same
two sets of rules, but would result in translation occurring through the other
rule set. This is because the lemmas of the potential translation <em class="ltx_emph ltx_font_italic">I found
a very good cake</em> would result in higher combined weights for the V and DET N
ADV ADJ rules than the V DET N ADV ADJ rule. This reason for this is that
translations containing these lemmas would have scored higher against an English
language model than translations like <em class="ltx_emph ltx_font_italic">I find a cake very good</em>, resulting
in higher weights for this set of Spanish lemmas attached to this set of rules.</p>
</div>
<div id="S3.SS3.p8" class="ltx_para">
<p class="ltx_p">In both examples of Spanish inputs, using this approach and a suitable corpus to
train an English language model, the set of transfer rules that results in the
more likely English translation is chosen.</p>
</div>
<div id="S3.SS3.p9" class="ltx_para">
<p class="ltx_p">This method has been tested using the Kazakh–Turkish, Kyrgyz–Turkish, and
Spanish–English translation pairs, and it has been observed that the results
are better when there is a greater number of ambiguous rules. The module has
not yet been included in any released translation system.</p>
</div>
</section>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>New modules</h2>
<div id="S4.p1" class="ltx_para">
<p class="ltx_p">Several previously-unpublished modules are now available for the Apertium
pipeline. Discussed in this section are <span class="ltx_text ltx_font_typewriter">apertium-recursive</span>, which
provides for true recursive transfer; <span class="ltx_text ltx_font_typewriter">apertium-separable</span>, which enables
the processing of multi-word expressions; and <span class="ltx_text ltx_font_typewriter">apertium-anaphora</span>, which
allows the resolution of anaphors in the source text.</p>
</div>
<section id="S4.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.1 </span>Recursive structural transfer</h3>
<figure id="S4.F4" class="ltx_figure"><span class="ltx_inline-para ltx_minipage ltx_align_middle" style="width:433.6pt;">
<span id="S4.F4.p1" class="ltx_para ltx_align_center"><pre class="ltx_verbatim ltx_font_typewriter">
NP -> det n { 2 + 1 } |
NP PP { 2 _ 1 } ;
PP -> pr NP { 2 + 1 } ;
</pre>
</span></span>
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>A simple set of recursive rules translating a subset of noun
phrases and prepositional phrases from English to Basque. A noun phrase
(<span class="ltx_text ltx_font_typewriter">NP</span>) in the source language consists of a determiner (<span class="ltx_text ltx_font_typewriter">det</span>)
and a noun (<span class="ltx_text ltx_font_typewriter">n</span>), and may optionally include a prepositional phrase
(<span class="ltx_text ltx_font_typewriter">PP</span>), and a prepositional phrase consists of a preposition
(<span class="ltx_text ltx_font_typewriter">pr</span>) and a noun phrase. All three output rules reverse the order
of the two nodes: the order of a determiner and a noun is reversed, the
order of a noun phrase and a prepositional phrase is reversed, and the order
of an adposition (preposition/postposition) and a noun phrase is reversed.
The action part of the rules (building up the target translation) appears
between braces <span class="ltx_text ltx_font_typewriter">{…}</span>. The indices, <span class="ltx_text ltx_font_typewriter">1</span> and <span class="ltx_text ltx_font_typewriter">2</span>,
indicate the position of the unit matched in the input, <span class="ltx_text ltx_font_typewriter">_</span>
represents a space in the output, and <span class="ltx_text ltx_font_typewriter">+</span> indicates that the words on
either side of it will be conjoined without a space. </figcaption><span class="ltx_inline-para ltx_minipage ltx_align_middle" style="width:433.6pt;">
<span id="S4.F4.p2" class="ltx_para ltx_align_center">
<span class="ltx_p">[Missing Figure: forest: 1]
</span>
</span></span>
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>A source language parse tree for the phrase <em class="ltx_emph ltx_font_italic">the house by
the side of that road</em> built using the rules in
figure <a href="#S4.F4" title="Figure 4 ‣ 4.1 Recursive structural transfer ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>. When no further application of the rules is
possible, this tree will be transformed into the tree shown in
figure <a href="#S4.F4" title="Figure 4 ‣ 4.1 Recursive structural transfer ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>.</figcaption><span class="ltx_inline-para ltx_minipage ltx_align_middle" style="width:433.6pt;">
<span id="S4.F4.p3" class="ltx_para ltx_align_center">
<span class="ltx_p">[Missing Figure: forest: 2]
</span>
</span></span>
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure">Figure 4: </span>The target language tree resulting from applying the action
steps of the rules in figure <a href="#S4.F4" title="Figure 4 ‣ 4.1 Recursive structural transfer ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a> to the tree in
figure <a href="#S4.F4" title="Figure 4 ‣ 4.1 Recursive structural transfer ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>. The analyses yielded by this tree will
generate the Basque phrase <em class="ltx_emph ltx_font_italic">kale haren ertzeko etxea</em> ‘the house by
the side of the road’. The final step of combining definite articles and
postpositions with the immediately preceding words is not
shown.</figcaption>
</figure>
<div id="S4.SS1.p1" class="ltx_para">
<p class="ltx_p">Given the range of possible syntactic structures, it is common for any two
languages to have significantly different word orders. For example, in Welsh,
verbs tend to be at the beginning of a sentence; in English they tend to be in
the middle; and in Kyrgyz, they tend to be at the end.</p>
</div>
<div id="S4.SS1.p2" class="ltx_para">
<p class="ltx_p">These differences are problematic for Apertium’s finite-state chunking module,
which matches fixed sequences of words that must be contiguous. This limitation
means it is fairly easy to write rules which perform operations such as changing
the order of nouns and adjectives, since these are usually adjacent, but
changing larger structures is much harder. Switching the order of the subject
and the main verb, for instance, would generally require writing a rule for each
sequence of words that can make up each of those parts. The English-Spanish pair
has more than 30 chunking rules for handling noun phrases with different numbers
of determiners and adjectives, and those rules don’t attempt to deal with all
structures that may occur in noun phrases, such as relative clauses.</p>
</div>
<div id="S4.SS1.p3" class="ltx_para">
<p class="ltx_p">To deal with the limitations of finite-state chunking, the
<span class="ltx_text ltx_font_typewriter">apertium-recursive</span> module <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib220" title="A tree-based structural transfer module for the Apertium machine translation platform" class="ltx_ref">38</a>]</cite> was developed by
Daniel Swanson as part of Google Summer of Code
2019<span id="footnote5" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">5</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">5</sup>
<span class="ltx_tag ltx_tag_note">5</span>
<a href="https://summerofcode.withgoogle.com/archive/2019/projects/6746718069063680/" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://summerofcode.withgoogle.com/archive/2019/projects/6746718069063680/</a>.</span></span></span>
to apply structural transfer rules recursively using context-free grammars
(CFGs) and a Generalized Left-right Right-reduce (GLR) parser. This makes it
possible to process nested structures such as relative clauses or prepositional
phrases within prepositional phrases. An example of the latter is shown in
Figures <a href="#S4.F4" title="Figure 4 ‣ 4.1 Recursive structural transfer ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a> and <a href="#S4.F4" title="Figure 4 ‣ 4.1 Recursive structural transfer ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>, with the relevant
rules in Figure <a href="#S4.F4" title="Figure 4 ‣ 4.1 Recursive structural transfer ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4</span></a>. In this example, the word order of a set
of nested prepositional phrases needs to be completely reversed (or in
linguistic terms, the order of noun phrases (NPs) and adpositional phrases (PPs)
each needs to be reversed), regardless of the number of prepositional phrases
involved in order to translate from English to Basque.</p>
</div>
<div id="S4.SS1.p4" class="ltx_para">
<p class="ltx_p">A recursive approach to transfer can be helpful for translation pairs between
syntactically more similar languages as well. For example, in the case of the
English-Spanish noun phrase rules mentioned above, the more than 30 rules
required for handling determiners, adjectives, and nouns can be simplified to
less than 10 rules in <span class="ltx_text ltx_font_typewriter">apertium-recursive</span> because more complicated
structures can be handled by composing simpler ones. In fact, the majority of
these can be covered by just 3 rules saying that a noun phrase is composed of a
noun, or an adjective and a noun phrase, or a determiner and a noun phrase.
These 3 rules can immediately handle any number of determiners and adjectives.</p>
</div>
</section>
<section id="S4.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.2 </span>Processing multi-word expressions</h3>
<div id="S4.SS2.p1" class="ltx_para">
<p class="ltx_p">Multi-word expressions (MWEs) are compound expressions composed of two or more
words, such as phrasal verbs (<em class="ltx_emph ltx_font_italic">take out</em>, <em class="ltx_emph ltx_font_italic">wake up</em>, <em class="ltx_emph ltx_font_italic">make a
call</em>) and phrasal nouns (<em class="ltx_emph ltx_font_italic">telephone pole</em>). Separable multi-word
expressions are those that may be split by an intermediary word or phrase (such
as <em class="ltx_emph ltx_font_italic">take out</em> in <em class="ltx_emph ltx_font_italic">take the trash out</em>). This phenomenon can be seen in
a number of languages. In English, the multi-word “take away” can remain
unified, such as in <em class="ltx_emph ltx_font_bold ltx_font_italic">take away the item</em>, or be split up, such as
in <em class="ltx_emph ltx_font_bold ltx_font_italic">take the item away</em>—both phrasings have identical meanings.
This phenomenon can also be seen in some German verbs, where the separable
particle can detach from its lexical core, such as with the separable verb
<em class="ltx_emph ltx_font_italic">anrufen</em> ‘to call’: <em class="ltx_emph ltx_font_bold ltx_font_italic">rufe meine Freundin an</em> ‘call my
friend’. See <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib46" title="Multiword expression processing: a survey" class="ltx_ref">8</a>]</cite> for more on this phenomenon.</p>
</div>
<div id="S4.SS2.p2" class="ltx_para">
<p class="ltx_p">Separable MWEs are particularly problematic for Apertium’s rule-based
translation. Prior to the introduction of the <span class="ltx_text ltx_font_typewriter">apertium-separable</span>
module, the individual components of both non-separable and separable
multi-words were translated as individual tokens, often leading to less-optimal
translations. For example, during the English-to-Spanish translation of
<em class="ltx_emph ltx_font_bold ltx_font_italic">take the trash away</em>, the phrase’s individual components were
translated to produce <em class="ltx_emph ltx_font_italic">tomar la basura fuera</em> which isn’t a phrase that
native speakers of Spanish would produce. The more optimal solution is to
process <em class="ltx_emph ltx_font_italic">take away</em> as a single unit in order to obtain the correct
expression <em class="ltx_emph ltx_font_italic">sacar la basura</em>. Similarly, the Arpitan verbal expression
<em class="ltx_emph ltx_font_italic">tornar fâre</em> ‘to redo’ has negative forms of the type <em class="ltx_emph ltx_font_italic">tornar pas
fâre</em> which were not previously recognised nor correctly generated.</p>
</div>
<div id="S4.SS2.p3" class="ltx_para">
<p class="ltx_p"><span class="ltx_text ltx_font_typewriter">Apertium-separable</span> provides a framework to address mistranslations
arising from this sort of non-contiguous word ordering.
Section <a href="#S4.SS2.SSS1" title="4.2.1 The apertium-separable module ‣ 4.2 Processing multi-word expressions ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.2.1</span></a> describes the module and
section <a href="#S4.SS2.SSS2" title="4.2.2 Usage ‣ 4.2 Processing multi-word expressions ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.2.2</span></a> describes its usage.</p>
</div>
<section id="S4.SS2.SSS1" class="ltx_subsubsection">
<h4 class="ltx_title ltx_title_subsubsection">
<span class="ltx_tag ltx_tag_subsubsection">4.2.1 </span>The apertium-separable module</h4>
<div id="S4.SS2.SSS1.p1" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">apertium-separable</span> module was developed by Irene Tang as part of
Google Summer of Code
2017<span id="footnote6" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">6</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">6</sup>
<span class="ltx_tag ltx_tag_note">6</span>
<a href="https://summerofcode.withgoogle.com/archive/2017/projects/4690909727817728/" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://summerofcode.withgoogle.com/archive/2017/projects/4690909727817728/</a>.</span></span></span>
to handle both contiguous and discontiguous (or “separable”) MWEs. The module
accepts an XML-format dictionary as input, which contains a list of phrase types
and a list of mappings between MWEs and their component elements—and in the
case of non-contiguous MWEs, a specification of the possible phrase types that
might separate the elements of the MWE.
</p>
</div>
<div id="S4.SS2.SSS1.p2" class="ltx_para">
<p class="ltx_p">As an example, one phrase-type entry that the <span class="ltx_text ltx_font_typewriter">eng</span> dictionary might
include is the definition a noun phrase (NP) as (among other patterns) any
sequence of words such that the first contains a <span class="ltx_text ltx_font_typewriter"><det></span> tag, the second an
<span class="ltx_text ltx_font_typewriter"><adj></span> tag, and the last a <span class="ltx_text ltx_font_typewriter"><n></span> tag. The <span class="ltx_text ltx_font_typewriter">eng</span> dictionary should
then also include an entry specifying how <em class="ltx_emph ltx_font_italic">take away</em> as an MWE followed by
such a noun phrase may be mapped to its component elements. These phrase-type
and vocabulary entries work together as a framework for handling MWEs.</p>
</div>
<div id="S4.SS2.SSS1.p3" class="ltx_para">
<p class="ltx_p">The XML dictionary is compiled into a finite state transducer. As a parser feeds
the input text into the transducer one character or tag at a time, it looks out
for sequences of characters and tags that match anything in the dictionary. If a
match is found, then the parser outputs the corresponding substitution.</p>
</div>
<div id="S4.SS2.SSS1.p4" class="ltx_para">
<p class="ltx_p">Processors for this module may be included in two places in the Apertium RBMT
pipeline: immediately following morphological tagging and preceding lexical
transfer, and immediately following structural transfer and preceding
morphological generation. The former use allows “assembly” of source-language
MWEs for transfer, and the latter “disassembles” transferred target-language
MWEs for morphological generation.</p>
</div>
</section>
<section id="S4.SS2.SSS2" class="ltx_subsubsection">
<h4 class="ltx_title ltx_title_subsubsection">
<span class="ltx_tag ltx_tag_subsubsection">4.2.2 </span>Usage</h4>
<div id="S4.SS2.SSS2.p1" class="ltx_para">
<p class="ltx_p">Both contiguous and discontiguous multi-word expressions can also be handled by
this module. Processing seemingly simple contiguous MWEs in this way allows for
more robust bilingual dictionary entries with fairly vanilla morphological
transducers. For example, it may not make sense to have an entry for
<em class="ltx_emph ltx_font_italic">little brother</em> in an English morphological transducer that already
contains the component words, but it is useful to have an entry like this in a
bilingual dictionary with a language like Kyrgyz, which has two words for
brother with the difference in meaning associated with relative age to a
sibling. In this situation, the <span class="ltx_text ltx_font_typewriter">apertium-separable</span> module processes
the analysis of <em class="ltx_emph ltx_font_italic">little brother</em> as an adjective and a noun
(<span class="ltx_text ltx_font_typewriter">^little<adj>$</span> <span class="ltx_text ltx_font_typewriter">^brother<n><sg>$</span>) and
retokenizes it as a multi-word noun (<span class="ltx_text ltx_font_typewriter">^little
brother<n><sg>$</span>). Note that the assembly of the MWE (as described
here) would occur in the English-Kyrgyz translation direction before bilingual
dictionary lookup, and the disassembly of the MWE (the reverse) would occur in
the Kyrgyz-English translation direction before morphological generation.</p>
</div>
<div id="S4.SS2.SSS2.p2" class="ltx_para">
<p class="ltx_p">The module is used extensively in the French-Catalan pair, particularly in the
verbal phrases included in the dictionaries. Thus, for example, it is defined
that <em class="ltx_emph ltx_font_italic">faire appel</em> ‘to do appeal’ should be translated as <em class="ltx_emph ltx_font_italic">apel·lar</em>
‘to appeal’. However, there are often adverbs between the verb <em class="ltx_emph ltx_font_italic">faire</em> and
the noun <em class="ltx_emph ltx_font_italic">appel</em>, for example when negated: <em class="ltx_emph ltx_font_italic">ne fait pas appel</em>
‘does not appeal’. The module is used to reorder the phrase before lexical
transfer as <em class="ltx_emph ltx_font_italic">ne fait appel pas</em> (with <em class="ltx_emph ltx_font_italic">fait appel</em> as a single lexical
unit). Since the adverb now follows the multi-word verb instead of appearing
between its components, structural transfer does not need to treat such a
sentence any differently than sentences containing single-word verbs. Similar
examples are found in the [unreleased] Kazakh-Kyrgyz pair.</p>
</div>
</section>
</section>
<section id="S4.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">4.3 </span>Anaphora resolution</h3>
<div id="S4.SS3.p1" class="ltx_para">
<p class="ltx_p">The <span class="ltx_text ltx_font_typewriter">apertium-anaphora</span> module was developed by Tanmai Khanna as part of
Google Summer of Code
2019<span id="footnote7" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">7</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">7</sup>
<span class="ltx_tag ltx_tag_note">7</span>
<a href="https://summerofcode.withgoogle.com/archive/2019/projects/5434868157120512/" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://summerofcode.withgoogle.com/archive/2019/projects/5434868157120512/</a>.</span></span></span>
to handle anaphora resolution in the Apertium pipeline. Anaphora resolution is
the process of resolving references to earlier items in the discourse. This is
necessary in a Machine Translation pipeline as languages have different ways of
using anaphors, and sometimes it is necessary to know the antecedent of an
anaphor to translate it correctly.</p>
</div>
<div id="S4.SS3.p2" class="ltx_para">
<p class="ltx_p">For example, in Catalan, the masculine singular possessive determiner is
<em class="ltx_emph ltx_font_italic">el seu</em>. Its gender and number are inflectional properties relating to
how it agrees with nouns, but its referent may be any gender or number. Hence
it could be translated to English as any of <em class="ltx_emph ltx_font_italic">his/her/its/their</em>, the
gender and number of which relate to the referent and not to a modified noun. To
pick the correct translation in English, then, it is necessary to know what
<em class="ltx_emph ltx_font_italic">el seu</em> refers to. Without a module in an Apertium translation pipeline
to do this, a default translation of the anaphor appears in the target language.
For instance, in the case of English possessive determiners, the default is
currently <em class="ltx_emph ltx_font_italic">his</em>.</p>
</div>
<div id="S4.SS3.p3" class="ltx_para">
<p class="ltx_p">While there are several statistical methods to resolve anaphors using machine
learning, Apertium is focused on supporting low-resource language pairs, which
usually don’t have enough data available for these methods to be viable. Common
rule-based approaches, on the other hand, often use parse trees
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib107" title="An algorithm for pronominal anaphora resolution" class="ltx_ref">18</a>, <a href="#bib.bib16" title="CogNIAC: high precision coreference with limited knowledge and linguistic resources" class="ltx_ref">2</a>, <a href="#bib.bib226" title="A rule-based pronoun resolution system for French" class="ltx_ref">41</a>, <a href="#bib.bib108" title="Deterministic coreference resolution based on entity-centric, precision-ranked rules" class="ltx_ref">19</a>, <a href="#bib.bib117" title="Rule-based pronominal anaphora treatment for machine translation" class="ltx_ref">21</a>, <a href="#bib.bib261" title="When annotation schemes change rules help: a configurable approach to coreference resolution beyond ontonotes" class="ltx_ref">51</a>]</cite>.
The <span class="ltx_text ltx_font_typewriter">apertium-anaphora</span> module uses a rule-based approach to anaphora
resolution which does not require any training data, nor rely on parse trees.
Based on Mitkov’s algorithm <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib135" title="Multilingual anaphora resolution" class="ltx_ref">25</a>]</cite>, it gives saliency
scores to candidate antecedents in the context (the current and previous three
sentences) based on <span class="ltx_text ltx_font_bold">saliency indicators</span>, which are syntactic or lexical
indicators that are expected to correlate to a higher or lower likelihood that a
candidate antecedent is the correct one, using positive and negative scores
respectively. For example, indefinite nouns can be given a small negative score
and proper nouns can be given a small positive score, as it has been shown
empirically that they are less or more likely to be the antecedent of anaphors,
respectively <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib135" title="Multilingual anaphora resolution" class="ltx_ref">25</a>]</cite>. After the scores of all the
indicators are applied, the candidate with the highest score, hence considered
most salient, is chosen as the antecedent. A complete example of this is
presented in section <a href="#S4.SS3.SSS2" title="4.3.2 Example Usage ‣ 4.3 Anaphora resolution ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.3.2</span></a>. These saliency indicators are
added in the <span class="ltx_text ltx_font_typewriter">apertium-anaphora</span> module as manually written rules. These
rules are written for and are applied based on source-language forms only.
Because of this, a ruleset can be reused for multiple translation pairs with the
same source language.</p>
</div>
<div id="S4.SS3.p4" class="ltx_para">
<p class="ltx_p">Apart from manually written rules, a universal indicator is the Referential
Distance indicator. This indicator, which was also discovered empirically
<cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib135" title="Multilingual anaphora resolution" class="ltx_ref">25</a>]</cite>, tells the algorithm that as the distance
between the anaphor and candidate antecedent increases, the candidate is less
likely to be the correct antecedent of the anaphor. Penalisation of candidates
that are further from the anaphor is implemented by adding to candidates in the
same sentence as the anaphor a <span class="ltx_text ltx_font_typewriter">+1</span> score, candidates in the preceding
sentence a <span class="ltx_text ltx_font_typewriter">+0</span> score, in the sentence before the preceding sentence a
<span class="ltx_text ltx_font_typewriter">-1</span> score, and so on.</p>
</div>
<div id="S4.SS3.p5" class="ltx_para">
<p class="ltx_p">In the next few sections, some unique features of this module are discussed
(<a href="#S4.SS3.SSS1" title="4.3.1 Some unique features ‣ 4.3 Anaphora resolution ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.3.1</span></a>), an example highlighting the process and benefit of
having anaphora resolution in the Machine Translation pipeline is shown
(<a href="#S4.SS3.SSS2" title="4.3.2 Example Usage ‣ 4.3 Anaphora resolution ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.3.2</span></a>), a preliminary evaluation of the module is presented
(<a href="#S4.SS3.SSS3" title="4.3.3 Preliminary evaluation ‣ 4.3 Anaphora resolution ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.3.3</span></a>), and future work for this module is outlined
(<a href="#S4.SS3.SSS4" title="4.3.4 Future Work ‣ 4.3 Anaphora resolution ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">4.3.4</span></a>).</p>
</div>
<section id="S4.SS3.SSS1" class="ltx_subsubsection">
<h4 class="ltx_title ltx_title_subsubsection">
<span class="ltx_tag ltx_tag_subsubsection">4.3.1 </span>Some unique features</h4>
<div id="S4.SS3.SSS1.p1" class="ltx_para">
<p class="ltx_p">Unlike <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib135" title="Multilingual anaphora resolution" class="ltx_ref">25</a>]</cite> original algorithm, this module is
extremely customisable. The linguistic patterns to be detected and the scores
to be assigned are all defined in an XML file specific to each translation
direction. These patterns help identify and rank potential antecedents, and can
include references to various types of surrounding words and even the anaphor
whose antecedent is being resolved. The translation pair developer also has the
ability to define multiple types of anaphors—such as possessive determiners,
reflexive pronouns, zero anaphors, etc.—so as to be able to write separate
rules for the resolution of each of them.
</p>
</div>
</section>
<section id="S4.SS3.SSS2" class="ltx_subsubsection">
<h4 class="ltx_title ltx_title_subsubsection">
<span class="ltx_tag ltx_tag_subsubsection">4.3.2 </span>Example Usage</h4>
<figure id="S4.T1" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 1: </span>A Catalan-English translation example which highlights a use case for <span class="ltx_text ltx_font_typewriter">apertium-anaphora</span>.</figcaption>
<table class="ltx_tabular ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_justify ltx_th ltx_th_column ltx_border_tt" style="width:160.4pt;"><span class="ltx_text ltx_wrap ltx_font_bold">Input sentence (Catalan)</span></th>
<th class="ltx_td ltx_align_justify ltx_th ltx_th_column ltx_border_tt" style="width:238.5pt;">Els grups del Parlament han mostrat aquest dimarts <span class="ltx_text ltx_font_bold">el seu</span> suport al batle d’Alaró.</th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_justify ltx_border_t" style="width:160.4pt;"><span class="ltx_text ltx_wrap ltx_font_bold">Reference translation (English)</span></td>
<td class="ltx_td ltx_align_justify ltx_border_t" style="width:238.5pt;">Parliamentary groups showed <span class="ltx_text ltx_font_bold">their</span> support for the mayor of Alaró on Tuesday.</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_justify ltx_border_t" style="width:160.4pt;"><span class="ltx_text ltx_wrap ltx_font_bold">Apertium translation without <span class="ltx_text ltx_font_typewriter">apertium-anaphora</span> (English)</span></td>
<td class="ltx_td ltx_align_justify ltx_border_t" style="width:238.5pt;">The bands of the Parliament have shown this Tuesday <span class="ltx_text ltx_font_bold">his</span> support at the mayor of Alaró.</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_justify ltx_border_t" style="width:160.4pt;"><span class="ltx_text ltx_wrap ltx_font_bold">Apertium translation with <span class="ltx_text ltx_font_typewriter">apertium-anaphora</span> (English)</span></td>
<td class="ltx_td ltx_align_justify ltx_border_t" style="width:238.5pt;">The bands of the Parliament have shown this Tuesday <span class="ltx_text ltx_font_bold">their</span> support at the mayor of Alaró.</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_justify ltx_border_tt" style="width:160.4pt;"></td>
<td class="ltx_td ltx_align_justify ltx_border_tt" style="width:238.5pt;"></td>
</tr>
</tbody>
</table>
</figure>
<div id="S4.SS3.SSS2.p1" class="ltx_para">
<p class="ltx_p">A sample translation which highlights the usage of <span class="ltx_text ltx_font_typewriter">apertium-anaphora</span>
has been given in Table <a href="#S4.T1" title="Table 1 ‣ 4.3.2 Example Usage ‣ 4.3 Anaphora resolution ‣ 4 New modules ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>. The source sentence goes through a
series of modules in the translation pipeline, as described in
Section <a href="#S2" title="2 Overview of the Apertium platform ‣ Recent advances in Apertium, a free / open-source rule-based machine translation platform for low-resource languages 1 footnote 1 1 footnote 1 Springer Open Access publication. This version from pre-print latex form does not contain some changes made in the editorial process. Published version available: https://link.springer.com/article/10.1007/s10590-021-09260-6" class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>. The output of the lexical selection module
contains a stream of lexical units, including the morphological analysis and the
translation of each lexical unit. This is taken as the input to the
<span class="ltx_text ltx_font_typewriter">apertium-anaphora</span> module. The lexical unit of the example anaphor,
<span class="ltx_text ltx_font_italic">el seu</span>, at this stage in the stream is as follows:</p>
</div>
<div id="S4.SS3.SSS2.p2" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter"> ^el seu<det><pos><m><sg>/his<det><pos><m><sg>$
</pre>
</div>
<div id="S4.SS3.SSS2.p3" class="ltx_para">
<p class="ltx_p">The antecedent of the possessive determiner <span class="ltx_text ltx_font_italic">el seu</span> is <span class="ltx_text ltx_font_italic">els
grups</span> ‘the groups’, which is plural, and hence it should be translated as
<span class="ltx_text ltx_font_italic">their</span> in English and not <span class="ltx_text ltx_font_italic">his</span>. The anaphora resolution module
attempts to resolve this anaphor and identify its antecedent by applying all
rules that match the context. For instance, the <span class="ltx_text ltx_font_typewriter">First NP</span> rule gives a
positive score to the first noun of the sentence (<span class="ltx_text ltx_font_italic">grups</span>), as the first
noun of a sentence is more likely to be the antecedent of an anaphor. The
<span class="ltx_text ltx_font_typewriter">Preposition NP</span> rule gives a negative score to a noun that is part of a
prepositional phrase (<span class="ltx_text ltx_font_italic">Parlament</span>), as a noun inside a prepositional
phrase is less likely to be the antecedent of an anaphor. Both of these
tendencies have been observed empirically <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib135" title="Multilingual anaphora resolution" class="ltx_ref">25</a>]</cite>, and
have been implemented as language-specific rules.</p>
</div>
<div id="S4.SS3.SSS2.p4" class="ltx_para">
<p class="ltx_p">After application of all the rules on all candidate antecedents, the one with
the highest score is considered the most salient antecedent for the anaphor. If
the rules are successful, then the correct antecedent should have the highest
score (in this case, <span class="ltx_text ltx_font_italic">bands</span>). The anaphora resolution module then
attaches this antecedent (in the target language) to the lexical unit of the
anaphor:
</p>
</div>
<div id="S4.SS3.SSS2.p5" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter">
^el seu<det><pos><m><sg>/his<det><pos><m><sg>/band<n><pl>$
</pre>
</div>