-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathgiellalt.html
1439 lines (1374 loc) · 89.6 KB
/
giellalt.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html><html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>GiellaLT infrastructure: A multilingual infrastructures for rule-based NLP 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. #1 This work is licensed under a Creative Commons Attribution–NonCommercial-NoDerivatives 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by-nc-nd/4.0/. Find the Publisher’s version from: https://dspace.ut.ee/handle/10062/89595 </title>
<!--Generated on Thu Apr 20 14:43:17 2023 by LaTeXML (version 0.8.6) http://dlmf.nist.gov/LaTeXML/.-->
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">GiellaLT infrastructure: A multilingual infrastructures for rule-based
NLP<span id="footnote1" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup>
<span class="ltx_tag ltx_tag_note">1</span>
This is Flammieâs draft, official version may differ. #1</span></span></span>
This work is licensed under a Creative Commons Attribution–NonCommercial-NoDerivatives
4.0 International Licence. Licence details:
<a href="http://creativecommons.org/licenses/by-nc-nd/4.0/" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://creativecommons.org/licenses/by-nc-nd/4.0/</a>.
Find the Publisher’s version from:
<a href="https://dspace.ut.ee/handle/10062/89595" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://dspace.ut.ee/handle/10062/89595</a>
</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Sjur Nørstebø Moshagen, Flammie Pirinen, Lene Antonsen, Børre Gaup, Inga
Mikkelsen, Trond Trosterud, Linda Wiechetek, Katri Hiovain-Asikainen
<br class="ltx_break">Department of Language and Culture
<br class="ltx_break">NO-9019 UiT The Arctic University of Norway, Norway
<br class="ltx_break"><a href="[email protected]" title="" class="ltx_ref ltx_url ltx_font_typewriter">[email protected]</a>
</span></span>
</div>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p">This article gives an overview of the GiellaLT infrastructure, the main parts of it, and
how it has been and can be used to support a large number of indigenous and minority
languages, from keyboards to speech technology and advanced proofing tools. A special focus is given to languages with few or non-existing digital resources, and it is
shown that many tools useful to the daily digital life of language communities can be
created with reasonable effort, even when you start from nothing. A time estimate is
given to reach alpha, beta and final status for various tools, as a guide to interested
language communities.</p>
<p class="ltx_p">Keywords: Infrastructure, spelling checkers, keyboards, rule-based, machine translation,
finite state transducers, constraint grammar</p>
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p">You know your language, you are technically quite confident, you want to support your
community by making language tools. But where do you start? How do you go about it?
How do you make your tools work in Word or Google Docs? The GiellaLT infrastructure
is meant to be a possible answer to these and similar questions: it is a tool chest to take you
from initial steps all the way to the final products delivered on computers or mobile phones
in your language community. In this respect, the GiellaLT infrastructure is world leading
in its broad support for many languages and tools, and in making them possible to develop
irrespective of the size of your language community.</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">As an example, Inari Sami with 450 speakers is one of the over 130 languages in the
GiellaLT. Within the GiellaLT infrastructure Inari Sami linguists have been able to support
the Inari Sami language community with high quality tools like keyboards, a spellchecker,
MT, a smart dictionary and are now developing an Inari Sami grammar checker. All these
tools are available for Inari Sami speakers in the app stores of various operating systems,
and through the GiellaLT distribution system. In a revitalization perspective these tools are
crucial, but without an infrastructure supporting and reusing resources this would have been
impossible for the language community to solve on its own.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">The GiellaLT infrastructure supports development of various tools in several operating
systems, as shown in Table 1 — the tools are further described later on. Infrastructure setup,
existing resources and tools are available on Github and documented on
<a href="https://giellalt.github.io" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://giellalt.github.io</a>.</p>
</div>
<figure id="S1.T1" class="ltx_table">
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">Table 1: </span>
Tools built by the GiellaLT infrastructure, and their support for various systems.
= Windows, = macOS, = Linux, = iOS/iPadOS,
= Android, = ChromeOS, = browser,
= MS Word, = Google Docs, = LibreOffice, = REST/GraphQL API</figcaption>
<table class="ltx_tabular ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_column">Tools</th>
<th class="ltx_td ltx_align_left ltx_th ltx_th_column">Supported systems</th>
<th class="ltx_td ltx_align_left ltx_th ltx_th_column">Further info</th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t">Keyboards</td>
<td class="ltx_td ltx_border_t"></td>
<td class="ltx_td ltx_border_t"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Spelling checkers</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Hyphenators</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Text tokenisation and analysis</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Grammar checking</td>
<td class="ltx_td"></td>
<td class="ltx_td ltx_align_left">Includes speller, LO only Linux</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">+
Machine translation</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Text-to-speech</td>
<td class="ltx_td"></td>
<td class="ltx_td ltx_align_left">planned, not ready</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Dictionaries</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Language learning</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left">Automatic installation and updating</td>
<td class="ltx_td"></td>
<td class="ltx_td"></td>
</tr>
</tbody>
</table>
</figure>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">When adding a new language to the GiellaLT infrastructure, the system builds a toy
model and sets it up with all language independent build files needed in order to make the
tools shown in Table 1. The main advantage of GiellaLT is thus that the quite substantial
work that has gone into writing the files needed for compiling, testing and integrating the
language models into the various tools. The language models differ from tool to tool in
systematic ways, in some cases a descriptive model (accepting de facto usage) is needed,
in other cases a normative one (accepting only forms according to the norm) is called for.
Part of the language model, e.g. the treatment of Arabic numerals and non-linguistic symbols is also re-used from language to language. A large part of any practical language technology project is the integration of the tools in user programs in different operative systems.
For the GiellaLT infrastructure this has been done for a wide range of cases, as shown in
Table 1. This makes it possible to make language technology tools for new languages without spending several man-years on building a general infrastructure.</p>
</div>
<div id="S1.p5" class="ltx_para">
<p class="ltx_p">GiellaLT is developed by two research groups at UiT. At present, it includes support
and tools for 130 different languages, most of them extremely low resource like Inari Sámi
and Plains Cree. Such languages have few or no text resources, where the few that exist
typically are too noisy to be able to represent a norm. These and similar languages must
thus rely on mostly rule-based methods like the ones described in this article. Another point
to consider is that the smaller the language community, the larger the need of tools to
support using the language in writing. Having such an infrastructure is thus of crucial importance for the future of many of the world’s languages.</p>
</div>
<div id="S1.p6" class="ltx_para">
<p class="ltx_p">Not only does the GiellaLT infrastructure offer a pipeline for building tools, it also supports the process in the other end: when the language work is approaching production quality, there exists a delivery and update ecosystem, making it easy to distribute the tools to
the user community. The infrastructure also contains tools to develop the linguistic data
and verify its integrity and quality automatically.</p>
</div>
<div id="S1.p7" class="ltx_para">
<p class="ltx_p">What is needed to utilise GiellaLT is a linguistic description of the language and a
skilled linguist, a native speaker to fill in the gaps in the description, and a programmer to
help with configuring the infrastructure for a new language and support the linguist in the
daily work. A language project will also benefit from activists who need the tools: they will
tell what they need, test the tools, and thus ensure the linguistic quality of the output. In
practice every new language added to the infrastructure for which practical tools are developed will also require adaptions and additions in GiellaLT, thereby contribute to the
strength of the GiellaLT infrastructure.</p>
</div>
<div id="S1.p8" class="ltx_para">
<p class="ltx_p">Throughout this chapter of the book, we try to give an estimate of the expected amount
of work needed to reach a certain maturity level for tools and resources. The information
is given in side-bar info boxes, as seen below. The symbols and abbreviations used are as
follows:</p>
</div>
<div id="S1.p9" class="ltx_para">
<p class="ltx_p">⏱
mm: ¼ ߙ
ߚ :1 mm
1.0: 3-4 m</p>
</div>
<div id="S1.p10" class="ltx_para">
<ul id="S1.I1" class="ltx_itemize">
<li id="S1.I1.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S1.I1.i1.p1" class="ltx_para">
<p class="ltx_p">ߙ – alpha version, first useful but clearly unfinished
version<span id="footnote2" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup>
<span class="ltx_tag ltx_tag_note">2</span>
A more precise definition of what these labels mean within the GiellaLT infrastructure is available
at https://giellalt.github.io/MaturityClassification.html</span></span></span></p>
</div>
</li>
<li id="S1.I1.i2" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S1.I1.i2.p1" class="ltx_para">
<p class="ltx_p">ߚ – beta version, getting ready, but some polish still needed</p>
</div>
</li>
<li id="S1.I1.i3" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S1.I1.i3.p1" class="ltx_para">
<p class="ltx_p">1.0 – final version, released to the language community</p>
</div>
</li>
<li id="S1.I1.i4" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div id="S1.I1.i4.p1" class="ltx_para">
<p class="ltx_p">mm – man month, one month’s work</p>
</div>
</li>
</ul>
<p class="ltx_p">The estimates given are based on our experience, but a number of external factors will
influence the actual time a given project takes. We still believe that having an indication of
expected work estimates can be useful when planning language technology projects. It
should be emphasized that all estimates assumes that the GiellaLT infrastructure and support system is being used.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>An overview of the infrastructure</h2>
<div id="S2.p1" class="ltx_para">
<p class="ltx_p">The GiellaLT infrastructure contains all the necessary pieces to go from linguistic source
code to install-ready products. Most of the steps are automated as part of a continuous
integration & continuous delivery (CI/CD) system, and language-independent parts
are included from or taken from various independent
repositories<span id="footnote3" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>
<span class="ltx_tag ltx_tag_note">3</span>
<a href="https://github.com/divvun" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://github.com/divvun</a></span></span></span>.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">The infrastructure is in practice a way of organising all the linguistic data and scripts in
a manner that is easily maintainable by humans who work on various aspects of the text
and then can be systematically built into ready-to-use software products by automated
tools. The way we approach this in practice changes from time to time as the software
engineering ecosystem develops; however, the organisation of the system aims to keep the
linguistic data more constant, as the linguistics do not change at the pace software engineering tools. The crux of the infrastructure in the large scale is though having the right
</p>
</div>
<div id="S2.p3" class="ltx_para">
<p class="ltx_p">files of linguistic data in the right place. That is, standardisation of the folder and filenames
and standardisation of the analysis tags are main features of the infrastructure.
The data is stored in the version control system git<span id="footnote4" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup>
<span class="ltx_tag ltx_tag_note">4</span>
</span></span></span>, hosted by GitHub<span id="footnote5" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">5</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">5</sup>
<span class="ltx_tag ltx_tag_note">5</span>
</span></span></span>. In GitHub the
data is organised in repositories. Each repository is a unit of mostly self-standing tools and
source code.</p>
</div>
<div id="S2.p4" class="ltx_para">
<p class="ltx_p">There are a few different kinds of linguistic repositories in our infrastructure, mainly
ones for keyboard development (keyboard repositories) and ones for development of morphological dictionaries, grammars, and all other linguistic data (language repositories);
these repositories are language specific, i.e., there’s one repository for each
language<span id="footnote6" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">6</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">6</sup>
<span class="ltx_tag ltx_tag_note">6</span>
</span></span></span>. The
different repository types and the content they contain, along with its structure, is described
in the following sections.</p>
</div>
<section id="S2.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.1 </span>Keyboard repositories</h3>
<div id="S2.SS1.p1" class="ltx_para">
<p class="ltx_p">Keyboards enable us to enter text into digital devices. Without keyboards, no text. This is
a very real obstacle for the majority of the languages of the world, the ones with no keyboards. It thus needs to be easy to create and maintain keyboard definitions and make them
available to users.
In the GiellaLT infrastructure, keyboard definitions have their own repositories (using
the repository name pattern keyboard-*) that contain the linguistic data
defining the layout of the keyboards of the languages, and all metadata to build
the final installation packages. The repositories are organised as a bundle,
which is consumed by the tool kbdgen<span id="footnote7" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">7</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">7</sup>
<span class="ltx_tag ltx_tag_note">7</span>
</span></span></span>.
</p>
</div>
<div id="S2.SS1.p2" class="ltx_para">
<p class="ltx_p">The bundle structure is as follows:</p>
<pre class="ltx_verbatim ltx_font_typewriter">
sma.kbdgen
layouts # actual layout definitions
sma-NO.yaml # desktop layout for Norway
sma-SE.yaml # ditto for Sweden they are different
sma.yaml # mobile keyboard, identical for SE/NO
project.yaml # top-level metadata
resources # platform specific resources
mac
icon.sma-NO.png
icon.sma-SE.png -> icon.sma-NO.png
targets # metadata for various platforms
android.yaml
ios.yaml
mac.yaml
win.yaml
</pre>
</div>
<div id="S2.SS1.p3" class="ltx_para">
<p class="ltx_p">The layout definitions are described in the next chapter.
Based on the layouts and metadata, kbdgen builds installers, packages, or suitable target files for the following systems: macOS, Windows, Linux (X11, IBus m17n), Android,
iOS/iPadOS, and ChromeOS. For macOS and Windows, installers are readily available via
Divvun’s package manager Páhkat, further described towards the end of this chapter. For
iOS and Android, layouts are included in one or both of two keyboard apps: Divvun Keyboards, and Divvun Dev Keyboards. Divvun Dev Keyboards functions as a testing and
development ground, whereas production ready keyboards go into the Divvun Keyboards
app. All of this is done automatically or semi-automatically using CI and CD servers8
.</p>
</div>
<div id="S2.SS1.p4" class="ltx_para">
<p class="ltx_p">The kbdgen tool also supports generating SVG files for debugging and documentation,
as well as exporting the layouts as xml files suitable for upload to CLDR. And finally, it
can also generate finite state error models for keyboard-based typing errors, giving suitable
penalties to neighbouring letters based on the layout.</p>
</div>
<div id="S2.SS1.p5" class="ltx_para">
<p class="ltx_p">The Windows installer includes a tool to register unknown languages, so that even languages never seen on a Windows computer will be properly registered, and thus making
Windows ready to support proofing tools and other language processing tools for those
languages.</p>
</div>
</section>
<section id="S2.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2 </span>Language repositories</h3>
<div id="S2.SS2.p1" class="ltx_para">
<p class="ltx_p">The language repositories contain lexical data, grammar rules, morphophonology, phonetics, etc., anything linguistic and specific to the language or even specific to the tools is built
from the language data.</p>
</div>
<div id="S2.SS2.p2" class="ltx_para">
<p class="ltx_p">The language repositories, using a repository name pattern of lang-*, contain the
whole dictionaries of the languages, laid out in a format that can be compiled into the NLP
tools we provide. To achieve this, the lexical data has to be rich enough to achieve inflecting
dictionaries, that is, the words have to be added some information of their inflectional patterns for example. In practice, there is an unlimited amount of information that can be recorded per dictionary word that can be interesting. So in practice, this central part of the
language repository becomes like a lexical database of linguistic data. On top of that we
need different kinds of rules governing morphographemics, phonology, syntax, semantics
and so forth.
</p>
</div>
<div id="S2.SS2.p3" class="ltx_para">
<p class="ltx_p">In practice, writing a finite state (see 2.2.1) standardised language model in src/fst/ will provide
the user with the basis for all the NLP tools we can
build. To draw a parallel on how this works, if one is
familiar with java programming for example, this is
akin putting your maven-based project into
src/java/ or rust-based configurations in
Cargo.toml etc. would be a software engineering
interpretation of what an infrastructure is.</p>
</div>
<div id="S2.SS2.p4" class="ltx_para">
<p class="ltx_p">We also have some standards as to how to tag specific linguistic phenomena, as well as other lexical information. The linguistic software we write is in part
based on that similar phenomena are marked in same
manner in all languages. This ensures that components that are language-independent work the best. If
specific languages deviate from some standards, it practically can mean that for those languages specific exceptions need to be written for every application. This is especially clear
when working with such grammar-based machine translation, even a small mismatch in
marking the same structures makes the translation fail whereas systematic use of standard
annotations makes everything work automatically.
8 CI = continuous integration, CD = continuous delivery (Shahin et al, 2017)
The language repositories follow a specific template, structure, and practice to make
building everything easier:</p>
<pre class="ltx_verbatim ltx_font_typewriter">
lang-sme
docs
src
cg3
filters
fst
tools
spell-checkers
grammarcheckers
mt
</pre>
</div>
<section id="S2.SS2.SSS1" class="ltx_subsubsection">
<h4 class="ltx_title ltx_title_subsubsection">
<span class="ltx_tag ltx_tag_subsubsection">2.2.1 </span>Morphological analysis</h4>
<div id="S2.SS2.SSS1.p1" class="ltx_para">
<p class="ltx_p">The underlying format for the linguistic models in the GiellaLT infrastructure is based on
finite state morphology (FST), combining the lexc and twolc programming languages
(Koskenniemi 1983, Beesley and Karttunen 2003). This is no accident: These programming
languages as well as the Constraint Grammar formalism presented in the next subsection
were all developed for Finnish, a language with a complex grammar and many skilled computational linguists. The contribution of the persons behind GiellaLT has been to port these
compilers into open formats and set them up in an integrated infrastructure. Most GiellaLT
languages are of no interest to commercial language technology companies, and the infrastructure thus contains pipelines for all aspects of language technology.</p>
</div>
<div id="S2.SS2.SSS1.p2" class="ltx_para">
<p class="ltx_p">The morphology is written as sets of entries,
where each entry contains two parts (divided by
space and terminated by semicolon). The first part
contains a pair of two symbol strings (to the left and
the right of the colon, called the upper and lower
level). The part after the space is a pointer to a lexicon containing a set of entries, so that the content of
all entries in this lexicon is concatenated to the content of all entries
pointing to it. The symbol <code class="ltx_verbatim ltx_font_typewriter">#</code> has
special status, denoting the end of the string. In the
text box to the right, the entry kana:kana is directed
to the lexicon n1. This lexicon contains 3 entries,
each pointing to the lexicon clitics. These 3 lexica
will then give rise to 3*3*4=36 distinct forms. The
upper level of the entries contains lemmas and morphosyntactic tags, whereas the lower level contains
stems and affixes. It may also contain archiphonemes, such as <code class="ltx_verbatim ltx_font_typewriter">^A</code> representing front ä and back a, and triggers for morphophonological
processes. In this example <code class="ltx_verbatim ltx_font_typewriter">^WG</code> triggers the weak grade of the consonant gradation process,
a process which in this example assimilates nt into nn in certain grammatical contexts (here:
genitive and inessive singular).</p>
</div>
<div id="S2.SS2.SSS1.p3" class="ltx_para">
<p class="ltx_p">The morphographemics is taken care of in a separate finite state transducer, written in a
separate language, twolc, in a separate file (src/fst/phonology.twolc):</p>
</div>
<div id="S2.SS2.SSS1.p4" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter">
Alphabet
a b c ... %^U:y %^A: %^O: ;
Sets
Vow = a e i o u y %^U %^A %^O ;
Rules
vowel harmony
%^A <=> a [a|o|u] \[||y]* %> (\[||y]* %>) \[||y]* ;
"Consonant Gradation nt:nn"
t:n <=> n _ Vow: %^WG: ;
</pre>
</div>
<div id="S2.SS2.SSS1.p5" class="ltx_para">
<p class="ltx_p">Twolc defines the alphabet and sets of the model. Also, this transducer has an upper and
lower level, so that the upper level of this transducer is identical to the lower level of the
morphological (lexc) transducer. The rule format <code class="ltx_verbatim ltx_font_typewriter">A:B <=> L _ R ;</code> denotes that there is
a relation between upper level A and lower level B when occurring between the contexts L
and R. The result is that non-concatenative processes such as vowel harmony and consonant
gradation are done in a morphographemic transducer separate from the morphological one.</p>
</div>
<div id="S2.SS2.SSS1.p6" class="ltx_para">
<p class="ltx_p">⏱
ߙ :1-2 mm
ߚ :6 mm
1.0: 12 mm</p>
</div>
<div id="S2.SS2.SSS1.p7" class="ltx_para">
<p class="ltx_p">An excerpt from the lexc file
for Kven nouns (the file
src/fst/stems/nouns.lexc
in the infrastructure):</p>
</div>
<div id="S2.SS2.SSS1.p8" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter">
LEXICON Nouns
kana:kana n1 "hen" ;
kyn:kyn n1 " pen" ;
hinta:hinta n1 "price" ;
LEXICON n1
+N+Sg+Nom: clitics ;
+N+Sg+Gen:^WG%>n clitics ;
+N+Sg+Ine:^WG%>ss^A clitics ;
...
LEXICON clitics
# ;
+Qst:%>k^O # ;
+Foc/han:%>h^An # ;
+Foc/ken:%>kin #
</pre>
</div>
<div id="S2.SS2.SSS1.p9" class="ltx_para">
<p class="ltx_p">These two transducers are then compiled into one transducer, containing string pairs like
for example hinta+N+Sg+Ine:hinnassa, where the intermediate representation
<code class="ltx_verbatim ltx_font_typewriter">hinta^WG>ss^A</code> is not visible in the resulting transducer. The result is a model containing
the pairs of all and only the grammatical wordforms in the language and their corresponding lemma and grammatical analyses.</p>
</div>
<div id="S2.SS2.SSS1.p10" class="ltx_para">
<p class="ltx_p">The actual amount of work needed to get to a reasonable quality will vary depending on
the complexity of the language, available electronic resources, existing documentation in
the form of grammars and dictionaries, and experience, but based on previous projects a
reasonable first version with decent coverage can be made in about six months. For good
coverage, one should estimate at least a year of work.</p>
</div>
</section>
<section id="S2.SS2.SSS2" class="ltx_subsubsection">
<h4 class="ltx_title ltx_title_subsubsection">
<span class="ltx_tag ltx_tag_subsubsection">2.2.2 </span>Morphosyntactic disambiguation</h4>
<div id="S2.SS2.SSS2.p1" class="ltx_para">
<p class="ltx_p">Most wordforms are grammatically ambiguous, such as the English verb and noun walks.
The correct analysis is in most cases clear from the context. The form walks may e.g. occur
after determiners (and be a noun) or after personal pronouns (and be a verb). More complex
grammars typically contain more homonymy, such as the South Saami leah ‘they/you are’
- has four different analyses with the same lemma, three of them finite verb analyses and
one of them a non-finite analysis, the con-negative verb reading:</p>
</div>
<div id="S2.SS2.SSS2.p2" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter">
"<leah>"
"lea" V IV Ind Prs Pl3 <W:0.0>
"lea" V IV ConNeg <W:0.0>
"lea" V IV Imprt Sg2 <W:0.0>
"lea" V IV Ind Prs Sg2 <W:0.0>
</pre>
</div>
<div id="S2.SS2.SSS2.p3" class="ltx_para">
<p class="ltx_p">Within the GiellaLT infrastructure, disambiguation and further analysis of text is made
with constraint grammar (Karlsson 1990) compiled with the free open-source implementation VISLCG-3 (Bick 2015). The morphological output of the transducer feeds into a chain
of CG-modules that do disambiguation, parsing, and dependency analysis. The analysis
may be sent to applied CG modules such as e.g., grammar checking.
Syntactic rules of the parser disambiguate the morphological ambiguity and can add
syntactic function tags. In the following sentence the context shows that leah should be
con-negative based on the negation verb ij to the left.</p>
</div>
<div id="S2.SS2.SSS2.p4" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter">
Im leah.
NEG.1sg.prs be.con-negative
"I am not./No."
"<Im>"
"ij" V IV Neg Ind Prs Sg1 <W:0.0> @+FAUXV
:
"<leah>"
"lea" V IV ConNeg <W:0.0> @-FMAINV
"lea" V IV ConNeg <W:0.0> @-FMAINV
; "lea" V IV Imprt Sg2 <W:0.0>
; "lea" V IV Ind Prs Pl3 <W:0.0>
; "lea" V IV Ind Prs Sg2 <W:0.0>
IFF ConNeg (*-1 Neg BARRIER CC OR COMMA OR ConNeg);
</pre>
</div>
<div id="S2.SS2.SSS2.p5" class="ltx_para">
<p class="ltx_p">The rule that is responsible for removing all the other readings of leah and only picking the
con-negative (ConNeg) reading is an IFF rule that refers to the lemma lea “be” in its ConNeg form and a negation verb to the left, without any conjunction (CC) or comma or
another ConNeg form in between. The rule here is simplified, there are more constraints
for special cases. The IFF operator either selects the ConNeg reading if the constraints are
true, or removes it if the constraints are not true.</p>
</div>
<div id="S2.SS2.SSS2.p6" class="ltx_para">
<p class="ltx_p">⏱
ߙ :3 mm
ߚ :8 mm
1.0: 18 m
</p>
</div>
<div id="S2.SS2.SSS2.p7" class="ltx_para">
<p class="ltx_p">As the analysis is moved further and further away from the language-specific morphology, the rules become increasingly language independent. Antonsen et al (2010) have
shown that whereas disambiguation should be done by language-specific grammars,
closely related languages may share the same function grammars (assigning roles like subject, object). The dependency grammar was shown to be largely language independent.</p>
</div>
</section>
</section>
<section id="S2.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3 </span>Other repository types</h3>
<div id="S2.SS3.p1" class="ltx_para">
<p class="ltx_p">The GiellaLT infrastructure contains several other repository types, although not as structured as the keyboard and language repositories. Corpus repositories (repository name pattern corpus-*) contain corpus texts, mostly in original format, with metadata and conversion instructions in an accompanying xsl file. This is done to make it easy to rerun
conversions as needed. The corpus tools and processing are further described later on.</p>
</div>
<div id="S2.SS3.p2" class="ltx_para">
<p class="ltx_p">There are a few repositories with shared linguistic data (repository name pattern
shared-*). Typically, they contain proper names shared among many language repositories, definition of punctuation symbols, numbers, etc. The shared data is included in the
regular language repositories by reference.</p>
</div>
<div id="S2.SS3.p3" class="ltx_para">
<p class="ltx_p">Both keyboard and language repositories are structurally set up and maintained using a
templating system, for which we have two template repositories (repository name pattern
template-*). Updates to the build system, basic directory structure, and other shared
properties are propagated from the templates to all repositories using the tool gut,
9
and
allows all supported repositories to grow in tandem when new features and technologies
are introduced. This ensures relatively low-cost scaling in terms of features and abilities
for each language, so that a new feature or tool with general usability easily can be introduced to all languages.</p>
</div>
<div id="S2.SS3.p4" class="ltx_para">
<p class="ltx_p">There are separate repositories for speech technology projects (repository name pattern
speech-*). As speech technology is a quite recent addition to the GiellaLT, these repositories are not standardised in their structure yet, and have no templates to support setup
and updates. As we gain more experience in this area, we expect to develop our support for
speech technologies.</p>
</div>
</section>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Linguistic tools and software</h2>
<div id="S3.p1" class="ltx_para">
<p class="ltx_p">The main point of the GiellaLT infrastructure is to provide tools for language communities.
In this chapter we present the tools that are most prominent and have been most used and
useful for our users: keyboards, grammar and spell checking and correction, dictionaries,
linguistic analysis, and machine translation between or from the minority languages. Also,
language learning and speech synthesis is covered, and finally we show how to distribute
the tools to the language communities and to ensure that the tools stay up to date.
We also show what constitutes the starting point for building a software tool for your
language as well as the prerequisites needed for getting that far. Finally, we show some
ideas that are under construction or showing promising results, ideas for which we cannot
yet provide an exact recipe on how to build working systems.</p>
</div>
<div id="S3.p2" class="ltx_para">
<p class="ltx_p">⏱
ߙ1/ :₂₀ mm
mm: ¼ ߚ
1.0: 1 mm
⏱
ߙ :1-3 mm
ߚ :6 mm
1.0: 12 mm
</p>
</div>
<section id="S3.SS1" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>Keyboards</h3>
<div id="S3.SS1.p1" class="ltx_para">
<p class="ltx_p">To be able to type and write a language, you need a keyboard. Using the tool kbdgen, one
can easily specify a keyboard layout in a YAML file, mimicking the actual layout of the
keyboard. The listing below shows the definition of the Android mobile keyboard layout
for Lule Sámi. The kbdgen tool takes this definition and a bit of metadata, combines it
with code for an Android keyboard app, compiles everything, signs the built artefact and
uploads it to the Google Play Store, ready for testing.</p>
</div>
<div id="S3.SS1.p2" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter">
modes:
android:
default: |
w e r t y u i o p
a s d f g h j k l
z x c v b n m
kbdgen
</pre>
</div>
<div id="S3.SS1.p3" class="ltx_para">
<p class="ltx_p">Additional metadata is for example language name in the native language, names for some
of the keys, icons for easy recognition of the active keyboard, and references to speller
files, if available.</p>
</div>
<div id="S3.SS1.p4" class="ltx_para">
<p class="ltx_p">The YAML files can contain definitions for dead key sequences for desktop keyboards,
as well as long-press popup definitions for touch-screen keyboards. The newest version of
the kbdgen tool supports various physical layouts for desktop keyboards, to allow for nonISO keyboard definitions. Further details for the layout specification can be found in the
kbdgen documentation<span id="footnote8" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">8</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">8</sup>
<span class="ltx_tag ltx_tag_note">8</span>
</span></span></span>.</p>
</div>
<div id="S3.SS1.p5" class="ltx_para">
<p class="ltx_p">The overall goal is to make it as easy as possible for linguists to write a keyboard definition and get it into the hands of the language community. A first draft can be created in
less than a day, and a good keyboard layout for most operating systems will take about a
week to develop, with a couple of weeks more for testing, feedback, and adjustments.</p>
</div>
<div id="S3.SS1.p6" class="ltx_para">
<p class="ltx_p">The keyboard infra in GiellaLT works well for any alphabetic and syllabary-based writing system, essentially everything except iconographic and similar systems.</p>
</div>
<div id="S3.SS1.p7" class="ltx_para">
<p class="ltx_p">Because of technical limitations by Apple and Google, it is not possible to create keyboard definitions for external, physical keyboards for Android tablets and iPads. Our onscreen keyboards work as they should even when a physical keyboard is attached to such
tablets.</p>
</div>
</section>
<section id="S3.SS2" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.2 </span>Spell-checking and correction</h3>
<div id="S3.SS2.p1" class="ltx_para">
<p class="ltx_p">In GiellaLT, spell-checking and correction are built on a language model (in the form of a
finite-state transducer) for the language in question. A spell-checker is a mechanism that
recognises wordforms that do not belong to a dictionary of all acceptable wordforms in the
language and tries to suggest the most likely wordforms for the unknown ones. The
GiellaLT spellcheckers differ from this approach in not containing a list of wordforms, but
rather a list of stems with a pointer to concatenative morphology (see also the section 2.2.2
on language model repositories for specifics on what this looks like and is built) as well as
a separate morphophonological component. The resulting language model should then recognise all and only the correct wordforms of the language in question. Since the language
model is dynamic it is also able to recognise wordforms resulting from productive but unlexicalized morphological processes, such as dynamic compounding or derivation. In languages like German or Finnish, one can freely combine for example words like kissa ‘cat’,
bussi ‘bus’ and kauppias ‘salesman’ into kissabussikauppias that will be acceptable for
language users and thus should be supported by the spell-checker. The main challenge
when building a spell-checker and corrector is to have a good coverage in the lemma list,
and in our experience, several months of work in dictionary building or around 10,000 wellselected words will suffice for a good entry-level spell-checker. Since the FST also will be
used for analysing texts (chapter 2.2), it will be necessary to compile a normative version
of the FST, which excludes non-normative forms. They are excluded by means of tags.</p>
</div>
<div id="S3.SS2.p2" class="ltx_para">
<p class="ltx_p">The mechanism for correcting wrongly spelled words into correct ones is called error
correction. The most basic error correction model is based upon the so-called Levenshtein
distance (Levenshtein 1965): For an unknown wordform, suggest known wordforms resulting from one of the following operations: delete character, add character, exchange
character with another one or swap the order of two characters. The benefit of this baseline
model is language independent, so we can use it for all languages as a starting point.</p>
</div>
<div id="S3.SS2.p3" class="ltx_para">
<p class="ltx_p">Spell-checking and correction can be improved on a per language basis drawing from
the knowledge of the language and its users. One of the most basic ways of improving the
quality of spelling corrections, is improving the error modelling. f we know what kind of
errors users make and why, we can ensure that the relevant corrections are suggested more
commonly. Segment length, especially consonant length, may be hard to perceive and
hence likely to be misspelled. Doubling and omitting identical letters are thus built into
most error models. The same are parallel diphthongs such as North Saami uo/oa, ie/ea,
where L2 writers may be likely to pick the wrong one. By making the diphthong pairs part
of the error model the errors can easily be corrected. The position in the word may also be
relevant: For a language where final stop devoicing is not shown in the written language,
exchanging b, d, g with p, t, k may be given a penalty number lower than the default value.</p>
</div>
<div id="S3.SS2.p4" class="ltx_para">
<p class="ltx_p">When the dictionary reaches sufficiently high coverage the problems caused by suggesting relatively rare words become more apparent, one way of dealing with this problem
is codifying the rare words on the lexical level. If there are corpora of correctly written
texts available, it is also possible to use statistical approaches to make sure that very rare
words are not suggested as corrections unless we are sure they are the most likely ones.</p>
</div>
<div id="S3.SS2.p5" class="ltx_para">
<p class="ltx_p">It is also possible to list common misspellings of whole words. The South Saami error
model contains the pair uvre:eevre (for eevre ”just, precisely”), where word forms like
muvre and duvre otherwise would have had a shorter Levenshtein distance.</p>
</div>
<div id="S3.SS2.p6" class="ltx_para">
<p class="ltx_p">Seen from a minority language perspective, the main point is that the GiellaLT infrastructure offers a ready-made way of building not only language models but also speller
error models and a possibility to integrate them into a wide range of word processors. And
having a mechanism for automatically separating normative from descriptive forms means
that the same source code and language model can be used and reused in many different
contexts.</p>
</div>
<div id="S3.SS2.p7" class="ltx_para">
<p class="ltx_p">⏱
mm: ¼ ߙ
ߚ :1 mm
1.0: 3-4 mm</p>
</div>
<div id="S3.SS2.p8" class="ltx_para">
<p class="ltx_p">⏱
mm: ¼ ߙ
ߚ :1 mm
1.0: 3 mm</p>
</div>
</section>
<section id="S3.SS3" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.3 </span>Automatic hyphenation</h3>
<div id="S3.SS3.p1" class="ltx_para">
<p class="ltx_p">The rules for proper hyphenation vary from language to language, but usually it is based
on the syllable structure of the words. Depending on the language, morphology may also
play a role, especially word boundaries in languages with compounding.
The GiellaLT infrastructure supports hyphenation using several mechanisms. The core
component is an FST rule component defining the syllable structure. Exceptions can be
specified in the lexicon, and both lexicalised and dynamic compounds will be used to override the rule-based hyphenation. The result is high-quality hyphenation, but it requires good
coverage by the lexicon. The lexical component is based on the morphological analyser
described in earlier sections.
Given that the morphological analyser is already done, adding hyphenation rules does
not take a lot of work. The most time-consuming work is testing and ensuring that the result
is as it should be. Getting or building hyphenated test data can be time consuming.</p>
</div>
</section>
<section id="S3.SS4" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.4 </span>Dictionaries</h3>
<div id="S3.SS4.p1" class="ltx_para">
<p class="ltx_p">The GiellaLT infrastructure includes a setup for combining dictionaries and language models. Most GiellaLT languages possess a rich morphology, with tens or even hundreds of
inflected forms for each lemma, often including complex stem-alternation processes, prefixing or dynamic compounding. Looking up unknown words may be a tedious endeavour,
since few of the instances of a lemma in running text may be the lemma form itself.
The GiellaLT dictionaries combine the dictionary with an FST-based lookup model that
finds the lemma form, sends it to the dictionary and presents the translation to the user.
This may be done via a click-in-text-function, as in Figure 1.
The FST may then also be used to generate paradigms of different sizes to the user, as well
as facilitate example extraction from corpora. The tags in the analysis can also be used as
triggers for additional information about the word, e.g. derivation, which can be linked to
information in an online-grammar (see Johnson et al 2013 for a presentation).
Dictionary source files are written in a simple XML format. If one has access to a bilingual machine-readable dictionary and the lemmas in the dictionary have the same form as
the lemmas in the FST, the dictionary may easily be turned into a morphology-enriched
dictionary in the GiellaLT infrastructure. Less structured dictionaries will require far more
Figure 1: The dictionary set-up. The word, which is clicked for in this image, is both inflected and a compound. The analyser finds the base form for both parts, gives them to the
dictionary, and the translation is then sent to the use
work. Homonymous lemmas with different inflection should be distinguished by getting
different tags, both in lexc and in the XML schema. In this way we can ensure that the
words are presented with the correct inflectional paradigm to the dictionary user.</p>
</div>
<div id="S3.SS4.p2" class="ltx_para">
<p class="ltx_p">⏱
ߙ :3 mm
ߚ :8 mm
1.0: 12 mm</p>
</div>
</section>
<section id="S3.SS5" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.5 </span>Grammar-checking and correction</h3>
<div id="S3.SS5.p1" class="ltx_para">
<p class="ltx_p">Everybody *make errors – both grammatical errors and typos. Expectations are higher as
to what grammar checkers should do compared to a spellchecker, even if we do not think
about it consciously. Whereas to find that *teh is a typo for the we only need to check the
word itself, to find that *their is a typo for there we need to actually look at the whole
sentence. So even for typos we need a more powerful tool that understands grammar.
These rather trivial errors make up a big part in English grammar checking (e.g., the
Grammarly program11). Most minority languages, especially the circumpolar ones in the
GiellaLT infrastructure, are morphologically much more complex. They are rich in inflections and derivations, with complex wordforms that bear a lot of potential for errors. When
wordforms are pronounced close to each other or endings are omitted in speech they are
also frequently misspelled in writing. Typical errors are those that concern agreement between subject and verb or determiner and noun (cf. the North Sámi example), as well as
case marking errors in different parts of the sentence.</p>
</div>
<div id="S3.SS5.p2" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter">
Mii smit maid *igot gullot. > igut
We Smi.Nom.Pl want.3Pl listen > want.1Pl
We Smi also (they) want to listen > we want to
</pre>
</div>
<div id="S3.SS5.p3" class="ltx_para">
<p class="ltx_p">When writing a rule-based grammar checker we first identify the morphological and
syntactic structure of a sentence similar to parsing. Each word is associated with a lemma
and a number of morphological tags. If the word is homonymous either in its form or already based on different lexemes (like address – 1. noun 2. verb infinitive 3. finite verb) it
is associated with more than one possible analysis. Disambiguation is not the same as in
parsing as the goal of it is another one. Instead of aiming for one remaining analysis, we
only want to remove as much ambiguity as is necessary to find a potential error. But since
we expect the sentence to contain errors, therefore being a bit more unreliable, we are a bit
more relaxed on disambiguation. At the same time, we do need reliable information as to
what the context of our error is. In case of address, we would need to identify it as a noun
before looking for an agreement error with a subsequent finite verb.</p>
</div>
<div id="S3.SS5.p4" class="ltx_para">
<p class="ltx_p">The analysis must be robust enough to recognise the grammatical pattern in question
even when the grammar is wrong. Constraint grammar differs from other rule-based grammar formalisms in being bottom-up, taking the words and their morphological analysis as
a starting point, removing irrelevant analyses, and adding grammatical functions. For robust rules we make use of all the potential of rule-based methods. We have access to semantic information (e.g. human, time, vehicle), valency information (case and semantic
role of the arguments of a verb), pronunciation related traits which makes the form bound
to certain misspellings. We can also add dependencies and map semantic roles to their respective verbs in order to find valency-based case or pre-/postposition errors.
</p>
</div>
<div id="S3.SS5.p5" class="ltx_para">
<p class="ltx_p">GiellaLT’s first grammar checker was made for North Sámi and work started in 2011.
(Wiechetek 2012; Wiechetek 2017). Since 2019 grammar checkers are supported by GiellaLT for any language. Its setup includes various modules to ensure availability of the required information to find a grammatical error and generate suggestions and user feedback.</p>
</div>
<div id="S3.SS5.p6" class="ltx_para">
<p class="ltx_p">As Microsoft Word unfortunately does not open for integrating third-party solutions for
low-resource languages, we are forced to using its web-based plugin interface instead of
the usual blue line below the text. The grammar checker interface, detected errors and suggested corrections are presented in a sidebar to the right of the text, as seen in the screen
shot of a version of the North Sámi grammar checker in Figure 2.</p>
</div>
<div id="S3.SS5.p7" class="ltx_para">
<p class="ltx_p">When making a grammar checker from scratch without any previous study of which
grammatical errors are frequent for either L1 or L2 users, building a grammar checker becomes a study of errors at the same time as the tool is made. A well-functioning grammar
checker requires hand-written rules that are validated by a set of example sentences from
the corpus (newspaper texts, fiction, etc.) so exceptions to a certain grammatical construction are well covered. This set of examples should be included in a daily testing routine to
see if rules break when they are modified and to follow development of precision and recall.
One should start with a set of at least 50 examples including both positive and negative
examples of a certain error so that both precision and recall can be tested.</p>
</div>
<div id="S3.SS5.p8" class="ltx_para">
<p class="ltx_p">Error detection and correction rules that add error tags to a specific token and exchange
incorrect lemmata or morphological tag combinations with correct ones. The most complex
of these rules include very specific context conditions and numerous negative conditions
for exceptions of the rules. The rule in the following example detects an agreement error in
a word form that is expected to be third person plural and fails to be so.</p>
</div>
<div id="S3.SS5.p9" class="ltx_para">
<p class="ltx_p">The expectation is based on a preceding pronominal subject in third person plural or a
nominal subject in nominative plural. *-1 specifies its position to the left. The BARRIER
operator restricts the distance to the subjects by specifying word (forms) that may or may
not appear between the subject and its verb. In this case only adverbs or particles may
appear between them. The target of error detection is a verb in present or past tense.
The exceptions are specified in separate condition specifications afterwards. Some regard the target and exclude common homonyms like illative case forms or idiosyncratic
adverbs, common spelling mistakes of nouns (that then get a verbal analysis), homonymous
non-finite verbforms. The rule also excludes 1st person plural present tense forms except
for those ending in –at (here specified by a regex), and infinitives that are preceded by</p>
</div>
<div id="S3.SS5.p10" class="ltx_para">
<p class="ltx_p">Figure 2 North Sámi Divvun grammar checker in MS Word in the right sidebar
instead of the blue underline
lemmata with infinitive valency tags. The exceptions specify also coordinated nounphrases, dual forms with coordinated human subjects or coordinations that involve first or
second person pronouns, to name some of them.</p>
</div>
<div id="S3.SS5.p11" class="ltx_para">
<p class="ltx_p">⏱
ߙ :1 mm
ߚ :3 mm
1.0: 12 mm</p>
</div>
<div id="S3.SS5.p12" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter">
ADD (&syn-number_congruence-subj-verb) TARGET (V Ind Prs) - ConNeg OR (V Ind Prt)
IF
(*-1 (Pron Sem/Hum Pers Pl3 Nom) OR (N Pl Nom) BARRIER NOT-ADV-PCLE)
(NEGATE 0 (Sg Ill))
(NEGATE 0 N + Err/Orth-any)
(NEGATE 0 Prs Pl1) - ("<.*at>r"))
(NEGATE 0 Inf LINK -1 <TH-Inf> OR <*Inf>)
(NEGATE 0 Du3 LINK *1 Sem/Human BARRIER NOT-ADV-PCLE LINK 1 CC LINK 1 Sem/Human)
(NEGATE 0 Pl1 LINK -1 Sem/Human + (Nom Pl) LINK -1 CC LINK -1 ("mun" Nom));
ADD rules cooccur with COPY rules which replace the incorrect tags (in this case any person
number that is not Pl3) with the correct one, i.e., Pl3 with the error tag added by the accompanying ADD rule.
COPY (Pl3 &SUGGEST) EXCEPT Sg1 OR Sg2 OR Sg3 OR Du1 OR Du2 OR Du3 OR Pl1 OR Pl2
TARGET (V Ind Prs &syn-number_congruence-subj-verb) ;
</pre>
</div>
</section>
<section id="S3.SS6" class="ltx_subsection">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.6 </span>Machine translation</h3>
<div id="S3.SS6.p1" class="ltx_para">
<p class="ltx_p">Machine translation (MT) in GiellaLT is handled by making our morphological analysers,
and syntactic parsers available to the Apertium MT infrastructure (see Khanna et al 2021).
This is basically performed by a series of conversions in order to convert the output of the
GiellaLT language models to adhere to Apertium conventions as well as by adding some
new components. The basic building block is the morphological and syntactic analyser for
the source language (see section 2.2.2) and a bilingual dictionary. This analyser must have
a good coverage, which also includes non-normative forms. If the source language and the
target language are not closely related, the syntactic tags in the analysis of the source language are very useful.</p>
</div>
<div id="S3.SS6.p2" class="ltx_para">
<p class="ltx_p">The output in the target language is generated by a morphological analyser which at
least covers the lemmas in the dictionary. In addition, one needs a grammatical description
and phrases where the grammars of source and target languages do not meet. This grammar
handles all mismatches between source and target language, whatever they may be: dropping of pronouns, introduction of articles, gendering of pronouns, idioms and MWE’s, etc.</p>
</div>
<div id="S3.SS6.p3" class="ltx_para">
<p class="ltx_p">We do the bilingual lexicography like this:</p>
</div>
<div id="S3.SS6.p4" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter">
<e><p><l>bivdi<s n="n"/></l><r>jeger<s n="n"/><s n="m"/></r></p></e>
<e><p><l>bivdi<s n="n"/></l><r>fisker<s n="n"/><s n="m"/></r></p></e>
<e><p><l>bivdin<s n="n"></l><r>jakt<s n="n"/><s n="f"/></r></p></e>
</pre>
</div>
<div id="S3.SS6.p5" class="ltx_para">
<p class="ltx_p">Inside the e (entry) and p (pair) node there are two parts, l(left) and r (right), with the
two languages as node content. Each word is followed by a set of s nodes, containing
grammatical information (here: POS (n = noun) and gender (m, f, nt = masculine, feminine and neuter). This is a basic XML format used with Apertium. The logic is simple, and
it benefits from the tags being systematic. If the dictionary contains two or more word pairs
with identic lemma on the left side, one must consider which word pair to choose, or one
can make lexical selection rules, based on context in the sentence. These rules make a lexical selection module, which can be made as XML, based on position and lemma and tags
in the context, or can be made with CG rules, see 3.2.2.
The system supports multi-words in both directions and idioms. The primary goal is to
get a good coverage of singleton words. Syntactic transfer is done as follows</p>
</div>
<div id="S3.SS6.p6" class="ltx_para">
<p class="ltx_p">⏱
ߙ :12 mm
ߚ :15 mm
1.0: 18 mm</p>
</div>
<div id="S3.SS6.p7" class="ltx_para">
<pre class="ltx_verbatim ltx_font_typewriter">
S -> VP NP { 1 _
*(maybe_adp)[case=2.case]
*(maybe_art)[number=2.number,
case=2.case,gender=2.gender
,def=ind] 2 } ;
V -> %vblex {1[person =
(if (1.tense = imp) "" else 1.person),
number = (if (1.number = du)
pl else 1.number)] } ;
</pre>
</div>
<div id="S3.SS6.p8" class="ltx_para">
<p class="ltx_p">In this simplified example from a real-world grammar for syntactic transfer from NorthSaami to a Germanic language we show that the machine translation syntax is quite like
the notation conventions from e.g., the Standard Theory (Chomsky 1965). Rules may thus
operate on either word or phrase level. Here we handle distribution of case, number and
gender for the generated adpositions and articles, which are based on the case and the position in the sentence, and translating the singular-dual-plural system of North Sámi into
the Germanic singular-plural system (Pirinen and Wiechetek 2022). After chunking words
together in the first two modules, the following modules change the word order, to the
extent it is necessary.</p>
</div>
<div id="S3.SS6.p9" class="ltx_para">
<p class="ltx_p">In our experience the systems start to be usable for understanding the idea of the text in
translation when containing 5,000 well-selected translation pairs, if it is possible also to
transfer compounding and/or derivations from source language to target language. This
makes a few months of work. If the language pair requires large re-ordering of the grammar, it will demand more work.</p>
</div>
<div id="S3.SS6.p10" class="ltx_para">
<p class="ltx_p">Although there is a request for MT programs from the majority language to the minority
language, we have chosen to make systems the other way, from the minority language to