-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathPirinen-2024-iwclul.html
830 lines (791 loc) · 62.7 KB
/
Pirinen-2024-iwclul.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
<!DOCTYPE html><html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title>Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY.</title>
<!--Generated on Mon Dec 16 16:47:52 2024 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.-->
<!--Document created on December 16, 2024.-->
<link rel="stylesheet" href="../latexml/LaTeXML.css" type="text/css">
<link rel="stylesheet" href="../latexml/ltx-article.css" type="text/css">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
</head>
<body>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">Keeping Up Appearances—or how to get all Uralic languages included into
bleeding edge research and software: generate, convert, and LLM your way into
multilingual
datasets<span id="footnote1" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup>
<span class="ltx_tag ltx_tag_note">1</span>
This is Flammieâs draft, official version may differ. The ACL anthology version is available at <a href="http://aclanthology.org/{2024.iwclul-1.16}" title="" class="ltx_ref ltx_url ltx_font_typewriter">http://aclanthology.org/{2024.iwclul-1.16}</a>. ACL is typically licenced CC-BY.</span></span></span>
</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">Flammie A Pirinen
<br class="ltx_break">Divvun
<br class="ltx_break">UiT—Norgga árktalaš universitehta
<br class="ltx_break">Tromsø, Norway
<br class="ltx_break"><a href="[email protected]" title="" class="ltx_ref ltx_url ltx_font_typewriter">[email protected]</a>
</span></span>
</div>
<div class="ltx_dates">(December 16, 2024)</div>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p">The current trends in natural language processing strongly favor large
language models and generative AIs as the basis for everything. For Uralic
languages that are not largely present in publically available data on the
Internet, this can be problematic. In the current computational linguistic
scene, it is very important to have representation of your language in
popular datasets. Languages that are included in well-known datasets are
also included in shared tasks, products by large technology corporations,
and so forth. This inclusion will become especially important for
under-resourced, under-studied minority, and Indigenous languages, which
will otherwise be easily forgotten. In this article, we present the
resources that are often deemed necessary for digital presence of a language
in the large language model -obsessed world of today. We show that there
are methods and tricks available to alleviate the problems with a lack of
data and a lack of creators and annotators of the data, some more successful
than others.</p>
</div>
<section id="S1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<div id="S1.p1" class="ltx_para">
<p class="ltx_p">In recent years, the landscape of language technology has changed quite rapidly,
mainly with the advent large language models, but the overarching shift towards
big data has been ongoing for longer. The problem with this shift is, that it
is based on the big data for large majority languages, the inclusion of all the
smaller languages, including all of the Uralic languages, has come as an
afterthought if at all.</p>
</div>
<div id="S1.p2" class="ltx_para">
<p class="ltx_p">The expected solution for the continued sustainability of minority Uralic
languages in the landscape of modern languages in the time of large language
models is to “generate” more data. Ideally, by ‘generate’, the engineers in
large language model contexts mean, that authentic written (or spoken) data
needs to be created by native writers who should not make too many spelling or
grammar errors and write the most current normative form. This can be an
unreachable goal for a language that has fewer than million speakers and writers
who are not L1, as while the requirements for large language models are going
down over time, they are still orders of magnitude larger that can plausibly be
created by limited amount of writers and speakers in limited amount of time.</p>
</div>
<div id="S1.p3" class="ltx_para">
<p class="ltx_p">What we suggest in this paper is to carefully organise the initial work of
corpus curation and creation around materials that are of high importance to the
contemporary language technology community. We leverage existing resources and
language technologies to minimise unnecessary and repetitive work by linguists
and language professionals on the language data that is being worked on;
automating what can be automated and re-using linguists annotation efforts is a
key to efficient development of high-quality human verified gold data.</p>
</div>
<div id="S1.p4" class="ltx_para">
<p class="ltx_p">Our <span class="ltx_text ltx_font_italic">research question</span> is, going from existing langauge technology
resources: which tools are best suitable for launching and bootstrapping which
resources. If language has usable electronical dictionaries, morphological
analysers and generators, spell-checkers and so on, what can be used to
effectivise the dataset creation and corpus curation. The question is especially
interesting now, as there is a possibility to use contemporary multilingual
large language models, as well as traditional rule-based, statistical and hybrid
language models to perform various pre-processing and processing tasks.</p>
</div>
<div id="S1.p5" class="ltx_para">
<p class="ltx_p">Our <span class="ltx_text ltx_font_italic">key contributions</span> from this article are: <span class="ltx_text ltx_font_italic">the experimental
framework</span> for others to compare and combine methods of gold data annotation for
smaller languages, the <span class="ltx_text ltx_font_italic">pipelines</span> from traditional rule-based
annotations and LLM generations into concrete target formats, and the results of
comparing some of the approaches for a low resource Uralic language along with
recommendations of what is currently the most effective approach. As a side
product we have created, curated and annotated beginnings of <span class="ltx_text ltx_font_italic">several new
datasets</span> for an under-resourced Uralic language.</p>
</div>
<div id="S1.p6" class="ltx_para">
<p class="ltx_p">We have laid out experimental computational linguistics data creation and
annotation system that can use both existing rule-based tools as well as large
language models to aid the processs. One of the goals of this experiment and
the approach is that we want to promote inclusion of more Uralic languages in
all of the common language technology datasets. We are considering three
separate approaches to help creation of annotated gold data:</p>
</div>
<div id="S1.p7" class="ltx_para">
<ol id="S1.I1" class="ltx_enumerate">
<li id="S1.I1.i1" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">1.</span>
<div id="S1.I1.i1.p1" class="ltx_para">
<p class="ltx_p">rule-based generators and generative language models to generate a
starting point for a data set, to be proof-read and re-annotated by
humans,</p>
</div>
</li>
<li id="S1.I1.i2" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">2.</span>
<div id="S1.I1.i2.p1" class="ltx_para">
<p class="ltx_p">rule-based analysers creating annotated dataset in legacy and ad hoc
formats that are converted and organised into a starting point for human
re-annotation, and</p>
</div>
</li>
<li id="S1.I1.i3" class="ltx_item" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">3.</span>
<div id="S1.I1.i3.p1" class="ltx_para">
<p class="ltx_p">generative language models providing human annotators with starting
points or improvements during annotation process</p>
</div>
</li>
</ol>
</div>
<div id="S1.p8" class="ltx_para">
<p class="ltx_p">There are of course other possibilities as well, these are based on our previous
experience and iterations with different datasets and projects. It must be noted
that the goal here is to generate something comparable to human annotated gold
corpus, so we are not planning to automate data generation or annotation. This
has to be also contrasted to the reality of limited human resources for working
with smaller Uralic languages, we do not necessarily have a possiblity to hire 5
annotators to work on data full hours for several months, but to ask if the
language experts who have other main jobs as language experts can use hours or
two here and there on the task, this is one of the motivations of our experiment
as well.</p>
</div>
</section>
<section id="S2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Background</h2>
<div id="S2.p1" class="ltx_para">
<p class="ltx_p">The Uralic languages, especially besides the bigger national languages, are
relatively under-resourced; the size of freely available texts is measured in
millions of tokens or less. However, Uralic languages do have strong traditions
of rule-based language technology. Also, lately, the large language model
-based language technology has showed itself as a viable option for some use
cases. Our approach to resource creation to overcome some of the
under-resourcedness problem is thus to see if we can leverage the existing
technology to supplement the well-planned tactical selection of language dataset
resources. In this article, we suggest curating and creating data that are
highly relevant for the large language model building industry and also for the
researchers of languages in language technology and linguists as well. While
majority of industry and researchers concern themselves with basically English
and maybe handful of commercially plausible majority languages of the world, we
have discovered some related research both from the industry and the researchers
who specialise in minority and under-resourced languages.</p>
</div>
<div id="S2.p2" class="ltx_para">
<p class="ltx_p">As one reference point, we study what technology companies and central research
groups in LLM-based language technology have said about support for smaller
language in the recent years; One reason for writing this article and its
experiments is also inspired by these works: Meta and FAIR research group
(Facebook’s AI Research) have released resources and studies under the moniker
of <span class="ltx_text ltx_font_italic">No language left behind</span> (NLLB) <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib73" title="No language left behind: scaling human-centered machine translation" class="ltx_ref">3</a>]</cite>, also known for datasets
and evaluation schemes under <span class="ltx_text ltx_font_italic">Scaling neural machine translation to next
200 languages</span> (FLoRES) <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib74" title="Scaling neural machine translation to 200 languages" class="ltx_ref">10</a>]</cite>. Unsurprisingly, this data set has so far
included only Finnish, Estonian, and Hungarian when it comes to Uralic language
inclusion. Alphabet and Google research have also been active on extending the
range of languages supported under the name of <span class="ltx_text ltx_font_italic">next 1000
languages</span> <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib27" title="Building machine translation systems for the next thousand languages" class="ltx_ref">1</a>]</cite>. They have also
published several research papers listing exactly the sources they use to gather
information and data on the languages <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib290" title="LinguaMeta: unified metadata for thousands of languages" class="ltx_ref">7</a>]</cite>, this
is directly useful information to know that, if you want to be included in
Google’s considerations list of languages that might be supported or relevant,
perhaps you want to have data in the resources and datasets they use.</p>
</div>
<div id="S2.p3" class="ltx_para">
<p class="ltx_p">The resources that we use in this articles experiments here have also been used
for several years now in the academic community as the go-to resource to measure
if your tool works with the given language. For example, the <span class="ltx_text ltx_font_italic">Universal
Dependencies</span> (UD) treebanks <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib365" title="Universal dependencies 2.14" class="ltx_ref">12</a>]</cite>, are used in a huge number of papers
investigating computational linguistic methods in a large number of languages,
including the annual shared tasks in syntactic parsing. It would thus appear
that UD as a resource has passed the test of time. Secondly we have seen the
<span class="ltx_text ltx_font_italic">Unimorph</span> dataset, that concerns morphology of languages, has been used
widely in the research and applications. Namely with research of morphophonology
and machine learning there have been regular shared tasks.
We have explicitly left out parallel corpora and machine translations from this
article for two reasons: firstly it is already a main focus of the large
corporations and research groups working on the natural language engineering
tasks and secondly our corpus selection is based on aiming to have a large
subset of professionally human-translated texts as the source texts in these
datasets, we find these are much more valuable than machine translated or
post-edited texts, for the early phases of big data building we are in.
</p>
</div>
<div id="S2.p4" class="ltx_para">
<p class="ltx_p">For the experimentation of this article I have chosen Inari Sámi as a target
language; Inari Sámi is a Uralic language, that does not as of now have many of
the resources that we are about to create. It is a low-resource Indigenous
language with limited amount of speakers and written resources available, but an
active speaker community that writes new texts. We have existing tools in
rule-based language technology available from the well-known free and open
source repository<span id="footnote2" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup>
<span class="ltx_tag ltx_tag_note">2</span>
<a href="https://giellalt.github.io/lang-smn" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://giellalt.github.io/lang-smn</a></span></span></span>.
Furthermore, the most recent versions of large language model -based systems
have been seen to support Inari Sámi (instead of just refusing to handle it and
deferring to professionals as earlier versions did). Finally, we have a
computational linguist who is not a native speaker but is capable of working
with the language and has contacts to language experts, we find this is
sufficient for initial experimentation, but of course for serious language data
building, more expert knowledge is needed.</p>
</div>
<div id="S2.p5" class="ltx_para">
<p class="ltx_p">For some the work on dataset creation there has been previous works, for example
in Universal Dependencies and rule-based analyser there are existing methods
that have been used for other existing uralic dependencies treebanks, such as
the North Sámi <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib355" title="Annotation schemes in North Sámi dependency parsing" class="ltx_ref">11</a>]</cite> and Karelian
treebanks <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib276" title="Building minority dependency treebanks, dictionaries and computational grammars at the same time—an experiment in karelian treebanking" class="ltx_ref">5</a>]</cite>. For generation of the UniMorph data, some
of the datasets are generated based on rule-based generators <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib366" title="UniMorph 4.0: universal morphology" class="ltx_ref">2</a>]</cite>,
strictly speaking Wiktionary can also be considered as rule-based morphological
generation, however, we have not found this mentioned explicitly in existimg
articles about unimorph.</p>
</div>
</section>
<section id="S3" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Methods</h2>
<div id="S3.p1" class="ltx_para">
<p class="ltx_p">Our experimentation concerns the use of existing language technology tools to
help the creation of the datasets while following the rules and ideals behind
the given datasets. For example, when Universal Dependencies guidelines
dictates that the dependency annotation must be manual or human made, we do not
use the tools to generate unchecked 1-best annotations that would pollute the
dataset. The most common strategy here is to give all plausible hypotheses from
the automatic analysis to the linguist to post-edit, but another option is that
the post-edited analyses are verified to be plausible analyses of the system
(our end goal is to have a gold standard that agrees with the analyser and
linguistic expertise).</p>
</div>
<div id="S3.p2" class="ltx_para">
<p class="ltx_p">For the existing rule-based systems, we have downloaded and installed well-known
GiellaLT softwares, which are freely available from the GitHub with an open
source
licence <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib370" title="GiellaLT—a stable infrastructure for nordic minority languages and beyond" class="ltx_ref">6</a>]</cite>.<span id="footnote3" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup>
<span class="ltx_tag ltx_tag_note">3</span>
<a href="https://giellalt.github.io" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://giellalt.github.io</a></span></span></span>
The LLM experimentation is performed using a ChatGPT, the state-of-the-art
chatbot interface to a closed-source, commercial neural network.<span id="footnote4" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup>
<span class="ltx_tag ltx_tag_note">4</span>
The
version tested at the time of writing identifies itself as GPT-4, which was the
newest model at the time we began experimenting but has probably been outdated
by the time of the publication.</span></span></span> We have chosen ChatGPT since it is the most
popular one, it has freely usable version available for most Uralic language
researchers even without expensive AI budget. An example of ChatGPT performing
UniMorph dataset generation task can be seen in Figure <a href="#S3.F1" title="Figure 1 ‣ 3 Methods ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>.</p>
</div>
<figure id="S3.F1" class="ltx_figure"><img src="chatuit.png" id="S3.F1.g1" class="ltx_graphics ltx_centering ltx_img_square" width="598" height="517" alt="Refer to caption">
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>ChatGPT generating data for Inari Sámi UniMorph dataset.</figcaption>
</figure>
<div id="S3.p3" class="ltx_para">
<p class="ltx_p">When working with a preexisting computational linguistic, rule-based system, one
of the main engineering efforts lies on the conversion. Although it sounds
trivial, there is a lot of linguistic and engineering work to be taken into
account here: the actual format of the analyses is rarely exactly the same, so a
mapping needs to be devised, for example, converting “noun” analyses from
<span class="ltx_text ltx_font_typewriter">+N</span> to <span class="ltx_text ltx_font_typewriter">N;</span> or <span class="ltx_text ltx_font_typewriter">NOUN</span>. The mappings can also be 1:n or
m:1, merging and joining ‘tags’, as well as more involved re-writings. There
are a lot of other technical minor details related to such generations and
conversions that are beyond the scope of this article, for example, we needed an
algorithm that could remove duplicate forms that is aware of Unicode
normalisation forms and folding to avoid having the linguist read word forms
that look exactly the same several times. The topic of conversions in itself is
large enough to deserve its own article,<span id="footnote5" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">5</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">5</sup>
<span class="ltx_tag ltx_tag_note">5</span>
we have attempted to write one
such article, even at very condensed format it easily exceeds 8 pages that is
the maximum for average conference article in language technologies.</span></span></span> for the
purposes of this article we will point the readers to our github repositorium
containing freely available scripts.<span id="footnote6" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">6</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">6</sup>
<span class="ltx_tag ltx_tag_note">6</span>
<a href="anonymised" title="" class="ltx_ref ltx_url ltx_font_typewriter">anonymised</a></span></span></span> Some examples of
conversions are given in the Figure <a href="#S3.F2" title="Figure 2 ‣ 3 Methods ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a>.</p>
</div>
<figure id="S3.F2" class="ltx_figure">
<div class="ltx_flex_figure">
<div class="ltx_flex_cell ltx_flex_size_1">
<p class="ltx_p ltx_figure_panel">E.g. <span class="ltx_text ltx_font_italic">Finite State Morphology</span> to <span class="ltx_text ltx_font_italic">Unimorph</span></p>
</div>
<div class="ltx_flex_break"></div>
<div class="ltx_flex_cell ltx_flex_size_1"><pre class="ltx_verbatim ltx_figure_panel ltx_font_typewriter">
tlu tlu+N+Sg+Nom <-> tlu tlu N;SG;NOM
tlust tlu+N+Sg+Loc <-> tlust tlu N;SG;LOC
tlustn tlu+N+Sg+Loc+PxSg1 <-> tlustn tlu N;SG;LOC;PSS1S
</pre></div>
<div class="ltx_flex_break"></div>
<div class="ltx_flex_cell ltx_flex_size_1">
<p class="ltx_p ltx_figure_panel">E.g. <span class="ltx_text ltx_font_italic">VISL CG 3</span> to <span class="ltx_text ltx_font_italic">Universal Dependencies</span></p>
</div>
<div class="ltx_flex_break"></div>
<div class="ltx_flex_cell ltx_flex_size_1"><pre class="ltx_verbatim ltx_figure_panel ltx_font_typewriter">
"<mun>"
"mun" Pron Pers Sg1 Nom @SUBJ> #1->2
:
"<juuhim>"
"juuh" <mv> V TV Ind Prt Sg1 @FMV #2->0
:
"<vuol>"
"vuol" N Sem/Drink Sg Acc @<OBJ #3->2
^^^
|||
vvv
# textid = example.1
# text = mun juuhim vuol
1ΨmunΨmunΨPRONΨPron PersΨCase=Nom|Number=Sing|Person=1|PronType=PersΨ2ΨnsubjΨ_ _
2ΨjuuhimΨjuuhΨVERBΨV TVΨMood=Ind|Number=Sing|Person=1|Tense=PastΨ0ΨrootΨ_ _
3ΨvuolΨvuolΨNOUNΨN Sem/DrinkΨCase=Acc|Number=SingΨ2ΨobjΨ_ _
</pre></div>
</div>
<figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>Conversions between traditional rule-based analyses and target dataset formats
</figcaption>
</figure>
<div id="S3.p4" class="ltx_para">
<p class="ltx_p">The experiments with LLMs are based on the currently available free ChatGPT
interface prompted in English. We begin prompting with the most straightforward
requests, e.g. “can you generate a unimorph annotated list of all word-forms
Inari Sámi noun táálu?”, “create a CONLL-U annotated version of this
sentence”, etc.</p>
</div>
<div id="S3.p5" class="ltx_para">
<p class="ltx_p">It might be noteworthy, that since our goal is inclusion of our Uralic languages
in the relevant datasets, there is also a component of social engineering
involved in all of the dataset creations. Merely producing text files that
contain acceptable data is only a first step. The datasets we have selected to
experiment with, the selection has been also based on the openness and
documentation of the contribution process; all of the given datasets exist on
GitHub, and the contribution process is detailed in the documentation and
happens largely over GitHub only. This is in contrast to the commercially backed
datasets mentioned earlier; while it would be very valuable to have all Uralic
languages in the <span class="ltx_text ltx_font_italic">No Languages Left Behind</span> and <span class="ltx_text ltx_font_italic">Next Thousand
Languages</span>, the way to contribute here is not immediately so obvious and
available to larger audiences.</p>
</div>
</section>
<section id="S4" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>Corpora and Data Selection</h2>
<div id="S4.p1" class="ltx_para">
<p class="ltx_p">The corpora available for low-resource Uralic languages are scarce and limited.
The whole corpora of publically available web crawl data is typically less than
the millions of tokens that is often advertised as minimum requirement of large
language models. Furthermore, the data that is available is limited by
licences, quality, and genres: While some argue that all data that can be
crawled is free to use for language technologies, in practice ethical use
requires selecting only the data that has explicitly been licenced with a
suitable licence, such as Wikipedia or data coming from governmental public
domain records—or that has been personally licenced with the author for the
specific use. That furthermore limits both quality—wikipedia data is written
by language learners—and genres—government’s publication are mainly
politics, healthcare and such.</p>
</div>
<div id="S4.p2" class="ltx_para">
<p class="ltx_p">In this experiment we have used primarily freely licenced data from Saami
international corpora (SIKOR), <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib369" title="SIKOR uit norgga árktalaš universitehta ja norgga sámedikki sámi teakstačoakkáldat, veršuvdna 06.11.2018" class="ltx_ref">8</a>]</cite> but we have also
performed a short experiment on self-created and self-translated data that large
language model should not contain from beforehands.</p>
</div>
</section>
<section id="S5" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5 </span>Experimental results</h2>
<div id="S5.p1" class="ltx_para">
<p class="ltx_p">The main results of our experiment will be the actual datasets we can produce.
To quantify the usefulness of the langauge technology tools we have measured
post-edit distances. We have also performed a linguistic error analysis to
quantify the errors made, the effect on the time/effort tradeoff is further
discussed in the Section <a href="#S6" title="6 Discussion ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">6</span></a>.</p>
</div>
<div id="S5.p2" class="ltx_para">
<p class="ltx_p">In our experiment in creating datasets for Unimorph, we used both the rule-based
system and the LLM to generate the full datasets, that can be read and corrected
by a human. The results of generating are shown in the
table <a href="#S5.T1" title="Table 1 ‣ 5 Experimental results ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">1</span></a>. The expected forms is based on the linguistic
grammars we have available <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib368" title="Inarinsaamen taivutusoppi" class="ltx_ref">4</a>]</cite>. We have measured
the numbers of forms generated, Coverage counted as proportion of generated
unique forms out of expected and Accuracy as proportion of fully correct forms
and analyses of all generated. In general rule-based approach is close to the
gold standard, which is expected from rule-based systems, the LLM has also
generated a smaller subset of forms with lower accuracy.</p>
</div>
<figure id="S5.T1" class="ltx_table">
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_tt">POS</th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_tt"><span class="ltx_text ltx_font_bold">Expected</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_tt"><span class="ltx_text ltx_font_bold">RB</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_tt"><span class="ltx_text ltx_font_bold">RB</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_tt"><span class="ltx_text ltx_font_bold">RB</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_tt"><span class="ltx_text ltx_font_bold">LLM</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_tt"><span class="ltx_text ltx_font_bold">LLM</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_tt"><span class="ltx_text ltx_font_bold">LLM</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<td class="ltx_td"></td>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text ltx_font_bold">forms</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text ltx_font_bold">forms</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text ltx_font_bold">Cov %</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text ltx_font_bold">Acc %</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text ltx_font_bold">forms</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text ltx_font_bold">Cov %</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column"><span class="ltx_text ltx_font_bold">Acc %</span></th>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_t"><span class="ltx_text ltx_font_bold">Nouns</span></td>
<td class="ltx_td ltx_align_right ltx_border_t">58</td>
<td class="ltx_td ltx_align_right ltx_border_t">100<sup class="ltx_sup">∗</sup>
</td>
<td class="ltx_td ltx_align_right ltx_border_t">100 %</td>
<td class="ltx_td ltx_border_t"></td>
<td class="ltx_td ltx_align_right ltx_border_t">14</td>
<td class="ltx_td ltx_align_right ltx_border_t">15 %</td>
<td class="ltx_td ltx_align_right ltx_border_t">21 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left"><span class="ltx_text ltx_font_bold">Verbs</span></td>
<td class="ltx_td ltx_align_right">57</td>
<td class="ltx_td ltx_align_right">55</td>
<td class="ltx_td ltx_align_right">96 %</td>
<td class="ltx_td ltx_align_right">99 %</td>
<td class="ltx_td ltx_align_right">22</td>
<td class="ltx_td ltx_align_right">39 %</td>
<td class="ltx_td ltx_align_right">0 %</td>
</tr>
<tr class="ltx_tr">
<td class="ltx_td ltx_align_left ltx_border_bb"><span class="ltx_text ltx_font_bold">Adjectives</span></td>
<td class="ltx_td ltx_align_right ltx_border_bb">51</td>
<td class="ltx_td ltx_align_right ltx_border_bb">61</td>
<td class="ltx_td ltx_align_right ltx_border_bb">100 %</td>
<td class="ltx_td ltx_border_bb"></td>
<td class="ltx_td ltx_align_right ltx_border_bb">14</td>
<td class="ltx_td ltx_align_right ltx_border_bb">20 %</td>
<td class="ltx_td ltx_align_right ltx_border_bb">10 %</td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 1: </span>Unimorph dataset creation statistics. Expected forms is number of
forms based on the grammar, RB from rule-basd generator and LLM from large
language model, Coverage and Accuracy measured in % units. <sup class="ltx_sup">∗</sup> Some extra
forms in rule-based model are due to allomorphy which was not accounted for
expected forms.</figcaption>
</figure>
<div id="S5.p3" class="ltx_para">
<p class="ltx_p">In our experiments in Universal Dependencies annotation, we used the rule-based
system to generate ambiguous listing of all potential readings of the sentence
with annotations, according to the guidelines in previous works
by <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib276" title="Building minority dependency treebanks, dictionaries and computational grammars at the same time—an experiment in karelian treebanking" class="ltx_ref">5</a>]</cite>, and asked LLM to generate similar hypotheses
likewise. In Table <a href="#S5.T2" title="Table 2 ‣ 5 Experimental results ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">2</span></a> we measure the post edit distance of the
sentences fixed and re-annotated, the error rates are calculated as
<math id="S5.p3.m1" class="ltx_Math" alttext="E=\frac{S+I}{N}" display="inline"><mrow><mi>E</mi><mo>=</mo><mfrac><mrow><mi>S</mi><mo>+</mo><mi>I</mi></mrow><mi>N</mi></mfrac></mrow></math>, where <math id="S5.p3.m2" class="ltx_Math" alttext="E" display="inline"><mi>E</mi></math> is the error rate, <math id="S5.p3.m3" class="ltx_Math" alttext="S" display="inline"><mi>S</mi></math> is number of substitutions
made, <math id="S5.p3.m4" class="ltx_Math" alttext="I" display="inline"><mi>I</mi></math> is the insertions made, and <math id="S5.p3.m5" class="ltx_Math" alttext="N" display="inline"><mi>N</mi></math> number of readings (i.e. N is number
of CONLL-U lines with an index). We do not have <math id="S5.p3.m6" class="ltx_Math" alttext="D" display="inline"><mi>D</mi></math> for deletions since both
methods generated correctly generated one token per token in the input and there
are so far no retokenisation requirements (multi-word tokens, multi-token words
etc.), however LLM missed some punctuation tokens causing an insertion to be
required. The full error rate basically counts whole lines of CONLL-U when
making matches and dep error rate just the dep field.</p>
</div>
<figure id="S5.T2" class="ltx_table">
<table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle">
<thead class="ltx_thead">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_tt">System</th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_tt"><span class="ltx_text ltx_font_bold">Full WER</span></th>
<th class="ltx_td ltx_align_right ltx_th ltx_th_column ltx_border_tt"><span class="ltx_text ltx_font_bold">Dep WER</span></th>
</tr>
</thead>
<tbody class="ltx_tbody">
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t"><span class="ltx_text ltx_font_bold">Rule-Based</span></th>
<td class="ltx_td ltx_align_right ltx_border_t">0.47</td>
<td class="ltx_td ltx_align_right ltx_border_t">0.22</td>
</tr>
<tr class="ltx_tr">
<th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb"><span class="ltx_text ltx_font_bold">LLM</span></th>
<td class="ltx_td ltx_align_right ltx_border_bb">1.00</td>
<td class="ltx_td ltx_align_right ltx_border_bb">0.52</td>
</tr>
</tbody>
</table>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 2: </span>Caption</figcaption>
</figure>
</section>
<section id="S6" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">6 </span>Discussion</h2>
<div id="S6.p1" class="ltx_para">
<p class="ltx_p">We have tested rule-based and LLM-based annotations as a help in linguistic
work. Currently, for morphology we get clearly better results with the
rule-based tools and the results are good enough that it makes work on dataset
creation more effective. If we analyse the errors that the systems make, we see
that rule-based system includes some results with linguistically motivated
potential errors, like wrong stem alternation or missing accent in a suffix.
The errors in LLM generated version are that it just uses seemingly random
suffixes with unchanged stem, it also uses some forms like cases that do not
exist in Inari Sámi (but for example exist in Finnish), all in all cleaning this
data would possible even be slower than writing the data by hand. When we
error-analyse the dependency analysis the results get more interesting, like
both starting points require quite a significant amount of work to get to
gold-standard state, but this is also to be expected if reference the past
experiences of UD annotation from converted or machine analysed starting point.
What is interesting is that the LLM can sometimes generate quite accurate
depdendency subgraphs of certain expressions, for example personal names, we
assume this is due to them appearing in very similar form in existing English
documentations, where high level dependency structure is the same even if there
are slight variations in the morphological level.</p>
</div>
<div id="S6.p2" class="ltx_para">
<p class="ltx_p">There are a large number of different large language models and generative
artificial intelligence that could possibly be used to experiment this and that
is a common feedback we get. We are using a version of a popular LLM that is
available to us, without excessive extra costs. This is also available to most
researchers who are the target audience of this paper.</p>
</div>
<div id="S6.p3" class="ltx_para">
<p class="ltx_p">A common feedback we get, that there are various techniques that should be used
for low resource setup, like fine-tunings, transfer learnings, in context
learnings, prompting techniques and so on. We are experimenting in a situation
where we start with zero data for the fine-tuning task, we are the ones who will
create these data initially, so the use of such data will generally be a future
research topic, after we have done the initial data creation. As the
methodology here is extremely fast moving and outdates itself in matter of
months, we try to begin by only importing either approaches that have been
proven and stabilised, perhaps in majority language context, into out lesser
resourced languages, or we can perform experimentation that does not tie up too
much valuable and scarce resources. Another interesting future research
question would be whether it is more beneficial and time-effective to fine-tune
early or on-goingly, given the constraints in data and human resources we face
in the processing of smaller Uralic languages. We have not found an easy enough
recipe to do transfer learning that would not take us more time than actually
working on the data creation as described by the approach of this article. Our
impression is furthermore that there is currently ongoing research on this topic
that we hope will yield some answers that are relevant to us as well.</p>
</div>
<div id="S6.p4" class="ltx_para">
<p class="ltx_p">It is exciting to see that, even if the large language models have rathar
disappointing accuracy in generating and annotation of smaller Uralic languages,
they are able to generate something that is relevant to the task and
occasionally some word-forms or annotations are even correct. This suggests
that maybe with further fine-tuning, prompting, in-context learning, transfer
learning, and so forth, there could be a usable version of LLM-aided language
data annotation and generation in the future.</p>
</div>
<div id="S6.p5" class="ltx_para">
<p class="ltx_p">One question for future work is of course how to integrate these findings to a
workflow and softwares for annotation. In this experiment we used normal text
editors and raw data formats for data annotation, which is suitable for
programmers and short experiments, for the full scale linguistic annotation this
would be integrated to a specific editor. And that raises the question of if
the ideal way to help linguistic jobs would bear a user interface similar to
what we get in the email post writing programs, office tools and programming
editors today with a so-called <span class="ltx_text ltx_font_italic">co-pilot</span>?</p>
</div>
</section>
<section id="S7" class="ltx_section">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">7 </span>Conclusion</h2>
<div id="S7.p1" class="ltx_para">
<p class="ltx_p">We performed several experiments to find out an efficient way of creating NLP
datasets for smaller Uralic languages. We have found that using both existing
rule-based technology and large language models can help rapid creation of the
data, but neither approach is without its caveats. The gold standard remains
fully human annotated data, but in lack of that it should be considered if we
can achieve reasonable amounts of resources with computer-aided annotation
modes.</p>
</div>
</section>
<section id="Sx1" class="ltx_section">
<h2 class="ltx_title ltx_title_section">Limitations</h2>
<div id="Sx1.p1" class="ltx_para">
<p class="ltx_p">The experimentation on large language models is done using one closed source
commercial system and is not reproducible at all, however, this is a common
practice in the science of natural language processing in 2024.</p>
</div>
<div id="Sx1.p2" class="ltx_para">
<p class="ltx_p">The experiments were performed by language learner instead of native speaker or
expert, the qualitative results may differ when language experts are working on
the same pre-processed data.</p>
</div>
</section>
<section id="Sx2" class="ltx_section">
<h2 class="ltx_title ltx_title_section">Ethics</h2>
<div id="Sx2.p1" class="ltx_para">
<p class="ltx_p">The large language models used in this experimentation have wasted an estimated
several hundreds of litres of drinking
water <span id="footnote7" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">7</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">7</sup>
<span class="ltx_tag ltx_tag_note">7</span>
<a href="https://www.thetimes.com/uk/technology-uk/article/thirsty-chatgpt-uses-four-times-more-water-than-previously-thought-bc0pqswdr" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://www.thetimes.com/uk/technology-uk/article/thirsty-chatgpt-uses-four-times-more-water-than-previously-thought-bc0pqswdr</a></span></span></span>
and not insignificant amount of
energy <cite class="ltx_cite ltx_citemacro_cite">[<a href="#bib.bib367" title="Energy and policy considerations for deep learning in NLP" class="ltx_ref">9</a>]</cite>.<span id="footnote8" class="ltx_note ltx_role_footnote"><sup class="ltx_note_mark">8</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">8</sup>
<span class="ltx_tag ltx_tag_note">8</span>
<a href="https://disconnect.blog/silicon-valley-is-sacrificing-the-climate-for-ai/" title="" class="ltx_ref ltx_url ltx_font_typewriter">https://disconnect.blog/silicon-valley-is-sacrificing-the-climate-for-ai/</a></span></span></span>
If LLM method is taken in to use in the development of annotated gold corpora
and data sets, this needs to be taken into consideration until the providers of
LLMs resolve the excessive use of natural resources.</p>
</div>
<div id="Sx2.p2" class="ltx_para">
<p class="ltx_p">No underpaid crowd-sourcers were involved in performing the linguistic tasks,
all annotations and evaluations were made by fully paid colleagues.</p>
</div>
</section>
<section id="bib" class="ltx_bibliography">
<h2 class="ltx_title ltx_title_bibliography">References</h2>
<ul id="bib.L1" class="ltx_biblist">
<li id="bib.bib27" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[1]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">A. Bapna, I. Caswell, J. Kreutzer, O. Firat, D. van Esch, A. Siddhant, M. Niu, P. Baljekar, X. Garcia, W. Macherey, T. Breiner, V. Axelrod, J. Riesa, Y. Cao, M. X. Chen, K. Macherey, M. Krikun, P. Wang, A. Gutkin, A. Shah, Y. Huang, Z. Chen, Y. Wu, and M. Hughes</span><span class="ltx_text ltx_bib_year"> (2022)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Building machine translation systems for the next thousand languages</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><span class="ltx_text ltx_bib_external">2205.03983</span>,
<a href="https://arxiv.org/abs/2205.03983" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p2" title="2 Background ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§2</span></a>.
</span>
</li>
<li id="bib.bib366" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[2]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">K. Batsuren, O. Goldman, S. Khalifa, N. Habash, W. Kieraś, G. Bella, B. Leonard, G. Nicolai, K. Gorman, Y. G. Ate, <span class="ltx_text ltx_bib_etal">et al.</span></span><span class="ltx_text ltx_bib_year"> (2022)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">UniMorph 4.0: universal morphology</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the Thirteenth Language Resources and Evaluation Conference</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_pages"> pp. 840–855</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p5" title="2 Background ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§2</span></a>.
</span>
</li>
<li id="bib.bib73" class="ltx_bibitem ltx_bib_article">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[3]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, <span class="ltx_text ltx_bib_etal">et al.</span></span><span class="ltx_text ltx_bib_year"> (2022)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">No language left behind: scaling human-centered machine translation</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">arXiv preprint arXiv:2207.04672</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p2" title="2 Background ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§2</span></a>.
</span>
</li>
<li id="bib.bib368" class="ltx_bibitem ltx_bib_book">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[4]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">P. Morottaja and M. Olthuis</span><span class="ltx_text ltx_bib_year"> (2023)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Inarinsaamen taivutusoppi</span>.
</span>
<span class="ltx_bibblock"> <span class="ltx_text ltx_bib_publisher">Sámediggi</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S5.p2" title="5 Experimental results ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§5</span></a>.
</span>
</li>
<li id="bib.bib276" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[5]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">F. A. Pirinen</span><span class="ltx_text ltx_bib_year"> (2019)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Building minority dependency treebanks, dictionaries and computational grammars at the same time—an experiment in karelian treebanking</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the Universal Dependencies Workshop 2019</span>,
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p5" title="2 Background ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§2</span></a>,
<a href="#S5.p3" title="5 Experimental results ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§5</span></a>.
</span>
</li>
<li id="bib.bib370" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[6]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">F. Pirinen, S. Moshagen, and K. Hiovain-Asikainen</span><span class="ltx_text ltx_bib_year"> (2023)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">GiellaLT—a stable infrastructure for nordic minority languages and beyond</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_pages"> pp. 643–649</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S3.p2" title="3 Methods ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§3</span></a>.
</span>
</li>
<li id="bib.bib290" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[7]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">S. Ritchie, D. van Esch, U. Okonkwo, S. Vashishth, and E. Drummond</span><span class="ltx_text ltx_bib_year"> (2024-05)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">LinguaMeta: unified metadata for thousands of languages</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the 2024 Joint International Conference on
Computational Linguistics, Language Resources and Evaluation (LREC-COLING
2024)</span>, <span class="ltx_text ltx_bib_editor">N. Calzolari, M. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue (Eds.)</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">Torino, Italia</span>, <span class="ltx_text ltx_bib_pages"> pp. 10530–10538</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="https://aclanthology.org/2024.lrec-main.921" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p2" title="2 Background ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§2</span></a>.
</span>
</li>
<li id="bib.bib369" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[8]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">SIKOR</span><span class="ltx_text ltx_bib_year"> (2021)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">SIKOR uit norgga árktalaš universitehta ja norgga sámedikki sámi teakstačoakkáldat, veršuvdna 06.11.2018</span>.
</span>
<span class="ltx_bibblock">Note: <span class="ltx_text ltx_bib_note"><span class="ltx_ref ltx_nolink ltx_url ltx_font_typewriter ltx_ref_self">http://gtweb.uit.no/korp</span>Accessed: 2024-10-01</span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S4.p2" title="4 Corpora and Data Selection ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§4</span></a>.
</span>
</li>
<li id="bib.bib367" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[9]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">E. Strubell, A. Ganesh, and A. McCallum</span><span class="ltx_text ltx_bib_year"> (2019-07)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Energy and policy considerations for deep learning in NLP</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the 57th Conference of the Association for
Computational Linguistics</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">Florence, Italy</span>, <span class="ltx_text ltx_bib_pages"> pp. 3645–3650</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="https://www.aclweb.org/anthology/P19-1355" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#Sx2.p1" title="Ethics ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_title">Ethics</span></a>.
</span>
</li>
<li id="bib.bib74" class="ltx_bibitem ltx_bib_article">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[10]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">N. Team <span class="ltx_text ltx_bib_etal">et al.</span></span><span class="ltx_text ltx_bib_year"> (2024)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Scaling neural machine translation to 200 languages</span>.
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_journal">Nature</span> <span class="ltx_text ltx_bib_volume">630</span> (<span class="ltx_text ltx_bib_number">8018</span>), <span class="ltx_text ltx_bib_pages"> pp. 841</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p2" title="2 Background ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§2</span></a>.
</span>
</li>
<li id="bib.bib355" class="ltx_bibitem ltx_bib_inproceedings">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[11]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_author">F. M. Tyers and M. Sheyanova</span><span class="ltx_text ltx_bib_year"> (2017-01)</span>
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Annotation schemes in North Sámi dependency parsing</span>.
</span>
<span class="ltx_bibblock">In <span class="ltx_text ltx_bib_inbook">Proceedings of the Third Workshop on Computational Linguistics
for Uralic Languages</span>,
</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_place">St. Petersburg, Russia</span>, <span class="ltx_text ltx_bib_pages"> pp. 66–75</span>.
</span>
<span class="ltx_bibblock">External Links: <span class="ltx_text ltx_bib_links"><a href="https://dx.doi.org/10.18653/v1/W17-0607" title="" class="ltx_ref doi ltx_bib_external">Document</a>,
<a href="https://www.aclweb.org/anthology/W17-0607" title="" class="ltx_ref ltx_bib_external">Link</a></span>
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p5" title="2 Background ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§2</span></a>.
</span>
</li>
<li id="bib.bib365" class="ltx_bibitem ltx_bib_misc">
<span class="ltx_tag ltx_bib_key ltx_role_refnum ltx_tag_bibitem">[12]</span>
<span class="ltx_bibblock"><span class="ltx_text ltx_bib_title">Universal dependencies 2.14</span>.
</span>
<span class="ltx_bibblock ltx_bib_cited">Cited by: <a href="#S2.p3" title="2 Background ‣ Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets 1 footnote 1 1 footnote 1 This is Flammieâs draft, official version may differ. The ACL anthology version is available at http://aclanthology.org/{2024.iwclul-1.16}. ACL is typically licenced CC-BY." class="ltx_ref"><span class="ltx_text ltx_ref_tag">§2</span></a>.
</span>
</li>
</ul>
</section>
</article>
</div>
<footer class="ltx_page_footer">
<div class="ltx_page_logo">Generated on Mon Dec 16 16:47:52 2024 by <a href="http://dlmf.nist.gov/LaTeXML/" class="ltx_LaTeXML_logo"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="Mascot Sammy"></a>
</div></footer>
</div>
</body>
</html>