Investigate `ʽἑκάεργος` and `̓Ολυμπιιάς` #72

whoopsedesy · 2021-11-20T15:19:30Z

These "words" appear at the top of the expectancy output (because they do not begin with a letter). They look like errors. ʽἑκάεργος U+02BD MODIFIER LETTER REVERSED COMMA and ̓Ολυμπιιάς starts with an unattached U+0313 COMBINING COMMA ABOVE.

lemma,sedes,x,z
ʽἑκάεργος,4,3,
̓Ολυμπιιάς,6.5,1,

The text was updated successfully, but these errors were encountered:

sasansom · 2021-12-18T15:57:20Z

I'm looking at ʽἑκάεργος. This lemma is odd, and I have no idea why the lemmatizer would output a lemma beginning with a comma and not a letter. The original words in the texts (ἑκάεργον at Iliad 1.147, 474; Hom.Hymn 4.239) are not preceded by commas.

I've tried to fix this in lemma.py by mapping ἑκάεργον to Ἑκάεργος (commit 7d9583c, as well as fixing an error in the original beta code, a soft instead of rough breathing for three instances of Far-shooting):
("ἑκάεργον", "Ἑκάεργος"),.
The funny thing is, it fixed only one of the three instances (Hom.Hymn 4.239). It turns out their beta code differs:

Hom.Hymn	4	239	3	ἑκάεργον	Ἑκάεργος

*(eka/ergon

Il.	1	147	3	ἑκάεργον	ʽἑκάεργος

e(ka/ergon

Il.	1	474	2	ἑκάεργον	ʽἑκάεργος

e(ka/ergon

The mapping fixed *(eka/ergon, but not e(ka/ergon. I'm wondering if the disparity comes from the fact that lemma.py uses Unicode. Perhaps something funny happens when going between the beta code of the text and the unicode of the lemma.py, but I don't know.

Then I noticed another disparity between the lemmatization of words with the same beta code in the Hom. Hymns and Iliad:

Hom.Hymn	4	333	6	ἑκάεργος	Ἑκάεργος

e(ka/ergos

Il.	1	479	6	ἑκάεργος	ἑκάεργος

e(ka/ergos

Here, the beta code is the same in both texts (e(ka/ergos), but Hom.Hymn 4.333 produces the lemma mapped in lemma.py (Ἑκάεργος), whereas Iliad 1.479 produces what I'm assuming is the lemma from the backoff_lemmatizer (ἑκάεργος).

So, the original problem is still outstanding for the two instances in the Iliad, and I am not sure how to resolve it. The problem doesn't seem to be with the beta code or with lemma.py.

whoopsedesy · 2021-12-20T06:45:40Z

So, the original problem is still outstanding for the two instances in the Iliad, and I am not sure how to resolve it. The problem doesn't seem to be with the beta code or with lemma.py.

Seems fixed to me in 7d9583c?

Before:

$ grep ʽἑκάεργος corpus/*.csv
corpus/homerichymns.csv:Hom.Hymn,4,239,3,ἑκάεργον,ʽἑκάεργος,4,⏑⏑–⏑,auto,1
corpus/iliad.csv:Il.,1,147,3,ἑκάεργον,ʽἑκάεργος,4,⏑⏑–⏑,manual,1
corpus/iliad.csv:Il.,1,474,2,ἑκάεργον,ʽἑκάεργος,4,⏑⏑–⏑,auto,1

After:

$ make clean
$ make corpus/homerichymns.csv corpus/iliad.csv
$ grep ʽἑκάεργος corpus/*.csv

whoopsedesy · 2021-12-20T07:39:58Z

Both ʽἑκάεργος and ̓Ολυμπιιάς as lemmata originate in cltk/grc_models_cltk: cltk/grc_models_cltk#5. It's not something to do with SEDES code or the source texts. We can still work around it using our local list of lemmatization overrides, as you did already with ἑκάεργος.

sasansom referenced this issue Dec 18, 2021

Fix far-shooting

7d9583c

whoopsedesy mentioned this issue Dec 20, 2021

greek_lemmatized_sents.pickle contains lemmas that start with non-alphabetic characters cltk/grc_models_cltk#5

Open

sasansom closed this as completed in 8fe7296 Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate `ʽἑκάεργος` and `̓Ολυμπιιάς` #72

Investigate `ʽἑκάεργος` and `̓Ολυμπιιάς` #72

whoopsedesy commented Nov 20, 2021

sasansom commented Dec 18, 2021 •

edited

Loading

whoopsedesy commented Dec 20, 2021

whoopsedesy commented Dec 20, 2021

Investigate ʽἑκάεργος and ̓Ολυμπιιάς #72

Investigate ʽἑκάεργος and ̓Ολυμπιιάς #72

Comments

whoopsedesy commented Nov 20, 2021

sasansom commented Dec 18, 2021 • edited Loading

whoopsedesy commented Dec 20, 2021

whoopsedesy commented Dec 20, 2021

Investigate `ʽἑκάεργος` and `̓Ολυμπιιάς` #72

Investigate `ʽἑκάεργος` and `̓Ολυμπιιάς` #72

sasansom commented Dec 18, 2021 •

edited

Loading