Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate ʽἑκάεργος and ̓Ολυμπιιάς #72

Closed
whoopsedesy opened this issue Nov 20, 2021 · 3 comments
Closed

Investigate ʽἑκάεργος and ̓Ολυμπιιάς #72

whoopsedesy opened this issue Nov 20, 2021 · 3 comments

Comments

@whoopsedesy
Copy link
Collaborator

These "words" appear at the top of the expectancy output (because they do not begin with a letter). They look like errors. ʽἑκάεργος U+02BD MODIFIER LETTER REVERSED COMMA and ̓Ολυμπιιάς starts with an unattached U+0313 COMBINING COMMA ABOVE.

lemma,sedes,x,z
ʽἑκάεργος,4,3,
̓Ολυμπιιάς,6.5,1,
@sasansom
Copy link
Owner

sasansom commented Dec 18, 2021

I'm looking at ʽἑκάεργος. This lemma is odd, and I have no idea why the lemmatizer would output a lemma beginning with a comma and not a letter. The original words in the texts (ἑκάεργον at Iliad 1.147, 474; Hom.Hymn 4.239) are not preceded by commas.

I've tried to fix this in lemma.py by mapping ἑκάεργον to Ἑκάεργος (commit 7d9583c, as well as fixing an error in the original beta code, a soft instead of rough breathing for three instances of Far-shooting):
("ἑκάεργον", "Ἑκάεργος"),.
The funny thing is, it fixed only one of the three instances (Hom.Hymn 4.239). It turns out their beta code differs:

Hom.Hymn 4 239 3 ἑκάεργον Ἑκάεργος

*(eka/ergon

Il. 1 147 3 ἑκάεργον ʽἑκάεργος

e(ka/ergon

Il. 1 474 2 ἑκάεργον ʽἑκάεργος

e(ka/ergon

The mapping fixed *(eka/ergon, but not e(ka/ergon. I'm wondering if the disparity comes from the fact that lemma.py uses Unicode. Perhaps something funny happens when going between the beta code of the text and the unicode of the lemma.py, but I don't know.

Then I noticed another disparity between the lemmatization of words with the same beta code in the Hom. Hymns and Iliad:

Hom.Hymn 4 333 6 ἑκάεργος Ἑκάεργος

e(ka/ergos

Il. 1 479 6 ἑκάεργος ἑκάεργος

e(ka/ergos

Here, the beta code is the same in both texts (e(ka/ergos), but Hom.Hymn 4.333 produces the lemma mapped in lemma.py (Ἑκάεργος), whereas Iliad 1.479 produces what I'm assuming is the lemma from the backoff_lemmatizer (ἑκάεργος).

So, the original problem is still outstanding for the two instances in the Iliad, and I am not sure how to resolve it. The problem doesn't seem to be with the beta code or with lemma.py.

@whoopsedesy
Copy link
Collaborator Author

So, the original problem is still outstanding for the two instances in the Iliad, and I am not sure how to resolve it. The problem doesn't seem to be with the beta code or with lemma.py.

Seems fixed to me in 7d9583c?

Before:

$ grep ʽἑκάεργος corpus/*.csv
corpus/homerichymns.csv:Hom.Hymn,4,239,3,ἑκάεργον,ʽἑκάεργος,4,⏑⏑–⏑,auto,1
corpus/iliad.csv:Il.,1,147,3,ἑκάεργον,ʽἑκάεργος,4,⏑⏑–⏑,manual,1
corpus/iliad.csv:Il.,1,474,2,ἑκάεργον,ʽἑκάεργος,4,⏑⏑–⏑,auto,1

After:

$ make clean
$ make corpus/homerichymns.csv corpus/iliad.csv
$ grep ʽἑκάεργος corpus/*.csv

@whoopsedesy
Copy link
Collaborator Author

Both ʽἑκάεργος and ̓Ολυμπιιάς as lemmata originate in cltk/grc_models_cltk: cltk/grc_models_cltk#5. It's not something to do with SEDES code or the source texts. We can still work around it using our local list of lemmatization overrides, as you did already with ἑκάεργος.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants