Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unusual multiword-entries #34

Open
flammie opened this issue Jul 6, 2020 · 6 comments
Open

Fix unusual multiword-entries #34

flammie opened this issue Jul 6, 2020 · 6 comments

Comments

@flammie
Copy link
Member

flammie commented Jul 6, 2020

There is a high number of entries in current apertium-eng.eng.dix containing white-spaces that are not your typical lexical entries. These should be moved to various -separable dictionaries or just removed altogether. Examples:

<e lm="a holiday">       <i>a<b/>holiday</i><par n="house__n"/></e>
<e lm="foreign language"><i>foreign<b/>language</i><par n="house__n"/></e>
<e lm="very strong wind"><i>very<b/>strong<b/>wind</i><par n="house__n"/></e>
<e lm="shortly after">   <i>shortly<b/>after</i><par n="at__pr"/></e>
<e r="RL" lm="shortly after"><i>shortly<b/>after</i><par n="after__cnjadv"/></e>
<e r="LR" lm="that's why"><p><l>that's<b/>why</l><r>that<b/>is<b/>why</r></p><par n="after__cnjadv"/></e>
@mr-martian
Copy link
Contributor

I think a lot of these phrases are originally from eng-deu. I can do some grepping in an hour or two and verify this.

@flammie
Copy link
Member Author

flammie commented Jul 7, 2020

Yeah, I can envision many of them happening because of certain specific languages, maybe some were even added from Finnish too, but we need some future proof solution for language specific hacks to monodix, e.g. -separables.

I listed a handful of examples but there're dozens if not hundreds in -eng, not all of them equally questionable...

@mr-martian
Copy link
Contributor

See #35 for the ~900 that are only in eng-deu.

The biggest users of -eng multiwords seem to be fin-eng, isl-eng, and eng-deu.

@flammie
Copy link
Member Author

flammie commented Jul 8, 2020

See #35 for the ~900 that are only in eng-deu.

These seem good, but I guess they should be reviewed by a native speaker ;-)

The biggest users of -eng multiwords seem to be fin-eng, isl-eng, and eng-deu.

a large stash eng-fin ones were probably added semi-automatically from untrimmed debug output or other questionable sources and can be simply deleted... but if you have some scripts in place to easily generate a list I could have a lookl.

@mr-martian
Copy link
Contributor

eng_mwe_loc.txt

Here's the multiwords not affected by #35 and what bidixes they appear in.

@flammie
Copy link
Member Author

flammie commented Jul 9, 2020

hmm yeah its a real mishmash of things ranging from acceptable lexical units to random combination of adjacent words...

I don't know if there's any good heuristic to decide if they go to language specific or monolingual part than going through the list by hand, maybe someone can come up with tactics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants