New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Help needed in updating the eng.tagged corpus #21

Open

AMR-KELEG opened this issue Jul 10, 2019 · 1 comment

AMR-KELEG commented Jul 10, 2019 •

edited

Loading

I have found that some tags are marked as unknown * despite getting analysed by the compiled dictionary.

Theses cases can be discovered easily but I need help in manually inspecting them.

The tagging doesn't seem to be that easy as for example:
The token bloody is located in lines 11 and 11145
https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged#L11
https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged#L11145

line	analysis
11	`^bloody/*bloody`
11145	`^bloody/bloody<adj><sint>$`

What do you think is the better way to fix such cases?

The text was updated successfully, but these errors were encountered:

Member

ftyers commented Jul 10, 2019

For the weighted automata project, the best way is to just ignore these errors. Your code should just discard/skip invalidly encoded words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment