Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help needed in updating the eng.tagged corpus #21

Open
AMR-KELEG opened this issue Jul 10, 2019 · 1 comment
Open

Help needed in updating the eng.tagged corpus #21

AMR-KELEG opened this issue Jul 10, 2019 · 1 comment

Comments

@AMR-KELEG
Copy link

AMR-KELEG commented Jul 10, 2019

I have found that some tags are marked as unknown * despite getting analysed by the compiled dictionary.

Theses cases can be discovered easily but I need help in manually inspecting them.

The tagging doesn't seem to be that easy as for example:
The token bloody is located in lines 11 and 11145
https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged#L11
https://github.com/apertium/apertium-eng/blob/master/texts/eng.tagged#L11145

line analysis
11 ^bloody/*bloody
11145 ^bloody/bloody<adj><sint>$

What do you think is the better way to fix such cases?

@ftyers
Copy link
Member

ftyers commented Jul 10, 2019

For the weighted automata project, the best way is to just ignore these errors. Your code should just discard/skip invalidly encoded words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants