Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you recognize sentence or paragraph boundaries when tagging a large text field ? #53

Closed
simonatdrg opened this issue Aug 31, 2016 · 2 comments

Comments

@simonatdrg
Copy link

Onelarge text field which we tag is yielding a lot of erroneous multiword tags due mostly to a large number of embedded newline characters. A simple (contrived) example of what we see.

I like my vitamin \n
A good time was had by all.

Since 'vitamin A' is in our tag dictionary, it will be tagged in this text if we use the standard tokenizer or the whitespace tokenizer. I've been playing around adding a MappingCharFilter to the query analyzer, which will substitute an arbitrary non-space character for a newline (I'm using Hebrew aleph) that can't occur in the English text or in our tag dictionary, followed by the Standard tokenizer. This inserts a junk character between 'vitamin' and 'a' so no tag will be found. However, this seems to be exquisitely sensitive to the presence or absence of spaces around the '\n' so I don't think it's robust enough

In an ideal world, I'd like the tagger to be able to recognize a new (tagger specific) Lucene token attribute ENDHERE, which would signal to the FST that this token is a boundary/terminal and not to look beyond it when a partial tag has been discovered. Obviously one would need some way of attaching this attribute to a token, (presumably by extending existing Tokenizers and filters). I'm not a Lucene expert so i have no idea if this is even feasible, which is why I'm reaching out here.

If all else fails I'll have to segment the text somehow upstream - there will porbably be a performance hit (our workflow is all in Python) but there will be fewer constraints compared to working within the Lucene analysis framework.

Comments welcome - maybe someone has solved this problem already

@mubaldino
Copy link
Member

Simon, that's an interesting and common situation. One which I don't see any useful pre-processing solution at large. We see line endings in all sorts of valid situations in the middle of a word (sentence wrap, word hyphenation + wrap, etc).
If you really think in your data you should trust that "\n" demarcates a valid phrase boundary 100% of the time, and you can easily split your text into lines... that makes sense for your case.
In my case, typically I post-filter things that seem invalid.
However, I often prefer to keep the tagging mechanism as simple and as generic as possible, and to let the business logic of a particular app worry about the pre or post processing.
I'd certainly like to hear if David (author of STT) or others have thoughts.

-marc

@dsmiley
Copy link
Member

dsmiley commented Aug 31, 2016

This is a duplicate of #25 so I'm going to close it. I don't think I have much more to add that I didn't already say there, which is in part what Simon mentioned RE using Lucene Attributes (amongst other things).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants