Can you recognize sentence or paragraph boundaries when tagging a large text field ? #53

simonatdrg · 2016-08-31T18:25:39Z

Onelarge text field which we tag is yielding a lot of erroneous multiword tags due mostly to a large number of embedded newline characters. A simple (contrived) example of what we see.

I like my vitamin \n
A good time was had by all.

Since 'vitamin A' is in our tag dictionary, it will be tagged in this text if we use the standard tokenizer or the whitespace tokenizer. I've been playing around adding a MappingCharFilter to the query analyzer, which will substitute an arbitrary non-space character for a newline (I'm using Hebrew aleph) that can't occur in the English text or in our tag dictionary, followed by the Standard tokenizer. This inserts a junk character between 'vitamin' and 'a' so no tag will be found. However, this seems to be exquisitely sensitive to the presence or absence of spaces around the '\n' so I don't think it's robust enough

In an ideal world, I'd like the tagger to be able to recognize a new (tagger specific) Lucene token attribute ENDHERE, which would signal to the FST that this token is a boundary/terminal and not to look beyond it when a partial tag has been discovered. Obviously one would need some way of attaching this attribute to a token, (presumably by extending existing Tokenizers and filters). I'm not a Lucene expert so i have no idea if this is even feasible, which is why I'm reaching out here.

If all else fails I'll have to segment the text somehow upstream - there will porbably be a performance hit (our workflow is all in Python) but there will be fewer constraints compared to working within the Lucene analysis framework.

Comments welcome - maybe someone has solved this problem already

mubaldino · 2016-08-31T18:33:58Z

Simon, that's an interesting and common situation. One which I don't see any useful pre-processing solution at large. We see line endings in all sorts of valid situations in the middle of a word (sentence wrap, word hyphenation + wrap, etc).
If you really think in your data you should trust that "\n" demarcates a valid phrase boundary 100% of the time, and you can easily split your text into lines... that makes sense for your case.
In my case, typically I post-filter things that seem invalid.
However, I often prefer to keep the tagging mechanism as simple and as generic as possible, and to let the business logic of a particular app worry about the pre or post processing.
I'd certainly like to hear if David (author of STT) or others have thoughts.

-marc

dsmiley · 2016-08-31T18:42:13Z

This is a duplicate of #25 so I'm going to close it. I don't think I have much more to add that I didn't already say there, which is in part what Simon mentioned RE using Lucene Attributes (amongst other things).

dsmiley closed this as completed Aug 31, 2016

dsmiley added the duplicate label Aug 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you recognize sentence or paragraph boundaries when tagging a large text field ? #53

Can you recognize sentence or paragraph boundaries when tagging a large text field ? #53

simonatdrg commented Aug 31, 2016

mubaldino commented Aug 31, 2016

dsmiley commented Aug 31, 2016