-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4432 Fixed phrase search queries with § #4877
base: main
Are you sure you want to change the base?
Conversation
5684f6a
to
9b37bec
Compare
Fun, fun, fun. Sounds like you got the right solution. Yes, let's add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with the other couple of punctuation marks added (and tests as well).
Thanks!
Great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @albertisfu!.
Alberto, can you spin off an infrastructure issue for Diego to do on the index changes, please? |
Infrastructure Issue here: This PR can be merged without risk; however, the fix won't take effect until the infrastructure issue is resolved. |
This PR fixes #4432
The issue originated in the
custom_word_delimiter_filter
, which was splitting terms like§247
into247
and§247
. Initially, I removedcustom_word_delimiter_filter
fromsearch_analyzer_exact
as suggested in #4432.However, this change caused
docketNumber
proximity queries to fail. For example, a query like21-1234
could not match1:21-bk-1234
. The problem occurs because this query is rewritten asdocketNumber:"21-1234"~1
. Due to thecustom_word_delimiter_filter
, the-
is treated as a separator, so internally ES transforms this query todocketNumber:"21 1234"~1,
meaning aproximity
query between tokens21
and1234
with a max distance of 1.Meanwhile, due to this same filter,
1:21-bk-1234
is indexed as["1:21-bk-1234", "1", "21", "bk", "1234"]
, allowing the query21-1234
to match1:21-bk-1234
.Removing this filter completely caused the
docketNumber
query to fail because it was internally interpreted asdocketNumber:"21-1234"~1
, which is afuzzy
query for a single term with max distance of 1. The problem is that the maximum edit distance allowed in fuzzy queries is 2, and matching1:21-bk-1234
requires more than 2 changes.Therefore, removing
custom_word_delimiter_filter
completely is not an option, either at search or indexing time. The alternative solution for the§
issue is to add an exception so§
is not considered a character to split tokens over. This can be done using theword_delimiter
type_table
parameter and mapping§
as an Alphanum character to avoid splitting.This change resolves the issue described in #4432 at search time. Queries like
"§247"
will match documents containing exactly§247
and ignore documents containing only247
.However, I also tested the reverse case:
"247"
will still match currently indexed documents containing only§247
since these documents were split and indexed as["§247", "247"]
.custom_word_delimiter_filter
change won't be split on§
, but existing indexed documents would require a full re-index to fix such queries.I also noticed that non-phrase queries like
§247
still don't match documents containing§247
after the filter tweak. This was a different issue: withincleanup_main_query
, terms are split on special characters (with some exceptions). For non-phrase queries,§247
was converted to§ "247"
. To fix this, I added§
as an exception in thecleanup_main_query
regex.More Exceptions to add?
Since we're relying on adding special terms as exceptions during splitting, are there other special terms common in legal documents that we should add to the exception list?
Would it make sense to avoid splitting on
$
and%
? For example, keeping$1000
or%10
as single terms for searching?To apply this fix in production, the following filter modifications should be applied to each index: