4432 Fixed phrase search queries with § #4877

albertisfu · 2025-01-01T18:24:06Z

This PR fixes #4432

The issue originated in the custom_word_delimiter_filter, which was splitting terms like §247 into 247 and §247. Initially, I removed custom_word_delimiter_filter from search_analyzer_exact as suggested in #4432.

However, this change caused docketNumber proximity queries to fail. For example, a query like 21-1234 could not match 1:21-bk-1234. The problem occurs because this query is rewritten as docketNumber:"21-1234"~1. Due to the custom_word_delimiter_filter, the - is treated as a separator, so internally ES transforms this query to docketNumber:"21 1234"~1, meaning a proximity query between tokens 21 and 1234 with a max distance of 1.

Meanwhile, due to this same filter, 1:21-bk-1234 is indexed as ["1:21-bk-1234", "1", "21", "bk", "1234"], allowing the query 21-1234 to match 1:21-bk-1234.

Removing this filter completely caused the docketNumber query to fail because it was internally interpreted as docketNumber:"21-1234"~1, which is a fuzzy query for a single term with max distance of 1. The problem is that the maximum edit distance allowed in fuzzy queries is 2, and matching 1:21-bk-1234 requires more than 2 changes.

Therefore, removing custom_word_delimiter_filter completely is not an option, either at search or indexing time. The alternative solution for the § issue is to add an exception so § is not considered a character to split tokens over. This can be done using the word_delimiter type_table parameter and mapping § as an Alphanum character to avoid splitting.

This change resolves the issue described in #4432 at search time. Queries like "§247" will match documents containing exactly §247 and ignore documents containing only 247.

However, I also tested the reverse case:

A query for "247" will still match currently indexed documents containing only §247 since these documents were split and indexed as ["§247", "247"].
New documents indexed after applying the custom_word_delimiter_filter change won't be split on §, but existing indexed documents would require a full re-index to fix such queries.

I also noticed that non-phrase queries like §247 still don't match documents containing §247 after the filter tweak. This was a different issue: within cleanup_main_query, terms are split on special characters (with some exceptions). For non-phrase queries, §247 was converted to § "247". To fix this, I added § as an exception in the cleanup_main_query regex.

More Exceptions to add?

Since we're relying on adding special terms as exceptions during splitting, are there other special terms common in legal documents that we should add to the exception list?
Would it make sense to avoid splitting on $ and %? For example, keeping $1000 or %10 as single terms for searching?

To apply this fix in production, the following filter modifications should be applied to each index:

opinion_index
oral_arguments_percolator_vectors
oral_arguments_vectors
people_vectors
recap_percolator
recap_vectors

POST
https://localhost:9200/{index_name}/_close

PUT
https://localhost:9200/{index_name}/_settings

{
   "settings":{
      "analysis":{
         "filter":{
            "custom_word_delimiter_filter":{
               "type":"word_delimiter",
               "type_table": [
                    "§ => ALPHANUM",
                    "$ => ALPHANUM",
                    "% => ALPHANUM",
                    "¶ => ALPHANUM",
                ],
               "split_on_numerics":false,
               "preserve_original":true
            }
         }
      }
   }
}

POST
https://localhost:9200/{index_name}/_open

Fixes: #4432

mlissner · 2025-01-01T20:25:33Z

Fun, fun, fun. Sounds like you got the right solution.

Yes, let's add $ and % to the list too. I think the other one to add might be the pilcrow: ¶ thank you!

mlissner

LGTM, with the other couple of punctuation marks added (and tests as well).

Thanks!

…tting terms

albertisfu · 2025-01-01T21:58:58Z

Great!
Added $ % ¶ as exceptions as well.
Tests have been updated.

ERosendo

Great work, @albertisfu!.

mlissner · 2025-01-03T12:47:36Z

Alberto, can you spin off an infrastructure issue for Diego to do on the index changes, please?

albertisfu · 2025-01-03T16:45:08Z

Infrastructure Issue here:
https://github.com/freelawproject/infrastructure/issues/221

This PR can be merged without risk; however, the fix won't take effect until the infrastructure issue is resolved.

fix(search): Fixed phrase search queries with §

9b37bec

Fixes: #4432

albertisfu force-pushed the 4432-fix-search-query-with-section-mark branch from 5684f6a to 9b37bec Compare January 1, 2025 18:25

fix(search): Avoid splitting § on non-phrase queries

336be50

albertisfu marked this pull request as ready for review January 1, 2025 20:12

albertisfu requested a review from mlissner January 1, 2025 20:12

mlissner assigned ERosendo Jan 1, 2025

mlissner requested a review from ERosendo January 1, 2025 20:25

mlissner approved these changes Jan 1, 2025

View reviewed changes

albertisfu added 2 commits January 1, 2025 15:31

fix(search): Introduced additional special characters to prevent spli…

0671ec8

…tting terms

fix(search): Added additional tests cases in test_query_cleanup_function

4906599

Merge branch 'main' into 4432-fix-search-query-with-section-mark

8657277

ERosendo approved these changes Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4432 Fixed phrase search queries with § #4877

4432 Fixed phrase search queries with § #4877

albertisfu commented Jan 1, 2025 •

edited

Loading

mlissner commented Jan 1, 2025

mlissner left a comment

albertisfu commented Jan 1, 2025

ERosendo left a comment

mlissner commented Jan 3, 2025

albertisfu commented Jan 3, 2025

4432 Fixed phrase search queries with § #4877

Are you sure you want to change the base?

4432 Fixed phrase search queries with § #4877

Conversation

albertisfu commented Jan 1, 2025 • edited Loading

mlissner commented Jan 1, 2025

mlissner left a comment

Choose a reason for hiding this comment

albertisfu commented Jan 1, 2025

ERosendo left a comment

Choose a reason for hiding this comment

mlissner commented Jan 3, 2025

albertisfu commented Jan 3, 2025

albertisfu commented Jan 1, 2025 •

edited

Loading