4305 Fixed ES More Like This query #4735

albertisfu · 2024-11-26T18:04:32Z

After investigating the issue with the MLT query using the production cluster to test the queries, I found a couple of issues affecting this query:

The MLT documentation doesn't mention it, but I found that routing is required in production to properly reference the existing document in the index.

"like":[
      {
         "_id":"o_1472349",
         "routing":1472349
      }
  ]

The routing corresponds to the cluster_id the opinion belongs to so an additional DB query is required to retrieve it when performing the MLT query.

The fields used in the MLT query must be the exact version. Otherwise, they affect the query, and sometimes they don't match results at all. This is likely because the exact version doesn't remove duplicates or apply stemming, which can lead to improper analysis when comparing documents.
Added the original parameters used in Solr for the MLT query:

"min_term_freq":5,
 "max_query_terms":10,
 "min_word_length":3,
 "max_word_length":0,
 "max_doc_freq":1000,

Once this is merged we should clean up the MLT cache:

from cl.lib.redis_utils import get_redis_interface

r = get_redis_interface("CACHE")
keys_mlt = r.keys("clusters-mlt-es")
if keys_mlt:
    r.delete(*keys_mlt)

Fixes: #4305

… are not found in the DB

cl/opinion_page/utils.py

mlissner

LGTM. To @ERosendo for final review.

Thank you!

cl/opinion_page/utils.py

albertisfu · 2024-11-27T16:47:37Z

I’ve applied a fix related to COURTLISTENER-8QN.
The issue was that "related:" queries did not return a text snippet, causing case law search feeds involving a "related:" query to fail since the text field was missing.

To resolve this, I enabled child_highlighting in the child query associated with the MLT query.

Since the text field is considered in the MLT query and is also a highlighted field, the highlights in related documents now reflect the terms used to score the similarity.

ERosendo

Code looks good, but I have a concern about the new routing logic.

ERosendo · 2024-11-28T16:23:24Z

cl/lib/elasticsearch_utils.py

+    ] or [
+        {"_id": f"o_{pk}"} for pk in related_ids
+    ]  # Fall back in case IDs are not found in DB.


This fallback value creates the impression that IDs might exist in the Elasticsearch cluster but not in our database. While this method attempts to use these IDs when cluster pairs are unavailable, it raises concerns about scenarios where the related_ids list contains more IDs than matching clusters. In such cases, the unmatched IDs would be entirely ignored. Is this intended behavior, or should we consider including them without routing?

Yeah, I got it your concern. The purpose of this fallback is only to prevent the MLT query from failing. If no documents are passed to the like parameter, the query will fail. So this is not meant to determine if any of the IDs that were not found in DB might return results. In the scenario where none of the opinions are found in the DB (if the user provided a wrong Opinion IDs), we simply pass the original IDs to the query to avoid the query to fail.
Since routing is required, and considering that the opinions IDs don’t exist in ES, even using the fallback list of IDs will return no results. In this case, I think that's better (returning no results) than display an error message due to a failed query.

That’s why, for instance, if the user provides two IDs and only one of them is found in the database, it’s better to search for that ID alone and ignore the other one, as it won’t be found anyway due to the lack of routing.

I've improved the comment here to better explain its purpose.

Let me know what do you think.

got it! thanks for the explanation. new comment looks good

albertisfu added 2 commits November 26, 2024 11:11

fix(elasticsearch): Fixed ES MLT query

2741121

Fixes: #4305

fix(elasticsearch): Added a fallback to the MLT query in case the IDs…

d8b72b0

… are not found in the DB

albertisfu requested a review from mlissner November 26, 2024 18:04

semgrep-app bot reviewed Nov 26, 2024

View reviewed changes

cl/opinion_page/utils.py Show resolved Hide resolved

Merge branch 'main' into 4305-fix-es-mlt-query

2274a54

mlissner approved these changes Nov 26, 2024

View reviewed changes

cl/opinion_page/utils.py Outdated Show resolved Hide resolved

mlissner assigned ERosendo Nov 26, 2024

mlissner requested a review from ERosendo November 26, 2024 18:59

albertisfu added 3 commits November 26, 2024 13:17

fix(elasticsearch): Removed stray print

00885f3

Merge branch 'main' into 4305-fix-es-mlt-query

1d8d462

fix(elasticsearch): Enabled child highlighting for the related: query

fc3a2c7

ERosendo approved these changes Nov 28, 2024

View reviewed changes

ERosendo assigned albertisfu and unassigned ERosendo Nov 28, 2024

albertisfu added 2 commits November 28, 2024 11:39

fix(elasticsearch): Improved comment in build_more_like_this_query

62bdf18

Merge branch 'main' into 4305-fix-es-mlt-query

5ae668a

albertisfu merged commit 67b1fc2 into main Nov 28, 2024
15 checks passed

albertisfu deleted the 4305-fix-es-mlt-query branch November 28, 2024 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4305 Fixed ES More Like This query #4735

4305 Fixed ES More Like This query #4735

albertisfu commented Nov 26, 2024

mlissner left a comment

albertisfu commented Nov 27, 2024

ERosendo left a comment

ERosendo Nov 28, 2024

albertisfu Nov 28, 2024

ERosendo Nov 28, 2024 •

edited

Loading

4305 Fixed ES More Like This query #4735

4305 Fixed ES More Like This query #4735

Conversation

albertisfu commented Nov 26, 2024

mlissner left a comment

Choose a reason for hiding this comment

albertisfu commented Nov 27, 2024

ERosendo left a comment

Choose a reason for hiding this comment

ERosendo Nov 28, 2024

Choose a reason for hiding this comment

albertisfu Nov 28, 2024

Choose a reason for hiding this comment

ERosendo Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

ERosendo Nov 28, 2024 •

edited

Loading