Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4305 Fixed ES More Like This query #4735

Merged
merged 8 commits into from
Nov 28, 2024
Merged

4305 Fixed ES More Like This query #4735

merged 8 commits into from
Nov 28, 2024

Conversation

albertisfu
Copy link
Contributor

After investigating the issue with the MLT query using the production cluster to test the queries, I found a couple of issues affecting this query:

  • The MLT documentation doesn't mention it, but I found that routing is required in production to properly reference the existing document in the index.
"like":[
      {
         "_id":"o_1472349",
         "routing":1472349
      }
  ]

The routing corresponds to the cluster_id the opinion belongs to so an additional DB query is required to retrieve it when performing the MLT query.

  • The fields used in the MLT query must be the exact version. Otherwise, they affect the query, and sometimes they don't match results at all. This is likely because the exact version doesn't remove duplicates or apply stemming, which can lead to improper analysis when comparing documents.

  • Added the original parameters used in Solr for the MLT query:

"min_term_freq":5,
 "max_query_terms":10,
 "min_word_length":3,
 "max_word_length":0,
 "max_doc_freq":1000,

Once this is merged we should clean up the MLT cache:

from cl.lib.redis_utils import get_redis_interface

r = get_redis_interface("CACHE")
keys_mlt = r.keys("clusters-mlt-es")
if keys_mlt:
    r.delete(*keys_mlt)

Fixes: #4305

@albertisfu albertisfu requested a review from mlissner November 26, 2024 18:04
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. To @ERosendo for final review.

Thank you!

cl/opinion_page/utils.py Outdated Show resolved Hide resolved
@albertisfu
Copy link
Contributor Author

I’ve applied a fix related to COURTLISTENER-8QN.
The issue was that "related:" queries did not return a text snippet, causing case law search feeds involving a "related:" query to fail since the text field was missing.

To resolve this, I enabled child_highlighting in the child query associated with the MLT query.

Since the text field is considered in the MLT query and is also a highlighted field, the highlights in related documents now reflect the terms used to score the similarity.

Screenshot 2024-11-27 at 10 45 04 a m

Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, but I have a concern about the new routing logic.

Comment on lines 197 to 199
] or [
{"_id": f"o_{pk}"} for pk in related_ids
] # Fall back in case IDs are not found in DB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fallback value creates the impression that IDs might exist in the Elasticsearch cluster but not in our database. While this method attempts to use these IDs when cluster pairs are unavailable, it raises concerns about scenarios where the related_ids list contains more IDs than matching clusters. In such cases, the unmatched IDs would be entirely ignored. Is this intended behavior, or should we consider including them without routing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I got it your concern. The purpose of this fallback is only to prevent the MLT query from failing. If no documents are passed to the like parameter, the query will fail. So this is not meant to determine if any of the IDs that were not found in DB might return results. In the scenario where none of the opinions are found in the DB (if the user provided a wrong Opinion IDs), we simply pass the original IDs to the query to avoid the query to fail.
Since routing is required, and considering that the opinions IDs don’t exist in ES, even using the fallback list of IDs will return no results. In this case, I think that's better (returning no results) than display an error message due to a failed query.

That’s why, for instance, if the user provides two IDs and only one of them is found in the database, it’s better to search for that ID alone and ignore the other one, as it won’t be found anyway due to the lack of routing.

I've improved the comment here to better explain its purpose.

Let me know what do you think.

Copy link
Contributor

@ERosendo ERosendo Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it! thanks for the explanation. new comment looks good :shipit:

@ERosendo ERosendo assigned albertisfu and unassigned ERosendo Nov 28, 2024
@albertisfu albertisfu merged commit 67b1fc2 into main Nov 28, 2024
15 checks passed
@albertisfu albertisfu deleted the 4305-fix-es-mlt-query branch November 28, 2024 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

'Related Case Law' section is not being shown on the Opinion page when ES is enabled.
3 participants