-
-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4305 Fixed ES More Like This query #4735
Conversation
… are not found in the DB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. To @ERosendo for final review.
Thank you!
I’ve applied a fix related to COURTLISTENER-8QN. To resolve this, I enabled Since the text field is considered in the MLT query and is also a highlighted field, the highlights in related documents now reflect the terms used to score the similarity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, but I have a concern about the new routing logic.
cl/lib/elasticsearch_utils.py
Outdated
] or [ | ||
{"_id": f"o_{pk}"} for pk in related_ids | ||
] # Fall back in case IDs are not found in DB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fallback value creates the impression that IDs
might exist in the Elasticsearch cluster but not in our database. While this method attempts to use these IDs when cluster pairs are unavailable, it raises concerns about scenarios where the related_ids
list contains more IDs
than matching clusters. In such cases, the unmatched IDs would be entirely ignored. Is this intended behavior, or should we consider including them without routing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I got it your concern. The purpose of this fallback is only to prevent the MLT query from failing. If no documents are passed to the like
parameter, the query will fail. So this is not meant to determine if any of the IDs that were not found in DB might return results. In the scenario where none of the opinions are found in the DB (if the user provided a wrong Opinion IDs), we simply pass the original IDs to the query to avoid the query to fail.
Since routing is required, and considering that the opinions IDs don’t exist in ES, even using the fallback list of IDs will return no results. In this case, I think that's better (returning no results) than display an error message due to a failed query.
That’s why, for instance, if the user provides two IDs and only one of them is found in the database, it’s better to search for that ID alone and ignore the other one, as it won’t be found anyway due to the lack of routing.
I've improved the comment here to better explain its purpose.
Let me know what do you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it! thanks for the explanation. new comment looks good
After investigating the issue with the MLT query using the production cluster to test the queries, I found a couple of issues affecting this query:
routing
is required in production to properly reference the existing document in the index.The routing corresponds to the
cluster_id
the opinion belongs to so an additional DB query is required to retrieve it when performing the MLT query.The fields used in the MLT query must be the
exact
version. Otherwise, they affect the query, and sometimes they don't match results at all. This is likely because the exact version doesn't remove duplicates or apply stemming, which can lead to improper analysis when comparing documents.Added the original parameters used in Solr for the MLT query:
Once this is merged we should clean up the MLT cache:
Fixes: #4305