-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4438 Improved MLT and "Cited by" queries on the Opinion page. #4446
Conversation
…inions page - Make queries run in parallel - Simplify the MLT query - Add a custom timeout Fixes: #4438
…ns clusters - Show unable to retrieve related cluster on query timeout
{{ cluster.caption|safe|v_wrapper }}{% if not forloop.last %}, {% endif %} | ||
{% endfor %} | ||
{% else %} | ||
{{ related_cluster.caption|safe|v_wrapper }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.
Ignore this finding from template-unescaped-with-safe.<span class="gray alt">related to</span> | ||
{% flag "o-es-active" %} | ||
{% for cluster in related_cluster %} | ||
{{ cluster.caption|safe|v_wrapper }}{% if not forloop.last %}, {% endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.
Ignore this finding from template-unescaped-with-safe.
cl/opinion_page/utils.py
Outdated
related_search_query = related_search_query.extra( | ||
size=settings.RELATED_COUNT, track_total_hits=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob')
.
cluster_related_query.sort({"_score": {"order": "desc"}}) | ||
.source(includes=["absolute_url", "caseName", "cluster_id"]) | ||
.extra(size=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob')
.
… on query timeout
search_query = ( | ||
cluster_cites_query.sort({"citeCount": {"order": "desc"}}) | ||
.source(includes=["absolute_url", "caseName", "dateFiled"]) | ||
.extra(size=5, track_total_hits=True) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob')
.
Looks great at a skim. I'll let Eduardo do the full review. Thank you for finding so many things to fix on this. Jeesh! |
cl/opinion_page/utils.py
Outdated
) -> tuple[ | ||
list[OpinionClusterDocument], | ||
list[int], | ||
dict[str, str], | ||
list[OpinionClusterDocument], | ||
int, | ||
bool, | ||
]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that using a data class instead of a tuple can make this code easier to read and understand. Tuples can get confusing when you have a lot of things in them, especially if you need to remember the right order. If you need to add something new to the tuple later, it can be a pain to change everything else. With a data class, it’s much simpler and less error-prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the dataclass is a good fit here. I've added it (RelatedCitingResults)
cl/opinion_page/utils.py
Outdated
from cl.lib.elasticsearch_utils import do_count_query | ||
from cl.lib.bot_detector import is_bot | ||
from cl.lib.elasticsearch_utils import ( | ||
build_es_main_query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like build_es_main_query
might be unused in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
cl/opinion_page/utils.py
Outdated
try: | ||
# Execute the MultiSearch request as needed based on available | ||
# cached results | ||
multi_search = MultiSearch() | ||
response_index = 0 | ||
related_index = citing_index = None | ||
if related_search_query: | ||
multi_search = multi_search.add(related_search_query) | ||
related_index = response_index | ||
response_index += 1 | ||
if cited_search_query: | ||
multi_search = multi_search.add(cited_search_query) | ||
citing_index = response_index | ||
multi_search.params( | ||
timeout=f"{settings.ELASTICSEARCH_FAST_QUERIES_TIMEOUT}s" | ||
) | ||
responses = multi_search.execute() if multi_search._searches else [] | ||
related_clusters: list[OpinionClusterDocument] = ( | ||
list(responses[related_index]) | ||
if related_index is not None | ||
else cached_related_clusters or [] | ||
) | ||
citing_clusters: list[OpinionClusterDocument] = ( | ||
list(responses[citing_index]) | ||
if citing_index is not None | ||
else cached_citing_results or [] | ||
) | ||
citing_cluster_count: int = ( | ||
responses[citing_index].hits.total.value | ||
if citing_index is not None | ||
else cached_citing_cluster_count or 0 | ||
) | ||
timeout_related = False if related_clusters else timeout_related | ||
timeout_cited = False if citing_clusters else timeout_cited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially simplify the code within the try block by moving some logic outside of it. For example, we might be able to compute related_clusters
, citing_clusters
, citing_cluster_count
, and timeouts
after the try statement (using a dataclass
could be helpful for this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, using the dataclass helped to also simplify the code here. Refactor applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good. I think we can merge it once my comments have been addressed and any conflicts resolved.
…sters_with_cache - Added test cases for timeout and connection errors when getting mlt and citing clusters
related_query = related_query.extra( | ||
size=settings.RELATED_COUNT, track_total_hits=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob')
.
@ERosendo Thank you for your review and suggestions! Also added additional tests to verify the expected behaviour on Connection timeouts and Connection errors. |
LGTM 🚀 |
Thank you both! |
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
This PR addresses #4438 and other related issues:
es_get_citing_and_related_clusters_with_cache
, which retrieves related and/or cited clusters from the cache or ES independently.ELASTICSEARCH_FAST_QUERIES_TIMEOUT
setting, which defaults to 2 seconds.Time out:
Available results:
This query is faster because it avoids the
has_child
query. Instead, a plain query is used to return opinions, which should be quicker than a join query. Additionally, this query excludes opinions related to the current cluster and uses a collapse bycluster_id
to ensure only one opinion per cluster is returned, preventing duplicate clusters in the results. The query only returns the necessary fields for displaying results on the Opinion page.Although this query matches related opinions during testing, further tuning of the MLT query parameters in production may be required, as planned in 'Related Case Law' section is not being shown on the Opinion page when ES is enabled. #4305.
The MLT query for the Search frontend still uses the
has_child
query because nested opinions are needed for the frontend results. However, highlighting and other unnecessary clauses have been removed, fields have been simplified and the related cluster or clusters in the query are excluded from results.Additional issues fixed in this PR:
related:
query in the search frontend, such as:For instance: https://www.courtlistener.com/?q=related:1247437,9581751&stat_Precedential=on
This query returned a MultipleObjectsReturned error:
https://freelawproject.sentry.io/issues/5835257763/
This issue has been resolved by ensuring that if multiple sub-opinions belonging to the same cluster are queried, only one
OpinionCluster
is returned. If the opinions belong to different clusters, multiple clusters are returned.In that case, all related clusters are shown in the frontend: