Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4438 Improved MLT and "Cited by" queries on the Opinion page. #4446

Merged
merged 7 commits into from
Sep 19, 2024

Conversation

albertisfu
Copy link
Contributor

@albertisfu albertisfu commented Sep 12, 2024

This PR addresses #4438 and other related issues:

  • Simplified the MLT query to use only the following fields to match related opinions:
    • "procedural_history"
    • "posture"
    • "syllabus"
    • "text"
  • Removed highlighting for the MLT query.
  • The MLT query and the "Cited by" query are now executed in parallel using the Multi-Search API. This required a full refactor of the previous methods. The two queries executed on the Opinion page are now handled by a single method, es_get_citing_and_related_clusters_with_cache, which retrieves related and/or cited clusters from the cache or ES independently.
  • Added a timeout for the Multi-Search query, defined by the ELASTICSEARCH_FAST_QUERIES_TIMEOUT setting, which defaults to 2 seconds.
  • If the query times out and no cached related or "cited by" clusters are available, a message is displayed on the Opinion page:

Time out:
Screenshot 2024-09-12 at 2 12 15 p m

Available results:
Screenshot 2024-09-12 at 2 10 30 p m

  • Improved the MLT query structure to make it more efficient on the Opinion Page, where nested opinions are not required. The query is as follows:
{
   "query":{
      "bool":{
         "filter":[
            {
               "match":{
                  "cluster_child":"opinion"
               }
            }
         ],
         "should":[
            {
               "bool":{
                  "must":[
                     {
                        "more_like_this":{
                           "fields":[
                              "procedural_history",
                              "posture",
                              "syllabus",
                              "text"
                           ],
                           "like":[
                              {
                                 "_id":"o_9883076"
                              },
                              {
                                 "_id":"o_10579993"
                              },
                              {
                                 "_id":"o_9883077"
                              },
                              {
                                 "_id":"o_95500"
                              }
                           ],
                           "min_term_freq":1,
                           "max_query_terms":12
                        }
                     }
                  ],
                  "must_not":[
                     {
                        "terms":{
                           "cluster_id":[
                              95500
                           ]
                        }
                     }
                  ]
               }
            }
         ],
         "minimum_should_match":1
      }
   },
   "sort":[
      {
         "_score":{
            "order":"desc"
         }
      }
   ],
   "collapse":{
      "field":"cluster_id"
   },
   "size":5,
   "track_total_hits":false,
   "_source":{
      "includes":[
         "absolute_url",
         "caseName",
         "cluster_id"
      ]
   }
}
  • This query is faster because it avoids the has_child query. Instead, a plain query is used to return opinions, which should be quicker than a join query. Additionally, this query excludes opinions related to the current cluster and uses a collapse by cluster_id to ensure only one opinion per cluster is returned, preventing duplicate clusters in the results. The query only returns the necessary fields for displaying results on the Opinion page.

    Although this query matches related opinions during testing, further tuning of the MLT query parameters in production may be required, as planned in 'Related Case Law' section is not being shown on the Opinion page when ES is enabled. #4305.

  • The MLT query for the Search frontend still uses the has_child query because nested opinions are needed for the frontend results. However, highlighting and other unnecessary clauses have been removed, fields have been simplified and the related cluster or clusters in the query are excluded from results.

  • Additional issues fixed in this PR:

This issue has been resolved by ensuring that if multiple sub-opinions belonging to the same cluster are queried, only one OpinionCluster is returned. If the opinions belong to different clusters, multiple clusters are returned.

In that case, all related clusters are shown in the frontend:
Screenshot 2024-09-12 at 3 19 15 p m

…inions page

- Make queries run in parallel
- Simplify the MLT query
- Add a custom timeout

Fixes: #4438
…ns clusters

- Show unable to retrieve related cluster on query timeout
@albertisfu albertisfu changed the title 4438 Improved MLT and 'Cited by' queries on the Opinion page. 4438 Improved MLT and "Cited by" queries on the Opinion page. Sep 12, 2024
{{ cluster.caption|safe|v_wrapper }}{% if not forloop.last %}, {% endif %}
{% endfor %}
{% else %}
{{ related_cluster.caption|safe|v_wrapper }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.

Ignore this finding from template-unescaped-with-safe.

<span class="gray alt">related to</span>
{% flag "o-es-active" %}
{% for cluster in related_cluster %}
{{ cluster.caption|safe|v_wrapper }}{% if not forloop.last %}, {% endif %}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.

Ignore this finding from template-unescaped-with-safe.

Comment on lines 272 to 274
related_search_query = related_search_query.extra(
size=settings.RELATED_COUNT, track_total_hits=False
)
Copy link

@semgrep-app semgrep-app bot Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

Ignore this finding from avoid-query-set-extra.

Comment on lines +194 to +196
cluster_related_query.sort({"_score": {"order": "desc"}})
.source(includes=["absolute_url", "caseName", "cluster_id"])
.extra(size=5)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

Ignore this finding from avoid-query-set-extra.

Comment on lines +158 to +162
search_query = (
cluster_cites_query.sort({"citeCount": {"order": "desc"}})
.source(includes=["absolute_url", "caseName", "dateFiled"])
.extra(size=5, track_total_hits=True)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

Ignore this finding from avoid-query-set-extra.

@albertisfu albertisfu marked this pull request as ready for review September 12, 2024 20:26
@albertisfu albertisfu requested a review from mlissner September 12, 2024 20:26
@mlissner
Copy link
Member

Looks great at a skim. I'll let Eduardo do the full review. Thank you for finding so many things to fix on this. Jeesh!

Comment on lines 205 to 212
) -> tuple[
list[OpinionClusterDocument],
list[int],
dict[str, str],
list[OpinionClusterDocument],
int,
bool,
]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that using a data class instead of a tuple can make this code easier to read and understand. Tuples can get confusing when you have a lot of things in them, especially if you need to remember the right order. If you need to add something new to the tuple later, it can be a pain to change everything else. With a data class, it’s much simpler and less error-prone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the dataclass is a good fit here. I've added it (RelatedCitingResults)

from cl.lib.elasticsearch_utils import do_count_query
from cl.lib.bot_detector import is_bot
from cl.lib.elasticsearch_utils import (
build_es_main_query,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like build_es_main_query might be unused in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 277 to 310
try:
# Execute the MultiSearch request as needed based on available
# cached results
multi_search = MultiSearch()
response_index = 0
related_index = citing_index = None
if related_search_query:
multi_search = multi_search.add(related_search_query)
related_index = response_index
response_index += 1
if cited_search_query:
multi_search = multi_search.add(cited_search_query)
citing_index = response_index
multi_search.params(
timeout=f"{settings.ELASTICSEARCH_FAST_QUERIES_TIMEOUT}s"
)
responses = multi_search.execute() if multi_search._searches else []
related_clusters: list[OpinionClusterDocument] = (
list(responses[related_index])
if related_index is not None
else cached_related_clusters or []
)
citing_clusters: list[OpinionClusterDocument] = (
list(responses[citing_index])
if citing_index is not None
else cached_citing_results or []
)
citing_cluster_count: int = (
responses[citing_index].hits.total.value
if citing_index is not None
else cached_citing_cluster_count or 0
)
timeout_related = False if related_clusters else timeout_related
timeout_cited = False if citing_clusters else timeout_cited
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially simplify the code within the try block by moving some logic outside of it. For example, we might be able to compute related_clusters, citing_clusters, citing_cluster_count, and timeouts after the try statement (using a dataclass could be helpful for this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using the dataclass helped to also simplify the code here. Refactor applied.

Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good. I think we can merge it once my comments have been addressed and any conflicts resolved.

…sters_with_cache

- Added test cases for timeout and connection errors when getting mlt and citing clusters
Comment on lines +272 to +274
related_query = related_query.extra(
size=settings.RELATED_COUNT, track_total_hits=False
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

Ignore this finding from avoid-query-set-extra.

@albertisfu
Copy link
Contributor Author

albertisfu commented Sep 19, 2024

@ERosendo Thank you for your review and suggestions!
I have applied the suggested refactor accordingly.

Also added additional tests to verify the expected behaviour on Connection timeouts and Connection errors.

@ERosendo
Copy link
Contributor

LGTM :shipit: 🚀

@mlissner mlissner merged commit e426297 into main Sep 19, 2024
13 checks passed
@mlissner mlissner deleted the 4438-improve-opinions-mlt-query branch September 19, 2024 14:14
@mlissner
Copy link
Member

Thank you both!

Copy link

sentry-io bot commented Sep 20, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ OpinionCluster.MultipleObjectsReturned: get() returned more than one OpinionCluster -- it returned 3! / View Issue
  • ‼️ NameError: name 'request' is not defined /opinion/{pk}/{slug}/summaries/ View Issue
  • ‼️ NameError: name 'urlencode' is not defined /opinion/{pk}/{slug}/summaries/ View Issue
  • ‼️ TypeError: fetch_related_clusters() missing 2 required positional arguments: 'cluster' and 'request' /opinion/{pk}/{_}/ View Issue
  • ‼️ NameError: name 'datetime' is not defined /opinion/{pk}/{_}/ View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants