4438 Improved MLT and "Cited by" queries on the Opinion page. #4446

albertisfu · 2024-09-12T16:37:20Z

This PR addresses #4438 and other related issues:

Simplified the MLT query to use only the following fields to match related opinions:
- "procedural_history"
- "posture"
- "syllabus"
- "text"
Removed highlighting for the MLT query.
The MLT query and the "Cited by" query are now executed in parallel using the Multi-Search API. This required a full refactor of the previous methods. The two queries executed on the Opinion page are now handled by a single method, es_get_citing_and_related_clusters_with_cache, which retrieves related and/or cited clusters from the cache or ES independently.
Added a timeout for the Multi-Search query, defined by the ELASTICSEARCH_FAST_QUERIES_TIMEOUT setting, which defaults to 2 seconds.
If the query times out and no cached related or "cited by" clusters are available, a message is displayed on the Opinion page:

Time out:

Available results:

Improved the MLT query structure to make it more efficient on the Opinion Page, where nested opinions are not required. The query is as follows:

{
   "query":{
      "bool":{
         "filter":[
            {
               "match":{
                  "cluster_child":"opinion"
               }
            }
         ],
         "should":[
            {
               "bool":{
                  "must":[
                     {
                        "more_like_this":{
                           "fields":[
                              "procedural_history",
                              "posture",
                              "syllabus",
                              "text"
                           ],
                           "like":[
                              {
                                 "_id":"o_9883076"
                              },
                              {
                                 "_id":"o_10579993"
                              },
                              {
                                 "_id":"o_9883077"
                              },
                              {
                                 "_id":"o_95500"
                              }
                           ],
                           "min_term_freq":1,
                           "max_query_terms":12
                        }
                     }
                  ],
                  "must_not":[
                     {
                        "terms":{
                           "cluster_id":[
                              95500
                           ]
                        }
                     }
                  ]
               }
            }
         ],
         "minimum_should_match":1
      }
   },
   "sort":[
      {
         "_score":{
            "order":"desc"
         }
      }
   ],
   "collapse":{
      "field":"cluster_id"
   },
   "size":5,
   "track_total_hits":false,
   "_source":{
      "includes":[
         "absolute_url",
         "caseName",
         "cluster_id"
      ]
   }
}

This query is faster because it avoids the has_child query. Instead, a plain query is used to return opinions, which should be quicker than a join query. Additionally, this query excludes opinions related to the current cluster and uses a collapse by cluster_id to ensure only one opinion per cluster is returned, preventing duplicate clusters in the results. The query only returns the necessary fields for displaying results on the Opinion page.

Although this query matches related opinions during testing, further tuning of the MLT query parameters in production may be required, as planned in 'Related Case Law' section is not being shown on the Opinion page when ES is enabled. #4305.
The MLT query for the Search frontend still uses the has_child query because nested opinions are needed for the frontend results. However, highlighting and other unnecessary clauses have been removed, fields have been simplified and the related cluster or clusters in the query are excluded from results.
Additional issues fixed in this PR:
- I identified a problem in clusters with more than one sub-opinion, or when a user passed multiple opinion IDs to the related: query in the search frontend, such as:
  For instance: https://www.courtlistener.com/?q=related:1247437,9581751&stat_Precedential=on
  This query returned a MultipleObjectsReturned error:
  https://freelawproject.sentry.io/issues/5835257763/

This issue has been resolved by ensuring that if multiple sub-opinions belonging to the same cluster are queried, only one OpinionCluster is returned. If the opinions belong to different clusters, multiple clusters are returned.

In that case, all related clusters are shown in the frontend:

…inions page - Make queries run in parallel - Simplify the MLT query - Add a custom timeout Fixes: #4438

…ns clusters - Show unable to retrieve related cluster on query timeout

semgrep-app · 2024-09-12T16:40:26Z

cl/search/templates/search.html

+                                {{ cluster.caption|safe|v_wrapper }}{% if not forloop.last %}, {% endif %}
+                              {% endfor %}
+                            {% else %}
+                              {{ related_cluster.caption|safe|v_wrapper }}


Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.
_{Ignore this finding from template-unescaped-with-safe.}

semgrep-app · 2024-09-12T16:40:27Z

cl/search/templates/search.html

+                            <span class="gray alt">related to</span>
+                            {% flag "o-es-active" %}
+                              {% for cluster in related_cluster %}
+                                {{ cluster.caption|safe|v_wrapper }}{% if not forloop.last %}, {% endif %}


Detected a segment of a Flask template where autoescaping is explicitly disabled with '| safe' filter. This allows rendering of raw HTML in this segment. Ensure no user data is rendered here, otherwise this is a cross-site scripting (XSS) vulnerability.
_{Ignore this finding from template-unescaped-with-safe.}

semgrep-app · 2024-09-12T16:40:28Z

cl/opinion_page/utils.py

+        related_search_query = related_search_query.extra(
+            size=settings.RELATED_COUNT, track_total_hits=False
+        )


QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').
_{Ignore this finding from avoid-query-set-extra.}

semgrep-app · 2024-09-12T16:40:29Z

cl/opinion_page/utils.py

+        cluster_related_query.sort({"_score": {"order": "desc"}})
+        .source(includes=["absolute_url", "caseName", "cluster_id"])
+        .extra(size=5)


QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').
_{Ignore this finding from avoid-query-set-extra.}

… on query timeout

semgrep-app · 2024-09-12T19:18:07Z

cl/opinion_page/utils.py

+    search_query = (
+        cluster_cites_query.sort({"citeCount": {"order": "desc"}})
+        .source(includes=["absolute_url", "caseName", "dateFiled"])
+        .extra(size=5, track_total_hits=True)
+    )


QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').
_{Ignore this finding from avoid-query-set-extra.}

…he type hints

mlissner · 2024-09-13T06:00:26Z

Looks great at a skim. I'll let Eduardo do the full review. Thank you for finding so many things to fix on this. Jeesh!

ERosendo · 2024-09-17T14:14:35Z

cl/opinion_page/utils.py

+) -> tuple[
+    list[OpinionClusterDocument],
+    list[int],
+    dict[str, str],
+    list[OpinionClusterDocument],
+    int,
+    bool,
+]:


I believe that using a data class instead of a tuple can make this code easier to read and understand. Tuples can get confusing when you have a lot of things in them, especially if you need to remember the right order. If you need to add something new to the tuple later, it can be a pain to change everything else. With a data class, it’s much simpler and less error-prone.

Yeah the dataclass is a good fit here. I've added it (RelatedCitingResults)

ERosendo · 2024-09-17T15:21:46Z

cl/opinion_page/utils.py

-from cl.lib.elasticsearch_utils import do_count_query
+from cl.lib.bot_detector import is_bot
+from cl.lib.elasticsearch_utils import (
+    build_es_main_query,


It seems like build_es_main_query might be unused in this file.

ERosendo · 2024-09-17T16:57:54Z

cl/opinion_page/utils.py

+    try:
+        # Execute the MultiSearch request as needed based on available
+        # cached results
+        multi_search = MultiSearch()
+        response_index = 0
+        related_index = citing_index = None
+        if related_search_query:
+            multi_search = multi_search.add(related_search_query)
+            related_index = response_index
+            response_index += 1
+        if cited_search_query:
+            multi_search = multi_search.add(cited_search_query)
+            citing_index = response_index
+        multi_search.params(
+            timeout=f"{settings.ELASTICSEARCH_FAST_QUERIES_TIMEOUT}s"
+        )
+        responses = multi_search.execute() if multi_search._searches else []
+        related_clusters: list[OpinionClusterDocument] = (
+            list(responses[related_index])
+            if related_index is not None
+            else cached_related_clusters or []
+        )
+        citing_clusters: list[OpinionClusterDocument] = (
+            list(responses[citing_index])
+            if citing_index is not None
+            else cached_citing_results or []
+        )
+        citing_cluster_count: int = (
+            responses[citing_index].hits.total.value
+            if citing_index is not None
+            else cached_citing_cluster_count or 0
+        )
+        timeout_related = False if related_clusters else timeout_related
+        timeout_cited = False if citing_clusters else timeout_cited


We could potentially simplify the code within the try block by moving some logic outside of it. For example, we might be able to compute related_clusters, citing_clusters, citing_cluster_count, and timeouts after the try statement (using a dataclass could be helpful for this).

Yeah, using the dataclass helped to also simplify the code here. Refactor applied.

ERosendo

The PR looks good. I think we can merge it once my comments have been addressed and any conflicts resolved.

…sters_with_cache - Added test cases for timeout and connection errors when getting mlt and citing clusters

semgrep-app · 2024-09-19T00:06:41Z

cl/opinion_page/utils.py

+        related_query = related_query.extra(
+            size=settings.RELATED_COUNT, track_total_hits=False
+        )


QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').
_{Ignore this finding from avoid-query-set-extra.}

albertisfu · 2024-09-19T00:25:54Z

@ERosendo Thank you for your review and suggestions!
I have applied the suggested refactor accordingly.

Also added additional tests to verify the expected behaviour on Connection timeouts and Connection errors.

ERosendo · 2024-09-19T12:03:51Z

LGTM 🚀

mlissner · 2024-09-19T14:14:57Z

Thank you both!

sentry-io · 2024-09-20T15:09:26Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ OpinionCluster.MultipleObjectsReturned: get() returned more than one OpinionCluster -- it returned 3! / View Issue
‼️ NameError: name 'request' is not defined /opinion/{pk}/{slug}/summaries/ View Issue
‼️ NameError: name 'urlencode' is not defined /opinion/{pk}/{slug}/summaries/ View Issue
‼️ TypeError: fetch_related_clusters() missing 2 required positional arguments: 'cluster' and 'request' /opinion/{pk}/{_}/ View Issue
‼️ NameError: name 'datetime' is not defined /opinion/{pk}/{_}/ View Issue

_{Did you find this useful? React with a 👍 or 👎}

albertisfu added 2 commits September 11, 2024 18:40

fix(elasticsearch): Improved citing and related clusters query for Op…

5f19338

…inions page - Make queries run in parallel - Simplify the MLT query - Add a custom timeout Fixes: #4438

fix(elasticsearch): Improve the MLT query to support multi sub-opinio…

97b5c74

…ns clusters - Show unable to retrieve related cluster on query timeout

albertisfu changed the title ~~4438 Improved MLT and 'Cited by' queries on the Opinion page.~~ 4438 Improved MLT and "Cited by" queries on the Opinion page. Sep 12, 2024

Merge branch 'main' into 4438-improve-opinions-mlt-query

6b70f51

semgrep-app bot reviewed Sep 12, 2024

View reviewed changes

fix(elasticsearch): Show unable to retrieve cited by clusters message…

f726896

… on query timeout

semgrep-app bot reviewed Sep 12, 2024

View reviewed changes

fix(elasticsearch): Fixed es_get_citing_and_related_clusters_with_cac…

bda35db

…he type hints

albertisfu marked this pull request as ready for review September 12, 2024 20:26

albertisfu requested a review from mlissner September 12, 2024 20:26

ERosendo reviewed Sep 17, 2024

View reviewed changes

ERosendo approved these changes Sep 17, 2024

View reviewed changes

albertisfu added 2 commits September 18, 2024 14:14

Merge branch 'main' into 4438-improve-opinions-mlt-query

8a5e28a

fix(elasticsearch): Refactor applied to es_get_citing_and_related_clu…

f6e7e00

…sters_with_cache - Added test cases for timeout and connection errors when getting mlt and citing clusters

semgrep-app bot reviewed Sep 19, 2024

View reviewed changes

mlissner merged commit e426297 into main Sep 19, 2024
13 checks passed

mlissner deleted the 4438-improve-opinions-mlt-query branch September 19, 2024 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4438 Improved MLT and "Cited by" queries on the Opinion page. #4446

4438 Improved MLT and "Cited by" queries on the Opinion page. #4446

albertisfu commented Sep 12, 2024 •

edited

Loading

semgrep-app bot Sep 12, 2024

semgrep-app bot Sep 12, 2024

semgrep-app bot Sep 12, 2024 •

edited

Loading

semgrep-app bot Sep 12, 2024

semgrep-app bot Sep 12, 2024

mlissner commented Sep 13, 2024

ERosendo Sep 17, 2024

albertisfu Sep 19, 2024

ERosendo Sep 17, 2024

albertisfu Sep 19, 2024

ERosendo Sep 17, 2024

albertisfu Sep 19, 2024

ERosendo left a comment

semgrep-app bot Sep 19, 2024

albertisfu commented Sep 19, 2024 •

edited

Loading

ERosendo commented Sep 19, 2024

mlissner commented Sep 19, 2024

sentry-io bot commented Sep 20, 2024 •

edited

Loading

4438 Improved MLT and "Cited by" queries on the Opinion page. #4446

4438 Improved MLT and "Cited by" queries on the Opinion page. #4446

Conversation

albertisfu commented Sep 12, 2024 • edited Loading

semgrep-app bot Sep 12, 2024

Choose a reason for hiding this comment

semgrep-app bot Sep 12, 2024

Choose a reason for hiding this comment

semgrep-app bot Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

semgrep-app bot Sep 12, 2024

Choose a reason for hiding this comment

semgrep-app bot Sep 12, 2024

Choose a reason for hiding this comment

mlissner commented Sep 13, 2024

ERosendo Sep 17, 2024

Choose a reason for hiding this comment

albertisfu Sep 19, 2024

Choose a reason for hiding this comment

ERosendo Sep 17, 2024

Choose a reason for hiding this comment

albertisfu Sep 19, 2024

Choose a reason for hiding this comment

ERosendo Sep 17, 2024

Choose a reason for hiding this comment

albertisfu Sep 19, 2024

Choose a reason for hiding this comment

ERosendo left a comment

Choose a reason for hiding this comment

semgrep-app bot Sep 19, 2024

Choose a reason for hiding this comment

albertisfu commented Sep 19, 2024 • edited Loading

ERosendo commented Sep 19, 2024

mlissner commented Sep 19, 2024

sentry-io bot commented Sep 20, 2024 • edited Loading

Suspect Issues

albertisfu commented Sep 12, 2024 •

edited

Loading

semgrep-app bot Sep 12, 2024 •

edited

Loading

albertisfu commented Sep 19, 2024 •

edited

Loading

sentry-io bot commented Sep 20, 2024 •

edited

Loading