Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match opinions based on pincite #4323

Open
albertisfu opened this issue Aug 19, 2024 · 7 comments
Open

Match opinions based on pincite #4323

albertisfu opened this issue Aug 19, 2024 · 7 comments
Assignees

Comments

@albertisfu
Copy link
Contributor

This is a follow-up to #4211, where we discussed potential improvements for matching the correct opinion when resolving citations to create OpinionCited instances.

Currently within es_reverse_match if more than one Opinion found belongs to the same cluster, the first one matched is the one retrieved to create the OpinionCited instance.

This can be improved in one of two ways:

Now that we have opinion ordering, it makes sense to choose the first opinion by order.

But that could be improved because sometimes a citation will have what's called a pincite, which is a citation to a particular page. If we know the pincite, we should figure out which of the sub-opinions it refers to based on the page numbers for each citation, and use that sub-opinion. (But this is hard.)

Currently, ordering keys in Opinions are empty, so this improvement might have to wait until the ordering is populated.

@mlissner mlissner changed the title Enhance sub-Opinions matching logic within es_reverse_match Match opinions based on pincite Dec 16, 2024
@mlissner
Copy link
Member

This is a bit of a complicated bug, so here's the TLDR:

  1. Eyecite finds a citation.
  2. We look for clusters with that citation.
  3. We find one, but it has three sub-opinions.
  4. Currently we just match the first one to be indexed.
  5. Instead, we should match the first one by ordering_key, or if there's a pincite, we should match the opinion where that pincite occurs.

@flooie flooie moved this to To Do in Case Law Sprint Dec 17, 2024
@grossir grossir moved this from To Do to In progress in Case Law Sprint Dec 18, 2024
@grossir
Copy link
Contributor

grossir commented Dec 19, 2024

I think the scenarios proposed look like this:

  • We are resolving a pincite

    • Does the opinion "content" has page numbers available?
      - Yes: match the opinion in the cluster that has the proper page number
      - No: does the cluster has an ordering key?
      - - - - - Yes: use the first ordering key
      - - - - - No: use the first indexed (we do this currently)
  • We are resolving a general citation

    • Does the cluster have an ordering key
      • Yes: use the first ordering key
      • No: use the first indexed (we do this currently)

For the scenarios where we can't resolve the pincite I think we should match the "combined" opinion if it exists, instead of the first in order.

Looking at a particular cluster with 4 opinions, the first 3 have ordering keys and types lead, concurrence, dissent; the last and oldest is a "combined" opinion with null ordering key. So, if the pincite was not solvable, or in the case of a general citation, shouldn't we point to the combined opinion which is the "whole" decision instead of a fragment?

Even if we decide against this, there would be corrections to do. In that same cluster, 3297 opinions are citing the combined opinion; and only 1 is citing the "lead" opinion.

Jumping into the task itself:

Case when we can identify page numbers in the opinion

if there's a pincite, we should match the opinion where that pincite occurs.

This is possible for opinions that come from a HTML / XML source

  • html_lawbox : <span class=\"star-pagination\">*115</span> . Example
  • xml_harvard: <page-number citation-index=\"1\" label=\"116\">*116</page-number> . Example
  • html_columbia: <span class=\"star-pagination\">*Page 319</span. Example
  • html: This should be possible in some courts that markup their page numbers; but we would need to check in a case by case basis and would take more time to implement
  • plain_text: I think it wouldn't be possible to do cleanly, since if the page numbers are not clearly marked up, they could collide with random numbers in the opinion's text

Once we have the pincite page number, we test for it's presence in any of the HTML fields, and it's a match if it exists.

We have 363 895 clusters where a pincite citing into them would be resolvable, around 3.7% of the clusters in the DB

It seems eyecite already identifies pincites...

Matching on ordering key

Currently we just match the first one to be indexed.
Instead, we should match the first one by ordering_key

This can be done easily, but I am unsure if it's the correct choice. If done, we should back-correct the OpinionsCited table for cases like the example above?

As of time of writing this, 430 596 clusters have a opinions with at least 1 not null ordering_key in its opinions. 4.38% of clusters.


Queries for the stats:

-- clusters with more than 1 opinion, and 1 rich structured field per opinion
courtlistener=> select count(*) from (select cluster_id from search_opinion group by cluster_id having count(*) > 1 and bool_and(xml_harvard <> '' or html_lawbox <> '' or html_columbia <> '')) a;
 count  
--------
 363895


-- clusters with at least 1 ordering key
courtlistener=> select count(distinct(cluster_id)) from search_opinion where ordering_key is not null;
 count  
--------
 430596
(1 row)

courtlistener=> select count(distinct(cluster_id)) from search_opinion;
  count  
---------
 9823322
(1 row)

-- clusters that do not have a combined opinion
courtlistener=> select count(*) from (select cluster_id from search_opinion group by cluster_id having bool_and(ordering_key is not null)) a;
 count  
--------
 219875
(1 row)

Some extra thoughts

Assuming we can resolve the pincites, we should update the html_with_citations of the citing opinion to hyperlink (probably with an HTML fragment) the page number; so that when followed, the reader is autoscrolled into the proper page.


Also, resolving pincites suggests some model changes. We could add a filed to OpinionsCited with the actual page number.
This would require deleting the "depth" field. However, the information on that field would not be lost, since it could be re-computed via aggregating the same table over citing_opinion and cited_opinion. Having the proper pincite as a DB field would allow finer grain analysis without losing the "depth" information.

class OpinionsCited(models.Model):
    citing_opinion = models.ForeignKey(
        Opinion, related_name="cited_opinions", on_delete=models.CASCADE
    )
    cited_opinion = models.ForeignKey(
        Opinion, related_name="citing_opinions", on_delete=models.CASCADE
    )
    pincite = models.IntegerField(
        help_text="The page cited"
    )
--- computing depth
SELECT citing_opinion_id, cited_opinion_id, count(*) as depth
FROM search_opinions_cited

What's more, we could even leave the "depth" on the model, as a "depth" of pincites, which I imagine happens if the same opinion part is cited multiple times

--- computing depth
SELECT citing_opinion_id, cited_opinion_id, sum(depth) as depth
FROM search_opinions_cited

@mlissner
Copy link
Member

For the scenarios where we can't resolve the pincite I think we should match the "combined" opinion if it exists, instead of the first in order.

Hm, @flooie might have an opinion here, but I think if we have sub-opinions (plural) as well as a combined opinion, we should just match to the first one when we can't resolve the pincite. I think it's generally the most important decision in the cluster and the one that's assumed.

Even if we decide against this, there would be corrections to do. In that same cluster, 3297 opinions are citing the combined opinion; and only 1 is citing the "lead" opinion.

When we re-run the citation finder, it'll nuke existing citations and replace them with better ones. It's designed that way.

Once we have the pincite page number, we test for it's presence in any of the HTML fields, and it's a match if it exists.

We wouldn't want to be looking in the HTML to do matches, BUT if we're going to do pincites, we should fix #4843 first. I think it'd give us an efficient way to do this.

We should update the html_with_citations of the citing opinion to hyperlink (probably with an HTML fragment) the page number; so that when followed, the reader is autoscrolled into the proper page.

Yes!

This would require deleting the "depth" field.

Hm, that doesn't seem worth it, but could we just not store the pincite in the DB? Our destination is:

  1. A link from opinion A to the pin-cited opinion B (this goes in the DB)
  2. A anchor fragment (eg #page-22). Maybe we just put that in the HTML and that's good enough?

I noted on the pincite sub-issue that it would be hard to do. Up to Bill if it's worth it now or something we should do later. It's pretty tough.

@grossir grossir moved this from In progress to Blocked in Case Law Sprint Dec 20, 2024
@flooie
Copy link
Contributor

flooie commented Jan 2, 2025

Perhaps this is obvious, but I would just point out that pin-cites alone are not sufficient to identify which sub-opinion is being cited.

If someone searched 58 U.S. 596 at 600 our system would fail unless we had either the author or the text to disambiguate which opinion. If you look at the below image you can see that page 600 contains the end of the majority, the dissent, and the concurrence.

Image

Hm, @flooie might have an opinion here, but I think if we have sub-opinions (plural) as well as a combined opinion, we should just match to the first one when we can't resolve the pincite. I think it's generally the most important decision in the cluster and the one that's assumed.

I agree

We wouldn't want to be looking in the HTML to do matches, BUT if we're going to do pincites, we should fix #4843 first. I think it'd give us an efficient way to do this.

Fixing #4843 only allows us to pincite to the cluster in a safer way - it doesnt help us pincite to sub-opinions. As highlighted above.

A anchor fragment (eg #page-22). Maybe we just put that in the HTML and that's good enough?

We should already be generating #p22 anchor tags. The javascript standardizes most (hopefully all) citations so that each is linkable in the new design.

I'm not sure anyone mentioned the fact that parallel citations are also going to make things trickier.

@grossir
Copy link
Contributor

grossir commented Jan 2, 2025

Some thoughts after talking with Bill and looking for examples

  • The "page" anchor tags are already generated for marked up opinions (example), we would need to generate the proper anchor on the citing opinion
  • We aren't considering "paragraph" pincites. like "Ward at ¶ 30". I think these hold no ambiguity, since they are not page numbers; but I am not sure how common they are.
    - Example of an ohioctapp opinion citing other opinions using the paragraphs.
    - Example of an az opinion written with marked up paragraphs (that we do not enrich in our HTML)
  • I found an html_with_citations issue that I did not see mapped on the parent issue: id. citations sometimes put too much text inside the <a> tag #4882
  • I agree with Bill that some page-number pincites would be ambiguous when multiple opinions are on the same page; but some are not (see 2 examples below), given that the pincite points to a page that belongs to a single opinion
Type Pincite Comment Citing op Cited op
Pincite to non-majority opinion See S. Bell Tel. & Tel. Co. v. Pub. Serv. Comm'n, 270 S.C. 590, 610, 244 S.E.2d 278, 288 (1978) (Ness, J., concurring in part and dissenting in part) The in part opinion begins in page 605, so this pincite would actually be resolvable. Also, note that there are 2 pincites: the hyperlinked one does not correspond to the numbering we actually have on display https://www.courtlistener.com/opinion/5065237/in-re-application-of-blue-granite/?q=%22Roe+at%22+dissenting&type=o&order_by=dateFiled+desc&stat_Published=on https://www.courtlistener.com/opinion/1338206/sou-bell-tel-tel-co-v-pub-ser-comm/#610
Pincite to page Aros v. Beneficial Ariz., Inc., 194 Ariz. 62, 66 (1999). We display the opinion in the parallel citation format, not how it was cited, thus the fragment wouldn’t work https://www.courtlistener.com/opinion/9491968/planned-parenthood-v-kristin-mayeshazelrigg/?q=%22Roe+at%22+dissenting&type=o&order_by=dateFiled+desc&stat_Published=on https://www.courtlistener.com/opinion/1187886/leonard-h-v-beneficial-arizona-inc/
Pincite to paragraph Medina, 2011-Ohio-3990, at ¶ 13 (8th Dist.)   https://www.courtlistener.com/opinion/10014124/camacho-v-rose-mary-johanna-graselli-rehab-inc/?q=%22Roe+at%22&type=o&order_by=dateFiled+desc&stat_Published=on https://www.courtlistener.com/opinion/2704393/medina-v-medina-gen-hosp/
Pincite to non-majority opinion See, e.g., In re Allstate Cty. Mut. Ins., 85 S.W.3d at 198 Would be resolvable, dissent begins at 197 https://www.courtlistener.com/opinion/4635540/barbara-technologies-corporation-v-state-farm-lloyds/?q=Barbara+Techs.+Corp.+v.+State+Farm+Lloyds https://www.courtlistener.com/opinion/1588427/in-re-allstate-county-mut-ins-co/#198

@flooie
Copy link
Contributor

flooie commented Jan 2, 2025

I think we should table this - and just link to the first ordered opinion.

I think we need to improve eyecite more first as well as think about changes to citation and/or other models first.

@flooie flooie moved this from Blocked to General Backlog in Case Law Sprint Jan 2, 2025
@mlissner
Copy link
Member

mlissner commented Jan 3, 2025

Sounds good, thanks Bill and everybody else for the analysis! We'll get to this at some later point.

@flooie flooie moved this from General Backlog to Future... in Case Law Sprint Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Future...
Status: No status
Development

No branches or pull requests

4 participants