feat(recap.mergers): Update PACER attachment processing #4665

flooie · 2024-11-06T21:58:47Z

This change attempts to address the blackhole that is doppelgänger criminal attachments.

In cases where we have "doppelgänger" dockets its possible that the pacer case id is filtering out
the docket that is available for the attachment.

flooie · 2024-11-06T22:06:06Z

Should fix issue #4664

cl/recap/mergers.py

This should fix tests

grossir · 2024-11-07T14:37:54Z

cl/recap/mergers.py

+            except RECAPDocument.DoesNotExist as exc:
+                # In cases where we have "doppelgänger" dockets drop pacer
+                # case id and check if the docket exists once more.
+                if params.get("pacer_case_id"):


albertisfu

I did some tests using the doppelgänger oknd case with pacer_case_id 27765 and 27767, for document 811. Here are my findings:

The approach of removing the pacer_case_id from the RD lookup params will help only when the main RD related to the attachment page being uploaded is not present in the docket related to the pacer_case_id used for the attachment page upload.

For instance, if the attachment page upload is sent with the following params:

upload_type=2
court=oknd
pacer_case_id=27767

And considering the attachment page has the following metadata:

document_number=811
pacer_case_id=27767
pacer_doc_id=14703438821
attachments:
  attachment_number=1
  pacer_doc_id=14713438822

If the docket with pacer_case_id 27767 doesn't have a main RD with pacer_doc_id=14703438821, but the docket with pacer_case_id 27765 does have it, then removing the pacer_case_id from the lookup will match main RECAPDocument 14703438821 from the 27765 docket and merge it there.

However, I see an issue here:

If there are more than 2 dockets for the doppelgänger case and if the main RECAPDocument for the provided pacer_doc_id exists in more than one of them, this lookup will return a RECAPDocument.MultipleObjectsReturned exception.

So it seems that the main issue is that attachment page uploads for doppelgänger case documents (at least for this court) always belong to the same pacer_case_id, even if you clicked the document from a different pacer_case_id. You can confirm this by:

Going to case 27765:
https://ecf.oknd.uscourts.gov/cgi-bin/DktRpt.pl?27765
Request documents from 811 to 811 and run the report.
Click document 811.
On the attachment page, inspect Document Number 811. You will see:
<a href="https://ecf.oknd.uscourts.gov/doc1/14713438821" onclick="goDLS('/doc1/14713438821','27767','4230','','1','1','','','');return(false);">811</a>

The link for this document contains the pacer_case_id 27767, which I believe the extension uses to upload the attachment page. (@ERosendo could help to confirm this).

So it seems, at least in this court, we won't get attachment pages for other case_ids other than the latest pacer_case_id.
We can confirm this by looking for attachment page uploads for 27765:
https://www.courtlistener.com/admin/recap/processingqueue/?court=oknd&pacer_case_id=27765&upload_type=2
None can be found.

While there are attachment uploads for 27767:
https://www.courtlistener.com/admin/recap/processingqueue/?court=oknd&pacer_case_id=27767&upload_type=2

Unless this issue can be solved from the extension, I think we'll need to apply a different approach. For instance, I believe we can lookup RECAPDocuments with the provided pacer_doc_id and court, and if more than one is found, merge the attachment page in all of them. If there is a reliable way to detect which types of dockets can have the doppelgänger issue, we could use that condition to only apply that approach there. Otherwise, we'd need to apply it every time, assuming multiple documents with the same pacer_doc_id and court are only possible on doppelgänger dockets.

Also, a similar approach would be required for PDF uploads, which seem to have the same issue, since the PDF receipt page appears to use the same incorrect pacer_case_id that can be found in the attachment page for the upload.

mlissner · 2024-11-13T00:54:16Z

If there is a reliable way to detect which types of dockets can have the doppelgänger issue, we could use that condition to only apply that approach there.

Not that I know of, unfortunately.

I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please?

albertisfu · 2024-11-13T01:32:25Z

I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please?

Sure, just regarding this comment so we can get the right size:

Also, a similar approach would be required for PDF uploads, which seem to have the same issue, since the PDF receipt page appears to use the same incorrect pacer_case_id that can be found in the attachment page for the upload.

Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed?

Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column?

mlissner · 2024-11-13T01:39:29Z

Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column?

I think PR's always take priority, except over P0's, which I see as "something is burning".

Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed?

I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger.

mlissner · 2024-11-13T01:40:12Z

What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done.

Would you have space for both solutions in this sprint?

albertisfu · 2024-11-13T15:46:58Z

I think PR's always take priority, except over P0's, which I see as "something is burning".

Correct, just in this case the solution required to implement would be completely different as the one in the PR.

I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger.

Yes, that's correct. The idea I have, though I still need to analyze it further to ensure it doesn’t interfere with other uploads, is that as a first step, we always ignore the pacer_case_id if we encounter a RECAPDocument.MultipleObjectsReturned. Then, we check if the RDs belong to different dockets and merge the attachment pages and PDFs across all of them.

If we find duplicate RDs within the same docket, we apply the current logic to select the best RD and clean up duplicates.

This approach would essentially make the pacer_case_id parameter useless. That’s why I was concerned about finding a way to detect potential doppelgängers and only apply the logic in those cases. So now my question is: Is it possible to have a document with the same pacer_doc_id in a court where the documents aren’t the same and the cases aren’t of the doppelgänger type? I think we haven’t seen anything like that, but I just want to be sure.

What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done.

The first one, which belongs to this PR, will focus on merging the attachment page for a document across all related doppelgänger cases.

The other one, related to PDF uploads, will focus on merging PDFs for both main documents and attachments across all related doppelgänger cases.

So I think it’s best to have a separate issue and PR for the PDFs.

I’d say each of both tasks are medium-sized, as we need to analyze the best way to adjust the current merge code to ensure it works for all possible sources from which we receive attachments and PDF uploads.

Would you have space for both solutions in this sprint?

I think we have space for at least the attachments one. And if it's not too problematic, we could complete the PDF one as well. However, I think it would be better to leave them for the end of the sprint, after we finish the API-related issues, to ensure the API priority is completed.

mlissner · 2024-11-14T00:11:40Z

Is it possible to have a document with the same pacer_doc_id in a court where the documents aren’t the same and the cases aren’t of the doppelgänger type? I think we haven’t seen anything like that, but I just want to be sure.

I wondered the same thing. I think it's OK and we haven't seen anything like that that I know of.

Sounds great about prioritization. If this makes it in, cool. If not, it's tricky stuff and not a priority, so that's fine too!

Thank you.

…nger cases

albertisfu · 2024-12-04T15:00:55Z

I’ve applied the approach discussed above. Initially, I considered applying the approach within the merge_attachment_page_data method. However, I realized it would be better to apply it outside of this method for a few reasons:

When uploading an attachment page, a ProcessingQueue and a PacerHtmlFile are generated for every upload. These instances are useful for debugging purposes and for reprocessing content, if needed. One solution would be to create additional PQs within merge_attachment_page_data for each attachment page requiring merging in doppelgänger cases. However, this doesn’t seem like the right place to handle this logic. It would also necessitate processing the attachments on each docket and relating the PQs to their respective dockets and entries at the end, which would significantly increase the complexity of merge_attachment_page_data.
As far as I recall, doppelgänger cases can only be found in district or bankruptcy cases?
However, merge_attachment_page_data also serves ACMS and Appellate cases. Adding logic to handle only district or bankruptcy cases for doppelgänger scenarios would require additional conditional logic that would increase complexity in the method.

I found it more appropriate to apply this new logic as a preliminary step to process_recap_attachment(). To achieve this, I created a new method called look_for_doppelganger_rds_and_process_recap_attachment, which is called by process_recap_upload(). This method extracts the pacer_doc_id from the attachment page and uses it, along with the court, to search for other RECAPDocument instances with the same pacer_doc_id in the court. These matches indicate they belong to a doppelgänger case. The logic for this step is encapsulated in the look_for_doppelganger_rds method.

The filtering logic:

main_rds = (
        RECAPDocument.objects.select_related("docket_entry__docket")
        .filter(
            pacer_doc_id=pacer_doc_id,
            docket_entry__docket__court=court,
        )
        .order_by("docket_entry__docket__pacer_case_id")
        .distinct("docket_entry__docket__pacer_case_id")
        .only(
            "pacer_doc_id",
            "docket_entry__docket__pacer_case_id",
            "docket_entry__docket__court_id",
        )
    )

RECAPDocument instances are filtered by pacer_doc_id within the same court but belonging to distinct pacer_case_id. This ensures we avoid potential duplicates RDs within the same case.

The identified RECAPDocument instances are then used to create additional PQs as needed for each document. Finally, the original and additional PQs are processed as usual by process_recap_attachment using the same court and pacer_doc_id, but with distinct pacer_case_id found for each doppelgänger case. Each PQ is marked as successful or failed independently, which is helpful for debugging purposes.

If this approach seems appropriate to you, we’ll need to apply something similar for:

RECAP email attachment pages
Fetch_attachment_page
PDF uploads
Docket reports containing docket entries with nested attachment data.

If you agree we can open related issues for this.

Let me know what do you think.

mlissner

This is cool, and I think your approach is great. It will make a lot of folks happy to get this and the PDF merging done (using the same approach).

Nothing too major, but I think a slight tweak to how you split up the logic will help.

Thank you!

cl/recap/mergers.py

cl/recap/tasks.py

… att pages

albertisfu · 2024-12-13T20:49:33Z

Thanks! I've applied the suggested changes, which have indeed streamlined the lookups and the creation of additional PQs in one method find_subdocket_att_page_rds and renamed doppelganger terms to subdockets.

Just wanted to confirm that we're ok with one question above: subdockets can't be found in Appellate or ACMS uploads? as this approach only applies to bankruptcy and district uploads.

And for the other uploads that would require this logic to be applied:

RECAP email entries and attachment pages
Extension PDF uploads
Fetch attachment page
Fetch PDF documents
Docket reports containing docket entries with nested attachment data.

is it ok to open one issue for each of these?

mlissner

Got it, thank you!

mlissner · 2024-12-14T00:04:24Z

Auto-merge enabled.

Just wanted to confirm that we're ok with one question above: subdockets can't be found in Appellate or ACMS uploads? as this approach only applies to bankruptcy and district uploads.

Confirmed, as best as I know.

is it ok to open one issue for each of these?

Yes, please. Can you add them to the Sprint project board and put a size on each of them, please?

feat(recap.mergers): Update PACER attachment processing

437110f

flooie requested a review from ERosendo November 6, 2024 21:58

grossir reviewed Nov 7, 2024

View reviewed changes

cl/recap/mergers.py Outdated Show resolved Hide resolved

fix(recap.mergers): Reraise exception

a4fbacb

This should fix tests

flooie requested a review from grossir November 7, 2024 12:44

grossir reviewed Nov 7, 2024

View reviewed changes

fix(recap.mergers): Use Correct key in params

d32cf49

albertisfu reviewed Nov 12, 2024

View reviewed changes

mlissner assigned albertisfu Nov 13, 2024

albertisfu added 5 commits November 21, 2024 18:34

Merge branch 'main' into 4664-recap-attachment-doppelganger-edge-case

32fd828

fix(recap): Added support for processing attachment pages in doppelgä…

e532157

…nger cases

Merge branch 'main' into 4664-recap-attachment-doppelganger-edge-case

d5fb59e

fix(elasticsearch): Fixed encoding when opening the attachment file

5ea2bd1

Merge branch 'main' into 4664-recap-attachment-doppelganger-edge-case

6ce3351

s-taube requested review from mlissner and removed request for ERosendo December 9, 2024 17:09

s-taube assigned mlissner and unassigned albertisfu Dec 9, 2024

mlissner requested changes Dec 11, 2024

View reviewed changes

cl/recap/mergers.py Outdated Show resolved Hide resolved

cl/recap/tasks.py Outdated Show resolved Hide resolved

albertisfu added 2 commits December 13, 2024 12:54

Merge branch 'main' into 4664-recap-attachment-doppelganger-edge-case

d3609a4

fix(recap): Refined approach for looking up and processing subdockets…

f722807

… att pages

mlissner approved these changes Dec 14, 2024

View reviewed changes

Merge branch 'main' into 4664-recap-attachment-doppelganger-edge-case

ae27f76

mlissner enabled auto-merge December 14, 2024 00:03

mlissner merged commit b429ae0 into main Dec 14, 2024
14 of 15 checks passed

mlissner deleted the 4664-recap-attachment-doppelganger-edge-case branch December 14, 2024 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recap.mergers): Update PACER attachment processing #4665

feat(recap.mergers): Update PACER attachment processing #4665

flooie commented Nov 6, 2024

flooie commented Nov 6, 2024

grossir Nov 7, 2024 •

edited

Loading

albertisfu left a comment

mlissner commented Nov 13, 2024

albertisfu commented Nov 13, 2024

mlissner commented Nov 13, 2024

mlissner commented Nov 13, 2024 •

edited

Loading

albertisfu commented Nov 13, 2024

mlissner commented Nov 14, 2024

albertisfu commented Dec 4, 2024

mlissner left a comment

albertisfu commented Dec 13, 2024

mlissner left a comment

mlissner commented Dec 14, 2024

feat(recap.mergers): Update PACER attachment processing #4665

feat(recap.mergers): Update PACER attachment processing #4665

Conversation

flooie commented Nov 6, 2024

flooie commented Nov 6, 2024

grossir Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

albertisfu left a comment

Choose a reason for hiding this comment

mlissner commented Nov 13, 2024

albertisfu commented Nov 13, 2024

mlissner commented Nov 13, 2024

mlissner commented Nov 13, 2024 • edited Loading

albertisfu commented Nov 13, 2024

mlissner commented Nov 14, 2024

albertisfu commented Dec 4, 2024

mlissner left a comment

Choose a reason for hiding this comment

albertisfu commented Dec 13, 2024

mlissner left a comment

Choose a reason for hiding this comment

mlissner commented Dec 14, 2024

grossir Nov 7, 2024 •

edited

Loading

mlissner commented Nov 13, 2024 •

edited

Loading