Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(recap.mergers): Update PACER attachment processing #4665

Merged
merged 11 commits into from
Dec 14, 2024

Conversation

flooie
Copy link
Contributor

@flooie flooie commented Nov 6, 2024

@ERosendo

This change attempts to address the blackhole that is doppelgänger criminal attachments.

In cases where we have "doppelgänger" dockets its possible that the pacer case id is filtering out
the docket that is available for the attachment.

@flooie flooie requested a review from ERosendo November 6, 2024 21:58
@flooie
Copy link
Contributor Author

flooie commented Nov 6, 2024

Should fix issue #4664

cl/recap/mergers.py Outdated Show resolved Hide resolved
@flooie flooie requested a review from grossir November 7, 2024 12:44
except RECAPDocument.DoesNotExist as exc:
# In cases where we have "doppelgänger" dockets drop pacer
# case id and check if the docket exists once more.
if params.get("pacer_case_id"):
Copy link
Contributor

@grossir grossir Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the params dict key you want to look for is actually "docket_entry__docket__pacer_case_id", as in line 1641 in this same function. search_recapdocument does not have a field for pacer_case_id, so this key on the params dict would never be found

                                             Table "public.search_recapdocument"
         Column          |           Type           | Collation | Nullable |                     Default                      
-------------------------+--------------------------+-----------+----------+--------------------------------------------------
 id                      | integer                  |           | not null | nextval('search_recapdocument_id_seq'::regclass)
 date_created            | timestamp with time zone |           | not null | 
 date_modified           | timestamp with time zone |           | not null | 
 date_upload             | timestamp with time zone |           |          | 
 document_type           | integer                  |           | not null | 
 document_number         | character varying(32)    |           | not null | 
 attachment_number       | smallint                 |           |          | 
 pacer_doc_id            | character varying(64)    |           | not null | 
 is_available            | boolean                  |           |          | 
 sha1                    | character varying(40)    |           | not null | 
 filepath_local          | character varying(1000)  |           | not null | 
 filepath_ia             | character varying(1000)  |           | not null | 
 docket_entry_id         | integer                  |           | not null | 
 description             | text                     |           | not null | 
 ocr_status              | smallint                 |           |          | 
 plain_text              | text                     |           | not null | 
 page_count              | integer                  |           |          | 
 is_free_on_pacer        | boolean                  |           |          | 
 ia_upload_failure_count | smallint                 |           |          | 
 file_size               | integer                  |           |          | 
 thumbnail               | character varying(100)   |           |          | 
 thumbnail_status        | smallint                 |           | not null | 
 is_sealed               | boolean                  |           |          | 
 acms_document_guid      | character varying(64)    |           | not null | 

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some tests using the doppelgänger oknd case with pacer_case_id 27765 and 27767, for document 811. Here are my findings:

  • The approach of removing the pacer_case_id from the RD lookup params will help only when the main RD related to the attachment page being uploaded is not present in the docket related to the pacer_case_id used for the attachment page upload.

For instance, if the attachment page upload is sent with the following params:

upload_type=2
court=oknd
pacer_case_id=27767

And considering the attachment page has the following metadata:

document_number=811
pacer_case_id=27767
pacer_doc_id=14703438821
attachments:
  attachment_number=1
  pacer_doc_id=14713438822

If the docket with pacer_case_id 27767 doesn't have a main RD with pacer_doc_id=14703438821, but the docket with pacer_case_id 27765 does have it, then removing the pacer_case_id from the lookup will match main RECAPDocument 14703438821 from the 27765 docket and merge it there.

However, I see an issue here:

  • If there are more than 2 dockets for the doppelgänger case and if the main RECAPDocument for the provided pacer_doc_id exists in more than one of them, this lookup will return a RECAPDocument.MultipleObjectsReturned exception.

So it seems that the main issue is that attachment page uploads for doppelgänger case documents (at least for this court) always belong to the same pacer_case_id, even if you clicked the document from a different pacer_case_id. You can confirm this by:

Going to case 27765:
https://ecf.oknd.uscourts.gov/cgi-bin/DktRpt.pl?27765
Request documents from 811 to 811 and run the report.
Click document 811.
On the attachment page, inspect Document Number 811. You will see:
<a href="https://ecf.oknd.uscourts.gov/doc1/14713438821" onclick="goDLS('/doc1/14713438821','27767','4230','','1','1','','','');return(false);">811</a>

The link for this document contains the pacer_case_id 27767, which I believe the extension uses to upload the attachment page. (@ERosendo could help to confirm this).

So it seems, at least in this court, we won't get attachment pages for other case_ids other than the latest pacer_case_id.
We can confirm this by looking for attachment page uploads for 27765:
https://www.courtlistener.com/admin/recap/processingqueue/?court=oknd&pacer_case_id=27765&upload_type=2
None can be found.

While there are attachment uploads for 27767:
https://www.courtlistener.com/admin/recap/processingqueue/?court=oknd&pacer_case_id=27767&upload_type=2

Unless this issue can be solved from the extension, I think we'll need to apply a different approach. For instance, I believe we can lookup RECAPDocuments with the provided pacer_doc_id and court, and if more than one is found, merge the attachment page in all of them. If there is a reliable way to detect which types of dockets can have the doppelgänger issue, we could use that condition to only apply that approach there. Otherwise, we'd need to apply it every time, assuming multiple documents with the same pacer_doc_id and court are only possible on doppelgänger dockets.

Also, a similar approach would be required for PDF uploads, which seem to have the same issue, since the PDF receipt page appears to use the same incorrect pacer_case_id that can be found in the attachment page for the upload.

@mlissner
Copy link
Member

If there is a reliable way to detect which types of dockets can have the doppelgänger issue, we could use that condition to only apply that approach there.

Not that I know of, unfortunately.

I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please?

@albertisfu
Copy link
Contributor

I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please?

Sure, just regarding this comment so we can get the right size:

Also, a similar approach would be required for PDF uploads, which seem to have the same issue, since the PDF receipt page appears to use the same incorrect pacer_case_id that can be found in the attachment page for the upload.

Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed?

Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column?

@mlissner
Copy link
Member

Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column?

I think PR's always take priority, except over P0's, which I see as "something is burning".

Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed?

I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger.

@mlissner
Copy link
Member

mlissner commented Nov 13, 2024

What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done.

Would you have space for both solutions in this sprint?

@albertisfu
Copy link
Contributor

I think PR's always take priority, except over P0's, which I see as "something is burning".

Correct, just in this case the solution required to implement would be completely different as the one in the PR.

I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger.

Yes, that's correct. The idea I have, though I still need to analyze it further to ensure it doesn’t interfere with other uploads, is that as a first step, we always ignore the pacer_case_id if we encounter a RECAPDocument.MultipleObjectsReturned. Then, we check if the RDs belong to different dockets and merge the attachment pages and PDFs across all of them.

If we find duplicate RDs within the same docket, we apply the current logic to select the best RD and clean up duplicates.

This approach would essentially make the pacer_case_id parameter useless. That’s why I was concerned about finding a way to detect potential doppelgängers and only apply the logic in those cases. So now my question is: Is it possible to have a document with the same pacer_doc_id in a court where the documents aren’t the same and the cases aren’t of the doppelgänger type? I think we haven’t seen anything like that, but I just want to be sure.

What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done.

The first one, which belongs to this PR, will focus on merging the attachment page for a document across all related doppelgänger cases.

The other one, related to PDF uploads, will focus on merging PDFs for both main documents and attachments across all related doppelgänger cases.

So I think it’s best to have a separate issue and PR for the PDFs.

I’d say each of both tasks are medium-sized, as we need to analyze the best way to adjust the current merge code to ensure it works for all possible sources from which we receive attachments and PDF uploads.

Would you have space for both solutions in this sprint?

I think we have space for at least the attachments one. And if it's not too problematic, we could complete the PDF one as well. However, I think it would be better to leave them for the end of the sprint, after we finish the API-related issues, to ensure the API priority is completed.

@mlissner
Copy link
Member

Is it possible to have a document with the same pacer_doc_id in a court where the documents aren’t the same and the cases aren’t of the doppelgänger type? I think we haven’t seen anything like that, but I just want to be sure.

I wondered the same thing. I think it's OK and we haven't seen anything like that that I know of.

Sounds great about prioritization. If this makes it in, cool. If not, it's tricky stuff and not a priority, so that's fine too!

Thank you.

@albertisfu
Copy link
Contributor

I’ve applied the approach discussed above. Initially, I considered applying the approach within the merge_attachment_page_data method. However, I realized it would be better to apply it outside of this method for a few reasons:

  • When uploading an attachment page, a ProcessingQueue and a PacerHtmlFile are generated for every upload. These instances are useful for debugging purposes and for reprocessing content, if needed. One solution would be to create additional PQs within merge_attachment_page_data for each attachment page requiring merging in doppelgänger cases. However, this doesn’t seem like the right place to handle this logic. It would also necessitate processing the attachments on each docket and relating the PQs to their respective dockets and entries at the end, which would significantly increase the complexity of merge_attachment_page_data.
  • As far as I recall, doppelgänger cases can only be found in district or bankruptcy cases?
    However, merge_attachment_page_data also serves ACMS and Appellate cases. Adding logic to handle only district or bankruptcy cases for doppelgänger scenarios would require additional conditional logic that would increase complexity in the method.

I found it more appropriate to apply this new logic as a preliminary step to process_recap_attachment(). To achieve this, I created a new method called look_for_doppelganger_rds_and_process_recap_attachment, which is called by process_recap_upload(). This method extracts the pacer_doc_id from the attachment page and uses it, along with the court, to search for other RECAPDocument instances with the same pacer_doc_id in the court. These matches indicate they belong to a doppelgänger case. The logic for this step is encapsulated in the look_for_doppelganger_rds method.

The filtering logic:

main_rds = (
        RECAPDocument.objects.select_related("docket_entry__docket")
        .filter(
            pacer_doc_id=pacer_doc_id,
            docket_entry__docket__court=court,
        )
        .order_by("docket_entry__docket__pacer_case_id")
        .distinct("docket_entry__docket__pacer_case_id")
        .only(
            "pacer_doc_id",
            "docket_entry__docket__pacer_case_id",
            "docket_entry__docket__court_id",
        )
    )
  • RECAPDocument instances are filtered by pacer_doc_id within the same court but belonging to distinct pacer_case_id. This ensures we avoid potential duplicates RDs within the same case.

The identified RECAPDocument instances are then used to create additional PQs as needed for each document. Finally, the original and additional PQs are processed as usual by process_recap_attachment using the same court and pacer_doc_id, but with distinct pacer_case_id found for each doppelgänger case. Each PQ is marked as successful or failed independently, which is helpful for debugging purposes.

If this approach seems appropriate to you, we’ll need to apply something similar for:

  • RECAP email attachment pages
  • Fetch_attachment_page
  • PDF uploads
  • Docket reports containing docket entries with nested attachment data.

If you agree we can open related issues for this.

Let me know what do you think.

@s-taube s-taube requested review from mlissner and removed request for ERosendo December 9, 2024 17:09
@s-taube s-taube assigned mlissner and unassigned albertisfu Dec 9, 2024
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool, and I think your approach is great. It will make a lot of folks happy to get this and the PDF merging done (using the same approach).

Nothing too major, but I think a slight tweak to how you split up the logic will help.

Thank you!

cl/recap/mergers.py Outdated Show resolved Hide resolved
cl/recap/tasks.py Outdated Show resolved Hide resolved
@albertisfu
Copy link
Contributor

Thanks! I've applied the suggested changes, which have indeed streamlined the lookups and the creation of additional PQs in one method find_subdocket_att_page_rds and renamed doppelganger terms to subdockets.

Just wanted to confirm that we're ok with one question above: subdockets can't be found in Appellate or ACMS uploads? as this approach only applies to bankruptcy and district uploads.

And for the other uploads that would require this logic to be applied:

  • RECAP email entries and attachment pages
  • Extension PDF uploads
  • Fetch attachment page
  • Fetch PDF documents
  • Docket reports containing docket entries with nested attachment data.

is it ok to open one issue for each of these?

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you!

@mlissner mlissner enabled auto-merge December 14, 2024 00:03
@mlissner
Copy link
Member

Auto-merge enabled.

Just wanted to confirm that we're ok with one question above: subdockets can't be found in Appellate or ACMS uploads? as this approach only applies to bankruptcy and district uploads.

Confirmed, as best as I know.

is it ok to open one issue for each of these?

Yes, please. Can you add them to the Sprint project board and put a size on each of them, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants