-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(recap.mergers): Update PACER attachment processing #4665
Conversation
Should fix issue #4664 |
This should fix tests
cl/recap/mergers.py
Outdated
except RECAPDocument.DoesNotExist as exc: | ||
# In cases where we have "doppelgänger" dockets drop pacer | ||
# case id and check if the docket exists once more. | ||
if params.get("pacer_case_id"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the params
dict key you want to look for is actually "docket_entry__docket__pacer_case_id"
, as in line 1641 in this same function. search_recapdocument
does not have a field for pacer_case_id
, so this key on the params dict would never be found
Table "public.search_recapdocument"
Column | Type | Collation | Nullable | Default
-------------------------+--------------------------+-----------+----------+--------------------------------------------------
id | integer | | not null | nextval('search_recapdocument_id_seq'::regclass)
date_created | timestamp with time zone | | not null |
date_modified | timestamp with time zone | | not null |
date_upload | timestamp with time zone | | |
document_type | integer | | not null |
document_number | character varying(32) | | not null |
attachment_number | smallint | | |
pacer_doc_id | character varying(64) | | not null |
is_available | boolean | | |
sha1 | character varying(40) | | not null |
filepath_local | character varying(1000) | | not null |
filepath_ia | character varying(1000) | | not null |
docket_entry_id | integer | | not null |
description | text | | not null |
ocr_status | smallint | | |
plain_text | text | | not null |
page_count | integer | | |
is_free_on_pacer | boolean | | |
ia_upload_failure_count | smallint | | |
file_size | integer | | |
thumbnail | character varying(100) | | |
thumbnail_status | smallint | | not null |
is_sealed | boolean | | |
acms_document_guid | character varying(64) | | not null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some tests using the doppelgänger oknd
case with pacer_case_id
27765
and 27767
, for document 811
. Here are my findings:
- The approach of removing the
pacer_case_id
from the RD lookup params will help only when the main RD related to the attachment page being uploaded is not present in the docket related to thepacer_case_id
used for the attachment page upload.
For instance, if the attachment page upload is sent with the following params:
upload_type=2
court=oknd
pacer_case_id=27767
And considering the attachment page has the following metadata:
document_number=811
pacer_case_id=27767
pacer_doc_id=14703438821
attachments:
attachment_number=1
pacer_doc_id=14713438822
If the docket with pacer_case_id
27767
doesn't have a main RD with pacer_doc_id=14703438821
, but the docket with pacer_case_id
27765
does have it, then removing the pacer_case_id
from the lookup will match main RECAPDocument 14703438821
from the 27765
docket and merge it there.
However, I see an issue here:
- If there are more than 2 dockets for the doppelgänger case and if the main RECAPDocument for the provided
pacer_doc_id
exists in more than one of them, this lookup will return aRECAPDocument.MultipleObjectsReturned
exception.
So it seems that the main issue is that attachment page uploads for doppelgänger case documents (at least for this court) always belong to the same pacer_case_id
, even if you clicked the document from a different pacer_case_id
. You can confirm this by:
Going to case 27765
:
https://ecf.oknd.uscourts.gov/cgi-bin/DktRpt.pl?27765
Request documents from 811 to 811 and run the report.
Click document 811.
On the attachment page, inspect Document Number 811. You will see:
<a href="https://ecf.oknd.uscourts.gov/doc1/14713438821" onclick="goDLS('/doc1/14713438821','27767','4230','','1','1','','','');return(false);">811</a>
The link for this document contains the pacer_case_id 27767
, which I believe the extension uses to upload the attachment page. (@ERosendo could help to confirm this).
So it seems, at least in this court, we won't get attachment pages for other case_ids
other than the latest pacer_case_id
.
We can confirm this by looking for attachment page uploads for 27765:
https://www.courtlistener.com/admin/recap/processingqueue/?court=oknd&pacer_case_id=27765&upload_type=2
None can be found.
While there are attachment uploads for 27767:
https://www.courtlistener.com/admin/recap/processingqueue/?court=oknd&pacer_case_id=27767&upload_type=2
Unless this issue can be solved from the extension, I think we'll need to apply a different approach. For instance, I believe we can lookup RECAPDocuments with the provided pacer_doc_id
and court
, and if more than one is found, merge the attachment page in all of them. If there is a reliable way to detect which types of dockets can have the doppelgänger issue, we could use that condition to only apply that approach there. Otherwise, we'd need to apply it every time, assuming multiple documents with the same pacer_doc_id
and court
are only possible on doppelgänger dockets.
Also, a similar approach would be required for PDF uploads, which seem to have the same issue, since the PDF receipt page appears to use the same incorrect pacer_case_id
that can be found in the attachment page for the upload.
Not that I know of, unfortunately. I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please? |
Sure, just regarding this comment so we can get the right size:
Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed? Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column? |
I think PR's always take priority, except over P0's, which I see as "something is burning".
I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger. |
What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done. Would you have space for both solutions in this sprint? |
Correct, just in this case the solution required to implement would be completely different as the one in the PR.
Yes, that's correct. The idea I have, though I still need to analyze it further to ensure it doesn’t interfere with other uploads, is that as a first step, we always ignore the If we find duplicate RDs within the same docket, we apply the current logic to select the best RD and clean up duplicates. This approach would essentially make the
The first one, which belongs to this PR, will focus on merging the attachment page for a document across all related doppelgänger cases. The other one, related to PDF uploads, will focus on merging PDFs for both main documents and attachments across all related doppelgänger cases. So I think it’s best to have a separate issue and PR for the PDFs. I’d say each of both tasks are medium-sized, as we need to analyze the best way to adjust the current merge code to ensure it works for all possible sources from which we receive attachments and PDF uploads.
I think we have space for at least the attachments one. And if it's not too problematic, we could complete the PDF one as well. However, I think it would be better to leave them for the end of the sprint, after we finish the API-related issues, to ensure the API priority is completed. |
I wondered the same thing. I think it's OK and we haven't seen anything like that that I know of. Sounds great about prioritization. If this makes it in, cool. If not, it's tricky stuff and not a priority, so that's fine too! Thank you. |
I’ve applied the approach discussed above. Initially, I considered applying the approach within the
I found it more appropriate to apply this new logic as a preliminary step to The filtering logic:
The identified If this approach seems appropriate to you, we’ll need to apply something similar for:
If you agree we can open related issues for this. Let me know what do you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool, and I think your approach is great. It will make a lot of folks happy to get this and the PDF merging done (using the same approach).
Nothing too major, but I think a slight tweak to how you split up the logic will help.
Thank you!
Thanks! I've applied the suggested changes, which have indeed streamlined the lookups and the creation of additional PQs in one method Just wanted to confirm that we're ok with one question above: subdockets can't be found in Appellate or ACMS uploads? as this approach only applies to bankruptcy and district uploads. And for the other uploads that would require this logic to be applied:
is it ok to open one issue for each of these? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thank you!
Auto-merge enabled.
Confirmed, as best as I know.
Yes, please. Can you add them to the Sprint project board and put a size on each of them, please? |
@ERosendo
This change attempts to address the blackhole that is doppelgänger criminal attachments.
In cases where we have "doppelgänger" dockets its possible that the pacer case id is filtering out
the docket that is available for the attachment.