-
-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4081 Get latest cases from iquery pages #4090
4081 Get latest cases from iquery pages #4090
Conversation
🔍 Existing Issues For ReviewYour pull request is modifying functions with the following pre-existing issues: 📄 File: cl/corpus_importer/tasks.py
Did you find this useful? React with a 👍 or 👎 |
cl/corpus_importer/management/commands/iquery_pages_probing_daemon.py
Outdated
Show resolved
Hide resolved
… Celery visibility_timeout.
… probings configurable - Fix IQUERY_PROBE_ITERATIONS to reach a 256 distance by default
I was testing Also while checking this, I found something unusual, which I'm not sure is normal. The probe detected that the latest However, when the signal retrieved everything from the previous I reviewed the case in PACER and found that However they do have different URLS:
Inspecting the HTML, both shows So, I am wondering if it's the same case or not. If it's not the same case, we need to change the match method to avoid overriding the case with a different |
Yes, it's pretty normal for docket numbers to be the same on several criminal dockets due to the #2185. The matching algo should match on the pacer_case_id + docket number and then broaden to just the docket number, no? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I wound up making quite a few suggestions to try to make the code easier to undertand, mostly.
Some of the suggestions simplify things, some just rename variables.
Did you consider also using the throttle decorator as a protection against too many tasks?
I think we can remove the batch protection that you've got and do the solution you proposed that avoids triggering iquery signals except from probes until all our probes are caught up.
Thank you! This is quite close for something so tricky.
Thanks for your comments and suggestions; I'll be applying them. I just left a comment on the one regarding how to abort the probing, since it seems it's better to count the missed probes before aborting.
Yes, I reviewed the method, and this is the order it uses: first, it uses the Here is the code for that part:
So the lookup order is:
In this case, the case was not matched by the first three lookups, but it was in the fourth. So, I think we need to update the lookup to only be applied if there is no Something like:
Great I'll remove the batch protection and I'll apply the required logic to control via a setting when to start hearing global signals once probes are caught up.
Well, the problem with the throttle decorator is that it doesn't limit the number of tasks sent to the queue. It just reschedules them when the rate is above the limit. So, it won't solve the issue of overwhelming Redis with too many tasks. I considered using the However, if we want to ensure we can handle that scenario, we could use If we decide to go this route, considering infrastructure limits, is there a recommended maximum process lifetime to avoid this issue? This way, I can consider a lock expiration equal to this duration and work backward. I think we could use |
Makes sense.
Seems pretty unlikely, actually. I think adding new cases is a pretty manual thing. Even in that case, I think we'd just want to queue them up and let it process, so long as we don't overwhelm the court by hitting it too rapidly. When we've had memory problems in the past, that was usually millions of queued tasks. With that in mind, do you think we need the throttling stuff or do you think we might be OK without it? |
Got it. Yeah, it sounds unlikely that once we catch up with the court to have a task overflow due to scheduling a huge number of tasks, so no throttling will be required to schedule tasks. I'm thinking that the only scenario where we'd need throttling is to avoid exceeding the limit of 1 request per second per court. Imagine the following scenario: 13:00:00 13:00:01 Then we receive another upload at 13:00:03 with a That means at 13:00:04 and 13:00:05, the court rate of 1 request per second will be exceeded. Here, I think the |
Yep, let's do it then. Good point. |
cl/corpus_importer/management/commands/iquery_pages_probe_daemon.py
Outdated
Show resolved
Hide resolved
- Fixed time.sleep mock
The code looks good. We can merge the PR after addressing the comments. |
…uery page scraper daemon and signal.
…emon and probe task.
- So it's possible to detect when highest_known_pacer_case_id might be wrong.
…iration time in seconds.
Thanks @ERosendo I've applied your suggestions. A couple of additional improvements I've applied:
By running this command once the PR is merged, since the tweak in the command to update the keys in Redis is included in this PR:
It is only necessary to confirm the To prevent the daemon from running, I added the
Finally, confirming other settings that should be set as env vars: Before merging it:
After conditions are met:
|
cl/corpus_importer/management/commands/probe_iquery_pages_daemon.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made one tweak to add a longer docstring. Otherwise, look good. Merging. Thanks for the discussion and tricky work on this. Eduardo, thanks for the review.
Alberto, I'm realizing I'm the blocker and the bottleneck for this. Do you think you could work with Ramiro to launch it? I don't think there's much left to do, IIRC, except for setting up a daemon and some variables, right? |
Of course. I'll work with Ramiro to get this done. Just one question, we'll need to run this command first in order to update the latest
Do you remember the date the last iquery scrape finished? I remember it should be around May 29, but I also recall we got blocked from some courts and ran another scrape that should have finished a few days later. |
@blancoramiro when you have a moment, please let me know so we can get this one launched. First, we need to run the following script to set the initial
And then we can confirm IDS were properly set by:
The we need to add the following env vars: And finally, we need to deploy the daemon The Celery queue where these tasks will be scheduled is defined in the environment variable: The average number of tasks to be scheduled is around 200 every 5 minutes, so I think this task doesn't require its own workers. Thank you! |
@mlissner I confirmed that the steps needed to get the iquery scraper running are as described in the previous comment #4090 (comment) Since we agreed to start scraping from Then, set the provided environment variables:
Also, assign a queue for these tasks to Finally, deploy the daemon Let me know if you have any questions. |
This PR introduces two mechanisms, as described in #4081, to keep the latest cases from iQuery pages up to date.
The first mechanism is the
iquery_pages_probing_daemon
, which iterates over all thedistrict_or_bankruptcy_pacer_courts
excluding["uscfc", "arb", "cit"]
and schedulesiquery_pages_probing
tasks for each court on every iteration.Between each court, it will wait for
IQUERY_PROBE_WAIT / len(court_ids)
by default, which is 5 minutes. This means that approximately every 5 minutes, a newiquery_pages_probing
task will be scheduled (only if the previous court task has already finished and workers can keep up with all the tasks).The
iquery_pages_probing
TaskThe task works as follows:
iquery_pacer_case_id_final
stored in Redis.IQUERY_PROBE_ITERATIONS
(default 10) iterations following a geometric binary sequence. For example, if the initialiquery_pacer_case_id_final
is 0, the probing sequence will be:1, 2, 4, 8, 16, 32, 64, 128, 256
.pacer_case_id
. For example, if the starting ID is 1,000, the pattern will be:1001, 1002, 1004, 1008, 1016, 1032, 1064, 1128, 1256
.court_probe_cycle_no_hits > 1
), a random 5% jitter will be added to the geometric binary sequence. For instance, if the last value in the sequence is 256, the jitter can be between 1-13 and will be appended to each value in the sequence. If the jitter is 6:1001+6, 1002+6, 1004+6, 1008+6, 1016+6, 1032+6, 1064+6, 1128+6, 1256+6
.pacer_case_id
on PACER.pacer_case_id
s in the sequence, the probing will be aborted.from_iquery_scrape
parameter set tofalse
in the last hit.iquery.probing.enqueued
semaphore will be cleaned up when finishing the task so other probing tasks can be scheduled in the next iteration.Behavior on Common Errors
The
iquery_pages_probing
task is designed to retry common errors for each probe instead of retrying the whole task from scratch.query_iquery_page
is used to request a single case iQuery page and can be retried independently up to 3 times in case of aTimeout
orPacerLoginException
.Timeout
, the main task will trigger a wait ofIQUERY_COURT_TIMEOUT_WAIT
(default 10 seconds) before requesting the nextpacer_case_id
. If 3Timeouts
are raised at the task level, it will be aborted and acourt_wait
equal toIQUERY_COURT_BLOCKED_WAIT
(default 10 minutes) will be set, so no other probing task for the court will be scheduled for 10 minutes.HTTPError
when requesting a case iQuery page means that a non-200 status code was returned, likely indicating a block from the court (I saw two status code returned by a court that blocked my IP 404 and 403). If this occurs, acourt_wait
equal toIQUERY_COURT_BLOCKED_WAIT
will be set and the task will be aborted.The iQuery Sweep Signal:
handle_update_latest_case_id_and_schedule_iquery_sweep
This is the second mechanism required to complete the probing-scraping process and works as follows:
iquery_pages_probing
task, as these dockets won't have apacer_case_id_final
greater than the one currently stored in Redis.pacer_case_id
.Court.federal_courts.district_or_bankruptcy_pacer_courts().exclude(pk__in=["uscfc", "arb", "cit"])
.If the docket meets the previous conditions, the
update_latest_case_id_and_schedule_iquery_sweep
method is called once the transaction is committed to avoid any errors in the following process that could prevent the docket from being properly saved or delay the save process.update_latest_case_id_and_schedule_iquery_sweep
This method is executed after the docket transaction that triggered the signal is committed.
The method process is wrapped within an atomic Redis lock (
acquire_atomic_redis_lock
), which uses a Lua script to make other processes wait until the lock is released, avoiding race conditions when getting and updatingiquery_pacer_case_id_final
andiquery_pacer_case_id_status
. This prevents duplicate processing of somepacer_case_id
s that are in progress or already scheduled.incoming_pacer_case_id
is greater than the currentiquery_pacer_case_id_final
. If so, theiquery_pacer_case_id_final
is updated with the new value.make_docket_by_iquery
task for eachpacer_case_id
that needs to be scraped, which are thepacer_case_ids
betweeniquery_pacer_case_id_status
and the updatediquery_pacer_case_id_final
.make_docket_by_iquery
task is scheduled with a delay (countdown) of 1 second from the previous one to maintain the rate of requesting 1 case iQuery page per second per court.IQUERY_SWEEP_BATCH_SIZE
(default 10,800), which is half of the Celeryvisibility_timeout
(21,600 seconds) to avoid a runaway of Celery tasks. IfIQUERY_SWEEP_BATCH_SIZE
is reached, a new batch will schedule the remaining tasks starting from a countdown of 0 seconds.iquery_pacer_case_id_status
is updated with the latestpacer_case_id
scheduled, and the atomic Redis lock is released (release_atomic_redis_lock
).Notes, Questions, and Concerns
iquery_pacer_case_id_status
for all courts. This could be set from the latestpacer_case_id
stored from each court from the date the last whole iQuery scrape finished.acquire_atomic_redis_lock
so other threads can wait long enough until the schedule is completed. However, it's not ideal to make other processes wait too long for the lock to be released. If a pod with a sweep signal waiting for the lock to be released dies, the signal will be lost, and if it contained apacer_case_id
greater than the current one stored, it will take more time to catch up. What is the recommended maximum wait time for a thread regarding pod lifetime?Let me know what do you think.