-
-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make a script to get the docs we already know are >= 1000 pages long #4839
Comments
@ERosendo, @albertisfu and I talked extensively about this yesterday, and we finally decided we should probably divide the script in two stages, this way we would do a first stage where we fetch all relevant docs from PACER, and then a second stage where we process them. This is helpful in managing the last rounds of the round-robin process, because we checked and the court with the highest number of big docs has over 270 more big docs than the next court, therefore that one court will be the only one being queried in the last ~270 rounds. First stageSo far we were fetching and processing each document in a similar way than
Second stageAfter we've fetched all relevant docs, we still need to process them. This could take a while, but there's no issue in adding as many workers as possible since we're not interacting with PACER anymore. This means we just have to identify the docs that were successfully fetched from the list in cache, and we add the tasks to process those docs and mark their @mlissner what do you think? |
So the idea is, generally, not to create more PACERFetchQueue requests until the one before is complete for each court, right? Seems like a sensible way to go about throttling. Two thoughts so far. First, I think you forgot to explain one of the branches of step 5 (see ???below). My understanding is:
Second, rather than doing the |
No description provided.
The text was updated successfully, but these errors were encountered: