Make a script to get the docs we already know are >= 1000 pages long #4839

elisa-a-v · 2024-12-18T15:58:01Z

No description provided.

elisa-a-v · 2025-01-16T17:10:22Z

@ERosendo, @albertisfu and I talked extensively about this yesterday, and we finally decided we should probably divide the script in two stages, this way we would do a first stage where we fetch all relevant docs from PACER, and then a second stage where we process them. This is helpful in managing the last rounds of the round-robin process, because we checked and the court with the highest number of big docs has over 270 more big docs than the next court, therefore that one court will be the only one being queried in the last ~270 rounds.

First stage

So far we were fetching and processing each document in a similar way than do_pacer_fetch does, which meant waiting for a document to be fully processed until a worker was available again, and given the lack of control we have over the amount of time that would take, making sure we didn't overload that single court wasn't easy. Limiting the task to fetch the docs from PACER in a single queue with a single worker would have solved this issue, but given that this wasn't so easy, we opted for a different approach:

We first identify which docs need to be fetched. This is a simple query that filters RECAPDocument instances that are not available in our DB, and that have 1000 pages or more.
We then do the round-robin to fetch docs from PACER. Fetching a doc from PACER normally involves three tasks (fetch, process, finalize), but we're only doing the first one here because that's the only one that needs to be throttled. The next tasks have a much wider range of possible execution times and therefore managing the round-robin process was harder with more than one worker.
After adding the task for each doc, we store the RECAPDocument id in cache so we know which docs need to be processed in the second stage. This is also useful as a recovery mechanism in case the command is interrupted for some reason.
After each round, we keep track of the court id of the last doc fetched as well as the id for the PacerFetchQueue created for that doc in local variables.
Before starting the next round, we check the last court processed, and if the court for the first doc in the new round is the same as that last one, then we check the PacerFetchQueue status. If the PacerFetchQueue is still in progress, we wait a few seconds and then we try again (with exponential back-off and a max number of retries to avoid getting stuck waiting for a PacerFetchQueue in case it had an error and didn't have it's status updated, which is a known issue). If, on the other hand, the PacerFetchQueue has a successful status, we check the time it was last updated. If this was less than 2 seconds ago, we wait. Otherwise, we add the new fetch task to the queue.

Second stage

After we've fetched all relevant docs, we still need to process them. This could take a while, but there's no issue in adding as many workers as possible since we're not interacting with PACER anymore. This means we just have to identify the docs that were successfully fetched from the list in cache, and we add the tasks to process those docs and mark their PacerFetchQueue successful. The order of execution for these tasks is now irrelevant, and the rate is only limited by our own resources, so this part should be pretty straightforward. We don't even need to restrict this to a single queue, but still it's probably best if we don't use other queues used by other services so as not to interfere with them.

@mlissner what do you think?

mlissner · 2025-01-16T22:07:10Z

So the idea is, generally, not to create more PACERFetchQueue requests until the one before is complete for each court, right? Seems like a sensible way to go about throttling.

Two thoughts so far. First, I think you forgot to explain one of the branches of step 5 (see ???below).

My understanding is:

last_court = None
things_to_download = [list of IDs, ready for court-based round-robin]
for thing in things_to_download:
    if thing.court == last_court:
        time_elapsed_since_last_scrape = get_timing_for_previous_fetch_for_court(thing.court)
        if time_elapsed_since_last_scrape < 2:
            sleep(2)
        else:
             fetch(thing)
    else:
        # ???

Second, rather than doing the sleep(2), which pauses the whole loop, could you just skip that and add a sleep(2) at the end of each round-robin loop?

github-project-automation bot added this to Sprint (Web Team) Dec 18, 2024

mlissner moved this to Backlog Dec 23 - Jan 3 (🔍) in Sprint (Web Team) Dec 18, 2024

mlissner moved this from Backlog Dec 23 - Jan 3 (🔍) to Backlog Dec 23 - Jan 3 Final (🌲) in Sprint (Web Team) Dec 20, 2024

mlissner assigned elisa-a-v Dec 20, 2024

mlissner moved this from Backlog Dec 23 - Jan 10 (🎉) to To Do in Sprint (Web Team) Dec 23, 2024

elisa-a-v moved this from To Do to In progress in Sprint (Web Team) Jan 2, 2025

elisa-a-v linked a pull request Jan 8, 2025 that will close this issue

feat(pacer): add command to fetch docs filtered by page count from PACER #4901

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make a script to get the docs we already know are >= 1000 pages long #4839

Make a script to get the docs we already know are >= 1000 pages long #4839

elisa-a-v commented Dec 18, 2024

elisa-a-v commented Jan 16, 2025

mlissner commented Jan 16, 2025

Make a script to get the docs we already know are >= 1000 pages long #4839

Make a script to get the docs we already know are >= 1000 pages long #4839

Comments

elisa-a-v commented Dec 18, 2024

elisa-a-v commented Jan 16, 2025

First stage

Second stage

mlissner commented Jan 16, 2025