Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change time range for incomplete jobs #1506

Merged
merged 2 commits into from
Dec 30, 2024
Merged

change time range for incomplete jobs #1506

merged 2 commits into from
Dec 30, 2024

Conversation

terrazoon
Copy link
Contributor

@terrazoon terrazoon commented Dec 27, 2024

Description

During load testing we found that the scheduled check_job_status tasks assumes that any job that started between 30 and 35 minutes ago and has not finished is 'incomplete' and it will kick off and try to complete this job from what it thinks was the last place the job ran.

In practice what happened during the load test was that the job was almost finished (it had been running for 30 minutes and sent more than 9000 of the 10000 messages), but that the check_job_status declared the job incomplete, somehow determined that it should resume from row 5525, and then started trying to save 4000+ notifications to the db that were already there, resulting in IntegrityErrors and multiple retries and a long period of chaos where the db was being hammered, etc.

So for the short term, avoid changing the logic but change the definition of "too long" from 30 minutes to 4 hours. This gives us time to send over 50k messages, which is more than we currently support in in one job. This should allow all jobs to finish before the check_job_status() task jumps in and declares them incomplete.

Security Considerations

N/A

@terrazoon terrazoon self-assigned this Dec 27, 2024
@terrazoon terrazoon linked an issue Dec 27, 2024 that may be closed by this pull request
Copy link
Contributor

@ccostino ccostino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @terrazoon, this is a great catch!

As noted in Slack, we should get an issue written up for understanding the entire current set up of the incomplete job checks - and whether we should even be checking at the job level or leave it strictly at message send checks - and figuring out how we want it to work going forward.

This'll at least get things in a much better state for the time being in production while we embark on this next leg of the journey. 🙂

@ccostino ccostino merged commit cde3a6b into main Dec 30, 2024
7 checks passed
@ccostino ccostino deleted the notify-api-1505 branch December 30, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adjust timing in check_job_status
3 participants