Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix database inserts for notifications #1466

Open
ccostino opened this issue Dec 6, 2024 · 6 comments
Open

Fix database inserts for notifications #1466

ccostino opened this issue Dec 6, 2024 · 6 comments
Assignees
Labels
engineering python Pull requests that update Python code

Comments

@ccostino
Copy link
Contributor

ccostino commented Dec 6, 2024

After the several large batches we had go through this past week, we're still seeing a host of database insertion errors that are happening as a batch gets fully processed by the app.

Initially these come through as Celery retry errors that are the result of psycopg2 IntegrityError exceptions, and eventually they ballon into one-off IntegrityError exceptions outside of the retries themselves. We need to get to the bottom of this and examine when and why we're doing these database inserts, and why we're still running into a race condition after our recent fixes where the app still thinks it needs to insert a new record when it doesn't.

This is resulting in massive application churn and CPU/memory waste, not to mention contributing to all of the New Relic false positives.

If there's no way around this scenario and we still need to try to perform these inserts, then we need to do the work to convert these insert-only statements into insert or update (upsert) statements and handle each scenario accordingly.

@terrazoon
Copy link
Contributor

Look at this, maybe the problem is caused by the retries:

@notify_celery.task(bind=True, name="save-sms", max_retries=5, default_retry_delay=300)
def save_sms(self, service_id, notification_id, encrypted_notification, sender_id=None):

@terrazoon
Copy link
Contributor

Instead of 5 retries, are we really getting 25 retries?

save_api_sms has 5 retries but calls save_sms which has 5 retries by itself:

@notify_celery.task(
bind=True, name="save-api-sms", max_retries=5, default_retry_delay=300
)
def save_api_sms(self, encrypted_notification):
save_api_email_or_sms(self, encrypted_notification)

@xlorepdarkhelm xlorepdarkhelm self-assigned this Dec 12, 2024
@xlorepdarkhelm xlorepdarkhelm moved this from ⬇ Up-Next to 🏗 In progress (WIP: ≤ 3 per person) in Notify.gov product board Dec 12, 2024
@xlorepdarkhelm
Copy link
Contributor

So these IntegrityErrors... they are because the rows actually exist in the DB, right?

What if... we simply suppress them and have the job act like the task worked?

@terrazoon
Copy link
Contributor

We are catching these errors and logging them as is, but it's hammering the db.

@ccostino
Copy link
Contributor Author

Yeah if these don't need to happen in the first place, let's solve the root cause of that and prevent the condition to begin with instead of just suppressing the error; if this is ultimately trying to do a thing that doesn't need to be done, then we shouldn't do the thing in the first place.

@ccostino
Copy link
Contributor Author

Noting here that two PRs have been merged that should help take care of this:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
engineering python Pull requests that update Python code
Projects
Status: 🏗 In progress (WIP: ≤ 3 per person)
Development

No branches or pull requests

3 participants