Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidekiq jobs are getting run twice, leading to database locks #2333

Closed
2 of 4 tasks
maxkadel opened this issue Apr 3, 2024 · 8 comments · Fixed by pulibrary/princeton_ansible#4840
Closed
2 of 4 tasks
Assignees
Labels
bug The application does not work as expected because of a defect

Comments

@maxkadel
Copy link
Contributor

maxkadel commented Apr 3, 2024

For more detailed notes, see the datadog notebook

Ensure Sidekiq jobs are not created twice with the same job id. This is very likely a redis latency issue.

History

We have a monitoring alert that is going off repeatedly for bibdata, saying that postgres queries are taking a very long time - sometimes as much as 15 minutes (at which point they're probably getting killed by Postgres, not finishing).

The postgres queries that are taking so long are:

SELECT dump_types . * FROM dump_types WHERE dump_types . constant = ? LIMIT ?

Which seem to be getting called from the AlmaDumpTransferJob.

It seems possible that this job is somehow getting called twice with different GlobalIDs, and causing a database lock? If that's it, it could be a redis latency issue? Or a sidekiq thread management issue? (again, see datadog notebook)

Acceptance Criteria

  • Reconfigure staging to use central staging redis server
  • Reconfigure qa to use central staging redis server
  • Reconfigure production to use central production redis server. (We need to make sure to flush existing jobs and stop workers prior to switching to this in production).
  • Check journalctl -u bibdata-workers.service --grep AlmaDumpTransferJob on the worker box to confirm this issue isn't happening any longer.
@maxkadel maxkadel added the bug The application does not work as expected because of a defect label Apr 3, 2024
@maxkadel maxkadel changed the title AlmaDumpTransferJob generates long-running / locking postgres queries Sidekiq jobs are getting run twice, leading to database locks Apr 3, 2024
@maxkadel
Copy link
Contributor Author

maxkadel commented Apr 4, 2024

Sidekiq thread on migrating to a new Redis suggests using redis replication

To migrate, you set up a new replica of the old primary, let it replicate, shut down workers, shut down primary, promote new replica to new primary, start up workers.

@Beck-Davis
Copy link
Contributor

Beck-Davis commented Apr 4, 2024

  • change BIBDATA_REDIS_URL in group_vars
  • ssh to each worker box & make sure no jobs are running/queued
  • stop sidekiq workers
  • run entire bibdata playbook from branch with site-config flag
  • restart sidekiq workers on the boxes
  • make sure sidekiq is still running
  • run a background job on bibdata & check how many jobs are run

@hackartisan
Copy link
Member

Please make sure to use the central redis var described in https://github.com/pulibrary/princeton_ansible/blob/main/roles/redis/README.md

@maxkadel
Copy link
Contributor Author

maxkadel commented Apr 5, 2024

Thanks @hackartisan! I had missed this.

@maxkadel
Copy link
Contributor Author

maxkadel commented Apr 9, 2024

Is Sidekiq connecting to Redis twice?

NFO: Sidekiq Pro 7.2.0, commercially licensed. Thanks for your support!
Apr 08 06:27:53 bibdata-alma-worker-staging1 sidekiq[865]: 2024-04-08T10:27:53.704Z pid=865 tid=49l INFO: Sidekiq 7.2.2 connecting to Redis with options {:size=>10, :pool_name=>"internal", :url=>"redis://lib-redis-staging1.princeton.ed>
Apr 08 06:27:53 bibdata-alma-worker-staging1 sidekiq[865]: 2024-04-08T10:27:53.714Z pid=865 tid=49l INFO: Sidekiq 7.2.2 connecting to Redis with options {:size=>2, :pool_name=>"default", :url=>"redis://lib-redis-staging1.princeton.edu:>

@maxkadel
Copy link
Contributor Author

The redis production server is still on Redis 6.0, which is too old for our current version of Sidekiq. See Princeton Ansible ticket to upgrade this server

@kevinreiss
Copy link
Member

We should talk to ops on how to transition the production environment to the same version as staging or discuss alternative plans.

@maxkadel
Copy link
Contributor Author

This may be related to #1959. Newer Honeybadger error. May be addressed by postgres configuration change

@christinach christinach added template-update Tickets that need to be updated using the appropriate issue template and removed template-update Tickets that need to be updated using the appropriate issue template labels Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The application does not work as expected because of a defect
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants