Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PG server won't start after crash #4042

Open
2 of 5 tasks
zhangluva opened this issue Dec 5, 2024 · 0 comments
Open
2 of 5 tasks

PG server won't start after crash #4042

zhangluva opened this issue Dec 5, 2024 · 0 comments

Comments

@zhangluva
Copy link

Please ensure you do the following when reporting a bug:

  • Provide a concise description of what the bug is.

  • Provide information about your environment.

  • Provide clear steps to reproduce the bug.
    Don't really know how to reproduce. We have the same system/configuration deployed to multiple clusters, only one is having the problem. The database is used by another OTF application, only thing different is the load pattern.

  • Attach applicable logs. Please do not attach screenshots showing logs unless you are unable to copy and paste the log data.

  • Ensure any code / output examples are properly formatted for legibility.

Note that some logs needed to troubleshoot may be found in the /pgdata/<CLUSTERNAME>/pg_log directory on your Postgres instance.

An incomplete bug report can lead to delays in resolving the issue or the closing of a ticket, so please be as detailed as possible.

If you are looking for general support, please view the support page for where you can ask questions.

Thanks for reporting the issue, we're looking forward to helping you!

Overview

When the primary server crashes, the failover happens correctly, but the old primary won't start with the following error message. Kill the pod clears the error and get the database container started correctly.

2024-12-04 15:35:46.387 | 2024-12-04 20:35:46.385 UTC [671109] HINT:  Terminate any old server processes associated with data directory "/pgdata/pg15". 
2024-12-04 15:35:46.387 | 2024-12-04 20:35:46,386 INFO:  stderr=2024-12-04 20:35:46.385 UTC [671109] FATAL:  pre-existing shared memory block (key 524289, ID 1) is still in use
2024-12-04 15:35:46.387 | 2024-12-04 20:35:46,386 INFO:  stdout= 
2024-12-04 15:35:46.386 | 2024-12-04 20:35:46,386 ERROR: Crash recovery finished with code=1 
2024-12-04 15:35:46.362 | 2024-12-04 20:35:46,362 INFO: doing crash recovery in a single user mode

Environment

Kubernetes version: 1.29.10
PGO version: 5.7.0
PG server version: 15.8 registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-15.8-2
Server resources:

  • cpu requests/limits: 2000m/3000m

  • memory requests/limits: 2500/3000
    Please provide the following details:

  • Platform: EKS

  • Platform Version: 1.29.10

  • PGO Image Tag: ubi8-5.7.0-0

  • Postgres Version ubi8-15.8-2

  • Storage: 10G PVC on GP3

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

  1. Run ...
  2. Do ...
  3. Try ...

EXPECTED

  1. Provide the behavior that you expected.
    If the database process keeps restarting, fail the liveness probe to get the pod restarted.

ACTUAL

  1. Describe what actually happens
    The database process keeps restarting and pod never gets into ready state (4/5 ready). If the new primary crashed during this time, the cluster becomes unavailable.

Logs

Please provided appropriate log output or any configuration files that may help troubleshoot the issue. DO NOT include sensitive information, such as passwords.

Additional Information

Please provide any additional information that may be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant