Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault postgres engine is slow when renewing leases and generating new credentials #29235

Open
jsanant opened this issue Dec 19, 2024 · 0 comments

Comments

@jsanant
Copy link

jsanant commented Dec 19, 2024

Describe the bug

We started noticing renewals of leases taking time and eventually erroring out with context deadline exceeded and rotation of Postgres credentials also failed.

This is happening intermittently, after a couple of minutes later the renewals go through and new credentials are also generated.

The request is canceled when renewing the lease with this error:

failed to renew entry: resp: (*logical.Response)(nil) err: 1 error occurred:
	* context canceled

To Reproduce
Steps to reproduce the behavior:

  1. Run vault write postgres/creds/my-role
  2. Run vault lease renew postgres/creds/my-role/j2hvxTOF2Kufh3Be0rMa9fX5
  3. See error failed to read lease entry postgres/creds/my-role/j2hvxTOF2Kufh3Be0rMa9fX5: context canceled

Expected behavior
The request should go through, renewals and rotation should work without any timeouts.

Environment:

  • Vault Server Version:
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    3
Threshold                2
Version                  1.10.3
Storage Type             raft
Cluster Name             demo-vault
Cluster ID               <redacted>
HA Enabled               true
HA Cluster               https://vault-2.vault-internal:8201
HA Mode                  active
Active Since             2024-12-18T15:19:51.305456306Z
Raft Committed Index     6166547
Raft Applied Index       6166547
  • Vault CLI Version:
Vault v1.10.3 (af866591ee60485f05d6e32dd63dde93df686dfb)
  • Server Operating System/Architecture:
Kubernetes version: v1.26.15
OS: Ubuntu 20.04.6 LTS

Vault server configuration file(s):

disable_mlock = true
ui = true

default_max_request_duration = "180s"

listener "tcp" {
  address = "0.0.0.0:8200"
  cluster_address = "0.0.0.0:8201"
  tls_client_ca_file = "/certs/vault-ca.crt"
  tls_cert_file = "/certs/vault.crt"
  tls_key_file = "/certs/vault.pem"
}

telemetry {
  dogstatsd_addr = "<redacted>:8125"
  dogstatsd_tags = ["env:stage"]
  disable_hostname = false
}

storage "raft" {
  path = "/vault/data"
  retry_join {
    leader_api_addr = "https://vault-0.vault-internal:8200"
    leader_ca_cert_file = "/certs/vault-ca.crt"
    leader_client_cert_file = "/certs/vault.crt"
    leader_client_key_file = "/certs/vault.pem"
  }
  retry_join {
    leader_api_addr = "https://vault-1.vault-internal:8200"
    leader_ca_cert_file = "/certs/vault-ca.crt"
    leader_client_cert_file = "/certs/vault.crt"
    leader_client_key_file = "/certs/vault.pem"
  }
  retry_join {
    leader_api_addr = "https://vault-2.vault-internal:8200"
    leader_ca_cert_file = "/certs/vault-ca.crt"
    leader_client_cert_file = "/certs/vault.crt"
    leader_client_key_file = "/certs/vault.pem"
  }
}

seal "awskms" {
 region     = "us-east-1"
 kms_key_id = "<redacted>"
}

service_registration "kubernetes" {}

Additional context

Things we have already validated:

  • We have checked the RDS instances to see if they are overloaded but they are not. The instance types range from small to large. The maximum number of connections is not reached.
  • We analyzed the IOPS of the underlying SDD disk used for Raft storage and we did not find any anomalies. The Raft DB size is around ~850 MB.
  • We verified the K8S cluster as well and there are no bottlenecks when communicating with the RDS instances.
  • We are also able to connect to the RDS instances from the Vault pods.
  • We also increased CPU and memory requests to eliminate any resource crunch
  • We have ~350 connections and ~500 roles in Postgres engine
  • We have also set VAULT_CLIENT_TIMEOUT to 300s in all Vault pods
@jsanant jsanant changed the title Vault postgres engine is slow when renewing leases and rotating credentials Vault postgres engine is slow when renewing leases and generating new credentials Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant