Vault postgres engine is slow when renewing leases and generating new credentials #29235

jsanant · 2024-12-19T08:03:03Z

Describe the bug

We started noticing renewals of leases taking time and eventually erroring out with context deadline exceeded and rotation of Postgres credentials also failed.

This is happening intermittently, after a couple of minutes later the renewals go through and new credentials are also generated.

The request is canceled when renewing the lease with this error:

failed to renew entry: resp: (*logical.Response)(nil) err: 1 error occurred:
	* context canceled

To Reproduce
Steps to reproduce the behavior:

Run vault write postgres/creds/my-role
Run vault lease renew postgres/creds/my-role/j2hvxTOF2Kufh3Be0rMa9fX5
See error failed to read lease entry postgres/creds/my-role/j2hvxTOF2Kufh3Be0rMa9fX5: context canceled

Expected behavior
The request should go through, renewals and rotation should work without any timeouts.

Environment:

Vault Server Version:

Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    3
Threshold                2
Version                  1.10.3
Storage Type             raft
Cluster Name             demo-vault
Cluster ID               <redacted>
HA Enabled               true
HA Cluster               https://vault-2.vault-internal:8201
HA Mode                  active
Active Since             2024-12-18T15:19:51.305456306Z
Raft Committed Index     6166547
Raft Applied Index       6166547

Vault CLI Version:

Vault v1.10.3 (af866591ee60485f05d6e32dd63dde93df686dfb)

Server Operating System/Architecture:

Kubernetes version: v1.26.15
OS: Ubuntu 20.04.6 LTS

Vault server configuration file(s):

disable_mlock = true
ui = true

default_max_request_duration = "180s"

listener "tcp" {
  address = "0.0.0.0:8200"
  cluster_address = "0.0.0.0:8201"
  tls_client_ca_file = "/certs/vault-ca.crt"
  tls_cert_file = "/certs/vault.crt"
  tls_key_file = "/certs/vault.pem"
}

telemetry {
  dogstatsd_addr = "<redacted>:8125"
  dogstatsd_tags = ["env:stage"]
  disable_hostname = false
}

storage "raft" {
  path = "/vault/data"
  retry_join {
    leader_api_addr = "https://vault-0.vault-internal:8200"
    leader_ca_cert_file = "/certs/vault-ca.crt"
    leader_client_cert_file = "/certs/vault.crt"
    leader_client_key_file = "/certs/vault.pem"
  }
  retry_join {
    leader_api_addr = "https://vault-1.vault-internal:8200"
    leader_ca_cert_file = "/certs/vault-ca.crt"
    leader_client_cert_file = "/certs/vault.crt"
    leader_client_key_file = "/certs/vault.pem"
  }
  retry_join {
    leader_api_addr = "https://vault-2.vault-internal:8200"
    leader_ca_cert_file = "/certs/vault-ca.crt"
    leader_client_cert_file = "/certs/vault.crt"
    leader_client_key_file = "/certs/vault.pem"
  }
}

seal "awskms" {
 region     = "us-east-1"
 kms_key_id = "<redacted>"
}

service_registration "kubernetes" {}

Additional context

Things we have already validated:

We have checked the RDS instances to see if they are overloaded but they are not. The instance types range from small to large. The maximum number of connections is not reached.
We analyzed the IOPS of the underlying SDD disk used for Raft storage and we did not find any anomalies. The Raft DB size is around ~850 MB.
We verified the K8S cluster as well and there are no bottlenecks when communicating with the RDS instances.
We are also able to connect to the RDS instances from the Vault pods.
We also increased CPU and memory requests to eliminate any resource crunch
We have ~350 connections and ~500 roles in Postgres engine
We have also set VAULT_CLIENT_TIMEOUT to 300s in all Vault pods

The text was updated successfully, but these errors were encountered:

jsanant changed the title ~~Vault postgres engine is slow when renewing leases and rotating credentials~~ Vault postgres engine is slow when renewing leases and generating new credentials Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vault postgres engine is slow when renewing leases and generating new credentials #29235

Vault postgres engine is slow when renewing leases and generating new credentials #29235

jsanant commented Dec 19, 2024

Vault postgres engine is slow when renewing leases and generating new credentials #29235

Vault postgres engine is slow when renewing leases and generating new credentials #29235

Comments

jsanant commented Dec 19, 2024