Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SPSMDB-1168: pbm: use unique names for downloaded certificates #1662

Merged
merged 2 commits into from
Sep 25, 2024

Conversation

mkuf
Copy link
Contributor

@mkuf mkuf commented Sep 23, 2024

CHANGE DESCRIPTION

Problem:
When using the operator with watchAllNamespaces: true and having the same name for psmdb clusters across multiple namespaces, backups may enter failed state due to the following error:

Error:                 create pbm object: create PBM connection to psmdb-db-rs0-0.psmdb-db-rs0.ns1.svc.cluster.local:27017,psmdb-db-rs0-1.psmdb-db-rs0.ns1.svc.cluster.local:27017,psmdb-db-rs0-2.psmdb-db-rs0.ns1.svc.cluster.local:27017: create mongo connection: connect: failed to find CERTIFICATE

Looking at the pbm agent, the backups completed successfully, but were marked as failed by the operator due to the certificate error.

Cause:
The operator is saving the cluster certificates at /tmp/<cluster-name>.(crt|pem) within its filesystem.
If the same cluster name is used across multiple namespaces, it is not guaranteed that the files contain the correct data at the right time.

Clusters would be something like this:

namespace clustername
ns1 psmdb-db
ns2 psmdb-db
ns3 psmdb-db

This results in the following files in the operator pod:

$ ls -l /tmp/
total 24
-rw------- 1 daemon daemon 1090 Sep 23 09:07 psmdb-db-ca.crt
-rw------- 1 daemon daemon 4212 Sep 23 09:07 psmdb-db-tls.pem

Solution:
This pr prefixes the files generated by the operator with the namespace the cluster is in and should therefore prevent any race conditions.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@CLAassistant
Copy link

CLAassistant commented Sep 23, 2024

CLA assistant check
All committers have signed the CLA.

@egegunes egegunes self-assigned this Sep 24, 2024
@egegunes egegunes changed the title pbm: use unique names for downloaded certificates K8SPSMDB-1168: pbm: use unique names for downloaded certificates Sep 25, 2024
@JNKPercona
Copy link
Collaborator

Test name Status
arbiter passed
balancer passed
custom-replset-name passed
custom-tls passed
cross-site-sharded passed
data-at-rest-encryption passed
data-sharded passed
demand-backup passed
demand-backup-eks-credentials passed
demand-backup-physical passed
demand-backup-physical-sharded passed
demand-backup-sharded passed
expose-sharded passed
ignore-labels-annotations passed
init-deploy passed
finalizer passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 failure
multi-cluster-service passed
non-voting passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-sharded passed
pitr-physical passed
pvc-resize passed
recover-no-primary passed
replset-overrides passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls passed
upgrade-sharded passed
users passed
version-service passed
We run 50 out of 50

commit: efc4989
image: perconalab/percona-server-mongodb-operator:PR-1662-efc49898

@hors hors merged commit cd107fc into percona:main Sep 25, 2024
13 of 14 checks passed
@hors
Copy link
Collaborator

hors commented Sep 25, 2024

Hi @mkuf, thank you for your contribution. This fix will be available in the next psmdb operator release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants