Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBM-1391: Enabling PITR after physical restore causes PSMDB to crash #1019

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

boris-ilijic
Copy link
Member

@boris-ilijic boris-ilijic commented Sep 23, 2024

PR for https://perconadev.atlassian.net/browse/PBM-1391

After physical restore with PITR, all PSMDB secondary nodes fail with the fatal error:

{"t":{"$date":"2024-09-12T06:12:55.515+00:00"},"s":"F",  "c":"ASSERT",   "id":23081,   "ctx":"ReplWriterWorker-6","msg":"Invariant failure","attr":{"expr":"!needsRenaming || allowRenameOutOfTheWay","msg":"Current collection name: (None), UUID: 0ea5a363-872c-4380-87c6-ff426f0b0a85. Future collection name: admin.pbmPITR","file":"src/mongo/db/catalog/create_collection.cpp","line":835}}
{"t":{"$date":"2024-09-12T06:12:55.515+00:00"},"s":"F",  "c":"ASSERT",   "id":23082,   "ctx":"ReplWriterWorker-6","msg":"\n\n***aborting after invariant() failure\n\n"}
{"t":{"$date":"2024-09-12T06:12:55.515+00:00"},"s":"F",  "c":"CONTROL",  "id":6384300, "ctx":"ReplWriterWorker-6","msg":"Writing fatal message","attr":{"message":"\n"}}
{"t":{"$date":"2024-09-12T06:12:55.515+00:00"},"s":"F",  "c":"CONTROL",  "id":6384300, "ctx":"ReplWriterWorker-6","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}

due to the fact that Primary and Secondary have DDL operations executed in different order:

  • Primary: PITR apply -> dropping PBM collections
  • Secondary: dropping PBM collections -> applying PITR (during sync from Primary)

This PR fixes the problem by omitting DDL operation (primary drop) on PBM's collection during the physical restore phase and replacing it with delete all operation.
PR also creates all remaining PBM system collections at the first start of the PBM Agent to reduce the DDL operation within PITR.

... with PITR

After PITR physical restore, there was inconsistent data between Primary
and Secondaries nodes.
The reason was that PITR oplog and dropping collections are applied in
reverse order:
- on Primary: [PITR oplog apply] -> [dropping PBM databases]
- on each Secondary: [dropping PBM databases] -> [catch up from Primary
including oplog apply]

Not using DDL operations (drop in this case) for PBM's system
collections fixes the problem.
That ensures that collection will not be created during PITR and by
doing that we eliminate the possible problem of having different UUIDs.
@boris-ilijic boris-ilijic force-pushed the PBM-1391-crash-after-physical-restore branch from 4ddd467 to 717044e Compare September 24, 2024 06:03
@boris-ilijic boris-ilijic marked this pull request as ready for review September 24, 2024 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant