-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "2ha.sh" script, managing 2-node Canonical K8s HA AA clusters #692
Conversation
Scenario overview: * Canonical K8s cluster containing 2 nodes * Dqlite data store (unable to obtain quorum) * Primary node dqlite files stored on DRBD * sync block-level replication between the two nodes * cluster monitoring and failover handled through Pacemaker Script functionality: * boostrap the service * wait for a DRBD primary to be elected * detect the node role based on the DRBD status and Dqlite state * have the replica wait for the primary to be ready before continuing * recover Dqlite after failovers * transfer and apply recovery files to secondary nodes * transfer Dqlite files to DRBD and other backup locations, creating necessary symlinks * install required packages * purge all K8s data * clear Pacemaker taints * remove recovery data "2ha.sh start_service" is intended to be used as part of a systemd unit that bootstraps the k8s services, coordinating with the other node and taking any necessary steps to recover Dqlite.
We're adding a guide that covers the 2-node A-A HA scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an initial pass. Consider the rephrasing as suggestions and feel free to ignore them
The script looks mostly fine (you already know my opinion on large bash scripts. Fine for now but should eventually be moved to Python or Go IMHO)
Thanks for reviewing this PR! I'll address the comments right away.
I admit that Openstack Devstack changed my perception of what a large bash script means but I see your point. |
57bd703
to
aa01e0e
Compare
aa01e0e
to
33ed437
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work on the 2-node HA set-up @petrutlucian94!
Please iterate over my polishing comments.
I am requesting changes because I would like to discuss the alternative solution with PostgreSQL.
30d0367
to
8d690c3
Compare
@louiseschmidtgen Thanks for reviewing the docs! I've addressed most comments and left a few questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple more small comments that need to be addressed, afterwards we are good to go. Thank you @petrutlucian94
Great work @petrutlucian94! |
…nonical#692) Scenario overview: * Canonical K8s cluster containing 2 nodes * Dqlite data store (unable to obtain quorum) * Primary node dqlite files stored on DRBD * sync block-level replication between the two nodes * cluster monitoring and failover handled through Pacemaker Script functionality: * boostrap the service * wait for a DRBD primary to be elected * detect the node role based on the DRBD status and Dqlite state * have the replica wait for the primary to be ready before continuing * recover Dqlite after failovers * transfer and apply recovery files to secondary nodes * transfer Dqlite files to DRBD and other backup locations, creating necessary symlinks * install required packages * purge all K8s data * clear Pacemaker taints * remove recovery data "2ha.sh start_service" is intended to be used as part of a systemd unit that bootstraps the k8s services, coordinating with the other node and taking any necessary steps to recover Dqlite.
Scenario overview:
Script functionality:
"2ha.sh start_service" is intended to be used as part of a systemd unit that bootstraps the k8s services, coordinating with the other node and taking any necessary steps to recover Dqlite.
This PR also adds a "how-to" guide for the 2-node A-A HA setup.