Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

markReadOnlyNodesInOrc broke gracefulfailover process #566

Open
jianhaiqing opened this issue Jul 22, 2020 · 5 comments
Open

markReadOnlyNodesInOrc broke gracefulfailover process #566

jianhaiqing opened this issue Jul 22, 2020 · 5 comments

Comments

@jianhaiqing
Copy link
Contributor

jianhaiqing commented Jul 22, 2020

https://github.com/presslabs/mysql-operator/blob/28ef010d9c7e6acc2518b78ef44ae220040818f0/pkg/controller/orchestrator/orchestrator_reconcile.go#L515

As you can see the following logs from orchestrator, the writable function broke the gracefulfailover process, since master is set read_only as true, and wait for CandidateReplica catch up with given binlog-position.
Can we check if cluster is ready or not before set master writable ?

2020-07-22 08:34:18 INFO topology_recovery: will handle DeadMaster event on nodetest-mysql-1.nodetest-mysql.jonathan:3306
2020-07-22 08:34:18 INFO auditType:recover-dead-master instance:nodetest-mysql-1.nodetest-mysql.jonathan:3306 cluster:nodetest-mysql-1.nodetest-mysql.jonathan:3306 message:problem found; will recover
2020-07-22 08:34:18 INFO topology_recovery: Running 1 PreFailoverProcesses hooks
2020-07-22 08:34:18 INFO topology_recovery: Running PreFailoverProcesses hook 1 of 1: /usr/local/bin/orc-helper failover-in-progress 'nodetest.jonathan' '' || true
2020-07-22 08:34:18 INFO CommandRun(/usr/local/bin/orc-helper failover-in-progress 'nodetest.jonathan' '' || true,[])
2020-07-22 08:34:18 INFO CommandRun/running: bash /tmp/orchestrator-process-cmd-033210013
[martini] Completed 200 OK in 11.10911ms
[martini] Started GET /api/audit-recovery/nodetest.jonathan for 172.29.1.36:35640
[martini] Completed 200 OK in 6.968061ms
[martini] Started GET /api/cluster/nodetest.jonathan for 172.29.1.36:35640
[martini] Completed 200 OK in 13.440635ms
[martini] Started GET /api/master/nodetest.jonathan for 172.29.1.36:35640
[martini] Completed 200 OK in 12.028423ms
## failover process is broken by the logical. 
[martini] Started GET /api/set-writeable/nodetest-mysql-1.nodetest-mysql.jonathan/3306 for 172.29.1.36:35640

some discussion about the issue in orchestrator. openark/orchestrator#1213

@AMecea
Copy link
Contributor

AMecea commented Jul 22, 2020

Hi @jianhaiqing, thank you for investigating this and reporting it!

Cluster condition FailoverInProgress which is set using the Orchestrator pre failover hook should prevent this. Maybe the cluster is not updated in time during reconciliation, we need to investigate this a little bit more.

@jianhaiqing
Copy link
Contributor Author

As you know failover process of orchestrator is very fast. reconcile process and failover process are totally isolated, so you can't guarantee the executing sequence.
What's more, I think there's no need to set writable and readable for master and replica repeatedly, since cluster is ready.

func (ou *orcUpdater) markReadOnlyNodesInOrc(insts InstancesSet, master *orc.Instance) {
// If there is an in-progress failover, we will not interfere with readable/writable status on this iteration.
	fip := ou.cluster.GetClusterCondition(api.ClusterConditionFailoverInProgress)
	if fip != nil && fip.Status == core.ConditionTrue {
		ou.log.Info("cluster has a failover in progress, will delay setting readable/writeable status until failover is complete",
			"since", fip.LastTransitionTime)
		return
	}
	// If there is an graceful failover, we will not interfere with readable/writable status on this iteration.
	// so I think it's nice to check the cluster status before set-writable repeatedly.
	if ou.cluster.IsClusterReady() {
		return
	}
...
}

@jianhaiqing
Copy link
Contributor Author

jianhaiqing commented Jul 23, 2020

let me elaborate the executing sequence of these two process: orchestrator and orchestrator controller

1. orchestrator failover: topology before failover

master-A
 + replica-B
 + replica-C
  1. orchestrator will pick B as candidate master according to the binlog executed of B and C
  2. move C below B
  3. set A read_only true , and read binlog-position of A named as currentPosition,
  4. wait for B to reach currentPosition.
  5. stop replica , reset replica of B. at the moment, A and B is totally isolated.
  6. set B read_only=false
  7. change A to B ; At the moment, the topology of mysqlcluster is as follows
Master-B
 + replica-C
 + replica-A

2. orchestrator_controller, just repeatedly set master writable or replica read-only .

And pre-failover will set cluster in fail-over state, and suppose to prevent master and replica read_only is set
but, as you know, orchestrator and orchestrator_controller are totally isolated,
orchestrator_controller might just pass the failover check, and orchestrator failover is executed, and executed in phase 5, and orchestrator_controller continue to set master writable.Then the gtid-set of A is larger than that of B.

@AMecea
Copy link
Contributor

AMecea commented Jul 24, 2020

A solution can be to let the orchestrator set nodes in read-only as @dougfales proposes in #522. But then we can't make use of the read-only mode properly, because, in case of a failover the cluster will end up being writable and this is what we wanted to avoid making operator setting the read-only state.

Please let me know what is your propose.
Thank you!

@dougfales
Copy link
Contributor

Yes, @AMecea is right, our change does trade the cluster readOnly mode for a promise that the operator won't touch the readable/writable status of the master, during failover or any other time. It's a trade-off, but we didn't need the read-only mode, so it was worth it for us. I think we may have seen the scenario you describe @jianhaiqing, during some of our testing before we switched to the approach in #522 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants