fix: uneven load among clickhouse shards caused by retry error mechanism #357

nir3c · 2023-08-02T19:02:22Z

Description

fix uneven load among clickhouse shards caused by the retry error mechanism (related to issues #325 and #322 )

The issue happens when ChProxy sends the HTTP call to one of the shards with a retry mechanism.

When the response from the shard is 502 the scope's host swaps with another one to send by it, which causes not to decreasing the initial host's counter and the new host's counter to decrease instead (when we finally decrementing the scope), which could lead to a situation when the new host's counter is 0 to be set to MaxUint32 value (as 0 - 1 equals to the maximum value of uint32).
once this situation happens the new host will not be able to receive any new requests (as its counter value will be always greater than all other hosts' counter, and the balancing function logic for getting the new host is based on the host's counter + penalty) and the host's replica will never be used (same as for the host -> it will always be higher value than all other replicas as we can see in the getReplica function)

Pull request type

Please check the type of change your PR introduces:

Checklist

Linter passes correctly
Add tests which fail without the change (if possible)
All tests passing
Extended the README / documentation, if necessary

Does this introduce a breaking change?

Yes
No

Further comments

sigua-cs · 2023-08-03T07:04:27Z

@nir3c thanks for the PR. Please check failed tests

nir3c · 2023-08-03T07:10:11Z

proxy.go

 			// comment s.host.dec() line to avoid double increment; issue #322
 			// s.host.dec()


@sigua-cs should I remove this comment as I add the decrement in line 255?

yes you should since your PR fix this legacy quick & dirty (and broken) fix.

mga-chka · 2023-08-03T10:54:33Z

Thanks for the PR.
FYI, since your PR is linked to the tricky issues we had with the retry mechanism, we will take a bit of time to review it (since we will need multiple reviewer and because of the holidays it will take more time).

Blokje5

LGTM! Great find.

Blokje5 · 2023-08-16T10:26:45Z

proxy.go

-
+				currentHost := s.host
+
+				// decrement the current failed host counter and increment the new host


It might also a be a good idea to link the PR.

mga-chka

thanks for the PR, sorry for the long delay (we take a lot of days off in summer in France).
I left a few comments but they're very minor. Please tell me we want to fix them so that we don't wait for an extra 2 weeks before we merge the PR.

Once the PR is merged, we will release a new version ASAP so that you can benefit from your fix.

mga-chka · 2023-09-01T08:42:49Z

proxyretry_test.go

+	assert.Equal(t, 1, int(s.host.load()))
+
+	assert.Equal(t, 0, int(erroredHost.counter.load()))
+	assert.Equal(t, 5, int(erroredHost.penalty))


You should use the penaltySize variable instead of 5, it would make the test more readable

mga-chka · 2023-09-01T08:44:46Z

proxy.go

 			// comment s.host.dec() line to avoid double increment; issue #322
 			// s.host.dec()


yes you should since your PR fix this legacy quick & dirty (and broken) fix.

1. add magic number penaltySize for test assert the number of error host's load + penalty result 2. update comments

nir3c · 2023-09-01T10:08:38Z

thanks for the PR, sorry for the long delay (we take a lot of days off in summer in France). I left a few comments but they're very minor. Please tell me we want to fix them so that we don't wait for an extra 2 weeks before we merge the PR.

Once the PR is merged, we will release a new version ASAP so that you can benefit from your fix.

thanks for reviewing the code, I update the code based on the comments you gave me, feel free review it again

mga-chka · 2023-09-01T12:22:36Z

thanks for the PR, sorry for the long delay (we take a lot of days off in summer in France). I left a few comments but they're very minor. Please tell me we want to fix them so that we don't wait for an extra 2 weeks before we merge the PR.
Once the PR is merged, we will release a new version ASAP so that you can benefit from your fix.

thanks for reviewing the code, I update the code based on the comments you gave me, feel free review it again

We should release a new version containing your fix on Monday.

fix uneven load among clickhouse shards caused by retry error mechanism

e16667f

nircs requested review from mga-chka, JingmaoYou and sigua-cs August 3, 2023 04:48

sigua-cs self-assigned this Aug 3, 2023

sigua-cs added the bug label Aug 3, 2023

sigua-cs requested review from LEijsackersCS and gontarzpawel August 3, 2023 07:09

nir3c commented Aug 3, 2023

View reviewed changes

Blokje5 approved these changes Aug 16, 2023

View reviewed changes

mga-chka approved these changes Sep 1, 2023

View reviewed changes

CR changes

6fd697b

1. add magic number penaltySize for test assert the number of error host's load + penalty result 2. update comments

nir3c requested a review from mga-chka September 1, 2023 10:05

nir3c closed this Sep 1, 2023

nir3c reopened this Sep 1, 2023

mga-chka approved these changes Sep 1, 2023

View reviewed changes

mga-chka merged commit 7bdfdb5 into ContentSquare:master Sep 1, 2023
2 checks passed

mga-chka mentioned this pull request Sep 1, 2023

chore: Refactor host in scope and seperate to new package topology #356

Merged

13 tasks

mga-chka mentioned this pull request Sep 26, 2023

[BUG] uneven load among clickhouse shards #325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: uneven load among clickhouse shards caused by retry error mechanism #357

fix: uneven load among clickhouse shards caused by retry error mechanism #357

nir3c commented Aug 2, 2023 •

edited

Loading

sigua-cs commented Aug 3, 2023

nir3c Aug 3, 2023

mga-chka Sep 1, 2023

nir3c Sep 1, 2023

mga-chka commented Aug 3, 2023

Blokje5 left a comment

Blokje5 Aug 16, 2023

nir3c Sep 1, 2023

mga-chka left a comment •

edited

Loading

mga-chka Sep 1, 2023

nir3c Sep 1, 2023

mga-chka Sep 1, 2023

nir3c commented Sep 1, 2023

mga-chka commented Sep 1, 2023

		// comment s.host.dec() line to avoid double increment; issue #322
		// s.host.dec()


		currentHost := s.host

		// decrement the current failed host counter and increment the new host

fix: uneven load among clickhouse shards caused by retry error mechanism #357

fix: uneven load among clickhouse shards caused by retry error mechanism #357

Conversation

nir3c commented Aug 2, 2023 • edited Loading

Description

Pull request type

Checklist

Does this introduce a breaking change?

Further comments

sigua-cs commented Aug 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mga-chka commented Aug 3, 2023

Blokje5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mga-chka left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nir3c commented Sep 1, 2023

mga-chka commented Sep 1, 2023

nir3c commented Aug 2, 2023 •

edited

Loading

mga-chka left a comment •

edited

Loading