-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: uneven load among clickhouse shards caused by retry error mechanism #357
fix: uneven load among clickhouse shards caused by retry error mechanism #357
Conversation
@nir3c thanks for the PR. Please check failed tests |
proxy.go
Outdated
// comment s.host.dec() line to avoid double increment; issue #322 | ||
// s.host.dec() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sigua-cs should I remove this comment as I add the decrement in line 255?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you should since your PR fix this legacy quick & dirty (and broken) fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Thanks for the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great find.
|
||
currentHost := s.host | ||
|
||
// decrement the current failed host counter and increment the new host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also a be a good idea to link the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR, sorry for the long delay (we take a lot of days off in summer in France).
I left a few comments but they're very minor. Please tell me we want to fix them so that we don't wait for an extra 2 weeks before we merge the PR.
Once the PR is merged, we will release a new version ASAP so that you can benefit from your fix.
proxyretry_test.go
Outdated
assert.Equal(t, 1, int(s.host.load())) | ||
|
||
assert.Equal(t, 0, int(erroredHost.counter.load())) | ||
assert.Equal(t, 5, int(erroredHost.penalty)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use the penaltySize variable instead of 5, it would make the test more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
proxy.go
Outdated
// comment s.host.dec() line to avoid double increment; issue #322 | ||
// s.host.dec() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you should since your PR fix this legacy quick & dirty (and broken) fix.
1. add magic number penaltySize for test assert the number of error host's load + penalty result 2. update comments
thanks for reviewing the code, I update the code based on the comments you gave me, feel free review it again |
We should release a new version containing your fix on Monday. |
Description
fix uneven load among clickhouse shards caused by the retry error mechanism (related to issues #325 and #322 )
The issue happens when ChProxy sends the HTTP call to one of the shards with a retry mechanism.
When the response from the shard is 502 the scope's host swaps with another one to send by it, which causes not to decreasing the initial host's counter and the new host's counter to decrease instead (when we finally decrementing the scope), which could lead to a situation when the new host's counter is 0 to be set to MaxUint32 value (as
0 - 1
equals to the maximum value of uint32).once this situation happens the new host will not be able to receive any new requests (as its counter value will be always greater than all other hosts' counter, and the balancing function logic for getting the new host is based on the host's counter + penalty) and the host's replica will never be used (same as for the host -> it will always be higher value than all other replicas as we can see in the getReplica function)
Pull request type
Please check the type of change your PR introduces:
Checklist
Does this introduce a breaking change?
Further comments