Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Optimizing Kruise-Manager Startup Performance in Large Clusters #1718

Open
MichaelRren opened this issue Aug 30, 2024 · 3 comments
Assignees

Comments

@MichaelRren
Copy link
Contributor

MichaelRren commented Aug 30, 2024

Why is this needed:

When managing a cluster with over 50,000 SidecarSets, restarting the kruise-manager poses a significant challenge. With the default settings of 3 reconciliation workers and a rate limiter set at 10 QPS, it can take over 30 minutes to resync all SidecarSets during startup. During this time, all deployments are effectively stalled because the reconciler is blocked.

Are there any solutions to prevent this issue? For example, could we implement a CreateFunc in predicate.Funcs to filter out resources based on their CreationTimestamp before the kruise-manager startup and only process partition non-empty resources.

This change would significantly reduce the startup time of kruise-manager in large clusters.

@ABNER-1
Copy link
Member

ABNER-1 commented Aug 30, 2024

Hello, @MichaelRren. The configuration must be accurate to avoid confusion for our users.
I believe we should establish best practices to guide users in setting it properly.

For your scenario, I recommend increasing the number of workers and enhancing the rate limiter. This adjustment should expedite the initial processing.

In 1.7, there is a patch about 'Optimizing Pod SidecarSet webhook and controller performance when lots of namespace scoped sidecarSet exists (#1547, @ls-2018)'. If your sidecarset is namepaced, this will also help you.

@MichaelRren
Copy link
Contributor Author

MichaelRren commented Aug 30, 2024

Thanks for your reply, @ABNER-1 . Unfortunately, the patch you suggested doesn’t suit our scenario because all the SidecarSets are within the same namespace in our clusters.

We have to adjust the number of workers and the rate limiter to cover this situation currently.

@ABNER-1
Copy link
Member

ABNER-1 commented Sep 11, 2024

Hi, @MichaelRren
Would you like to share the process and results before and after your parameter tuning?
I believe this is an excellent blog that discusses best practices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants