[feature request] Optimizing Kruise-Manager Startup Performance in Large Clusters #1718

MichaelRren · 2024-08-30T03:11:58Z

Why is this needed:

When managing a cluster with over 50,000 SidecarSets, restarting the kruise-manager poses a significant challenge. With the default settings of 3 reconciliation workers and a rate limiter set at 10 QPS, it can take over 30 minutes to resync all SidecarSets during startup. During this time, all deployments are effectively stalled because the reconciler is blocked.

Are there any solutions to prevent this issue? For example, could we implement a CreateFunc in predicate.Funcs to filter out resources based on their CreationTimestamp before the kruise-manager startup and only process partition non-empty resources.

This change would significantly reduce the startup time of kruise-manager in large clusters.

ABNER-1 · 2024-08-30T03:47:40Z

Hello, @MichaelRren. The configuration must be accurate to avoid confusion for our users.
I believe we should establish best practices to guide users in setting it properly.

For your scenario, I recommend increasing the number of workers and enhancing the rate limiter. This adjustment should expedite the initial processing.

In 1.7, there is a patch about 'Optimizing Pod SidecarSet webhook and controller performance when lots of namespace scoped sidecarSet exists (#1547, @ls-2018)'. If your sidecarset is namepaced, this will also help you.

MichaelRren · 2024-08-30T05:15:14Z

Thanks for your reply, @ABNER-1 . Unfortunately, the patch you suggested doesn’t suit our scenario because all the SidecarSets are within the same namespace in our clusters.

We have to adjust the number of workers and the rate limiter to cover this situation currently.

ABNER-1 · 2024-09-11T02:05:41Z

Hi, @MichaelRren
Would you like to share the process and results before and after your parameter tuning?
I believe this is an excellent blog that discusses best practices.

MichaelRren added the kind/feature-request label Aug 30, 2024

MichaelRren assigned FillZpp Aug 30, 2024

ABNER-1 assigned ABNER-1 and unassigned FillZpp Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Optimizing Kruise-Manager Startup Performance in Large Clusters #1718

[feature request] Optimizing Kruise-Manager Startup Performance in Large Clusters #1718

MichaelRren commented Aug 30, 2024 •

edited

Loading

ABNER-1 commented Aug 30, 2024

MichaelRren commented Aug 30, 2024 •

edited

Loading

ABNER-1 commented Sep 11, 2024

[feature request] Optimizing Kruise-Manager Startup Performance in Large Clusters #1718

[feature request] Optimizing Kruise-Manager Startup Performance in Large Clusters #1718

Comments

MichaelRren commented Aug 30, 2024 • edited Loading

ABNER-1 commented Aug 30, 2024

MichaelRren commented Aug 30, 2024 • edited Loading

ABNER-1 commented Sep 11, 2024

MichaelRren commented Aug 30, 2024 •

edited

Loading

MichaelRren commented Aug 30, 2024 •

edited

Loading