NR-250703: cluster-autoscaler improvements for GCP #36

sachin-shankar · 2024-04-18T10:23:07Z

Add support for label(tags in Azure) based auto-discovery of GCP Managed Instance Groups. The discovery includes figuring out the min and max sizes for the mig pool.

Add unit-tests for all the new code added.

Add GCP auto-discovery documentation .

Auto-Discovery Setup

To run a cluster-autoscaler which auto-discovers instance groups, use the --node-group-auto-discovery flag. There are 2 auto-discovery options to choose from.

NOTE - Only one of the 2 options can be used when configuring the --node-group-auto-discovery flag for cluster-autoscaler.

Auto-Discovery by Labels

For example, --node-group-auto-discovery=label:cluster-autoscaler-enabled=true,cluster-autoscaler-name=<YOUR CLUSTER NAME> will find all the instance groups with instance templates that are tagged with those labels containing those values.

NOTE

It is recommended to use a second tag like cluster-autoscaler-name=<YOUR CLUSTER NAME> when cluster-autoscaler-enabled=true is used across many clusters to prevent Instance Groups from different clusters recognized as the node groups
There are no --nodes flags passed to cluster-autoscaler because the node groups are automatically discovered by tags
No min/max values are provided when using this option. cluster-autoscaler will detect the "min" and "max" labels on the Instane Group resource in GCP, adjusting the desired number of nodes within these limits.
If there are no min/max labels on the Instance Group resource, cluster-autoscaler will use the default min/max values of 0 and 1000 respectively.

Auto-Discovery by NamePrefix

For example, --node-group-auto-discovery=mig:namePrefix=test-lemon-peel-mp,min=2,max=10 will internally use a Regular Expression to find all the instance groups whose name begins with test-lemon-peel-mp and set the minimum and maximum number of nodes to 2 and 10 respectively.

NOTE

Min and Max key/value pairs where max > min must be specified when using this option and will not use any defaults.
To add more than one instance groups that do not share the same name prefix, use the --node-group-auto-discovery flag multiple times. Ex:

--node-group-auto-discovery=mig:namePrefix=test-lemon-peel-mp,min=2,max=10
--node-group-auto-discovery=mig:namePrefix=confab-nodes,min=2,max=10

Clearly, the name-prefixes must be statically configured before the initialization of the cluster-autoscaler container which makes this option less flexible.

Taint utils taking multiple taints

…wn-after-add-per-ng-poc feat: support `--scale-down-delay-after-*` per nodegroup

Rancher: Fix error messages and expose underlying error.

Existing bucketing is inconsistent. Specifically, the second to last bucket is [100, 1000), which is huge and doesn't allow to differentiate between something that took 2m (120s) and something that took 15m (900s).

Use exponential buckets for function_duration_seconds

fix(kwok): prevent quitting when scaling down node group

Fix VPA e2e test failures

…om/guopeng0/autoscaler into feature/node_group_healthy_metrics

…n_ds_v2 Allow draining when DaemonSet kind has custom API Group

…ealthy_metrics feat:add node group health and back off metrics

…Deployment

…cId.

Remove unused NodeInfoProcessor

Signed-off-by: Yuki Iwai <[email protected]>

CA: Before we perform go test, synchronizing go modules

The grouping should be made by the schedulability equivalence meaning we can introduce optimizations to the binpacking. Introduce a benchmark that estimates capacity needed for 51k pods, which can be grouped to two equivalence groups 50k and 1k.

Add a link to the sample manifest and update the image used in the example. Signed-off-by: Lennart Jern <[email protected]>

Bumps golang from 1.22.1 to 1.22.2. --- updated-dependencies: - dependency-name: golang dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>

Update CAPI docs

…vertical-pod-autoscaler/pkg/recommender/golang-1.22.2 Bump golang from 1.22.1 to 1.22.2 in /vertical-pod-autoscaler/pkg/recommender

…vertical-pod-autoscaler/pkg/updater/golang-1.22.2 Bump golang from 1.22.1 to 1.22.2 in /vertical-pod-autoscaler/pkg/updater

The optimization uses the fact that pods which are equivalent do not need to be check multiple times against already filled nodes. This changes the time complexity from O(pods*nodes) to O(pods).

Refactor estimation

…policy-example docs: precise AWS IAM policy example

Bump default VPA version to 1.1.0

CA: Fix apis vendoring

Fix broken link in README.md to point to equinixmetal readme

Include helm chart version in cluster-autoscaler version matrix

…vertical-pod-autoscaler/pkg/admission-controller/golang-1.22.2 Bump golang from 1.22.1 to 1.22.2 in /vertical-pod-autoscaler/pkg/admission-controller

Add support for label(tags in Azure) based auto-discovery of GCP Managed Instance Groups. The discovery includes figuring out the min and max sizes for the mig pool. Add unit-tests for all the new code added. Add GCP auto-discovery documentation.

BigDarkClown and others added 30 commits January 11, 2024 16:45

Taint utils taking multiple taints

c4063ef

Merge pull request kubernetes#6378 from BigDarkClown/multitaint

838ba52

Taint utils taking multiple taints

Merge pull request kubernetes#5729 from vadasambar/feat/3071/scale-do…

ffb54c8

…wn-after-add-per-ng-poc feat: support `--scale-down-delay-after-*` per nodegroup

docs: precise AWS IAM policy example

d3d079e

Merge branch 'kubernetes:master' into feature/node_group_healthy_metrics

23843ad

feat:add node group health and back off metrics

ae0ab53

Merge pull request kubernetes#6363 from ElanHasson/patch-1

13c5875

Rancher: Fix error messages and expose underlying error.

Fix unit tests

6415be0

Add support for edge zones in Azure provider

e8ca5fd

Use exponential buckets for function_duration_seconds

aa3bab1

Existing bucketing is inconsistent. Specifically, the second to last bucket is [100, 1000), which is huge and doesn't allow to differentiate between something that took 2m (120s) and something that took 15m (900s).

Merge pull request kubernetes#6453 from x13n/master

d31e1cf

Use exponential buckets for function_duration_seconds

Merge pull request kubernetes#6336 from qianlei90/fix-kwok-provider

df0ce2d

fix(kwok): prevent quitting when scaling down node group

Only log when we fail to get NodeGroup for a node.

a47ef89

just moving controller_fetcher to use it out of the recommender

3a3b388

check targetRef of VPA against pod ownerRef

2bba2ba

Move to table-based tests.

e8e3ad0

Merge pull request kubernetes#6391 from voelzmo/fix/vpa-e2e-tests

ed25db1

Fix VPA e2e test failures

Allow draining when DaemonSet kind has custom API Group

e6f9ba1

use ParseGroupVersion in Drinable method

ea26159

feat:add node group health and back off metrics

5773f50

feat:add node group health and back off metrics

68e661f

Merge branch 'feature/node_group_healthy_metrics' of https://github.c…

4b9d4b1

…om/guopeng0/autoscaler into feature/node_group_healthy_metrics

Merge pull request kubernetes#6412 from shamil/validate_api_version_i…

779c1ba

…n_ds_v2 Allow draining when DaemonSet kind has custom API Group

Merge pull request kubernetes#6396 from guopeng0/feature/node_group_h…

a2f8902

…ealthy_metrics feat:add node group health and back off metrics

doc: cluster-autoscaler: Oracle provider: Add small security note

486184c

Bump CA Chart image to v1.29

3db3d22

prefer statefulset in TestGetMatchingVpa to avoid double ownerref RS/…

3c47994

…Deployment

TestGetMatchingVpa add pod matching selector but not matching ownerRef

a5cae3b

Introduce GceInstance that extends cloudprovider.Instance with Numeri…

0673eb0

…cId.

Consider atomic nodes

1af8021

mewa and others added 30 commits March 26, 2024 16:26

Add chart versions

6e6622f

Add script to update required chart versions in README

c4e0e58

Add chart version column in version matrix

c6f1d2b

Move cluster-autoscaler update-chart-version-readme script to /hack

89595cb

Only check recent revisions when updating README

6684448

Update min cluster-autoscaler chart for Kubernetes 1.29

297295a

Remove unused NodeInfoProcessor

9223c7e

Fix broken link in README.md to point to equinixmetal readme

3bd6e99

Merge pull request kubernetes#6662 from azylinski/rm-NodeInfoProcessor

7184d23

Remove unused NodeInfoProcessor

CA: Before we perform go test, synchronizing go vendor

7254888

Signed-off-by: Yuki Iwai <[email protected]>

Merge pull request kubernetes#6668 from tenzen-y/sync-vendor

74446d4

CA: Before we perform go test, synchronizing go modules

Update CAPI docs

a4760f6

Add a link to the sample manifest and update the image used in the example. Signed-off-by: Lennart Jern <[email protected]>

Bump golang in /vertical-pod-autoscaler/pkg/updater

315fabb

Bumps golang from 1.22.1 to 1.22.2. --- updated-dependencies: - dependency-name: golang dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>

Bump golang in /vertical-pod-autoscaler/pkg/admission-controller

1f9035c

Bumps golang from 1.22.1 to 1.22.2. --- updated-dependencies: - dependency-name: golang dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>

Bump golang in /vertical-pod-autoscaler/pkg/recommender

4109085

Bumps golang from 1.22.1 to 1.22.2. --- updated-dependencies: - dependency-name: golang dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>

Merge pull request kubernetes#6678 from Nordix/lentzi90/capi-docs

ec29b02

Update CAPI docs

Merge pull request kubernetes#6686 from kubernetes/dependabot/docker/…

cf64094

…vertical-pod-autoscaler/pkg/recommender/golang-1.22.2 Bump golang from 1.22.1 to 1.22.2 in /vertical-pod-autoscaler/pkg/recommender

Merge pull request kubernetes#6683 from kubernetes/dependabot/docker/…

609fb71

…vertical-pod-autoscaler/pkg/updater/golang-1.22.2 Bump golang from 1.22.1 to 1.22.2 in /vertical-pod-autoscaler/pkg/updater

Introduce binbacking optimization for similar pods.

5aa6b2c

The optimization uses the fact that pods which are equivalent do not need to be check multiple times against already filled nodes. This changes the time complexity from O(pods*nodes) to O(pods).

Merge pull request kubernetes#6667 from kisieland/refactor-estimation

425b91e

Refactor estimation

CA: Fix apis vendoring

c11cc43

Merge pull request kubernetes#6448 from blanchardma/docs/fix-aws-iam-…

bbe242e

…policy-example docs: precise AWS IAM policy example

Merge pull request kubernetes#6655 from laoj2/vpa-release-1.1

4294709

Bump default VPA version to 1.1.0

Merge pull request kubernetes#6695 from pmendelski/fix-autoscaler-vendor

3780203

CA: Fix apis vendoring

Copyright boilerplate

3a078ec

Merge pull request kubernetes#6663 from aayushrangwala/patch-1

0db7e54

Fix broken link in README.md to point to equinixmetal readme

Merge pull request kubernetes#6541 from mewa/master

a87d7ac

Include helm chart version in cluster-autoscaler version matrix

Merge pull request kubernetes#6684 from kubernetes/dependabot/docker/…

8273c9c

…vertical-pod-autoscaler/pkg/admission-controller/golang-1.22.2 Bump golang from 1.22.1 to 1.22.2 in /vertical-pod-autoscaler/pkg/admission-controller

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NR-250703: cluster-autoscaler improvements for GCP #36

NR-250703: cluster-autoscaler improvements for GCP #36

sachin-shankar commented Apr 18, 2024

NR-250703: cluster-autoscaler improvements for GCP #36

Are you sure you want to change the base?

NR-250703: cluster-autoscaler improvements for GCP #36

Conversation

sachin-shankar commented Apr 18, 2024

Auto-Discovery Setup

Auto-Discovery by Labels

Auto-Discovery by NamePrefix