You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
So this is an edge case we got stuck in very recently. During some increased testing we apparently reached our default vcpu quota in the gcp project, but because we still didn't have alerts configured it went unnoticed for a while, until we realized that the nomad autoscaler wasn't scaling down nodes from the cluster.
The scenario:
gce-mig is trying to scale up but vcpu quota in the project is reached, the instance group has errors of the type QUOTA_EXCEEDED dev-nomad-client-zpgq europe-west1-b Creating Jan 25, 2024, 3:38:35 PM UTC+01:00 Instance 'dev-nomad-client-zpgq' creation failed: Quota 'N2D_CPUS' exceeded. Limit: 500.0 in region europe-west1.
Nomad autoscaler is periodically checking the mig but doesn't go further because the mig is not ready (stuck trying to scale-up) [TRACE] policy_manager.policy_handler: target is not ready: policy_id=c00c0934-a44e-c3eb-5200-e71e44255633
In our case we requested a vcpu increase and that solved the problem, the mig finished the scaling up event that it had started many hours ago, then the nomad autoscaler saw the mig ready and started to function properly again (in our case to scale down as the load had decreased a lot)
I would like to know if this could have been handled better, maybe the autoscaler could check the mig errors, or force a scaling event (not sure if that's possible)
The text was updated successfully, but these errors were encountered:
No unfortunately I don't, I should have tried to make a new scale call to the Mig myself to see what would happen but I didn't think of that :/ Possibly an explicit call to scale down could have overridden the scale-up it was stuck on.
Hello again,
We landed on this same problem today, and this time I tried changing the MIG size in GCP and it worked. After lowering the size of the group, the MIG got into a ready state again and the nomad autoscaler recovered.
So GCP is also being sneaky on this getting stuck constantly trying to scale up even when the cpu quota is hit.
Hello,
So this is an edge case we got stuck in very recently. During some increased testing we apparently reached our default vcpu quota in the gcp project, but because we still didn't have alerts configured it went unnoticed for a while, until we realized that the nomad autoscaler wasn't scaling down nodes from the cluster.
The scenario:
QUOTA_EXCEEDED dev-nomad-client-zpgq europe-west1-b Creating Jan 25, 2024, 3:38:35 PM UTC+01:00 Instance 'dev-nomad-client-zpgq' creation failed: Quota 'N2D_CPUS' exceeded. Limit: 500.0 in region europe-west1.
[TRACE] policy_manager.policy_handler: target is not ready: policy_id=c00c0934-a44e-c3eb-5200-e71e44255633
In our case we requested a vcpu increase and that solved the problem, the mig finished the scaling up event that it had started many hours ago, then the nomad autoscaler saw the mig ready and started to function properly again (in our case to scale down as the load had decreased a lot)
I would like to know if this could have been handled better, maybe the autoscaler could check the mig errors, or force a scaling event (not sure if that's possible)
The text was updated successfully, but these errors were encountered: