-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spread reconfigure in one rack #116
base: master
Are you sure you want to change the base?
Conversation
I would suggest to calculate the wait time instead of iterating. E.g. something like this: func waitForTicker(hostname string, interval time.Duration) {
var (
index int
err error
)
index, err = strconv.Atoi(hostname[len(hostname)-1:])
if err != nil {
index = 1
log.Warn("unable to parse leaf number from hostname, not spreading switch reloads", "hostname", hostname, "error", err)
return
}
// Get the next time the ticker should tick
now := time.Now()
next := now.Truncate(interval).Add(interval)
// If the index is odd, add half the interval to the next time
if index%2 != 0 {
next = next.Add(interval / 2)
}
// If the wait is longer than the interval, we can start earlier
if wait > interval {
wait = wait - interval
}
wait := next.Sub(now)
log.Info("Waiting", wait, "until", next, "to start trigger")
time.Sleep(wait)
} |
As an addition I would also skip the reconfiguration if it had just finished, to give BGP more time to propagate it's changes: diff --git a/cmd/internal/core/reconfigure-switch.go b/cmd/internal/core/reconfigure-switch.go
index 27fd446..998a8f8 100644
--- a/cmd/internal/core/reconfigure-switch.go
+++ b/cmd/internal/core/reconfigure-switch.go
@@ -18,11 +18,18 @@ import (
// ReconfigureSwitch reconfigures the switch.
func (c *Core) ReconfigureSwitch() {
+ var last time.Time
+
t := time.NewTicker(c.reconfigureSwitchInterval)
host, _ := os.Hostname()
for range t.C {
c.log.Info("trigger reconfiguration")
start := time.Now()
+ if start.Sub(last) < c.reconfigureSwitchInterval {
+ c.log.Info("skiping reconfiguration because of last reconfiguration was too recent")
+ continue
+ }
+
err := c.reconfigureSwitch(host)
elapsed := time.Since(start)
c.log.Info("reconfiguration took", "elapsed", elapsed)
@@ -48,6 +55,8 @@ func (c *Core) ReconfigureSwitch() {
c.log.Error("notification about switch reconfiguration failed", "error", err)
c.metrics.CountError("reconfiguration-notification")
}
+
+ last = time.Now()
}
} |
Now switched to cron based scheduling, will implement this logic accordingly |
References metal-stack/metal-roles#254 |
I think we decided not to do it because of too many negative implications. Vote for closing. |
We sporadically see network spikes which correlates with configuration reloads which happen at the same time in one rack.
With this, the start time of the reconfiguration is shifted by the half duration of the reconfiguration interval based on the hostname suffix of the switch. Switches which do not have a number as suffix are not spreading.
with this we have a 10 second spread with a 20 sec reconciliation interval
Keep this PR open, but as draft because we do not have the actual need for this.