-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uploader: Improve upload retry policy #61
Conversation
4c56dc9
to
4a43662
Compare
4f56725
to
d509819
Compare
c37cc5a
to
b4c376a
Compare
Have 2 tiers of retries: - one shorter cycle, when trying to upload to primary or backup, where we retry a couple times up to 1m - a longer cycle wrapping those 2, to sustain potentially long running crisis. We wait longer between each retry and keep trying for up to 1h The first loop solves for transient errors when saving to primary or backup storage, while the second loop solves for longer running incidents, for which it's probably better to keep trying for a long time instead of dropping the process and lose the recording saving.
ce8d72c
to
8e1577c
Compare
core/uploader.go
Outdated
} | ||
|
||
func SingleRequestRetryBackoff() backoff.BackOff { | ||
return newExponentialBackOffExecutor(5*time.Second, 10*time.Second, 30*time.Second) | ||
} | ||
|
||
const segmentWriteTimeout = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder that we need to lower this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should take this in consideration. Looks like Storj writes are really slow.
https://eu-metrics-monitoring.livepeer.live/grafana/d/JZNaFMv4z/vod-monitoring?orgId=1&refresh=5s&from=now-24h&to=now&viewPanel=7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a timetaken field to the "succeeded" log line so that we can get an idea of how long these uploads are taking. So then i'll lower it in a further PR does that sound ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT of making this an env var config, like the backup storage? So at least we can tune it better without a new code change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good idea 👍
@victorges I've just tweaked the outer retries down a bit, does that look ok to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Implement 2 tiers of retries:
we retry a couple times up to 30s
crisis. We wait longer between each retry and keep trying for up to 1h
for a long time instead of dropping the process and lose the segment.
The inner retry loop also avoids waiting too long before falling back to the backup storage.
We need to avoid waiting too long otherwise the recordings could start to get processed
while segments were still trying to upload to the storage.
In order to be able to hold the upload attempt for so long I also changed the logic to
save the file in a temporary local file instead of keeping it in memory. This will keep the
memory usage low in case we have a long-running incident.